Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6335
Aijun An Pawan Lingras Sheila Petty Runhe Huang (Eds.)
Active Media Technology 6th International Conference, AMT 2010 Toronto, Canada, August 28-30, 2010 Proceedings
13
Volume Editors Aijun An York University Department of Computer Science and Engineering Toronto, ON, M3J 1P3, Canada E-mail:
[email protected] Pawan Lingras Saint Mary’s University Department of Mathematics and Computing Science Halifax, NS, B3H 3C3, Canada E-mail:
[email protected] Sheila Petty University of Regina Faculty of Fine Arts Regina, SK, S4S 0A2, Canada E-mail:
[email protected] Runhe Huang Hosei University Faculty of Computer and Information Sciences Tokyo 184-8584, Japan E-mail:
[email protected]
Library of Congress Control Number: 2010933076 CR Subject Classification (1998): H.4, I.2, H.5, C.2, J.1, I.2.11 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-15469-7 Springer Berlin Heidelberg New York 978-3-642-15469-0 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
This volume contains the papers selected for presentation at the 2010 International Conference on Active Media Technology (AMT 2010), jointly held with the 2010 International Conference on Brain Informatics (BI 2010), at York University, Toronto, Canada, during August 28-30, 2010. Organized by the Web Intelligence Consortium (WIC) and IEEE Computational Intelligence Society Task Force on Brain Informatics (IEEE-CIS TF-BI), this conference was the sixth in the AMT series since its debut conference at Hong Kong Baptist University in 2001 (followed by AMT 2004 in Chongqing, China, AMT 2005 in Kagawa, Japan, AMT 2006 in Brisbane, Australia, AMT 2009 in Beijing, China). Active media technology (AMT) is a new area of research and development in intelligent information technology and computer science. It emphasizes the proactive, adaptive and seamless roles of interfaces and systems as well as new media in all aspects of digital life. Over the past few years, we have witnessed rapid developments of AMT technologies and applications ranging from business and communication to entertainment and learning. Examples include Facebook, Twitter, Flickr, YouTube, Moodle, Club Penguin and Google Latitude. Such developments have greatly changed our lives by enhancing the way we communicate and do business. The goal of the AMT conferences is to provide an international forum for exchanging scientific research and technological achievements in building AMTbased systems. AMT 2010 featured a selection of the latest research work and applications from the following areas related to AMT: active computer systems and intelligent interfaces, adaptive Web systems and information foraging agents, AMT for the Semantic Web, data mining, ontology mining and Web reasoning, e-commerce and Web services, entertainment and social applications of active media, evaluation of active media and AMT-based systems, intelligent information retrieval, machine learning and human-centered robotics, multi-agent systems, multi-modal processing, detection, recognition, and expression analysis, semantic computing for active media and AMT-based systems, smart digital media, Web-based social networks, and Web mining and intelligence. All the papers submitted to AMT 2010 were rigorously reviewed by three committee members and external reviewers. The selected papers offered new insights into the research challenges and development of AMT systems. AMT 2010 (together with BI 2010) also featured four keynote talks given by Ben Shneiderman of the University of Maryland, Jianhua Ma of Hosei University, Yingxu Wang of the University of Calgary, and Vinod Goel of York University. They spoke on their recent research in technology-mediated social participation, active smart u-things and cyber individuals, cognitive informatics and denotational mathematical means for brain informatics, and fractionating the rational
VI
Preface
brain, respectively. The abstracts of the first two keynote talks, which were on AMT, are included in this volume. AMT 2010 could not be successful without a team effort. We would like to thank all the authors who contributed to this volume. We also thank the Program Committee members and external reviewers for their dedicated contribution in the paper selection process. Our special thanks go to Tetsuya Yoshida and Yue Xu for organizing a special session on text analysis and utilization, and to Hanmin Jung, Li Chen and Sung-Pil Choi for organizing a special session on technology intelligence. We are grateful to the Chairs and members of the Organizing Committee for their significant contribution to the organization of the conference. In particular, we would like to acknowledge the generous help received from Ning Zhong, Jimmy Huang, Vivian Hu, Jessie Zhao, Jun Miao, Ellis Lau and Heather Bai. Our appreciation also goes to Juzhen Dong for her excellent technical support of the AMT 2010 conference management system and its websites. Last but not the least, we thank Alfred Hofmann and Anna Kramer of Springer for their help in coordinating the publication of this special volume in an emerging and interdisciplinary research area. We appreciate the support and sponsorship from York University and the University of Regina. August 2010
Aijun An Pawan Lingras Sheila Petty Runhe Huang
Conference Organization
Conference General Chairs Sheila Petty Runhe Huang
University of Regina, Canada Hosei University, Japan
Program Chairs Aijun An Pawan Lingras
York University, Canada Saint Mary’s University, Halifax, Canada
Organizing Chair Jimmy Huang
York University, Canada
Publicity Chairs Daniel Tao Jian Yang
Queensland University of Technology, Australia International WIC Institute/BJUT, China
WIC Co-chairs/Directors Ning Zhong Jiming Liu
Maebashi Institute of Technology, Japan Hong Kong Baptist University, Hong Kong
IEEE-CIS TF-BI Chair Ning Zhong
Maebashi Institute of Technology, Japan
WIC Advisory Board Edward A. Feigenbaum Setsuo Ohsuga Benjamin Wah Philip Yu L.A. Zadeh
Stanford University, USA University of Tokyo, Japan University of Illinois, Urbana-Champaign, USA University of Illinois, Chicago, USA University of California, Berkeley, USA
VIII
Organization
WIC Technical Committee Jeffrey Bradshaw Nick Cercone Dieter Fensel Georg Gottlob Lakhmi Jain Jianchang Mao Pierre Morizet-Mahoudeaux Hiroshi Motoda Toyoaki Nishida Andrzej Skowron Jinglong Wu Xindong Wu Yiyu Yao
UWF/Institute for Human and Machine Cognition, USA York University, Canada University of Innsbruck, Austria Oxford University, UK University of South Australia, Australia Yahoo! Inc., USA Compiegne University of Technology, France Osaka University, Japan Kyoto University, Japan Warsaw University, Poland Okayama University, Japan University of Vermont, USA University of Regina, Canada
Program Committee Bill Andreopoulos Pradeep Atrey Virendra Bhavsar Jiannong Cao Sharat Chandran Li Chen Sung-Pil Choi Chin-Wan Chung Tharam Dillon Abdulmotaleb El Saddik Alexander Felfernig William Grosky Daryl Hepting Jiajin Huang Wolfgang Huerst Joemon Jose Hanmin Jung Brigitte Kerherv´e Yang Liu Yijuan Lu Brien Maguire Wenji Mao Wee Keong Ng
Technische Universit¨ at Dresden, Germany University of Winnipeg, Canada University of New Brunswick, Canada Hong Kong Polytechnic University, Hong Kong Indian Institute of Technology, Bombay, India Hong Kong Baptist University, Hong Kong Korea Institute of Science and Technology Information, Korea Korea Advanced Institute of Science and Technology (KAIST), Korea Curtin University of Technology, Australia University of Ottawa, Canada Graz University of Technology, Austria University of Michigan, USA University of Regina, Canada Beijing University of Technology, China Utrecht University, The Netherlands University of Glasgow, UK Korea Institute of Science and Technology Information, Korea Universit´e du Qu´ebec `a Montr´eal, Canada Shandong University, China Texas State University, USA University of Regina, Canada Institute of Automation, Chinese Academy of Sciences, China Nanyang Technological University, Singapore
Organization
Yoshihiro Okada Eugene Santos Dominik Slezak Xijin Tang
Hiroyuki Tarumi Ruizhi Wang Yue Xu Rong Yan Jian Yang Tetsuya Yoshida Mengjie Zhang Shichao Zhang Zili Zhang William Zhu
Kyushu University, Japan University of Connecticut, USA University of Warsaw and Infobright Inc., Poland Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, China Kagawa University, Japan Tongji University, China Queensland University of Technology, Australia Facebook, USA International WIC Institute, Beijing University of Technology, China Hokkaido University, Japan Victoria University of Wellington, New Zealand University of Technology, Australia Southwest University, China University of Electronic Science and Technology of China, China
External Reviewers Ansheng Ge Mehdi Kargar Mauricio Orozco
IX
Damon Sotoudeh-Hosseini Peng Su Karthikeyan Vaiapury
Table of Contents
Keynote Talks Technology-Mediated Social Participation: Deep Science and Extreme Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ben Shneiderman Active Smart u-Things and Cyber Individuals . . . . . . . . . . . . . . . . . . . . . . . Jianhua Ma
1 5
Active Computer Systems and Intelligent Interfaces A Case for Content Distribution in Peer-to-Peer Networks . . . . . . . . . . . . . Morteza Analoui and Mohammad Hossein Rezvani
6
Interactive Visualization System for DES . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed S. Asseisah, Hatem M. Bahig, and Sameh S. Daoud
18
Intelligent Implicit Interface for Wearable Items Suggestion . . . . . . . . . . . . Khan Aasim, Aslam Muhammad, and A.M. Martinez-Enriquez
26
Adaptive Web Systems and Information Foraging Agents Folksonomy-Based Ontological User Interest Profile Modeling and Its Application in Personalized Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaogang Han, Zhiqi Shen, Chunyan Miao, and Xudong Luo
34
Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists for Actionable Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Derek L. Hansen, Ben Shneiderman, and Marc Smith
47
AMT for Semantic Web and Web 2.0 A Spatio-temporal Framework for Related Topic Search in Micro-Blogging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuangyong Song, Qiudan Li, and Nan Zheng
63
Exploiting Semantic Hierarchies for Flickr Group . . . . . . . . . . . . . . . . . . . . Dongyuan Lu and Qiudan Li
74
Understanding a Celebrity with His Salient Events . . . . . . . . . . . . . . . . . . . Shuangyong Song, Qiudan Li, and Nan Zheng
86
XII
Table of Contents
User Interests: Definition, Vocabulary, and Utilization in Unifying Search and Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi Zeng, Yan Wang, Zhisheng Huang, Danica Damljanovic, Ning Zhong, and Cong Wang Ontology Matching Method for Efficient Metadata Integration . . . . . . . . . Pyung Kim, Dongmin Seo, Mikyoung Lee, Seungwoo Lee, Hanmin Jung, and Won-Kyung Sung
98
108
Data Mining, Ontology Mining and Web Reasoning Multiagent Based Large Data Clustering Scheme for Data Mining Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Ravindra Babu, M. Narasimha Murty, and S.V. Subrahmanya Fractal Based Video Shot Cut/Fade Detection and Classification . . . . . . . Zeinab Zeinalpour-Tabrizi, Amir Farid Aminian-Modarres, Mahmood Fathy, and Mohammad Reza Jahed-Motlagh
116
128
Performance Evaluation of Constraints in Graph-Based Semi-supervised Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tetsuya Yoshida
138
Analysis of Research Keys as Tempral Patterns of Technical Term Usages in Bibliographical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hidenao Abe and Shusaku Tsumoto
150
Natural Language Query Processing for Life Science Knowledge . . . . . . . . Jin-Dong Kim, Yasunori Yamamoto, Atsuko Yamaguchi, Mitsuteru Nakao, Kenta Oouchida, Hong-Woo Chun, and Toshihisa Takagi
158
E-Commerce and Web Services A Semantic Web Services Discovery Algorithm Based on QoS Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baocai Yin, Huirong Yang, Pengbin Fu, and Xiaobo Chen Implementation of an Intelligent Product Recommender System in an e-Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seyed Ali Bahrainian, Seyed Mohammad Bahrainian, Meytham Salarinasab, and Andreas Dengel Recommendation of Little Known Good Travel Destinations Using Word-of-Mouth Information on the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kouzou Ohara, Yu Fujimoto, and Tomofumi Shiina
166
174
183
Table of Contents
XIII
Entertainment and Social Applications of Active Media The Influence of Ubiquity on Screen-Based Interfaces . . . . . . . . . . . . . . . . . Sheila Petty and Luigi Benedicenti
191
Perception of Parameter Variations in Linear Fractal Images . . . . . . . . . . Daryl H. Hepting and Leila Latifi
200
Music Information Retrieval with Temporal Features and Timbre . . . . . . Angelina A. Tzacheva and Keith J. Bell
212
Evaluation of Active Media and AMT Based Systems Towards Microeconomic Resources Allocation in Overlay Networks . . . . . Morteza Analoui and Mohammad Hossein Rezvani
220
Investigating Perceptions of a Location-Based Annotation System . . . . . . Huynh Nhu Hop Quach, Khasfariyati Razikin, Dion Hoe-Lian Goh, Thi Nhu Quynh Kim, Tan Phat Pham, Yin-Leng Theng, Ee-Peng Lim, Chew Hung Chang, Kalyani Chatterjea, and Aixin Sun
232
Apollon13: A Training System for Emergency Situations in a Piano Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuki Yokoyama and Kazushi Nishimoto
243
Intelligent Information Retrieval Exploring Social Annotation Tags to Enhance Information Retrieval Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zheng Ye, Xiangji Jimmy Huang, Song Jin, and Hongfei Lin
255
A Hybrid Chinese Information Retrieval Model . . . . . . . . . . . . . . . . . . . . . . Zhihan Li, Yue Xu, and Shlomo Geva
267
Term Frequency Quantization for Compressing an Inverted Index . . . . . . Lei Zheng and Ingemar J. Cox
277
Chinese Question Retrieval System Using Dependency Information . . . . . Jing Qiu, Le-Jian Liao, and Jun-Kang Hao
288
Machine Learning and Human-Centered Robotics A Novel Automatic Lip Reading Method Based on Polynomial Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meng Li and Yiu-ming Cheung
296
XIV
Table of Contents
An Approach for the Design of Self-conscious Agent for Robotics . . . . . . . Antonio Chella, Massimo Cossentino, Valeria Seidita, and Calogera Tona K-Means Clustering as a Speciation Mechanism within an Individual-Based Evolving Predator-Prey Ecosystem Simulation . . . . . . . Adam Aspinall and Robin Gras
306
318
Improving Reinforcement Learning Agents Using Genetic Algorithms . . . Akram Beigi, Hamid Parvin, Nasser Mozayani, and Behrouz Minaei
330
Robust and Efficient Change Detection Algorithm . . . . . . . . . . . . . . . . . . . . Fei Yu, Michael Chukwu, and Q.M. Jonathan Wu
338
Multi-Agent Systems Building Users’ Profiles from Clustering Resources in Collaborative Tagging Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maya Rupert and Salima Hassas
345
Some Optimizations in Maximal Clique Based Distributed Coalition Formation for Collaborative Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . Predrag T. Toˇsi´c and Naveen K.R. Ginne
353
Multi-Modal Processing, Detection, Recognition, and Expression Analysis Enhanced Intra Coding of H.264/AVC Advanced Video Coding Standard with Adaptive Number of Modes . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Golam Sarwer and Q.M. Jonathan Wu Extracting Protein Sub-cellular Localizations from Literature . . . . . . . . . . Hong-Woo Chun, Jin-Dong Kim, Yun-Soo Choi, and Won-Kyung Sung
361 373
Semantic Computing for Active Media and AMT Based Systems Enhancing Content-Based Image Retrieval Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qinmin Vivian Hu, Zheng Ye, and Xiangji Jimmy Huang
383
Modeling User Knowledge from Queries: Introducing a Metric for Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frans van der Sluis and Egon L. van den Broek
395
Computer-Assisted Interviewing with Active Questionnaires . . . . . . . . . . . Seon-Ah Jang, Jae-Gun Yang, and Jae-Hak J. Bae
403
Table of Contents
XV
Smart Digital Media Assessing End-User Programming for a Graphics Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lizao Fang and Daryl H. Hepting
411
Visual Image Browsing and Exploration (Vibe): User Evaluations of Image Search Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Grant Strong, Orland Hoeber, and Minglun Gong
424
Web Based Social Networks Contextual Recommendation of Social Updates, a Tag-Based Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adrien Joly, Pierre Maret, and Johann Daigremont Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Ding, Yuyin Sun, Bin Chen, Katy Borner, Li Ding, David Wild, Melanie Wu, Dominic DiFranzo, Alvaro Graves Fuenzalida, Daifeng Li, Stasa Milojevic, ShanShan Chen, Madhuvanthi Sankaranarayanan, and Ioan Toma NicoScene: Video Scene Search by Keywords Based on Social Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuyuki Tahara, Atsushi Tago, Hiroyuki Nakagawa, and Akihiko Ohsuga
436
448
461
Web Mining, Wisdom Web and Web Intelligence Social Relation Based Search Refinement : Let Your Friends Help You! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xu Ren, Yi Zeng, Yulin Qin, Ning Zhong, Zhisheng Huang, Yan Wang, and Cong Wang An Empirical Approach for Opinion Detection Using Significant Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anil Kumar K.M. and Suresha
475
486
Extracting Concerns and Reports on Crimes in Blogs . . . . . . . . . . . . . . . . . Yusuke Abe, Takehito Utsuro, Yasuhide Kawada, Tomohiro Fukuhara, Noriko Kando, Masaharu Yoshioka, Hiroshi Nakagawa, Yoji Kiyota, and Masatoshi Tsuchiya
498
Automatically Extracting Web Data Records . . . . . . . . . . . . . . . . . . . . . . . . Dheerendranath Mundluru, Vijay V. Raghavan, and Zonghuan Wu
510
XVI
Table of Contents
Web User Browse Behavior Characteristic Analysis Based on a BC Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dingrong Yuan and Shichao Zhang
522
Clustering Web Users Based on Browsing Behavior . . . . . . . . . . . . . . . . . . . Tingshao Zhu
530
Privacy Preserving in Personalized Mobile Marketing . . . . . . . . . . . . . . . . . Yuqing Sun and Guangjun Ji
538
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
547
Technology-Mediated Social Participation: Deep Science and Extreme Technology Ben Shneiderman Dept of Computer Science & Human-Computer Interaction Lab, University of Maryland, College Park, MD 20742 USA
[email protected]
Abstract. The dramatic success of social media such as Facebook, Twitter, YouTube, Flickr, blogs, and traditional discussion groups empowers individuals to become active in local and global communities. With modest redesign, these technologies can be harnessed to support national priorities such as healthcare/wellness, disaster response, community safety, energy sustainability, etc. This talk describes a research agenda for these topics that develops deep science questions and extreme technology challenges. Keywords: social media, participation, motivation, reader-to-leader framework.
1 Introduction The remarkable expansion of social media use has produced dramatic entrepreneurial successes and high expectations for the future. Beyond these commercial successes, many observers see the potential for social transformations in economic, political, social, educational, medical, and many other domains. Understanding how to increase the motivations for participation is a deep science question that will occupy researchers for many decades. Similarly, building scalable technological foundations that are secure and reliable will challenge software designers and implementers. The goal of these deep science and extreme technologies is to provide billions of users with the capacity to share information, collaborate on ambitious projects, and organize successful governance structures, while coping with malicious attacks, providing high levels of security, and ensuring reliability.
2 Deep Science The enduring questions of raising human motivation have taken on new importance in the age of social media. Wikipedia is a great success story because of its innovative strategies for motivating users to contribute the knowledge and to collaborate with others. But even in this success story, only one in a 1000 readers become registered contributors, and even fewer become regular collaborators who work together over weeks and months. Similarly, while there are billions of viewers of YouTube the numbers of contributors of content is small. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 1–4, 2010. © Springer-Verlag Berlin Heidelberg 2010
2
B. Shneiderman
Fig. 1. The Reader-to-Leader Framework suggests that the typical path for social media participation moves from reading online content to making contributions, initially small edits, but growing into more substantive contributions. The user-generated content can be edits to a wiki, comments in a discussion group, ratings of movies, photos, music, animations, or videos. Collaborators work together over periods of weeks or months to make more substantial contributions, and leaders act to set policies, deal with problems, and mentor new users [1].
Motivation or persuasion is an ancient human notion, but the capacity to study it on a global scale is just becoming a reality. The move from controlled laboratory experiments to interventions in working systems is happening because designers and researchers have enabled the capture of usage patterns on a scale never before possible. The Reader-to-Leader Framework [1] (Fig. 1) provides an orderly way of discussing the strategies and conducting research. At each stage innovative entrepreneurs and researchers have developed these strategies such as showing the number of views of a video, enabling ratings of contributions, honoring richer collaborations, and empowering leaders. Many other theories and frameworks are being proposed as commercial, government, and academic researchers rapidly expand their efforts. Traditional social science theories are being adapted to understand, predict, and guide designers who seek to increase trust, empathy, responsibility, and privacy in the online world. Similarly, mathematical theories of network analysis are being enhanced to accommodate the distinctly human dynamics of online social systems. The shift from descriptive and explanatory theories that are based on statistical analyses to predictive and prescriptive theories that provide guidance for community managers is happening rapidly, but much work remains to be done.
3 Extreme Technology The past 40 years of computing technology have produced remarkable progress. Strong credit goes to the chip developers who made the rapid and sustained strides characterized by Moore’s Law – doubling of chip density, speed, capacity every 18 months. Equal credit goes to the user interface designers who opened the doors to billions of users by creating direct manipulation interfaces based on carefully designed menus, effective graphical interfaces, convenient input devices, and comprehensible visual presentations.
Technology-Mediated Social Participation: Deep Science and Extreme Technology
3
The current agenda is rapidly moving to encompass the large-scale social media communities, such as the half billion users of Facebook and the four billion users of cell phones. Newer services such as Twitter have acquired more than 100 million users with billions of exchanges per month, but that is just the beginning. As individuals, organizations, companies, and governments increase their usage, the volume and pace of activity will grow bringing benefits to many users, but so will the impacts of service outages, privacy violations, and malicious attacks. Developers now recognize the primacy of the user interface in determining outcomes, so there is increased research, training, and exploratory design. Simultaneously, there is a growth in tools to track, analyze, and intervene in social media networks to as to promote more positive outcomes. One such effort is the free and open source NodeXL Project (Network Overview for Discovery and Exploration in Excel), which was initially supported by Microsoft Research (www.codeplex.com/nodexl). This tool enables importing of social media networks from Outlook, Twitter, YouTube, Flickr, WWW, etc. into Excel 2007/2010, and then gives users powerful analysis tools, plus rich visualization support [2, 3] (Fig. 2). NodeXL was designed to speed learning by social-media savvy business professionals who already use Excel, as well as by undergraduate and graduate students who
Fig. 2. This NodeXL screenshot shows the U.S. Senate voting patterns during 2007. The 100 Senators are linked to each other by edges whose strength is related to the number of similar votes. By restricting edges to those greater than 65% similarity and using a force directed layout algorithm, the clusters of Democrats (blue nodes on lower right) and Republicans (red nodes on upper left) become visible.
4
B. Shneiderman
are learning social network analysis. By providing easy import of data from important social media tools, NodeXL dramatically expands the community of users who can carry out analyses that lead to actionable business insights and research studies. NodeXL provides a rich set of visualization controls to select color, size, opacity, and other attributes of vertices and edges. The variety of layout algorithms and dynamic query filters allows users to tune the display to their needs. Varied centrality metrics for directed and undirected graphs, as well as a growing number of clustering algorithms, support exploration and discovery. NodeXL is an ongoing project that will be supported through the emerging Social Media Research Foundation (www.smrfoundation.org). Acknowledgments. Thanks to the NodeXL team (www.codeplex.com/nodexl), the community of U.S. National Science Foundation workshop participants (www.tmsp.umd.edu), the University of Maryland Human-Computer Interaction Lab, and Jennifer Preece.
References 1. Preece, J., Shneiderman, B.: The Reader-to-Leader Framework: Motivating technologymediated social participation. AIS Transactions on Human-Computer Interaction 1(1), 13–32 (2009) 2. Smith, M., Shneiderman, B., Milic-Frayling, N., Mendes-Rodrigues, E., Barash, V., Dunne, C., Capone, T., Perer, A., Gleave, E.: Analyzing (social media) networks with NodeXL. In: Proc. Communities & Technologies Conference (2009) 3. Hansen, M., Shneiderman, B., Smith, M.A.: Analyzing Social Media Networks with NodeXL: Insights from a Connected World. Morgan Kaufmann Publishers, San Francisco (2010)
Active Smart u-Things and Cyber Individuals Jianhua Ma Laboratory of Multimedia Ubiquitous Smart Environment, Department of Digital Media, Faculty of Computer and Information Sciences, Hosei University, Tokyo 184-8584, Japan
[email protected]
Abstract. Due to the continuing miniaturization of chips and availability of wired/wireless communications, many kinds/forms of devices can be integrated into physical objects and ambient environments. The u-things, as opposed to pure digital e-things existing on computers/Web/Internet, are ordinary physical things with attached, embedded or blended computers, networks, and/or some other devices such as sensors, actors, e-tags and so on. Active smart u-things are ones that can, more or less, sense, compute, communicate, and may take some actions according to their goals, situated contexts, users’ needs, etc. Active smart u-things can be with different levels of intelligence from low to high, and in various intelligent forms, e.g., aware, context-aware, interactive, reactive, proactive, assistive, adaptive, automated, autonomic, sentient, perceptual, organic, life-like, cognitive, thinking, etc. Active smart u-things may cover innumerable types of physical things in the real world. They can be roughly classified into three categories, i.e., smart object, smart space and smart system, according to their appearances and functions. The grand challenge is how to enable these smart u-things to offer desired services to all people in right time, right place and right means with ubisafe guarantee. Furthermore, the essential and existence of human in cyber-physical combined spaces should be re-examined. The Cyber Individual, with a short term ‘Cyber-I’, is a real individual’s counterpart in cyberspace. Cyber-I can be seen as a comprehensive description of a real individual including one’s physical status, physiological states, psychological behaviors, personal features, social relations, history experiences, etc. Such kind of individual description and modeling is fundamental to offer personalized services to different users according to their needs and situations. Keywords: u-thing, sensor, actuator, tag, robot, smart object, space and system, ubiquitous intelligence, cyberspace, cyber individual, user modeling.
A. An et al. (Eds.): AMT 2010, LNCS 6335, p. 5, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Case for Content Distribution in Peer-to-Peer Networks Morteza Analoui and Mohammad Hossein Rezvani Department of Computer Engineering, Iran University of Science and Technology (IUST) 16846-13114, Hengam Street, Resalat Square, Narmak, Tehran, Iran {analoui,rezvani}@iust.ac.ir
Abstract. In large scale peer-to-peer networks, it is impossible to perform a query request by visiting all peers. There are some works that try to find the location of resources probabilistically (i.e. non-deterministically). They all have used inefficient protocols for finding the probable location of peers who manage the resources. This paper presents a more efficient protocol that is proximityaware in the sense that it is able to cache and replicate the popular queries proportional to distance latency. The protocol dictates that the farther the resources are located from the origin of a query, the more should be the probability of their replication in the caches of intermediate peers. We have validated the proposed distributed caching scheme by running it on a simulated peer-to-peer network using the well-known Gnutella system parameters. The simulation results show that the proximity-aware distributed caching can improve the efficiency of peer-to-peer resource location services.
1 Introduction Most of the current P2P systems such as Gnutella, KazaA, and Pastry [1] fall within the category of P2P "content distribution" systems. A typical P2P content distribution system creates a distributed storage medium and allows doing services such as searching and retrieving query messages which are known as "resource location" services. The area of "content distribution systems" has a large overlap with the issue of "resource location services" in the literature. In general, there are two strands of work concerning the proximity-aware methodology. First, there are works on content distribution via constructing the P2P topology [2]. Second, there are works on resource location services [3, 4, 5]. These works assume a given topology setting such as mesh or tree for the P2P system. It has been shown by [6, 7] that finding an optimal-bandwidth topology for the P2P network is a NP-complete problem. So, we shall not try to solve the NP problem of topology construction here. Instead, we will try to optimize the proximity-aware resource locating problem within the given topology setting in the P2P system. In this paper, we are concerned with the design of a resource location service via scalable proximity-aware distributed caching mechanism. We define the resource location service as “given a resource name, find with a proximity probability, the location of peers that manage the resource.” We use Round Trip time (RTT) latency distance as the criterion for the probabilistic caching of each query. Each peer, upon A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 6–17, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Case for Content Distribution in Peer-to-Peer Networks
7
receiving a query, at first searches its local cache. If the query is found, the peer returns it to the original requesting peer along with the reverse path which is traversed by the query. In this order, the query is cached in the memory of each intermediate node using replication method based on the proposed proximity-aware distributed caching mechanism. The probability of the resource replication and updating of the caches in each intermediate node is proportional to the latency distance between that node and the location where the resource is found. The rest of the paper is organized as follows. We discuss the related researches in Section 2. Section 3 presents our proposed proximity-aware resource location mechanism. Section 4 presents the performance evaluation of the proposed mechanism. Finally, we conclude in Section 5.
2 Related Work A significant research toward proximity-aware resource location services in typical Gnutella-based unstructured P2P system has been done in [4, 8]. Some query search broadcasting policies using Gnutella system has been proposed in [8] and their performance has also been compared with each other. The proximity metric in [4] is Time to Live (TTL) of the query messages. Forwarding the queries is done with a fixed probability. When a query message is reached to a peer, its TTL is decremented. The forwarding of the query messages will be stopped if its TTL is reached to zero. The search technique proposed in [9] is similar to local indices technique which is proposed in [8] with different routing policy for query message. In the other proposed techniques which mentioned in [8] each node maintains “hints” as to which nodes contain data that answer certain queries, and route messages via local decisions based on these hints. This idea itself is similar to the philosophy of hints which is used by Menasce et al. in [4]. Pastry [1] is an example of systems with “strong guarantee” that employ search techniques. These systems can locate an object by its global identifier within a limited number of hops. Zhao et al. in [10] provide a priority-aware and consumption-guided dynamic probabilistic allocation method for a typical cache memory. Utilization of a sample size of a cache memory is measured for each priority level of a computer system. Allocation probabilities for each priority level are updated based on the measured consumption/utilization, i.e. allocation is reduced for priority levels consuming too much of the cache and allocation is increased for priority levels consuming too little of the cache. Another valuable work in the area of proximity-caching in P2P systems is presented by Jung et al. in [11]. They propose a simple caching protocol, which intuitively obtains information about physical network structure. Their caching protocol utilizes the internet address (IP) i.e. first 16 bits of IP address. The metadata used in their caching protocol is exchanged using piggy-back mechanism, and they extract useful IP prefix set by using RTT threshold value. The protocol is deployed into Chord, a well-known distributed hash table-based lookup protocol. Their result show genuine relationship between physical and logical network structure.
3 Proximity-Aware Distributed Caching Each pair of nodes ( s , r ) is associated with a latency distance lat ( s , r ) representing the RTT experienced by communication between them. The latency distance corresponding
8
M. Analoui and M.H. Rezvani
to a specific pair of nodes may be measured either directly through ping messages, or estimated approximately through a virtual coordinate service. Due to space limitations, we do not explain the details of the virtual coordinate service here. Interested readers can refer to [12] for it. Every super-peer in our system has a Local Index Table (LIT) that points to locally managed resources (such as files, Web pages, processes, and devices). Each resource has a location-independent Globally Unique Identifier (GUID) that can be provided by developers of the P2P network using different means. For example, in a distributed online bookstore application, developers could use ISBNs as GUIDs [4]. Each superpeer has a directory cache (DC) that points to the presumed location of resources managed by other super-peers. An entry in the DC is a pair (id, loc) in which id is the GUID of a resource and loc is the network address of a super-peer who might store the resource locally. Each peer s has a local neighborhood N (s ) defined as the set of super-peers who have connected to it. Tables 1 and 2 provide a high-level description of the proposed proximity-aware distributed caching mechanism. The QuerySearch (QS) procedure describes the operations in which a source s is looking for a resource, namely res . The string path
s1 , ..., s m is the sequence of super-peers that
have received this message so far. This sequence is used as a reverse path to the source. The header of each query message contains a TTL field which is used to control the depth of the broadcast tree. For example, Gnutella has been implemented with a TTL parameter equal to 7. The QueryFound (QF) procedure indicates that the resource res being searched by the source s has been found at super-peer v . In this procedure, the max_latency is the latency distance between the super-peer who manages res and the farthest super-peer in the reverse path. Each super-peer, upon receiving the QS message, at first searches within its LIT. If it finds the resource in the LIT, it will return a QF message. The QF message is forwarded to the source following the reverse path which has been used by the QS message. It updates the DCs corresponding to each of the intermediate nodes as well. The contribution of our work emerges at this point where the QF message updates the LIT in each of the intermediate nodes using replication of resources based on the proposed proximityaware distributed caching mechanism. The probability of resource replication and updating the LIT corresponding to each intermediate node is proportional to the latency distance between that node and the location where the resource has been found. To this end, each intermediate node r performs the following actions with a probability that is proportional to the latency distance between itself and the node which has been found as the manager of the resource: 1) establishing a TCP connection with the super-peer who manages the resource, 2) downloading the resource object and saving it in the client c who has enough available space, and 3) updating the LIT via adding the entry ( res , c ) . If the super-peer does not find the resource in its LIT but finds it in the DC, it will send a QS message to the super-peer who is pointed to by that DC. If this super-peer no longer has the resource, the search process will be continued from that point forward. If a super-peer does not find the resource neither in its LIT nor DC, it will forward the request to each super-peer in its neighborhood with a certain probability p which is called the "broadcasting probability." This probability could vary with the length of the path that the request traverses.
A Case for Content Distribution in Peer-to-Peer Networks Table 1. QuerySearch message received by super-peer
9
r
QuerySearch( source , res , (s1, …, sm), TTL) begin If res ∈ LIT then begin max_latency= max{(lat ( r , s1 ), ..., lat ( r , s m )} send QueryFound( source , res , max_latency, (s1, …, sm-1), r ) to sm end else if ( res , loc) ∈ DC then /* send request to presumed location */ Send QuerySearch( source , res , (s1, …, sm, r ), TTL-1) to loc else if (TTL > 0) then for v i = v1 to v m do
/* v i ∈ N (r ) */
begin max_latency= max{(lat ( r , v1 ), ..., lat ( r , v m )} Send QuerySearch( source , res ,(s1, …, sm, r), TTL-1) with probability /* probability p is proportional to
lat ( r , v i )
p
*/
max_ latency
end for end if end if end
Table 2. QueryFound message received by super-peer
r
QueryFound( source , res , max_latency, (s1, …, sm), v ) begin if r ≠ source then begin add ( res , v ) to DC with probability proportional to
lat ( r , v )
do
max_ latency begin Connect to super-peer v to get res from it /* finds a local client with enough available memory */
find local client c add ( res , c ) to LIT end send QueryFound( source , res , max_latency, (s1, …, sm-1), v) end else /* end of query search process */ connect to super-peer v to get res from it. end
to sm
to
vi
10
M. Analoui and M.H. Rezvani
Fig. 1 illustrates how a QS message would be propagated in the network. In the figure, the maximum number of the nodes to be traversed by a QS message is defined to be equal to 3 hops (apart from the source node itself). Similar to Gnutella, our system uses a Breath-First-Search (BFS) mechanism in which the depth of the broadcast tree is limited by the TTL criterion. The difference is that in Gnutella every query recipient node forwards the message to all of its neighbors, while in our proposal, the propagation is performed probabilistically and is done if the query is not found neither in the LIT nor in the DC of a node. In Fig. 1, the QS message originating from source S1 is probabilistically sent to super-peers S2, S3, and S4 in response to the search for the resource res . The super-peer S3 finds the resource in its LIT, but S2 and S4 do not find such an entry, hence probabilistically forward the message to the peers who have been registered in their DCs. Note that the super-peer S4 does not forward the message to S10 because, for example, in this case the forwarding probability is randomly selected to be zero. Figure 2 illustrates an example of returning QF messages in a reversed path from the location where the
Fig. 1. Forwarding a QS message using maximum hop-count equal to 3
A Case for Content Distribution in Peer-to-Peer Networks
11
Fig. 2. Forwarding the QF message through the reversed path
resource res is found to the node which the query has been originated from. The QF message is routed to the source (node S1) following the reverse path which is used by the QS message. The QF message updates the corresponding DC of each intermediate node based on the proposed proximity-aware distributed caching mechanism. The probability of replication and caching the resource object in the LIT of each intermediate node is proportional to the latency distance between that node and the location where the resource is found. The closer is the intermediate node to the discovered resource; the less will be the probability of caching the resource in the node’s LIT. This probability is shown by a graphical representation with partially boldfaced circles. In the sequence of nodes which consists of S1, S2, S6, and S13, the node S6 caches the address of the resource res with the least probability; whereas the node S1 caches it with the most probability. The probability of caching the resource res by S2 is larger than that of S6 and is smaller than that of S1.
12
M. Analoui and M.H. Rezvani
4 Experimental Analysis We have performed a large number of experiments to validate the effectiveness of our proximity-aware distributed caching scheme. We have evaluated the performance of the system with a file-sharing application based on several metrics. These metrics include fraction of involving super-peers in the query search, probability of finding an entry in DCs, average number of hops to perform the query requests, and the system load. Among these metrics, the load metric is defined as the amount of work that an entity must do per unit of time. It is measured in terms of two resource types: incoming bandwidth, and outgoing bandwidth. Since the availability of the incoming and the outgoing bandwidths is often asymmetric, we have treated them as separate resources. Also, due to heterogeneity of the system, it is useful to study the aggregate load, i.e., the sum of the loads concerning to all the nodes in the system. All of the results are averaged over 10 runs of experiments and have been come up with 95% confidence intervals. We followed the general routine devised in [13] for the efficient design of the P2P network. So, as the first step, we had to generate an instance topology based on a power-law distribution. We used the PLOD algorithm presented in [14] to generate the power-law topology for the network. The second step was calculating the expected cost of actions. Among three “macro” actions, i.e., query, join, and update, which exist in a cost model [13], we have restricted our attention to the query operations. Each of these actions is composed of smaller “atomic” actions for which the costs are given in [13]. In terms of bandwidth, the cost of an action is the number of bytes being transferred. We used the specified size of the messages for Gnutella protocol as is defined in [13]. For example, the query messages in Gnutella include a 22-byte Gnutella header, a 2 byte field for flags, and a null-terminated query string. The total size of a query message, including Ethernet and TCP/IP headers, is therefore 82 plus the query string length. Some values, such as the size of a metadata record are not specified by the protocol, rather are functions of the type of the data which is being shared. To determine the number of the results which are returned to a super-peer r , we have used the query model developed in [15] which is applicable to super-peer filesharing systems as well. The number of files in the super-peer’s index depends on the particular generated instance topology I . We have used this query model to determine the expected number of the returned results, i.e. E[ N r | I ] . Since the cost of the
query is a linear function of ( N r | I ) and also since the load is a linear function of the cost of the queries, we can use these expected values to calculate the expected load of the system [13]. In the third step, we must calculate the system load using the actions. For a given query originating from the node s and terminating in the node r we can calculate the expected cost, namely C sr . Then, we need to know the rate at which the query action occurs. The default value for the query rate is 9.26×10-3 which is taken from the general statistics provided by [13]. The query requests in our experiments have been generated by a workload generator. The parameters of the workload generator can be set up to produce uniform or non-uniform distributions. Considering the cost and the rate of each query action, we can now calculate the expected load which is incurred by the node r for the given network instance I as follows
A Case for Content Distribution in Peer-to-Peer Networks
E[ M r | I ] =
∑ E [C
s ∈ Network
sr
| I ].E[ Fs ]
13
(1)
Where, Fs is the number of the queries submitted by the node s in the time unit, and E[ Fs ] is simply the query rate per user. Let us define Q as the set of all super-peer nodes. Then, the expected load of all such nodes, namely M Q is defined as follows E[ M Q | I ] =
∑
n∈Q
E[ M n | I ]
(2)
|Q| Also, the aggregate load is defined as follows E[ M | I ] =
∑ E[ M
n ∈ network
n
| I]
(3)
We ran the simulation over several topology instances and averaged E[ M | I ] over these trials to calculate E[ E[ M | I ]] = E[ M ] . We came up with 95% confidence intervals for E[ M | I ] . In our experiments, the network size was fixed at 10000 nodes. As mentioned before, the generated network has a power-law topology with the average out-degree of 3.1 and TTL=7. These parameters reflect Gnutella topology specifications which has been used by many researchers so far. For each pair of the super-peers ( s , r ) , the latency distance lat ( s , r ) was generated using a normal distribution with an average μ = 250 ms and a variance δ = 0.1 [12]. Then, to find the pair-wise latency estimation, namely est ( s , r ) , we ran the virtual coordinate service method over the generated topology. The Least Frequency Used (LFU) is a typical frequency-based caching policy which has been proved to be an efficient policy in the area of distributed systems [16]. In LFU, the decision to replace an object from the cache is proportional to the frequency of the references to that object. All objects in the cache maintain the reference count and the object with the smallest reference count will be replaced. The criterion for replacing an object from the cache is computed as follows Cost Object = frequency Object × recency Object
(4)
Where, frequency Objeect and recency Object denote the "access frequency" and the "elapsed time from recent access", respectively. If the cache has enough room, LFU will store the new object in itself. Otherwise, LFU selects a candidate object which has the lowest Cost Object value among all cached objects. Then, LFU will replace the
candidate object by the new object if the Cost Object of the new object is higher than that of the candidate object. Otherwise, no replacement occurs. Figure 3 shows the experimental results concerning the effect of the resource replication on the fraction of participating super-peers, namely F , and the probability of finding objects, namely Pf , versus different broadcasting probabilities. It can be seen from the figure
14
M. Analoui and M.H. Rezvani
that Pf attains high values for much smaller values of p . By adjusting the broadcasting probability, one can tune the probability of finding the resource. In the case of using resource replication, Pf achieves larger values in comparison with the case in which the resource replication is not used. On the other hand, when we use resource replication method, F achieves smaller values in comparison with the case in which the resource replication is not used. Thus, the behavior of F is not similar to that of Pf . The reason lies in the fact that in the case of using the resource replication method, some intermediate nodes replicate the queries in their local disks (i.e., they cache the queries into their LIT); leading to a decrease in the LITs miss ratio, thus an increase in the probability of finding the queries. Such nodes do not need to propagate the QuerySearch message to other super-peers anymore.
Fig. 3. The effect of resource replication on the fraction of participating peers and the probability of finding objects for various broadcasting probabilities
Figure 4 shows the average number of the required hops to find the resource, namely H , which is normalized by the total number of super-peers (except the original source). The figure shows the effect of the resource replication method in various broadcasting probabilities. It can be seen in both curves of the Fig. 4 that the average number of hops initially increases until reaches to a maximum point and then begins to decrease. A higher broadcasting probability means that the super-peers who are located further away from the original source are contacted and the resource tends to be found further away from the original source. As p continues to increase, the
A Case for Content Distribution in Peer-to-Peer Networks
15
Fig. 4. The effect of resource replication on hop ratio for various broadcasting probabilities
increased values of hit ratio concerning to intermediate DCs allow the resource to be found in locations where are closer to the original source; hence causes a decrease in the value of H . It is clear from the Fig. 4 that the use of resource replication reduces the number of hops needed to find the resource. For example, in a reasonable practical point of broadcasting probability, such as 0.7, it yields a 31% improvement, where the hop ratio decreases from 0.08 to 0.055 Figure 5 shows the effect of resource replication on the total required bandwidth of the system, i.e. the required incoming and outgoing bandwidth of super-peers for various broadcasting probabilities. By increasing the broadcasting probability, some additional costs are imposed to the system. The most important costs include the cost of sending queries to each super-peer, a startup cost for each super-peer as they process the query, and the overhead of additional packet headers for individual query responses. Some of these factors are mentioned in the literature by prior researchers. Interested readers can find useful hints in [13]. The upper curve in Fig. 5 shows the required bandwidth in the absence of the resource replication. In this case, as the broadcasting probability p increases, the required bandwidth of the super-peers increases and reaches to 7.7 × 108 bps for a value of p equal to 0.8. From this point forward, the growing of bandwidth occurs more slightly until reaches to 7.9 × 108bps at the value of p equal to 1. The lower curve in Fig. 5 shows an improvement in the required bandwidth in the presence of the resource replication. In this case, the required bandwidth decreases to 6.6 × 108bps for a value of p equal to 0.8, resulting in a 14% improvement in comparison with the same point in the upper curve.
16
M. Analoui and M.H. Rezvani
Fig. 5. The effect of resource replication on total bandwidth for various broadcasting probabilities
5 Conclusions In this paper we have targeted the proximity-aware location service for peer-to-peer systems. The proposed protocol provides a scalable distributed caching mechanism to find the peers who manage a given resource and achieves an enhancement by replicating the objects based on the latency distance factor, resulting in less aggregate load over the system. The simulation results showed that using the probabilistic resource discovery service in peer-to-peer systems combined with latency-aware probabilistic resource replication, improves the overall performance of the system in terms of aggregated load, throughput, and the number of the peers who are involved in the search process.
References 1. Rowstron, A., Druschel, P.: Pastry: Scalable, Distributed, Object Location and Routing for Large-Scale Peer-to-Peer Systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001) 2. Dai, L., Cao, Y., Cui, Y., Xue, Y.: On Scalability of Proximity-Aware Peer-to-Peer Streaming. Computer Communications 32(1), 144–153 (2009) 3. Menascé, D.A., Kanchanapalli, L.: Probabilistic Scalable P2P Resource Location Services. ACM Sigmetrics Performance Evaluation Rev. 30(2), 48–58 (2002) 4. Menascé, D.: Scalable P2P Search. IEEE Internet Computing 7(2) (2003) 5. Zhu, Y., Hu, Y.: Efficient, Proximity-Aware Load Balancing for DHT-Based P2P Systems. IEEE Transactions on Parallel and Distributed Systems 16(1), 349–361 (2005)
A Case for Content Distribution in Peer-to-Peer Networks
17
6. Zhu, Y., Li, B., Pu, K.Q.: Dynamic Multicast in Overlay Networks with Linear Capacity Constraints. IEEE Transactions on Parallel and Distributed Systems 20(7), 925–939 (2009) 7. Zhu, Y., Li., B.: Overlay Networks with Linear Capacity Constraints. IEEE Transactions on Parallel and Distributed Systems 19(2), 159–173 (2008) 8. Yang, B., Garcia-Molina, H.: Improving Search in Peer-to-Peer Networks. In: The 22nd International Conference on Distributed Computing Systems (ICDCS 2002), Vienna, Austria (2002) 9. Adamic, L., Lukose, R., Puniyani, A., Huberman, B.: Search in Power-Law Networks (2001), http://www.parc.xerox.com/istl/groups/iea/papers/plsearch/ 10. Zhao, L., Newell, D., Iyer, R., Milekal, R.: Priority Aware Selective Cache Allocation. Patent (2009) 11. Jung, H., Yeom, H.Y.: Efficient Lookup Using Proximity Caching for P2P Networks. In: Proceeding of International Conference on Grid and Cooperative Computing (GCC), Wuhan, China, pp. 567–574 (2004) 12. Jesi, G.P., Montresor, A., Babaoglu, O.: Proximity-Aware Superpeer Overlay Topologies. IEEE Transactions on Network and Service Management (2007) 13. Yang, B., Garcia-Molina, H.: Designing a Super-Peer Network. In: Proc. Int’l Conf. Data Eng. (ICDE), pp. 49–63 (2003) 14. Palmer, C., Steffan, J.: Generating network topologies that obey power laws. In: The GLOBECOM (2000) 15. Yang, B., Garcia-Molina, H.: Comparing Hybrid Peer-to-Peer Systems. In: Proc. 27th Int. Conf. on Very Large Data Bases, Rome (2001) 16. Song, J.W., Park, K.S., Yang, S.B.: An Effective Cooperative Cache Replacement Policy for Mobile P2P Environments. In: Proceeding of IEEE International Conference on Hybrid Information Technology (ICHIT 2006), Korea, vol. 2, pp. 24–30 (2006)
Interactive Visualization System for DES Mohamed S. Asseisah, Hatem M. Bahig, and Sameh S. Daoud Computer Science Division, Department of Mathematics, Faculty of Science, Ain Shams University Cairo, Egypt
[email protected]
Abstract. The Data Encryption Standard (DES) is a secret key encryption scheme adopted as standard in the USA in 1977. Most cryptographic courses and textbooks include DES, and its variations. Interaction and visualization are key factors supporting the learning process. We present a dynamic interactive educational system that visualizes DES, and its variations. The aim of the system is to facilitate teaching and learning DES, and its variants for undergraduate and postgraduate students. The system has been used in Ain Shams University – Faculty of Science, in the course “Cryptography”. The analysis of the data emerging from the evaluation study of our system has shown that students found the system attractive and easy to use. On the whole, student interactions within the system helped them to become aware of DES, to conceptualize it, to overcome learning difficulties, and to correct themselves.
1 Introduction The exponential growth of information that characterizes modern age makes the need for learning more important than ever. But the sheer volume of what we have to learn and the speed at which we must learn it can be daunting. Meeting this challenge require new thinking about how we acquire knowledge and skills, and how we deploy learning resources that can keep up with the knowledge growth. Interaction and visualization are key factors supporting the learning process. They support learning while doing tasks [7]. Interaction is a vital part of the learning process and the level of interaction has an impact on the quality of the learning experience. Instructional designers should make learners active participants, not passive spectators in the process. Interaction shifts the instructional focus from the facilitator and materials to the learner, who must actively engage with peers, materials, and the instructor. A review of the literature reveals other reasons for using interactions. It has been shown that higher levels of interaction are associated with improved achievement and positive learning. E-learning [1] is learning or training that is delivered by electronic technology. Visualization is the most cutting-edge e-learning technique. It offers a radical departure from the highly criticized page-turns, drill-and-practice programs, and workbooks online. Visualization software promise to engage learners by making them active participants in real-world problem solving and allowing them to engage in role plays, providing a safe environment for exploration. These promises have captured the attention of the A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 18–25, 2010. © Springer-Verlag Berlin Heidelberg 2010
Interactive Visualization System for DES
19
instructional designers and their clients. Visualization must be evaluated in the context of the problem-based design. Using technology-based visualization, learners have the opportunity to experiment and to try a variety of strategies in ways that are often not practical or financially feasible in traditional classroom-based simulations. Most educational systems have been designed to simplify understanding the ideas of some main problems or in general overall course materials. There are few visualizing systems or applets [2, 3, 4, 5, 6, 9, 10, 11, 12, 14] to help students to understand cryptographic protocols. To the best of our knowledge, the systems [2, 5, 9, 10, 14] do not include both DES and AES. GRACE [4] includes a simple visualization of how to use DES, one variant of DES, and AES. It does not include any details of DES and AES. The applets in [11] gave a simple visualization of DES and its triply-encrypted variant. Cryptool [6] visualized DES and AES with a fixed message, and key. It does not support the learner to input the message and the key. Also, it does not include any variation of DES. This paper introduces a highly interactive educational system for DES, one of the most two important symmetric-key encryptions. The main features of the system are: 1. Dynamic: the user has ability to input plaintext/ciphertext and the key in different formats. 2. Step-by-step: there are controls for stepping the process back, forth, and restart the whole process at a time. 3. Tracing: it allows learner to see in detail what is happening in encryption and decryption processes. 4. Animation: it has interesting animations and graphics. 5. Standalone descriptive text and voice. Each step has a text and voice to help learner to understand each step of encryption/decryption. 6. Easy to use: the system is easy to use. The programming language used to develop the system is C# 2.0. In this paper we present a visualization of DES. The other part, variants of DES is similar. The paper is organized as follows. Section 2 contains a brief description of DES. Section 3 presents the visualization of DES in some details. Section 4 shows the outcomes when the system has been tested by the students. Finally, conclusions and future work are presented in Section 5.
2 DES The Data Encryption Standard [8, 13] operates on blocks of 64 bits using a secret key that is 56 bits long. The original proposal used a secret key that was 64 bits long. It is widely believed that the removal of these 8 bits from the key was done to make it possible for U.S. government agencies to secretly crack messages. Encryption of a block of the message takes place in 16 stages or rounds. From the input key, sixteen 48 bit keys are generated, one for each round. The block of the message is divided into two halves. The right half is expanded from 32 to 48 bits using fixed table. The result is combined with the subkey for that round using the XOR operation. Using the S-boxes, the 48 resulting bits are then transformed again to 32 bits, which are subsequently permutated again using yet another fixed table. This
20
M.S. Asseisah, H.M. Bahig, and S.S. Daoud
by now, thoroughly shuffled right half is now combined with the left half using the XOR operation. In the next round, this combination is used as the new left half. In real-life applications, the message to be enciphered is of variable size and normally much larger than 64 bits. There are five modes of operation that have been devised to encipher message of any size. These modes are: electronic codebook, cipher block chaining, cipher feedback, output feedback, counter [8].
3 Visualizing DES We have two models: an encryption model and a decryption model. In the encryption model, the system takes plaintext and a key as inputs and returns the ciphertext, After the system launch, the user is presented with three textboxes. The first textbox (on the left) is for inputting the plaintext. The second textbox (in the middle) is for inputting the key. The third textbox (on the right) is for displaying the output of the cipher. There are also some disabled buttons such as an encryption button, see Fig. 1. In order to enable these buttons, the learner should input plaintext and a key into their respective textboxes. The key can be entered in two formats. The first one is in a binary representation with length equals to 64 bits. The other format is text with a length of at least eight characters.
Fig. 1. DES interface before the encryption in details
After this preparation, the learner can do the encryption by one of two ways: 1. 2.
Encrypt the whole plaintext altogether: This is the usual way. By clicking on "Encrypt" button, the learner gets the resulting ciphertext, see Fig. 1. Encrypt the plaintext step-by-step: In this way, the learner can move through plaintext’s blocks (each block contains 8 characters) and stops on the desired block, which should be highlighted by yellow color. By clicking on the “Begin” button, the learner can step through the encryption process in full details, i.e., tracing, of the DES algorithm, see Fig. 1.
Interactive Visualization System for DES
21
The main feature of our visualizing system is to trace the encryption/decryption process in a dynamic, interactive and step-by-step way. DES Cipher consists of sixteen rounds. The first fifteen rounds consist of three main functions f, XOR, and Swap, while the last round only consists of f, and XOR. In addition, there are two permutations, one before the first round and another after the last round. The learner can use the “MainPanel”, see Fig. 2, to navigate through these functions and learn in detail about each step in the encryption. In MainPanel, we use gray color to indicate the click-ability of a label. Every button in MainPanel interface takes an input and then passes it inside the animated region. In order to move to the next round or to the previous round, the user can click on the forward or backward button respectively. These buttons are beneath MainPanel, see Fig. 2. Now, we will explain each function in detail.
Fig. 2. MainPanel interface
3.1 f Label This label triggers the most important interface. When a learner clicks on it, a new panel (named f method panel) expands along the left side of the window (beside MainPanel). The f method panel consists of two subpanels. The first one contains four clickable labels: Expansion P-Box method, XOR method, S-Boxes methods, and Straight P-Box methods. And the second subpanel to exhibit the simulation of the current method. • Expansion P-Box Label: it expands the binary input by using a permutation called P-Box permutation, see Fig. 3. • XOR Label: the output of the previous label is added to the round’s key using XOR operation as we can see in Fig. 4. • S-Boxes Label: S-Boxes do an important function in the encryption. So the learner must grasp it very well, for that we have designed a spectacular interface to simulate the process. The interface comprises of eight S-Boxes as we can see in Fig. 5. The output of XOR label (64 bits) is divided into 8 binary strings of length 6. Therefore, each S-Box gets 6 bits as input. Each S-Box’s simulation runs as follows.
22
M.S. Asseisah, H.M. Bahig, and S.S. Daoud
Fig. 3. f method interface
Fig. 4. XOR interface
i. Get the equivalent decimal value, say r, of the two bits on the sides of the 6-bit string. ii. According to the value of r, a blue arrow is pointed to the row number r in the S-Box, see Fig. 5. iii. Get the equivalent decimal value, say v, of the four bits in the middle of the 6bit string. iv. According to the value of v, a blue arrow is pointed to the column number v in the S-Box, see Fig. 5. v. The two arrows together will intersect in the resulting value, which is highlighted by blue color. vi. Then the resulting value is converted to 4-bit format, which is the final output of the S-Box. Finally, we get 32 bits string as the result of the all S-Boxes. • Straight P-Box Label: A simple permutation is performed on the input, i.e., output of S-Boxes. This permutation is called Straight P-Box permutation, see Fig. 6.
Interactive Visualization System for DES
23
Fig. 5. S-Boxes interface
Fig. 6. Straight P-Box interface
3.2 Main XOR Label The main XOR operation is applied on the output of the function f and the half-left of the current round’s input. The visualization of this operation is similar to Fig. 4. 3.3 Swap Region It swaps the output of the previous Main XOR label with the half-right of the round’s input. Thus, the left becomes the right and vice versa. After that, the two halves are passed to the next round as its input, see Fig. 2. 3.4 Initial and Final Permutation Labels Each one of the two permutations takes 64 bits and permutes it according to a predefined table. The initial permutation is executed before the first round, whereas, the final permutation is executed after the last round. The visualization of the initial and final permutation is similar to Fig. 6.
24
M.S. Asseisah, H.M. Bahig, and S.S. Daoud
3.5 The Key Generation Process The system also visualizes how to generate a key (called a sub-key) of length 56 bits for each round of encryption using binary string input of length 64 bits (called the master key). Each sub-key is used in its corresponding round in the MainPanel, specifically in XOR method which is included in f method. As we can see in Fig. 7, the main operations in each round of the key generation are: split operation, left Shift Operation, and compression P-Box operation. Before the first round, the input should face the Parity Drop permutation, which shrinks the input string (which is presented in binary format) from 64 bits to 56 bits. By clicking on Parity Drop label we can see the visualizing interface of the Parity Drop operation.
Fig. 7. Key Generation interface
4 Evaluation of the System The system has been used for teaching cryptography to undergraduate (forth-year) and pre-master students in Computer Science Division – Department of Mathematics at Faculty of Science, Ain shams University. They were 25 students. The evaluation of the system has been conducted using two steps: the first one was a questionnaire, and the second was an oral exam. The main function of the questionnaire was to show the students’ impression of the system. The results of each question, in terms of percentage response, are shown in Table 1. The oral exam is considered as credible evidence to prove the results concluded from the questionnaire. In fact, the second author makes a face-to-face interview with each student. During this interview, the student is asked some questions. From the answers of the students, we made sure that the answers in the questionnaire are credible.
Interactive Visualization System for DES
25
Table 1. Percentages of students response to the system Question • Enhances student’s learning • Increases effectiveness of the course • Has interesting animations and graphics • Has a suitable concept of help guides • Has an easy navigation system
Strongly agree 80
Agree
Neutral
Disagree
15
5
-
Strongly disagree -
87
13
-
-
-
13
69
17
1
-
11
32
50
7
-
20
76
4
-
-
5 Conclusions and Future Work In this paper, an interactive step-by-step visualization system has been presented to support understanding DES. Visualizing variants of DES is similar to DES. This system can be extended to the Advanced Encryption Standard (AES). AES is the winner of the contest, held in 1997 by the US Government, after the DES was found too weak because of its small key size and the technological advancements in processor power. We also intend to extend the system to support cryptanalysis of DES and AES.
References 1. Armstrong, M.: A handbook of human resource management practice (2003) 2. Asseisah, M., Bahig, H.: Visual Exploration of Classical Encryption on the Web. In: The Ninth IASTED International Conference on Web_based Education, March 15-17 (2010) 3. Bishop, D.: Introduction to Cryptography with Java Applets. Jones and Bartlett Publishers, USA (2003) 4. Cattaneo, G., De Santis, A., Ferraro Petrillo, U.: Visualization of cryptographic protocols with GRACE. Journal of Visual Languages and Computing 19, 258–290 (2008) 5. Crytpography demos, http://nsfsecurity.pr.erau.edu/crypto/index.html 6. Cryptool, http://www.cryptool.org 7. Hsi, S., Soloway, E.: Learner-Centered Design: Addressing, Finally, the Inique Needs of Learners. Proceedings of Computer Human Interaction 98, 211–212 (1998) 8. Menezes, A., Van Oorschot, P., Vanstone, S.: Handbook of Applied Cryptography, 2nd edn. CRC Press, Boca Raton (2001) 9. Protoviz, a simple protocol visualization, http://www.cs.chalmers.se/_elm/courses/security 10. RSA demo applet, http://cisnet.baruch.cuny.edu/holowczak/classes/9444/rsademo 11. Schweitzer, D., Baird, L.: The design and use of interactive visualization applets for teaching ciphers. In: Proceedings of the 7th IEEE Workshop on Information Assurance. US Military Academy, West Point (2006) 12. Spillman, R.: A software tool for teaching classical cryptology. In: Proceedings of the 6th National Colloquium on Information System Security Education, Redmond, Washington, USA, 13. Stinson, D.: Crytpography theary and practice. CRC Press, Boca Raton (2004) 14. Zaitseva, J.: TECP Tutorial Environment for Cryptographic Protocols. Master’s thesis, Institute of Computer Science, University of Tartu (2003)
Intelligent Implicit Interface for Wearable Items Suggestion Khan Aasim1, Aslam Muhammad1, and A.M. Martinez-Enriquez2 1 Department of CS & E., U.E.T., Lahore, Pakistan
[email protected],
[email protected] 2 Department of Computer Science, CINVESTAV-IPN, Mexico
[email protected]
Abstract. In daily routine life, people frequently perform computer aided physical activities explicitly shifting from real to virtual world and conversely. In this shift and in order to get some recommendations, people are inquired about personal information. Normally, people do not know how to organize ideas to answer to an automatic inquiry or sometimes, they are reluctant to disclose their own particulars. These issues slow down computer usage to get assistance concerning suggestions about wearable items. We tackle the problem by developing an intelligent interface for helping customers to choose entities like dresses, shoes, and hair style. The system is based on implicit Human Computer Interaction concept and Artificial Intelligence heuristics particularly on knowledge based systems. The developed system gathers customer information like height, weight, waist, skin color, in order to facilitate the selection of daily life commodities. The performance of the system is very encouraging applied for getting suggestions of dress selection for a business man, shoes for professional players, and party hair style for men. Keywords: Implicit HCI, Knowledge based systems, Ubiquitous computing.
1 Introduction In order to escape from stress or unpleasant situations, humans perform certain physical activities, like getting aid from computer resources to perform better and comfortable lives. Nevertheless, there is a gap between the real world and the computer environment. Main reason behind is that to perform automatic assistance for activity recommender system, people have to shift explicitly from visible to virtual computer environment. This shifting not only slows down human’s performance but also reasonably decreases the real usage in performing physical activities. Nowadays, computer applications are provided by limited Human Computer Interface (HCI) which is unaware of physical real environment. Normally, to get some computer assistance, it is required that users provide information by explicit input. The huge gap between physical and virtual environment increases the work load on people in performing computer assisted activities. These issues cover different research areas like context aware computing [3], tangible interaction [4], multi-modal interaction [2]. Moreover, in our society, not many people are aware of using A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 26–33, 2010. © Springer-Verlag Berlin Heidelberg 2010
Intelligent Implicit Interface for Wearable Items Suggestion
27
computers, they might hesitate to give confidential information or to be marked by tedious repetition. In order to address this problem, we propose an implicit HCI interface that gathers customer’s information seamlessly and then suggests or advises him regarding his getup. The rest of the paper is organized as follows: related work is presented in Section 2. Section 3 describes proposed system in detail. A case study is described in Section 4. In Section 5, conclusions and some future perspectives are given.
2 Related Work When we read an article on British Broadcasting Company (http://www.bbc.com) regarding adjustable undergarments, we had the impression that normally, people feel uncomfortable when their undergarments are not well fitted to them. To overcome it, an “adjustable undergarments” concept is presented, according to which a microcontroller can be used to adjust wearable item to attain customer comfort. Meng, et al [1] introduced a shoe-integrated system for human gait detection: normal gait, toe in, toe out, over supination, and heel walking gait. The inertial measurement unit (IMU) concept is introduced which consists of three-dimensional gyroscopes. To measure angular velocity and acceleration of foot, an accelerometer is used. During the research work on wearable computer and sensor interfaces Yangsheng et al [9] argue that the major benefit provided by wearable intelligent devices is the close proximity that maintain with users. Therefore, the development of an intelligent shoes system consisting of a micro controller, a suite of sensors for acquiring physiological, motion, and force information, as well as a wireless transmitterreceiver set are proposed. The data gathered from this intelligent shoe-integrated platform is used to further analyze real-time health and gait monitoring, real-time motion (activity) identification and real-time user localization. Development about new sensing cloths [7] allowing the simultaneous record of physiological signals, it is used the concept in textile industry by means of sensors integrated into fabrics. The sensing devices gathers user information used for health monitoring providing a direct feedback to user acting on awareness level and allowing better control for user conditions. Availability of sensor technology, ranging from temperature sensor to complex acceleration sensor, for instance Hendrik,, HolgeR, et al [5] present the concept of wearable comfortable clothes, to continually worn supporting users during the work. Touch interaction on mobile phone is a natural HCI. But, occlusion problem is presented, i.e. a “big finger” may lose high percentage of information presented on a small screen [6]. Up to the best of our knowledge, no such system exists which facilitates customer to select wearable items.
3 A Wearable Items Recommender System The developed infrastructure is composed of hardware like screen immiscible environment and pointing devices, functionalities for satisfying requirements like friendly interface, control mechanisms for providing information request from the interface,
28
K. Aasim, A. Muhammad, and A.M. Martinez-Enriquez
relationships of several tools and work process. Our system consists of five subsystems (see Fig. 1): 3.1 Information Gathering (IG) This system gets user information for the first time, before passing through the magical room. There are different ways to accomplish this step: 1) An operator inquires and records user's preference in order to promote implicit HCI, 2) Dedicated functionality for information gathering where user answers questions related with his preferences, interacting with speaker and mike, 3) Multi-mode information (manual and plus HCI). The automatic mode system is especially for technical users who have IT knowledge and who can access Internet. User information is retrieved from Internet rather than inquired him, for each available website source, when user has a web site or may have a Facebook account / any public profile application. This information can be gathered from based on tags.
Fig. 1. System Information Flow
IG saves users attributes like gender, age, address, facebook name, height, and some other. Website or Facebook username can be used to implicitly capture user’s data and some fields like favorite celebrity and profession. Some of these attributes are explicitly asked and others are implicitly captured by the system like user’s skin color, height. Implicit captured information come from camera and is updated by the “Information Refinement & Profile Update” component. 3.2 Event Category Selection (ECS) Users select the event category for which getup is chosen. The category can be business meeting, wedding dresses, sportswear, casual, etc. as shown in Figure 2. For each type of category a button appears aligned horizontally. User selects event category so that system can show him information according to his occasional match when he reaches the Magical Selection. In order to go to the magical room user has to pass through this door button. By default door is closed and is opened when user presses on. After pressing, the door is opened like automatic doors, i.e. from inside to outside horizontally. The category that user has selected is shown as a default category while he is in magical room.
Intelligent Implicit Interface for Wearable Items Suggestion
29
Fig. 2. Category Selection
3.3 Information Refinement and Profile Update (IR&PU) This is one of implicit process of the system, because user information is gathered without involving him physically to feed the data to update the user’s profile. When user walks toward the category selection door, a revolving camera takes users information. The camera is mounted on a reeling to move back and forth. The camera is mounted via circular surface which helps the camera to revolve 360 degrees. When user is walking towards the door this camera revolves around user and gathers facial's user characteristics. This information is used for getup selection purposes, in magical room. This step is very important as have to process on user’s latest information. The system uses Trace transform [8] which is a generalization of the Radon transform. The technique helps to recognize objects under transformations. 3.4 Magical Selection System (MS) MS is composed by the following modules: a) Gait detection The walking styles are deduced by our walking device since different walking styles exist due to kind of work performed, place, status, a particular used freestyle walking, etc. While walking people, they place their feet on different angles. In addition, they have different physical characteristics like large or flat footed. The walking device contains pressure sensors which are laid on a metallic surface. Customer walks on this device bare footed, giving the force of different part of his feet to sensor that detect and get user gait information. The walking devise is provided by different lights beneath: green color when nobody is walking on the device, blue when someone is walking on it and the system is getting information without any error, and red color when someone is walking and the system is not getting the correctly input. Red color conveys the message to person to again walk on the device. A voice message is also sent through speakers saying “Repeat again”. b) Camera system Although, we have gathered user’s facial information during Information Refinement step. So this is an optional system. The camera is placed in a magical room that can rotate 360 degrees. Camera is used when user is not satisfied with quality of earlier taken pictures or when he wants to view this getup in a new style. We place this system again in the magical room/cabin, such that user does not need to go back and new photographs can be captured for information refinement.
30
c)
K. Aasim, A. Muhammad, and A.M. Martinez-Enriquez
Screen
Other used hardware is a computer screen on which information is being displayed, whose main advantage is a touch screen which enables users to have a more natural interaction with the device. But sometimes occlusion of information may be presented, becoming a disadvantage. In order to provide usable interaction with the touch systems, we introduce a new concept of “Grip able Needle”, similar as a pen but with a feature that it can grip user’s finger. When user puts his finger inside this needle and the finger reaches a certain depth and touches the inner side of the needle, the two grips move and hold users finger. Thus, “Grip able Needle” sticks together user’s finger and acts as unique. Now user does not need to use three fingers to hold the needle. User can now easily interact with touch systems. More over different sizes needles are placed at the side of touch screen, so user can choose a needle according to his finger thickness. Other interesting part of “Computer Screen” is that user appearance is viewed on. Thus, he has the possibility to provide feedback and all controls are also placed. In addition, a big screen can be used, on which these filters are done, showing users view. In this way, users view themselves live during the selection of particular things. Thus, he does not need to go to a mirror or someone who makes comments whether he is all right. User can view himself on the big screen and decides what follow is on. d) Dedicated Functionality The dedicate functionality works on the top of knowledge base which consists of rules written in first order logic. The user information constitutes the premise part of a rule and system for suggestion/recommendation forms the action part. 3.5 Order Placement System (OP) When user presses “Place Order” button either from the touch screen or from keyboard, this process is launched. First the system gathers all the items that the user has added to the cart during this trip. When all selected items are available, the bill is generated. When an item is not available, a searching is triggered within the store or in other outlets. When an item is available at later time, the system generates the bill, marking the booked time and fixes the user’s address to be delivered.
4 Case Study Let’s consider as a case study, the business man, who comes to getup for a meeting. A customer has gone all previous steps and now he is going to use our currently software process. We consider that dress is the most important article. Thus, dresses are shown at first by default. This sequence can also be changed. Note that filters are already populated with specific values. Gender, age and height are populated from businessman’s profile information. The category drop down is populated from businessman information selected during event category selection step. When businessman’s mind changes, he can also select items from different category, providing flexibility since, at same time, users can shop items from multiple categories. Items, stuff, and brand information is populated based on user’s history.
Intelligent Implicit Interface for Wearable Items Suggestion
31
Fig. 3. Dress Selection
For instance, if user has selected suits most of times and cotton stuff, then these values are automatically populated. Anyhow, user can change this information to have different items and stuffs, at any time. The screen also contains a navigation pane which contains several types of links. Businessman has the possibility to organize views by clicking on the appropriate link: - Newly arrived stuff, - Particular style like plain, check, and lining, - Most ordered, - Most viewed, -Preferred color. When user has selected it, the system only displays items contained selected color. Let’s consider that the businessman has selected particular brand cotton as stuff and newly arrived as selected category (see Figure 3) to the businessman. Customer can try different suits and view himself on the big screen. Now after viewing different suits, he can select a particular color; now suits with that particular color are shown to user. He can add the selected suit to the compare cart by clicking “Compare” button. Now, let’s suppose that user has changed the color option and want to view suits in different color in same variety. After viewing a couple of suits he has finalized a particular suit. Now, he wants to compare it with the previously selected suit. He can click the “Comparison Mode” option. All the selected suits are shown to user; user can select all or some of them for comparison. He can select one and leave them on comparison cart or can delete them from comparison cart to minimize the selection options. If user has selected a particular suit he can click buy button to add it in the final cart. When user wants a recommendation from system, he clicks on “suggest suit” link. Our system uses a knowledge based system to recommend customers suitable clothing according with a selected social client's event. An example of rule is as follows: StartRule "Winter Dress Suit Suggestion" If nature (x) = customer /* ‘x’ is a customer */ color (x) = "fair" /* customer color is fair */ favoriteColor(x) = “black” /* customer favorite is black */ newFashionColor(selected_item) = “brown” Then Suggest(x) <== display(color(selected_item(brown) ) ) Suggest(x) <== display(color(selected_item( dark) ) ) EndRule After finalizing the selection of dress, the businessman can click shoes option from the left navigator pane. As user has selected a business meeting category during this step, so business meeting stuff is available for user by default as shown in Figure 4.
32
K. Aasim, A. Muhammad, and A.M. Martinez-Enriquez
Now user can navigate between different shoes to select the shoes of his own choice using either a touch screen or manual keyboard input. Businessman can select different options from the filter e.g. he can change the type of item or change type of stuff wanted. Let’s consider that user has selected a particular shoe. Now, he clicks “Sole Selection Mode”. The sole selection interface is shown. Since, the system has already gathered the gait information, it suggests a sole. Both the suggested sole and the selected shoe are shown to user. Moreover, the final shape of the shoe after having the suggested sole is displayed so that the user can have idea about final shape shoes. Once dresses and shoes are selected, user can chose hair style option. As the system has already captured user’s images implicitly, so own users hairstyle is shown by default. Now user can navigate between different hair styles (see Figure 5) to select the hairstyle of his own choice using either a touch screen or manual keyboard. When customer has finish, he clicks on “Place Order” button to give order. The system checks selected items from cart and checks whether they are available. If they are available, the system generates the bill. Items not available are notified to user, giving them the probably time and the possibility to send it to the registered customer address. User can either pays bill in cash or by credit card.
Fig. 4. Shoes Selection
Fig. 5. Hair Style Selection
5 Conclusions and Future Work Due to explicit interface, many systems become a bottle neck, because processes are slow and not all people can be aware of all electronic devices and the interface of computer systems. Thus, we designed and implemented a system including implicit HCI for suggesting customers to select wearable items and styles in a comfortable way. In our system, implicit HCI was achieved by including: -cameras to collect
Intelligent Implicit Interface for Wearable Items Suggestion
33
facial and hair style; -pressure sensors to recover customer gait and feet characteristics. In addition, a knowledge based system is integrated within our infrastructure whose rules are dedicated to recommend suitable clothing according with social events, customer's characteristics, and currently new fashion style. The magical room has multiple sensors and cameras where behavioral style of customer is natural, so gathering information is trusted. Drawing attention toward benefits of an implicit interface, we continue working to overcome certain limitations of our system.
References 1. Chen, M., Huang, B., Xu, Y.: Intelligent shoes for abnormal gait detection. In: Proc. of IEEE Int. Conf. on Robotics and Automation (ICRA 2008), CF, USA, pp. 2019–2024 (2008) 2. Coutaz, J.: Multimedia and Multimodal User Interfaces: A Taxonomy for Software Engineering Research Issues. In: Proc. of the Second East-West HCI Conf., St Petersburg, pp. 229–240 (1992) 3. Dey, A.K., Salber, D., Abowd, G.D.: A Conceptual framework and a toolkit for supporting the rapid prototyping of context-aware applications. Human-Computer Interaction Journal 16(2-4), 97–166 (2001) 4. Ishii, H., Ullmer, B.: Tangible Bits: Towards Seamless Interfaces between People, Bits and Atoms. In: Proc. of the SIGCHI Conf. on Human Factors in Computing Systems (CHI 1997), pp. 234–241. ACM Press, New York (1997) 5. Witt, H., Nicolai, T., Kenn, H.: The WUI-Toolkit: A Model-Driven UI Development Framework for Wearable User Interfaces. In: Proc. of 27th Int. Conf. on Distributed Computing Systems Workshops, ICDCSW 2007 (2007) 6. Jenabi, M., Reiterer, H.: Finger Interaction with Mobile Phone, Human-Computer Interaction Group University of Konstanz D-78464, Konstanz, Germany 7. Paradiso, R., Loriga, G., Taccini, N.: A Wearable Health Care System Based on Knitted Integrated Sensors. IEEE Transactions on Information Technology in Biomedicine 9(3) (September 2005) 8. Srisuk, S., Petrou, M., Kurutach, W., Kadyrov, A.: Face Authentication using the Trace Transform. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, USA, June 16-22, pp. 305–312 (2003) 9. Xu, Y., Li, W.J., Lee, K.K.: Intelligent Wearable Interfaces. Wiley-Interscience, USA (January 2, 2008)
Folksonomy-Based Ontological User Interest Profile Modeling and Its Application in Personalized Search Xiaogang Han1 , Zhiqi Shen2 , Chunyan Miao1 , and Xudong Luo1 1
2
School of Computer Engineering, Nanyang Technological University, Nanyang Ave, Singapore 639798 {hanx0009,ascymiao,xdluo}@ntu.edu.sg School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Ave, Singapore 639798
[email protected]
Abstract. Information overload on the Internet is becoming more and more insufferable. The accurate representation of user interests is crucial to a successful information filtering system that are used to solve the issue of information overload. To model the users’ interests more effectively, this paper investigate how to collect user tags from folksonomy and map them onto an existing domain ontology. The experiment that integrates our user interest profile model to a Web Search Engine shows that our approach can accurately capture user’s multiple interests at the semantic level, and thus the personalized search performance is significantly improved compared with the state-of-the-art approaches. Keywords: User Modeling, User Interest Profile, Folksonomy, Ontology.
1
Introduction
The amount of information resources on the Web nowadays is so enormous that the retrieval of relevant information is getting more and more difficult. To take this challenge, various personalization, recommendation, and information filtering techniques are developed. The main idea behind these techniques is to adapt relevant information to users according to their short and long-term interests [7]. As the popularity of Web 2.0, users can not only consume contents from the Internet but also contribute contents to the Internet. Thus, social tagging systems, such as Delicious,1 Last.fm,2 and Flickr,3 are developed to enable users to tag online resources (bookmarks, music, images, and so on) with freely chosen non-hierarchical annotations (e.g., tags). Tags provide a textual vector representation of the resource’s features, regardless of the type of the resource. Different 1 2 3
http://delicious.com/ http://www.last.fm/ http://www.flickr.com/
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 34–46, 2010. c Springer-Verlag Berlin Heidelberg 2010
Folksonomy-Based Ontological User Interest Profile Modeling
35
from automatically generated metadata and the metadata annotated by the authors of the resources, user-created tags are a reflection of the user’s own interest space for online resources. As the users share their tagging with the public, the collaboratively created tagging data in a social tagging system constitute a folksonomy, which explicitly provides abundant information about their interests. For example, [21] analyzed personal data in folksonomies and investigated how to generate and represent multiple user interests; and [17] harvested user interests across multiple social networking sites. Although there are lots of studies of applying user interest profle models to solve various information overload problems, e.g., recommendation systems [2] and personalized search [13][20][19], in most of them, semantics of each tag in the folksonomy has not been utilized very well. As users can freely choose tags using their own vocabulary, the resulting metadata can include homonyms (tags with the same spelling but of different meanings) and synonyms (a group of different tags of the same meaning). This may lead to misconnections among different concepts and inefficient searches for information about a topic [6]. Another challenge to folksonomy based user interest model is how to model multiple interests that most of users have [21]. Although various clustering methods [21][15] had been proposed to split user interests into individual clusters, their performance is not very promising because of their unsupervised nature. To remove the limitations of the semantics and multiple interests modeling, this paper presents a novel method for user interest profile modeling, which can accurately build user interests and solve the redundant problems in the pure folksonomy based approaches. More specifically, in our approach, to build ontological user interest profile, user tags in Delicious social bookmarking system are mapped onto a Web topic ontology (the Open Directory Project4 taxonomy). Thus, the semantics of tags in folksonomy are modeled. Our approach takes advantages of both folksonomy and domain ontology. As the tags in the folksonomy are assigned by the user to the resources, they can reflect the user’s real interests, otherwise the user would not label the resources with the tag at all. But the tags in folksonomy cannot contain semantic information about them and their relationship. On the other hand, as the ontology are predefined by domain experts, the concepts in the ontology are accurate and easy to spread to related concepts in the hierarchical tree. In addition, the evaluation experiment of the performance of the proposed user interest profile model shows that our user model can accurately represent user interests and enables semantic reasoning in the ontology, and when applied to personalized web search, our user interest profile modeling algorithm shows significant performance improvement compared with previous state-of-the-art approaches. The rest of the paper is structured as follows. Section 2 recaps the basic concepts and notations in Folksonomy. Section 3 develops our algorithm of folksonomy based ontological user interest profile modeling. Section 4 applies our model to personalized search and the evaluation result is presented in Section 5. 4
http://www.dmoz.org/
36
X. Han et al.
Section 6 discusses the related work. Finally, Section 7 concludes the paper and identifies possible future work.
2
Folksonomy
In this section, we recap the basic concepts and notations of folksonomies. Basically, a folksonomy is a set of user-contributed data aggregated by collaborative tagging systems, in which users can choose terms freely to describe their favorite Web resources. More specifically, a folksonomy generally consists of at least three sets of elements: users, tags and resources (although there exist different kinds of resources, in this work we focus on Web documents like those being bookmarked in del.icio.us). Formally, a folksonomy is defined as follows [12]: Definition 1. A folksonomy, denoted as F, is a tuple ( U, T, D, A) where U is a set of users, T is a set of tags, D is a set of Web documents, and A ∈ U ×T ×D is a set of annotations. For a particular user in the folksonomy, the focused data is the documents in the user’s collection and the tags that the user assigned to each of the documents. Such a set of data is given the name personomy [21], which is formally defined as follows: Definition 2. A personomy, denoted as Pu , of user u is a restriction of a folksonomy F to u: i.e., Pu = (Tu , Du , Au ) where – Au is the set of annotations of user u: Au = {(t, d) | (u, t, d) ∈ A}, – Tu is the tag set of user u: Tu = {t | (t, d) ∈ Au }, and – Du is the document set of user u: Du = {d | (t, d) ∈ Au }. To facilitate the following discussions, we further need two more notations. Firstly, T(u,d) represents the set of tags assigned to document d by user u, i.e., T(u,d) = {t | (t, d) ∈ Au }. And we denote the set of the tags assigned to document by all users in the folksonomy as Td = {t | (t, d) ∈ A}.
3
Mapping Tags onto Ontology
In this section we present our method of constructing user interest profile model. The basic idea of the method is: first represent the user interest profile generated from the folksonomy only as term vectors of tags without semantic information between these tags; then map the tags onto domain ontology to gain their semantics. In this process, the Web topic ontology we use is the Open Directory Project (ODP) taxonomy. It is the largest and most comprehensive Web directory, and is maintained well by a global community of volunteer editors. Moreover, it is widely used as the basis for various research projects in the area of Web personalization. So, it is reasonable that we chose it in our method. The main challenge with mapping tags in folksonomy to concepts in ODP is that some tags may belong to multiple categories in the ontology. For example, in
Folksonomy-Based Ontological User Interest Profile Modeling
37
ODP Computers
Recreation
Programming Languages
Python
Java
python
java
314
1138
Threads
java 32
Arts
Pets
Movies
Reptiles and Amphibians
Titles
Snakes
Shopping Food Beverages
M
Boas and Pythons
Monty Python and the Holy Grail
python
python
28
10
Coffee
java 314
Fig. 1. The tree representation of the concepts python and java in ODP categories
Fig 1, both python and java are mapped to three categories, each of which has distinct meanings. Thus, the mapping problem can be accurately expressed as: given a set of tags in user’s personomy, how to map each tag onto a single appropriate category in ODP? Clearly if no context is provided for a specific tag, it is impossible to map it onto the ontology correctly. So we need to investigate the co-tagging relations in the user’s personomy as context for ontological mapping. Users normally assigns 6-7 tags to each bookmark in the user’s personomy, namely the co-tag set we defined in the previous section as T(u,d) . Suppose cotags set T(u,d) = {t1 , ...tN }. We represent ODP categories for the set of tags as a hierarchical tree T ree, in which each category is a path from the root to a leaf. Each tag ti ∈ T(u,d) can be mapped to Mi categories: {C(ti , 1), ..., C(ti , Mi )} in T ree, where each category with probability P (C(ti , j)). We calculate P (C(ti , j)) as the number of documents indexed under the corresponding concept in ODP. The category mapping for each tag is determined by recursively visit T ree in a breadth first manner to calculate the probability of the tag belonging to each category. For example, suppose the co-tags set is {python, java}, as shown in Fig 1, each of the them are mapped to three categories in the tree. There are totally 6 categories in the figure. For python, the first category C(python, 1) is the path Computers-P rogramming-Languages-P ython-python and P (C(python, 1)) = 314 314+28+10 . At each level l of the tree, the probability for each tag ti belonging to each category is calculated as the global occurrence of the concept in each categories at level l multiplying the probability of the category itself. For example, in Fig 1, at the first level, the global occurrence of Computers is 3, as there are three concepts categorized under Computers. Therefore, P (python ∈ Computers, level = 314 × 36 . The category with the largest probability is marked as the 1) = 314+28+10 desired category for the tag, while other categories for the tag are discarded. The process runs recursively until a leaf is arrived for each tag. The formal algorithm description of the process is shown in Fig 2.
38
X. Han et al.
For ti ∈ T(u,d) do For node ∈ child(root(T ree)) do of categories bypass node P (node) = number total number of categories node score = P (node) × P (C(ti , j)) end for node = arg maxnode node score T ree = Subtree(T ree, node ) end for Fig. 2. Algorithm for mapping a set of tags onto ODP
The overview for our user interest profile modeling method is shown in Fig 3. The user-contributed tags are mapped to ODP using the proposed probability based algorithm as shown in Fig 2. The result of the mapping is a hierarchical user interest tree (as shown in Fig 3), in which each tag is denoted by a leaf. Semantic inference can be performed to spread user interest profile model by activating other nodes in the hierarchical ontology. After the mapping from folksonomy onto ODP, we can apply the generated ontological user interest profile model to various real world scenarios. history home hosting house howto html humor icons illustration images information inspiration interactive interesting internet iphone japan java javascript jobs jquery kids language learning mac magazine management maps marketing math media microsoft mobile money movie movies mp3 music network networking news online
opensource osx photography photos photoshop php plugin podcast politics portfolio privacy python radio rails realestate recipe
Mapping
...
...
...
Music TV
Internet
Games Business
Arts
... Movies
ODP ... Software Computers Science Hardware
Physics ...
Psychology Biology
Fig. 3. Overview of Our User Interest Profile Modeling
4
Search Personalization
In this section, we apply our user interest profile model to personalized search. The basic idea behind personalized Web search discussed here is to re-rank
Folksonomy-Based Ontological User Interest Profile Modeling
39
the results returned by Web search engine for a query. The re-ranking can be defined as the similarity between the query q and each document d in D (a set of documents). The degree of personalization is measured by calculating the similarity between u and each document d. The problem with the previous user interest profile modeling algorithms such as that of Noll and Meinel’s [13] is that the uniform tag frequency vector representation of user interests is not adaptive to the specific context. For example, when we initialize a query for personalized web search, it is expected that the top ranked result should be the one similar to not only the user interests but also the search context. However when applying Noll and Meinel’s approach to web search, those results that are more similar to the user but not fit in the current search context might be ranked high. Therefore, the key challenge is contextual user interests activation. So, this section will take this challenge and develop such a user interests activation algorithm. 4.1
Contextual User Interests Activation
User interests are contextual and a single instance of event can only activate part of user interests [16]. When users query Web search engine for relevant information, the direct context is the query keywords. In the following, we propose a contextual user interests activation algorithm based on the current query context. We have obtained the ontological representation of user interest profile model in the previous section, where each user interest is the leaf node in the ontological − → tree T . Now suppose the user have N interests denoted as In = {Ini | i = 0, ..., N }, each of which is a vector path from the tree root to the corresponding leaf. We further represent the search context as the set of K query keywords, → − which are also mapped to ODP as concept vectors Q = {qj | j = 0, ..., K}. − → −→ For each user interest Ini ∈ In, we compute the activation score for Ini in → − the context Q as: −→ − → δ(Ini , qj ) activation score(Ini , Q ) = K
(1)
j=0
In the above, similarity measure δ is one that [4] developed to calculate the similarity between two nodes in the hierarchical tree: δ(l1 , l2 ) =
2 × depth(LCAU (l1 , l2 )) , depth(l1 ) + depth(l2 )
(2)
where l1 and l2 are two nodes of a tree, LCAU (l1 , l2 ) is the lowest common ancestor of l1 and l2 , and depth(l1 ) and depth(l2 ) are the depth (from root) of these two nodes in the tree, respectively. After activation score for each Ini in the context Q is obtained, we sort the scores in descent order, and only top k user interests are activated.
40
4.2
X. Han et al.
User-Document Similarity
The user-document similarity presented by Noll and Meinel [13] is a dimensionless score defined as: θ(u, d) = Pu × |Pd | where Pu and Pd are the vector representation of the interest profile of user u and the profile of document d, respectively. Each element in Pu represents the frequency of a tag labeled by user u, and each element in Pd denotes the frequency of a tag assigned to document d by all users. All document tag frequencies are normalized to 1. Vallet [19] measure the user-document similarity by using the term frequency-inverse document frequency (tf-idf ) weighting scheme, in which both the tf-idf weight in the user space U and document space D are calculated: similarity(u, d) = i tf idfui × tf -idfdi . This weighting scheme eliminate the user and document length normalization factors in classic tf-idf vector space model because the popularity score of a document, which is a good source of relevancy [7], would be penalized by introducing length normalization factor. The difference between our similarity measure and keyword based similarity measure such as Noll and Meinel [13] and Vallet [19] is that we take the search context and the semantic similarity into consideration to calculate the semantic similarity between the activated user interests and the document in the search result list. Our similarity measure is an extension of Generalized Cosine-Similarity Measure (GCSM) developed in [4], which is used to compute similarity between concepts in the hierarchical structure. GCSM is an expansion of the tf-idf vector space model. Suppose the activated user interest is denoted as Active In, and a document in the result list is denoted → −−−−−−− → − as d, GCSM defines Active In · d as: → −−−−−−−→ − θ(Active In, d ) = ai bj Active Ini · dj , n
m
(3)
i=1 j=1
where ai , bj are the tf-idf for Active Ini and dj , respectively. Finally, the normalized GCSM similarity of these two vector is given as: → −−−−−−− → − Active In · d → −−−−−−−→ − → − θn (Active In, d ) = −−−−−−− (4) → → −−−−−−−→ − Active In · Active In d · d In our similarity measure, we also eliminate length normalization factor for the same reason as in Vallet [19]. At the same time, we introduce αi , the activation score for Active Ini in query context Q. Finally, our similarity measure is defined as: θ (Active In, d) =
n m
αi ai bj Active Ini · dj ,
i=1 j=1
5
Experimental Evaluation
In this section, we present our evaluation experiment.
(5)
Folksonomy-Based Ontological User Interest Profile Modeling
5.1
41
Evaluation Method of Personalized Search
In this subsection we discuss how to evaluate the performance of our user interest profile modeling algorithm by applying it to personalized Web search. David and Cantodor [19] have set up a reasonable evaluation framework to measure folksonomy-based Web search personalization approaches. So, we can evaluate our work within their framework and we compare the performance of our approach with the personalization approach proposed in [19]. In the framework, the user’s bookmarks are split into two sets. The first set is used to generate user interest profile, as is illustrated in the previous sections, and the second is used as testing set to evaluate the user profiling performance. For each bookmark in the testing set, the most popular tags are extracted from the social bookmarking systems. These top tags can reflect the features of the bookmark. We then launch a Web search with these top tags and collect the search results. It is assumed that the bookmarks in the a user’s testing set are relevant to the user and the user is interested in these bookmarks, otherwise the user would not store the bookmarks at all. By incorporating with an efficient user interest profile algorithm, the bookmarks which match the user better are expected to be re-ranked upward. The topic generation and evaluation is conducted in the following steps [19]. For each document d in the testing set: 1) we generate a topic description using the top k most popular tags associated to the document; 2) we query the topic on a Web search engine and return the top R documents as the topic’s result list; 3) if document d is found in the result list, we apply the user interest profile modeling algorithm to each document in the result list; and 4) we calculate the original ranking position γ1 and the new ranking position γ2 of d and compute Mean Reciprocal Ranking (MRR), which is the average of reciprocal ranking positions in the result list, for all documents in the testing set. In our experiments, we use a query size of 3 tags, and a result list size of 500 documents, which is the same settings as in [19], as we want to compare performance of our approach with [19]. There is of course a chance that document d does not appear in the result list. In this case, the document is discarded for topic generation. In order to measure the user interest profiling performance, we need to combine the rank of the result list by the search system and the rank of the result list by personalization approaches in order to re-rank the search result. We use the CombSUM rank based aggregation method [14] to aggregate the original ranking score and user interest profile ranking score. Formally, let τ denote the results list returned from a Web query by search engine, τ (i) denote the position of i in τ , sτ (i) denote the score assigned by search engine to item i ∈ τ The calculation of each of the normalization in [14] is as bellow: – Score normalization: for an item i ∈ τ , ωSτ (i) =
sτ (i) − minj∈t sτ (j) maxj∈t sτ (j) − minj∈t sτ (j)
(6)
42
X. Han et al.
– Rank normalization: for an item i ∈ τ , τ ωR (i) = 1 −
τ (i) − 1 |τ |
(7)
– The final CombSUM: τ sτC (i) = ωSτ (i) + ωR (i)
(8)
For the search engine rank, we can only get the result rank position τ (i), but cannot get the score, rank normalization is used to normalize the weight. For user interest profile ranks, the rank score sτ (i) can be obtained and therefore score normalization is used. 5.2
Experimental Data Sets
We created a test data set formed by 100 delicious users using the Delicious API5 . The users have at least 300 bookmarks. We use 80% of the bookmarks to create the user interest profile (the testing set), and the remaining 20% to generate the evaluation topics as described in the previous section. To increase the number of valid documents in the testing set, when perform the splitting, we extract top popular tags for each document d in the bookmarks for a user, launch Web search with the top tags, and check whether d is in the result set. Only in those cases where d is in the result set are included in the testing set. As an experimental system of personalized Web search, we chose to use Yahoo Search Engine with Yahoo Boss search API6 . This is because Yahoo Search Boss API is featured with unlimited queries per day, and re-ordering of the search result, which provide ideal platform for personalized search experiments. For ODP, we perform search on the ODP website and extract the candidate categories via HTTP request. 5.3
Validity of Our Approach
As the tags in a folksonomy are assigned by users with any words they like, it cannot be guaranteed that all tags can be mapped to ODP. We denote the percent of tags that can be mapped to ODP as mapping rate. Based on our experiment, the mapping rate for 98% of all users surpass 80%, we regard our mapping as a valid mapping. 5.4
Comparing Result of Personalized Web Search
We now investigate the performance of the personalization approaches when used in combination with a Web search engine. Fig 4 shows the MRR values of the existing personalization approaches and the one developed in this paper. The 5 6
http://delicious.com/help/api http://developer.yahoo.com/search/boss/
Folksonomy-Based Ontological User Interest Profile Modeling
43
existing ones are those in [13] and [19], which are denoted Noll and Vallet as in Fig 4. TagOnto 7 is denoted as our algorithm. The bars in white color, denoted as MRR1, are the MRR score for each algorithm without combined with the ranking of the Web search; whereas the bars in black color, denoted as MRR2, are the MRR score for each algorithm combining the ranking of Web search and each personalized algorithm. From Fig 4, we can see that the performance of Noll algorithm is lower than the Baseline. The reason is the uniform term vector representation of user interests might give priority to those resources more similar to the user without considering the search context. The problem also exists with Vallet algorithm but our algorithm. Although the tf-idf weight scheme gives bias to each element in the user interests and the resource features, it is not directly based on the search context. Further, the semantic similarity information cannot be measured in their approach. The performance of the approach proposed in this work, TagOnto, is better than than Noll and Vallet in terms of MRR.
0.4 0.35 0.3 0.25 0.2
MRR1
0.15
MRR2
0.1 0.05 0 Baseline
Noll
Vallet
TagOnto
Fig. 4. Performance Comparison
6
Related Work
Lots of researchers have studied user interest profile based on various implicit or explicit information harvest approaches. For example, Letizia [8] assists users in browsing the WWW by suggesting and displaying relevant web pages, which are implied by users’ interests from their previous browsing history. The work of [18] collects a wide range of implicit user activities such as search queries, browsing history, emails and documents, and then utilizes the information to re-rank Web search results. Compared to these work, our method harvests user interests from user-contributed tags, which more accurately reflect users’ real interests especially in semantics. More recently, since social tagging systems are becoming more popular, lots of studies have carried out to construct user interest profile model from folksonomies. For example, a user interest profile model based on the annotations 7
TagOnto - abbreviation for combination of tag and ontology based algorithm.
44
X. Han et al.
labeled by the users to the bookmarks is developed in [3]. The user interests are represented as a keyword frequency vector, in which each element is the frequency of the tag labeled by the user to a document. In [10], three different approaches for constructing user interests profile models from folksonomy are proposed. One of those approaches is an adaptive algorithm, called Add-A-Tag, which takes account of the structural and temporal nature of tagging data by reducing the weights on edges connecting two tags as time passes. In addition, Null and Meinel [13] discusses the issue of constructing a user interest profile model from a folksonomy in the context of personalized Web search. In their approach, a user interest profile model is represented in the form of a weighted vector with m components, each of which denotes the total count of tag for the user’s bookmark collection. However, all of them cannot handle semantics issue of tags, but our work can and our experiment shows that associating semantics to each tag can significantly improve personalized search. Moreover, the issue of multiple interests has not been addressed very well in the above existing work. However, it is a very important issue. In fact [5] surveyed user profiling in personal information agent and one of the main observations is that user always has multiple interests and a single term frequency vector is not an optimal form of user interest profile representation. Thus, Yeung et al. [21] tried to develop an algorithm to generate user profiles of multiple interests. Their approach is to construct a network of documents for a particular user, that is, to apply community-discovery algorithms to divide the nodes into clusters, and extract sets of tags which act as signatures of the clusters to reflect the interests of the users. However, it is hard to determine both the number of clusters and semantic distance measure. In our work, the problem is solved by introducing a comprehensive domain ontology. By mapping tags onto appropriate categories of the ontology, the user multiple interests and the semantic relations between them are well integrated and utilized. Further, as opposite to folksonomy based approaches, ontological user interest profile modeling based on the knowledge contained in domain ontologies have been proposed as well. In these approaches, a user interest profile is represented in terms of concepts in a hierarchical tree. The semantic information in ontology enables the inferring and activating of related concepts in the hierarchical structure easily. The homonyms and synonyms problems can be solved by evaluating the semantic similarity between different concepts in the hierarchy, which is impossible in folksonomy based approaches. For example, Middleton et al. [11] develop two algorithms in which user profiles are modeled using a research paper topic ontology. Similar approaches are also proposed to construct user profiles to enhance personalized Web searching [16] or collaborative recommendations [1]. The user interests in these systems are harvested from implicit reasoning on the resources via extracting significant terms as representation of the resource features. However, as the terms are not labeled by the users themselves, the real interests of users cannot be captured very well. Rather, the problem is solved in our work in this paper by integrate folksonomy and ontology semantic approach.
Folksonomy-Based Ontological User Interest Profile Modeling
45
Correctly mapping each tag in folksonomy onto correct categories in ODP is the preliminary step before the semantic information in the ontology can be utilized for personalized search. Ma. [9] employed four approaches to map a user’s interests onto appropriate ODP categories by introducing natural language processing techniques. However, it does not incorporate the statistic information in ODP (e.g. the number of indexed webpages under the concepts) and the inter-related interests under a common topic, both of which provide abundant information for precise mapping. However, the co-occurrences of tags in our approach are such statistic data that can be used for ontological mapping, and therefore our mapping accuracy is greatly improved. More recently, Vallet and Candador[19] present an evaluation framework for various folksonomy-based Web search personalization approaches, and compared his approaches with the approach of Null and Meinel’s in [13]. Following the same assumptions as in [13] and [19], our evaluation experiment compares our proposals with the existing approaches and shows our work is more effective than the existing ones.
7
Conclusions and Future Work
In this paper, we proposed a folksonomy and ontology based model of user interest profile. More specifically, a probability tree model is developed to map user tags onto an ontology. The proposed model can be applied to personalized web search according to our semantic similarity measures. The experiment shows that the personalized search that utilizes our user interest profile model outperforms a number of previous approaches (especially the one in [19] published in 2010). This improvement is mainly due to that we introduce semantic information inference by mapping the folksonomy onto a domain ontology. Therefore, our model of user multiple interests is more accurate and semantically inferable. In the future, we will explore how to build robust user interest profiles with cross-network user information, and apply the model to other fields such as recommendation systems, information filterings and so on.
References 1. Anand, S., Kearney, P., Shapcott, M.: Generating semantically enriched user profiles for web personalization. ACM Transactions on Internet Technology, Article 22, 7(4) (2007) 2. Chirita, P., Costache, S., Nejdl, W., Handschuh, S.: P-tag: Large scale automatic generation of personalized annotation tags for the web. In: Proceedings of the 16th International Conference on World Wide Web, pp. 845–854 (2007) 3. Diederich, J., Iofciu, T.: Finding communities of practice from user profiles based on folksonomies. In: Proceedings of the 1st International Workshop on Building Technology Enhanced Learning solutions for Communities of Practice, pp. 288–297 (2006) 4. Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems 21(1), 64–93 (2003)
46
X. Han et al.
5. Godoy, D., Amandi, A.: User profiling in personal information agents: A survey. Knowledge Engineering Review 20(4), 329–361 (2006) 6. Golder, S., Huberman, B.A.: The structure of collaborative tagging systems. Journal of Information Science 32(2), 198–208 (2005) 7. Hotho, A., Jaschke, R., Schmitz, C., Stumme, G.: Information retrieval in folksonomies: Search and ranking. In: The Semantic Web: Research and Applications, pp. 411–426 (2006) 8. Lieberman, H., et al.: Letizia: An agent that assists web browsing. In: Proceedings of International Joint Conference on Artificial Intelligence, vol. 14, pp. 924–929 (1995) 9. Ma, Z., Pant, G., Sheng, O.: Interest-based personalized search. ACM Transactions on Information Systems 25(1) (2007) 10. Michlmayr, E., Cayzer, S.: Learning user profiles from tagging data and leveraging them for personalized information access. In: Proceedings of the Workshop on Tagging and Metadata for Social Information Organization, 16th International World Wide Web Conference (2007) 11. Middleton, S., Shadbolt, N., De Roure, D.: Ontological user profiling in recommender systems. ACM Transactions on Information Systems 22(1), 54–88 (2004) 12. Mika, P.: Ontologies are us: A unified model of social networks and semantics. Web Semantics: Science, Services and Agents on the World Wide Web 5(1), 5–15 (2007) 13. Noll, M., Meinel, C.: Web search personalization via social bookmarking and tagging. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudr´e-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 367–380. Springer, Heidelberg (2007) 14. Renda, M., Straccia, U.: Web metasearch: rank vs. score based rank aggregation methods. In: Proceedings of the 2003 ACM Symposium on Applied Computing, p. 846 (2003) 15. Shepitsen, A., Gemmell, J., Mobasher, B., Burke, R.: Personalized recommendation in social tagging systems using hierarchical clustering. In: Proceedings of the 2008 ACM Conference on Recommender Systems, pp. 259–266 (2008) 16. Sieg, A., Mobasher, B., Burke, R.: Web search personalization with ontological user profiles. In: Proceedings of the 16 ACM Conference on Information and Knowledge Management, pp. 525–534 (2007) 17. Szomszor, M., Alani, H., Cantador, I., O’Hara, K., Shadbolt, N.: Semantic modelling of user interests based on cross-folksonomy analysis. In: 7th International Semantic Web Conference, pp. 632–648 (2008) 18. Teevan, J., Dumais, S., Horvitz, E.: Personalizing search via automated analysis of interests and activities. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 449–456 (2005) 19. Vallet, D., Cantador, I., Joemon, M.: Personalizing web search with folksonomybased user and document profiles. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., R¨ uger, S., van Rijsbergen, K. (eds.) Advances in Information Retrieval. LNCS, vol. 5993, pp. 420–431. Springer, Heidelberg (2010) 20. Xu, S., Bao, S., Fei, B., Su, Z., Yu, Y.: Exploring folksonomy for personalized search. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162 (2008) 21. Yeung, C., Gibbins, N., Shadbolt, N.: A study of user profile generation from folksonomies. In: Proceedings of the Workshop on Social Web and Knowledge Management (2008)
Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists for Actionable Insights* Derek L. Hansen1, Ben Shneiderman2, and Marc Smith3 1
College of Information Studies & Center for the Advanced Study of Communities and Information, University of Maryland, College Park, Maryland, USA
[email protected] 2 Dept. Of Computer Science & Human-Computer Interaction Lab, University of Maryland, College Park, Maryland, USA
[email protected] 3 Connected Action Consulting Group, Silicon Valley, California, USA
[email protected]
Abstract. Analyzing complex online relationships is a difficult job, but new information visualization tools are enabling a wider range of users to make actionable insights from the growing volume of online data. This paper describes the challenges and methods for conducting analyses of threaded conversations such as found in enterprise message boards, email lists, and forums. After defining threaded conversation, we characterize the types of networks that can be extracted from them. We then provide 3 mini case studies to illustrate how actionable insights for community managers can be gained by applying the network analysis metrics and visualizations available in the free, open source NodeXL tool, which is a powerful, yet easy-to-use tool embedded in Excel 2007/2010.
1 Introduction Threads are the things that hold the net together. Since the inception of the Internet most virtual communities have relied on asynchronous threaded conversation platforms as a main channel of communication. Usenet newsgroups, email lists, web boards, and discussion forums all contain collections of messages in reply to one another. The natural conversation style supported by the basic post-and-reply threaded message structure has proven enormously versatile, serving communities ranging widely in focus and goals. Cancer survivors and those seeking technical support or religious guidance are as likely to use a threaded discussion as a corporate workgroup. Modern incarnations of threaded conversation are embedded in social networking site wall posts, blog comments, Google Wave threads, YouTube or Flickr comments, and Twitter ‘reply to’ (RT) tweets. Traditional forums now include profile pages, participation statistics, reputation systems, and private messaging. *
This paper is a revised version of a chapter from “Analyzing Social Media Networks with NodeXL: Insights from a Connected World” by Hansen, Shneiderman, and Smith to be published by Morgan Kaufmann Publishers in Fall 2010.
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 47–62, 2010. © Springer-Verlag Berlin Heidelberg 2010
48
D.L. Hansen, B. Shneiderman, and M. Smith
Despite the differences in types of threaded conversation, the common structure lends itself well to network analysis, due to its easily identifiable reply structure that captures communication patterns between people. Unfortunately, most threaded conversation systems do not make this networked data easily accessible. The majority of threaded message content is not easily accessible due to the number of different software platforms used and the fact that many groups only make content accessible to subscribed members. Many threaded message systems do report participation statistics and ratings (e.g., top 10 contributors), which are important metrics but fail to capture the social connections between members – a critical component of virtual communities and corporate communities of practice. This paper considers how to analyze threaded conversations from a network perspective. We begin by defining threaded conversation and characterizing some of the most important networks that can be created from threaded conversation. We then include several brief case studies that demonstrate the value of taking a network approach. The major contribution is to demonstrate novel analysis and visualization approaches that provide users with powerful methods for extracting actionable insights. We rely upon a novel, open source network analysis tool called NodeXL (www.codeplex.com/nodexl), which enables a wider range of analysts to make discoveries and visual presentations that previously required a higher degree of technical skills. These analysts can apply their rich domain knowledge and understanding of social and organizational structures to handle larger datasets and make appropriate business decisions.
2 Definition and Structure of Threaded Conversation Threaded conversation is a commonly used design theme that enables online discussion between multiple participants using the ubiquitous post-reply-reply structure. It shows up in many forms from email lists to web discussion forums to photo sharing and customer review sites. The key properties of threaded conversation were enumerated in Resnick, et al. [1] and are listed here with some modification: • Topics. A set of topics, groups, or spaces, sometimes hierarchically organized to aid users in discovering interesting groups to “join.” Topics or groups are persistent, though their contents may change over time. Fig. 1 includes two topics: TOPIC 1: Social Media and TOPIC 2: NodeXL. • Threads. Within each topic or group, there are top-level messages and responses to those messages. Sometimes further nesting – responses to responses – is permitted. The top-level message and the entire tree of responses to it are called a thread. In Fig. 1, there are 5 unique threads. Thread A includes only 2 messages, while Thread B includes 6 messages. Thread D includes only a single message. • Single Authored. Each message contributed to a thread is authored by a single user. Typically, the person’s username or email address is shown alongside the post so people know who is talking. In Fig. 1, the author of each message and the time of their post are indicated. Users may post to multiple threads (e.g., Beth) or multiple times within a thread (e.g., Cathy). • Permanence. In many threaded conversations including email lists and Usenet, once a message has been posted it cannot be re-written or edited. A new message
Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists
49
Fig. 1. Threaded Conversation Diagram showing 5 Threads that are part of two different Topics. Each post includes a subject (e.g., Thread A), a single author (e.g., Adam), and a timestamp (e.g., 12/10/2010 2:30pm). Indenting indicates placement in the reply structure. Darker posts initiate new threads (i.e., they are top-level threads), while lighter posts reply to earlier messages in the same thread.
may be posted, but no matter how much someone may wish it, an original post often cannot be retracted. In some discussion boards and newer systems like Google Wave, original posts can be modified after initial contribution. • Homogeneous View. The partitioning of messages into topics is a feature shared by many discussion interfaces. Moreover, in most systems users all see the same view of the messages in a topic, either in chronological or reverse chronological order. Messages are often sorted into threads (e.g., Fig 1). In some cases, the system will keep track of which messages a user has previously viewed, so that it can highlight unread messages, but that is the only personalization of how people view the messages.
3 Threaded Conversation Research Research on communities that use threaded conversation began in the early days of Bulletin Board Systems (BBS) and Usenet. Many of the same themes continue to be explored today. For example, Kollock and Smith’s book “Communities in Cyberspace” [2] included chapters on identity online, deviant behavior and conflict management, social order and control, community structure and dynamics, visualization, and collective action. All of these topics are still being explored in new contexts and with new technologies such as social networking sites, blogs, microblogging, and wikis. Early books by Preece [3], Kim [4], and Powazek [5] provided some enduring, practical advice and inspiration for those managing online communities. One persistent finding
50
D.L. Hansen, B. Shneiderman, and M. Smith
is the skewed pattern of participation in threaded conversations wherein a few core members contribute the majority of content, many peripheral members contribute infrequently, and a large number of lurkers [6] benefit by overhearing the conversations of others [7]. While most early research on threaded conversations used content analysis, counts of participation patterns, and interviews, a few early researchers applied social network analysis to examine online interactions [e.g., 8-9]. Network analysis approaches are now common, particularly at technical conferences such as the International AAAI Conference on Weblogs and Social Media (ICWSM) that work with large datasets. However, analysis of large-scale networks by academics differs significantly from analysis of bounded networks by community administrators and corporate managers trying to gain insights relevant to their day-to-day actions. In the past couple of years network analysis tools such as NodeXL have made it possible for those without advanced degrees or specialized training to collect, analyze, and visualize networked data from social media sources [10-11]. This has prompted a great need for applied research that clarifies how network analysis techniques can be used to gain actionable insights – the focus of this article.
4 What Questions Can Be Answered? There are many reasons to explore networks that form within large collections of conversations. New employees or community members need to rapidly catch up with the "story so far" to get to a point that they can make useful contributions. Community managers need tools to help them serve as metaphorical fire rangers and game wardens for huge populations of discussion contributors and the mass of content they produce. When outsiders such as researchers or competitors peer into a set of relationships, social network analysis can point out people, documents, and events that are most notable. A few of the specific questions that can be addressed with network analysis of community conversations are described below: • Individuals. Who are important individuals within the community? Who are the question answerers, discussion starters, and administrators? Who are the topic experts? Who would be a good replacement for an outgoing administrator? Who fills a unique niche? • Groups. Who makes up the core members of the community? How interconnected are the core group members? Are there subgroups within the larger community? If so, how are the subgroups interconnected? How do they differ? • Temporal Comparisons. How have participation patterns and overall structural characteristics of the community changed over time? What does the progression of an individual from peripheral participant to core participant look like and who has made that transition well? How is the community structure affected by a major event like a new administrative team, the leaving of a prominent member, or an initiative to bring in new members? • Structural Patterns. What network properties are related to community sustainability? What are the common social roles that reoccur among community members (e.g., answer person, discussion starter, questioner, administrator)?
Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists
51
5 Threaded Conversation Networks Two primary types of networks can be created from threaded conversations: reply networks and affiliation networks, each of which is discussed here and illustrated later in the article with examples. 5.1 Reply Networks Each time someone replies to another person’s message, she creates a directed tie to that other person. If she replies to the same person multiple times, a stronger weighted tie is created. A reply graph treats the message authors as the graph vertices and the reply connections as the graph edges. There are two types of reply networks, depending on how you determine what constitutes a reply. The direct reply network connects a replier to the person they are immediately replying to in the course of a thread (see Fig. 2). In contrast, a top level reply network connects all repliers within a thread to the original thread author. NodeXL is a free and open source plugin for Microsoft Excel [10]. Network data about edges and vertices are stored in the spreadsheet, while network visualizations are displayed in the graph pane. The spreadsheet portion includes separate worksheets for the Edges (shown in Fig. 2), the Vertices (which includes a unique list of each vertex in the network and visual properties associated with them), and other data of interest such as clusters and overall graph metrics. Different visual properties such as edge width, color, and opacity can be mapped to data properties such as edge weight (i.e., number of messages exchanged) or edge type. Similarly, vertex size, color, opacity, and shape can be mapped to graph metrics (e.g., degree, betweenness centrality) or other attribute data (e.g., demographics). Vertex and edge labels can be displayed in multiple ways. Advanced features allow analysts to import data from social media tools (e.g., email, Twitter, YouTube, Flickr), automatically identify vertex clusters, layout the vertices according to different algorithms, calculate sub-graph images, and dynamically filter out edges and vertices using sliders. A top level reply network emphasizes those who start threads (i.e., post the toplevel message), while de-emphasizing conversations that occur midway through a thread. In some communities with short threads where all replies are typically directed at the original poster, such as email based Q&A communities, this network can better reflect the underlying dynamics. However, in discussion communities or forums with longer threads, the direct reply network is typically preferred since people later in the thread are often replying to each other. A top level reply network based on data in Fig. 1 would have Dave, Beth, and Ethan all pointing to Cathy who started the longest thread, thus emphasizing her importance. It would also include a self-loop from Cathy to Cathy, which are more common in these types of networks since people like Cathy reply to those who have replied to them. 5.2 Affiliation Networks Affiliation networks are bi-modal networks that connect people to a set of groups, events, or places. For example, a traditional affiliation network may connect a group of executives to companies for whom they serve on the board of directors. Vertices
52
D.L. Hansen, B. Shneiderman, and M. Smith
Fig. 2. Direct reply network graph based on data in Fig. 1. The network is constructed by creating an edge pointing from each replier to the person they replied to, and then merging duplicate edges to create an Edge Weight column. Notice that Beth has replied directly to Dave twice, so the edge connecting them is thicker. Fiona replied to her own message so there is a self-loop. Greg started a thread but was not replied to so he is not connected to anyone else.
represent both people and companies (which is why it is a bi-modal network), while edges represent affiliations between them. Affiliation networks for threaded conversations typically connect authors to Topics or Threads. The edges are undirected since there is only one possible direction (a person can post to a thread, but a thread can’t post to a person). They are weighted based on the number of times a person posted to a Topic or Thread. For example, an edge would connect Cathy to Thread B with a weight of 2, since she posted to that thread twice. Beth would be connected with a weight of 1 to Thread A, Thread B, and Thread C since she posted to each of them once. This network is ideal for identifying boundary spanners and Forums or Threads that share authors. Other affiliation networks connect authors to items that conversations are associated with (e.g., YouTube videos, Flickr Photos, blogs). These networks are related to recommender systems, in this case identifying “people who commented on this also comment on that” relationships. Each affiliation networks can be transformed into 2 additional unimodal, weighted networks: a user-to-user network connecting people based on the number of threads (or forums) they both contribute to, and a thread-to-thread (or forum-to-forum) network connecting threads together based on the number of contributors they share. Or in the case of videos they show connections between videos based on the number of shared authors. These networks are good for creating overview graphs of large communities with many threads or forums. They help to identify content clusters that share many of the same authors, as well as clusters of users that hang out together in similar threads or forums.
Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists
53
6 Analyzing a Technical Support Email List: CCS-D There are a host of technical support groups that use email lists, Usenet newsgroups, or web discussion forums to help individuals solve problems and make sense of a specific technology like JAVA, a product such as the iPhone, or a topic such as web design. Many companies host these forums to learn about problems with existing products, resolve customer concerns, generate new ideas on future improvements, and build a loyal customer community. To meet these goals it is often important to understand which individuals play important roles within the community, something that can be challenging when managing multiple, active communities. This section describes how to identify key members of the CSS-D email list devoted to the effective use of Cascading Style Sheets (CSS) in web design. It is a highly active list with around 50 messages sent each day. There are a handful of administrators who keep the conversation friendly and encourage contributors to follow the guidelines. See [12] for a complete description of the community and some of the strategies they use that make them so effective. 6.1 Preparing Email List Network Data Creating network data from email lists such as CSS-D poses a few challenges. Email lists often have people registered with multiple email addresses, making it necessary to combine duplicate addresses for the same person. This process is called deduplication and is an active area of research [13]. Another problem is that inferring who is replying to whom is not always obvious. By definition, all messages sent to an email list are sent to a single email list address (e.g.,
[email protected]). The result is a star network connecting all contributors’ email addresses to the list email address. Messages that begin a new thread (i.e., initial posts) will be sent to the list address and rarely will Cc other individuals. Replies to initial posts are handled differently depending on email list configuration choices. Some lists, like css-d, set the default Reply-To address to that of the original sender. Users who click “Reply” to the initial email will send directly to the person, whereas users who click “Reply to All” will send to the initial person in the To field and the email list in the Cc field. This configuration is good for network analysis because it can use the information in the Cc field to identify who is replying to who. It does encourage more private messages however, which are missed by the email list and are thus absent in the network analysis. Other email lists set the default so that when users click on “Reply To” it sends to the list and they must choose “Reply To All” to explicitly Cc the initial sender. This configuration makes it more likely that people just send to the list and don’t copy in the person they directly reply to. The result is that analysts may need to look at subject lines and email header information to reconstruct who is replying to whom. The NodeXL tool includes an email import tool where analysts can generate emailbased networks based on the To, Cc, and Bcc fields of an email corpus stored on a Windows indexed machine. It allows users to filter based on a time range, an email folder of interest, text in the subject line or body of the message, email features such as size or containing an attachment, and individual email addresses. It generates standard direct reply networks. The analysis of CSS-D is based on data from Jan-Feb of 2007.
54
D.L. Hansen, B. Shneiderman, and M. Smith
6.2 Identifying Important People and Social Roles at CSS-D In an online community, users contribute in different patterns and styles. In other words, community members fill different social roles. Understanding the composition of social roles within a community can provide many insights that make for more effective community managers. Unfortunately, simple activity and participation metrics are unable to capture the different types of contributions in discussion forums. In contrast, social network analysis provides metrics that can be used to automatically identify those who fill unique social roles and track their prevalence over time. This can help community managers: • Identify high-value contributors of different types: Which community members are the most important question answers or question starters? Who connects many other users together? Answering these questions can help community managers to know who to thank (and for what) and how to support individuals’ needs. • Determine if a community has the right mix of people: Is this community attracting enough Question Answerers? Are there enough Connectors to hold the community together? Is discussion crowding out Q&A? Is a discussion space dissolving into Q&A? Knowing the answers to these questions can help community managers know who to recruit or encourage more, as well as what policies may be needed. • Recognize changes and vulnerabilities in the social space: How has the community composition changed as it has grown? What is the effect of a certain prominent member leaving the community going to have? Which members are currently irreplaceable in the type of work they do? What is the effect of a policy change or change in settings on the community dynamics (e.g., changing the default Reply To behavior to send to the Sender versus the entire list)? Answering these questions can help community managers prepare for change, understand the effects of prior decisions and events, and cultivate important relationships. This section shows how to identify important individuals and social roles within the CSS-D community by using NodeXL’s subgraph images (i.e., egonetworks of CSS-D members) and creating a composite metric that helps identify the 2 most important social roles within Q&A communities like CSS-D: Answer People and Discussion People. This metric makes possible visualizations that show the relationships between these individuals as will be shown. The first step in identifying important contributors to the CSS-D email list is to remove the overwhelming effects of the email list address by removing it from the graph. In NodeXL this is accomplished easily by choosing “Skip” in the Visibility column, which assures that the list email address will not be included in future analysis, such as the calculation of graph metrics, or visualizations where it would just clutter up the graph. The next step is to create ego-networks of each contributor, which are called 1.5 Subgraph Images in NodeXL. In the examples provided in this section we use the Harel-Koren Fast Multiscale layout to automatically position the vertices in a meaningful organization [14]. NodeXL stores subgraph images of desired size in the spreadsheet itself or in a separate folder where they can be browsed. Once created,
Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists
55
Fig. 3. NodeXL Subgraph Images (1.5 Degree; vertex and incident edges are red/lighter) for 6 CSS-D contributors that fill 3 different social roles within the CSS-D community
analysts can use Excel’s built-in features to sort vertices based on graph metrics such as In-Degree (who receives messages from the most people) and Out-Degree (who sends messages to the most people) to bring differently connected individuals to the top. Sorting by centrality measures like Eigenvector reveal the core members of the community because they are active participants and talk to other active participants. Scanning through the Subgraph Images of CSS-D contributors shows the different social roles that exist within the email list community. Fig. 3 shows examples of 3 types of contributors (Question People, Answer People, and Discussion Starters) along with some of the metrics that could be used to identify them. Question people post a question and receive a reply by one or two individuals who are likely to be Answer People. Answer People mostly send messages (arrows point toward other vertices) to individuals who are not well connected themselves [15]. Discussion Starters mostly receive messages (arrows point toward them), often from people who are well-connected to each other. While the Fig. 3 graphs help identify the different types of social roles, metrics can also be used to classify individuals automatically. Question People are easy to detect because of their low degree. To identify people along the Answer Person / Discussion Starter spectrum we create a single Answer Person score by multiplying the percent out-degree by the inverse of the clustering coefficient (defined as the percent of neighbors who are connected). Those who score high are Answer People because they reply to others more than they are replied to and those they reply to are primarily isolates (i.e., question people). Those who score low are Discussion Starters since they are replied to often and by others who are well-connected. We only apply this metric to those with an out-degree + in-degree of 15 or higher to focus on active members. In Fig. 4 those with high Answer Person scores are darker disks, those with low scores are lighter disks, and those with a low degree (mostly Question People) are circles that are not filled in.
56
D.L. Hansen, B. Shneiderman, and M. Smith
Fig. 4. Two NodeXL graphs of the CSS-D email list network for Jan-Feb of 2007. Answer People (darker) and Discussion Starters (lighter) are identified by the calculated Answer Person Score. Circle vertices (filtered out of the graph on the right) have a total degree of fewer than 15 and mostly consist of Question People. Vertex size is mapped to Eigenvector Centrality. Edge weight (i.e., number of messages sent) is mapped to both edge Size and Opacity, applying a logarithmic scale and ignoring outliers.
The specific social roles and their prevalence within a particular community will depend on the nature of that community. Since the CSS-D community is primarily a Q&A community, it consists of mostly Question Askers, a handful of prominent Answer People, and a small number of Discussion Starters. Other more discussion-based communities would have many more Discussion Starters as well as other social roles such as Flame Warriors, Commentators, and Connectors. Tracking the ratio of people that play different social roles can be a good way to assure that a community is healthy. For example, if the CSS-D community had too few Answer People or an influx of many Question People it could not function as effectively. Viewing the entire reply network for the CSS-D email list (left graph in Fig. 4) provides some general insights about the composition of its population, although the size of the network makes it challenging to interpret without filtering. Larger nodes have a higher Eigenvector Centrality suggesting they are connected to many people and others who are well-connected. The binned layout is used to identify isolates along the bottom, of which there are many since the email list address itself was removed from the network. Isolates represent those who posted to the list and didn’t receive a response (e.g., they posted an announcement) or in some cases those who replied to the list without copying in the address of the person who they were replying to. Overall the entire reply network shows many individuals connected primarily through a handful of central question answerers and a small, but stable core group of members that interact with one another regularly. To better focus in on the core members of the community and their relationship to one another, analysts can filter out vertices with a total degree of less than 15. The
Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists
57
graph on the right side of Fig. 4 shows the resulting network after manually positioning the vertices. The edge weights, represented in the edge width and opacity provide a good sense of who interacted with whom during the 2 month time period and is thus likely to know each other and perhaps have similar interests. Note that even among these core members, Discussion Starters (light vertices) rarely reply to other Discussion Starters. Also notice that the largest vertex, while categorized as an Answer Person, receives many messages from the core members. This suggests that he plays multiple important roles within the community. In fact, if he were removed from the network there would be considerably fewer connections between the core members. Community administrators should make sure this individual is adequately appreciated and encouraged to remain in the community since his removal would seriously disrupt the community.
7 Finding a New Community Admin for the ABC-D Email List Administrators of online conversations play pivotal roles in maintaining social order, encouraging participation, and making communities feel like home [3]. They are typically among the most active members of a community [16] and can function better when they are known and respected by the members of the community. Because of the importance of administrators, when one leaves or steps down it has the potential to disrupt the community. In this section we look at how network analysis can help in identifying a potential replacement for an administrator that is going to step down. Data for this analysis comes from an email list we will call ABC-D, based around a specific profession. It is a classic example of a community of practice that spans multiple institutions. Unlike CSS-D, ABC-D encourages in-depth discussion about the community’s domain and is not primarily about questions and answers.
Fig. 5. NodeXL maps of ABC-D’s email list direct reply network, with the current Admin (left) and without the current Admin (right). The most central members are labeled including Admin in the left side image. Larger vertices have a higher eigenvector centrality and darker vertices have a higher betweenness centrality.
58
D.L. Hansen, B. Shneiderman, and M. Smith
The network is a direct reply network. An arrow pointing from person A to person B indicates that person A replied to a message of person B. Data from ABC-D was collected for a two-week period by Chad Doran, a graduate student at the University of Maryland College of Information Studies, who also came up with the administrator replacement scenario. A more complete analysis would include a longer time-period (e.g., 2 months) and include edge weights, but the current dataset is sufficient to illustrate the key idea. All data, including the name of the community, has been anonymized to respect the privacy of the group members. The graph on the left side of Fig. 5 shows the entire reply network with a few key individuals (as identified by graph metrics) labeled. The graph on the right is the same graph after removing the Admin and recalculating the graph metrics. The networks show that individuals are almost all connected in one large component, but the degree of any one individual is relatively low (e.g., the maximum total degree is 14 and the average total degree for an individual is about 3). The result is a fairly spread out network. Graph metrics were calculated and used to identify the most central individuals, who presumably are in the best position to serve as an administrator replacement. Darker vertices have a high betweenness centrality, suggesting that they are important at connecting different vertices and integrate the network as a whole. Larger vertices have a higher eigenvector centrality, which in this case suggests the person is well connected to others who are themselves well connected. As expected, the current administrator (labeled Admin in the left-side graph) has the highest betweenness centrality and a high eigenvector centrality. Interestingly, the individual with the highest eigenvector centrality (User32) is not directly connected to the Admin; in the time period of our data collection neither of them replied to the other. Another important individual is User11 who has a high betweenness centrality because he was the only link to several vertices, but a relatively low eigenvector centrality since most of his connections were with individuals who rarely posted. All of the labeled individuals scored high on the metrics and may be good candidates to replace the administrator. Of course other characteristics not captured in the network structure, such as their willingness to serve, their friendliness, and their experience would also be key determinants. A key question is: how the community would change if the administrator were removed from the network. The key network metrics, betweenness centrality and eigenvector centrality, will change for the remaining individuals because they are dependent on the network properties of other vertices. Thus, looking at the graph without the admin (on the right-hand side of Fig. 5) can help more accurately assess individual’s potential as a replacement. It also helps analysts to see how the network as a whole may be impacted. For example, removing the admin changes the average Closeness centrality from 3.2 to 3.5 suggesting that people will not be as directly connected with others once the admin is gone. Analysts may also notice certain subgroups within the network that lose an important connection to other subgroups, such as the large group at the top of the graph. These differences can be more easily noticed when the location of the vertices has been fixed in both graphs as in Fig. 5. Looking at the right-hand graph in Fig. 5 confirms that the initial individuals identified as possible replacements are good candidates. It also suggests that if certain candidates were chosen, such as User32, there may be subgroups of the community that would not be as well connected (e.g., the group of nodes at the top of the graph).
Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists
59
The fear is that these individuals may feel alienated by a new administrator they either don’t know or don’t converse with often. The graph also points out individuals who may be able to keep them involved: User11 and User22. Armed with this information the outgoing administrator may be wise to recommend that User32 and User22 jointly serve the role of administrator, or that whoever is chosen should foster a relationship with these individuals to link to those who may feel alienated.
8 Understanding Groups at Ravelry Ravelry (http://www.ravelry.com) is a thriving online community for anyone passionate about yarn. As of January 2010, there were over 600,000 knitters and crocheters registered on the site. Users organize their projects, yarn stashes, and needles; share and discover designs, ideas, and techniques; and form friendships through discussions and exploration of shared interests. In this section, the Ravelry community administrator works with data on the top 20 posters to 3 discussion forums created for different groups. The data and initial network analysis for this section was developed by Rachel Collins, a graduate student at Maryland’s iSchool. Imagine a community manager is assigned 3 group discussion forums to monitor and help develop. They are highly active groups, making it hard to keep up with all the messages and get a better sense of how the most important community members relate to one another, as well as how the groups differ. This understanding helps the community manager to recommend the best group for a newcomer to join, as well as identify individuals with certain expertise or social relations. The 3 groups (whose names have been changed for privacy reasons) include one common-interest group (Apathetic, Funloving Crafters [AFC]), one Meet-Up (Chicago Fiber Arts), and one Knit-Along (Project Needy). They are 3 of hundreds of similar groups. Discussion forums for each group serve as their central hubs. Individuals can participate in as many forum groups as they desire. The data includes project output, discussion board usage, blog activity, and community roles for the top 20 posters in each group. This lets the community admin relate many different activities together in a single analysis, focusing attention on the most active members who are typically the most important. Fig. 6 shows a bi-modal affiliation network of the 3 forums/groups (shown in text boxes) connected to individuals who have posted to them. Edge thickness is based on the number of forum posts (using a logarithmic mapping). The thinnest lines connect users to groups that they are members of, but have not yet posted to. Other visual properties are used to convey individuals’ level of activity in other parts of the community as described in the Fig. 6 caption. The graph identifies important individuals, such as those who post to multiple groups or have certain color/size/shape combinations. It also enables comparison of the three groups. For example, the graph makes clear that the AFC forum is very active, includes many bloggers, and includes relatively few people who complete a large number of projects (perhaps explaining the “Apathetic” in their title). In contrast, the Project Needy group includes many highly productive members, many of whom are both administrators and bloggers. In contrast, the Chicago Fiber Arts group has fewer bloggers and less project activity. Administrators could use a graph like Fig. 6 to identify potential candidates for Volunteer Editors or identify clusters of boundary spanners with which to form new
60
D.L. Hansen, B. Shneiderman, and M. Smith
Fig. 6. Bi-modal affiliation network connecting 3 Ravelry groups (i.e., forums AFC, Chicago fiber Arts & Project Needy) to contributors represented as circles. Edge width is based on number of posts (with logarithmic mapping). Vertex size is based on number of completed Ravelry projects. Maroon/lighter vertices have a blog and solid circles are either Community Moderators or Volunteer Editors. The network helps identify important boundary spanners (e.g., those connected to multiple groups), as well as compare groups.
groups because of shared interests. Providing graphs like this one to the groups themselves can also prompt self-reflection and potentially foster new connections. They can also be used to better understand how the activities on the site relate to one another, although use of statistics may be needed to more systematically validate initial claims. For example, Fig. 6 shows that location-based groups have a lower percentage of active members who blog and people who complete many projects seem to cluster into project groups. Finally, simplified versions of this graph may help newcomers to Ravelry get a sense of which group(s) they may want to join, as well as identify some of the prominent members they may want to follow or meet.
9 Conclusion and Future Work Network analysis and visual presentations of online communities that use threaded conversations can produce valuable insights. In this article we have defined threaded conversation and characterized the different types of networks that are created by them: the directed, weighted direct reply network and top level reply networks; the undirected, weighted affiliation network connecting threads (or forums) to the individuals that posted to them; and the undirected, weighted unimodal networks derived from the affiliation network including user-to-user network and thread-to-thread networks.
Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists
61
We have also demonstrated how new analysis tools such as NodeXL can be used by community administrators to gain actionable insights about the communities they serve. The analysis of the CSS-D technical support community showed how to identify important social roles and individuals who fill those roles including Answer People, Discussion Starters, and Questioners. The analysis of ABC-D discussionbased email list showed how to identify good candidates to replace a community administrator based on network metrics such as Betweeness and Eigenvector Centrality. And the analysis of Ravelry showed how to use a bi-modal affiliation network to understand how forum-based groups are connected, identify important boundary spanners, and relate non-discussion network metrics (e.g., blog activity; project activity) to group discussion activity. We hope these mini case studies provide inspiration for other focused network analyses aimed at gaining actionable insights about online interaction. Research on threaded conversation communities has a long history as outlined in Section 3, yet there remain many interesting research questions to explore. As threaded conversations become embedded within more complex social spaces with multiple interaction technologies, it is increasingly important to understand how they all interact. For example, Hansen found that technical and patient support groups benefit from combining a threaded conversation (i.e., email list) with a more permanent wiki repository [12]. The Ravelry example showed strategies that have not yet been widely used by the research community to understand how network position relates to use of other tools (i.e., blogs) or activities (i.e., projects). Network-based research is also needed to better understand the determinants of successful online communities. For example, we don’t know what proportion of mixtures of Answer People, Discussion Starters, and Questioners lead to better outcomes or what overall network statistics (e.g., clustering coefficient) are correlated to success. From a design perspective, there are many fascinating opportunities to enhance the threaded conversation model as evidenced by Google Wave and other prototype systems. Many opportunities remain to advance techniques to visualize online conversation spaces [17] and threaded conversation networks as demonstrated in this article.
References 1. Resnick, P., Hansen, D., Riedl, J., Terveen, L., Ackerman, M.: Beyond Threaded Conversation. In: CHI 2005 Extended Abstracts on Human Factors in Computing Systems, pp. 2138–2139. ACM, New York (2005) 2. Smith, M., Kollock, P. (eds.): Communities in Cyberspace. Routeledge, London (1999) 3. Preece, J.: Online Communities: Designing Usability and Supporting Sociability. John Wiley & Sons, Inc., New York (2000) 4. Kim, A.J.: Community Building on the Web: Secret Strategies for Successful Online Communities. Peachpit Press, Berkeley (2000) 5. Powazek, D.: Designing for Community. Waite Group Press, Corte Madera (2001) 6. Nonnecke, B., Preece, J.: Lurker Demographics: Counting the Silent. In: Proceedings of the SIG-CHI Conference on Human Factors in Computing Systems, pp. 73–80. ACM, New York (2000)
62
D.L. Hansen, B. Shneiderman, and M. Smith
7. Hansen, D.L.: Overhearing the Crowd: an Empirical Examination of Conversation Reuse in a Technical Support Community. In: Proceedings of the Fourth International Conference on Communities and Technologies, pp. 155–164. ACM, New York (2009) 8. Garton, L., Haythornthwaite, C., Wellman, B.: Studying Online Social Networks. J. Comput-Mediat Comm. 3(1) (1997) 9. Wellman, B., Salaff, J., Dimitrova, D., Garton, L., Gulia, M., Haythornthwaite, C.: Computer Networks as Social Networks: Collaborative Work, Telework, and Virtual Community. Annual Review of Sociology 22, 213–238 (1996) 10. Smith, M.A., Shneiderman, B., Milic-Frayling, N., Rodrigues, E.M., Barash, V., Dunne, C., Capone, T., Perer, A., Gleave, E.: Analyzing (social media) networks with NodeXL. In: Proceedings of the Fourth International Conference on Communities and Technologies, pp. 255–264. ACM, New York (2009) 11. Hansen, D.L., Rotman, D., Bonsignore, E., Milic-Frayling, N., Mendes Rodrigues, E., Smith, M., Shneiderman, B., Capone, T.: Do You Know the Way to SNA? A Process Model for Analyzing and Visualizing Social Media Data. HCIL-2009-17 Tech Report (2009) 12. Hansen, D.: Knowledge Sharing, Maintenance, and Use in Online Support Communities. Unpublished Dissertation, University of Michigan (2007) 13. Bilgic, M., Licamele, L., Getoor, L., Shneiderman, B.: D-Dupe: An Interactive Tool for Entity Resolution in Social Networks. In: Proceedings of IEEE Symposium on Visual Analytics Science and Technology. IEEE, Los Alamitos (2006) 14. Harel, D., Koren, Y.: A Fast Multi-scale Method for Drawing Large Graphs. In: Marks, J. (ed.) GD 2000. LNCS, vol. 1984, pp. 183–196. Springer, Heidelberg (2001) 15. Welser, H.T., Gleave, H., Fisher, D., Smith, M.: Visualizing the signatures of social roles in online discussion groups. Journal of Social Structure 8(2) (2007) 16. Butler, B., Sproull, L., Kiesler, S., Kraut, R.E.: Community Effort in Online Groups: Who Does the Work and Why? In: Weisband, S., Atwater, L. (eds.) Leadership at a Distance. Lawrence Erlbaum Associates Inc., Mahwah (2005) 17. Turner, T.C., Smith, M.A., Fisher, D., Welser, H.T.: Picturing Usenet: Mapping Computer-Mediated Collective Action. J. Comput-Mediat Comm. 10(4) (2005)
A Spatio-temporal Framework for Related Topic Search in Micro-Blogging Shuangyong Song, Qiudan Li, and Nan Zheng Laboratory of Complex Systems and Intelligence Science Institute of Automation, Chinese Academy of Sciences 95 Zhongguancun East Road, Beijing, China 100190 {shuangyong.song,qiudan.li,nan.zheng}@ia.ac.cn
Abstract. With the rapid development of Web 2.0, micro-blogging such as twitter is increasingly becoming an important source of up-to-date topics about what is happening in the world. By analyzing topic trends sequences and identifying relations among topics we have opportunities to gain insights into topic associations and thereby provide better services for micro-bloggers. This paper proposes a novel framework that mines the associations among topic trends in twitter by considering both temporal and location information. The framework consists of the extraction of topics’ spatio-temporal information and the calculation of the similarity among topics. The experimental results show that our method can find the related topics effectively and accurately.
1 Introduction Micro-blogging is a Web2.0 technology and a new form of blogging. Compared to regular blogging, micro-blogging realizes an even faster mode of communication [10]. It allows users to publish short message updates in different channels, including the Web, SMS, e-mail or instant messaging clients. Every update is usually limited to 140-200 characters, sometimes images and audios are added to enrich its contents. Users in micro-blogging are dubbed micro-bloggers, and the short message updates published by the users are dubbed micro-blogs. Unlike other social network services, in micro-blogging, a user A is allowed to "follow" other users without seeking any permission, and in real time the updates of the followed users will be sent to A automatically. If A is following B, B is call A's friend, and A is called B's follower. Thus friendship can either be reciprocated or one-way. During the last few years, micro-blogging has become one of the most popular Web 2.0 services, and it is still growing rapidly. Take Twitter 1 , a popular microblogging website, as an example, Nielsen.com reports that the total number of users in Twitter has increased from 530,000 in Sep. 2007 to 2,360,000 in Sep. 2009 [1]. A great deal of media attention has been focused on Twitter. On April 17, 2007, the day when Oprah Winfrey joined Twitter by sending a tweet from her Friday TV show, shared of US based visits to the Twitter site increased by 24% and some 1.2 million new users signed up for Twitter on that day alone [9]. In May, 2007, the White House 1
http://www.twitter.com
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 63–73, 2010. © Springer-Verlag Berlin Heidelberg 2010
64
S. Song, Q. Li, and N. Zheng
began posting short messages on Twitter [11]. Figure 1 shows a snapshot of Twitter’s user interface 2. Like the regular blogging service interfaces, the user’s registration information and her posts are shown on her homepage, where the latest tweet she published is with bigger font than the previous ones. In addition, some other application modules are shown in this interface: number of friends, followers, related lists and publisher tweets; the users who she is following; whether she is followed by the user who is looking at her homepage now. Those differences between micro-blogging and regular blogging is just because of the novel interactive mode between users in micro-blogging, ‘follow’. Changing the official question from “What are You Doing?” to “What’s Happening”, Twitter has becoming an important source of up-to-date topics about what is
Fig. 1. An example of Twitter’s user interface
Fig. 2. Content areas of Twitter
2
http://www.twitter.com/songshuangyong
Fig. 3. The geographical distribution of users in Twitter
A Spatio-temporal Framework for Related Topic Search in Micro-Blogging
65
Fig. 4. The trends of ‘Toyota’, ‘IBM’ and ‘Microsoft’ in Twitter
happening in the world. We analyze the content of tweets in Twitter with a st atistical data given in [19], which is shown in figure 2. The results are interesting, Conversational (such as questions or polls.) and Pointless Babble (the tweets like “I am eating a sandwich now”.) account for more than 3/4 of all contents. This result proves that Twitter is mainly used to note what was happening around the users and to communicate with other people, and most of the topics on Twitter are about the users’ daily life, such as food, music and electronic products. The geographical distribution of users in Twitter is given in figure 3, which is downloaded from Alexa.com3. From figure 3, we can see that more than one third of users in Twitter are from the United States, where the Twitter launched. Twitter is most popular in US, Europe and Asia (mainly India and Japan). Trends of three topics, ‘Toyota’, ‘IBM’ and ‘Microsoft’ 4, are given in figure 4 to describe the temporal information of topics in Twitter, which we will use to calculate the similarity between topics. We can see that ‘Toyota’ and ‘Microsoft’ always got more attention than ‘IBM’ from Feb. 7th, 2010 to Mar. 8th 2010. On Feb. 16th, 2010, the tweets about ‘Microsoft’ account for 0.3% of all the tweets. This is because that Microsoft was publicly previewing Windows Phone 7 for the first time on that day. As shown, real-time topics have significant temporal and location aspects, which are not taken into consideration by most micro-blog search engines. Through the spatio-temporal information, we can easily analyze the periods of time when topics have been concerned and the regional distributions of the micro-bloggers who have posted micro-blogs on those topics. Since every topic can be defined and described with their temporal and spatial information [6], the relationship among those topics can also be detected by calculating the similarity between any two of them through their spatio-temporal information. In this paper, we propose a novel framework to identify correlations among topics with their temporal and location components. We present a similarity-based searching and pattern matching algorithm that identifies spatio-temporal series data with similar temporal dynamics and location statistics (the location information of the microbloggers who have posted micro-blogs) in a specific period of time. By analyzing topic 3 4
http://www.alexa.com/siteinfo/twitter.com#demographics http://trendistic.com/toyota/ibm/microsoft/_30-days
66
S. Song, Q. Li, and N. Zheng
trends sequences and identifying relations among topics we have opportunities to gain insights into topic associations and thereby provide better services for micro-bloggers. The rest of this paper is organized as follows: In section 2, we provide a brief review of the related work. The introduction of our method was proposed in section 3. In section 4, some analysis of our experimental results is given. Finally we make a conclusion and discuss our plans for future work in section 5.
2 Related Work 2.1 Related Topic Detection Related topic detection attracts a lot of attention from researchers with the growth of online search, and the related technique has been explored in previous work such as query expansion and keyword suggestion. In [3], a new framework was proposed for performing better semantic related search suggestions with complex semantic relatedness, using the real time Wikipedia-based social network structures. A frequently used approach, co-occurrence, was tested on a dataset collected from eBay website (www.ebay.com) in [14] to recommend the sellers relevant and informative terms for title expansion. Besides, three particular features, including concept term, description relevance and chance-to-be viewed, was taken into account in the application scenario. In [16], Ribeiro-Neto et al. proposed several strategies and a term expansion method to seek the relationship between the advertisements and the web pages. Recently, utilizing search engines to help find the ontology alignment is another type of method to weight approximate topic matches [20]. For example, Gligorov et al. [20] proposed a method based on the search results from Google to find the alignment between topics. Another application of related topic detection is query expansion, which is applied when users expect additional keywords to achieve relevant documents and filter out the irrelevant ones. In [17], Buckley et al. extracted terms from known relevant documents or the top retrieved documents to add some terms to the original query. In [18], Xu and Croft proposed local context analysis, which combined the advantages of global and local expansion techniques. In [7], Mitra et al. retrieved an initial set of possibly relevant documents, and discovered correlated features to expand the query. In [8], Qiu and Frei expanded queries by adding those terms that are most similar to the concept of the query, rather than selecting terms that were similar to the query terms. In [5], Cui et al. tried to find co-occurrences with the seed term in query log. 2.2 Spatio-temporal Model Mining topics from web-based text data and analyzing their spatio-temporal patterns have applications in multiple domains. In [12], Lu et al. proposed a novel spatiotemporal model for collaborative filtering applications, which was based on low-rank matrix factorization that used a spatio-temporal filtering approach to estimate user and item factors. In [6], Li et al. proposed a probabilistic model to detect retrospective news events by explaining the generation of “four Ws” - who (persons), when (time), where (locations) and what (keywords), from each news article. However, their work considered time and location as independent variables, and aimed at discovering the
A Spatio-temporal Framework for Related Topic Search in Micro-Blogging
67
reoccurring peaks of events rather than extracting the spatiotemporal patterns of themes. Model construction for mining spatiotemporal theme patterns from weblog data was also investigated in [13]. The authors used a probabilistic approach to model the subtopic theme and spatiotemporal theme patterns simultaneously. In [15], SyedaMahmood et al. tried to automatically characterize the spatio-temporal patterns in cardiac echo videos for disease discrimination using prior knowledge of the region layout. Different from the previous work, we apply the spatio-temporal model to detect the related topics in micro-blogging. We describe topics with their temporal and spatial information, and detect the relationship among them by calculating the similarity between any two of them through their spatio-temporal information.
3 Our Approach Figure 5 shows the system architecture of our correlated topics search framework. The ‘Tweets’ means posts broadcasted by users in Twitter about small things happening in their daily life, like what they are thinking and experiencing, and the “User information” is filled in by users with their names, regions or some other private information. Those two aspects are used to generate the location information of topics by statistic of region distributions of users. The ‘trends data’ means the Twitter’s daily trending topics about what is happening in the world, which is used to generate the temporal information. We detect related topics with the topic issued by a user by comparing their spatio-temporal information. The most important parts in this framework are the extraction and representation of topic information and similarity calculation among topics.
Tweets data
Location information
User information data
Topics
Trends data
Temporal information
Data crawler
Topic information extraction and representation
Queried topic
User
Spatio-temporal series of topics
Correlated topics with the queried one
Similarity calculation among topics
Fig. 5. Spatio-temporal framework for related topic search
3.1 Data Crawler We use the Twitter trends API5 to download the statistics of the everyday trending topics on Twitter, where the topics are given in the form of popular queries. From Jan. 5
http://apiwiki.twitter.com/Twitter-Search-API-Method:-trends-daily
68
S. Song, Q. Li, and N. Zheng
1st to Dec. 31st, 2009, there were 171,735 queries which contain 17,619 topics. The most popular topic ‘Red’ appeared 2,414 times, while lots of topics just had once. The distribution of the frequencies per topic is shown in figure 6. We use those data to generate the everyday frequency of the chosen topics, which stands for the temporal dynamics. Then we download the tweets dataset and the users’ information dataset published by Munmun De Choudhury6 to generate location information of each topic. We extract all location names and statistic their frequencies.
Fig. 6. Distribution of the frequencies per topic
3.2 Topic Information Extraction and Representation The spatio-temporal topic has temporal and location aspects. The temporal information is referring to the frequency behaviors of a topic at predefined time periods, while the location information is referring to the regional distribution of the micro-bloggers who talked about a topic. Inspired by the idea in [2], we define a topic as the following state series:
Topick = [tk 1 , tk 2 ,..., tki ,..., tkT , sk (T +1) , sk ( T + 2) ,..., sk ( T + j ) ,..., sk (T + S ) ].
(1)
where tki represents the frequency state of topick at timestamp i, and sk(T+j) represents the frequency state of the micro-bloggers, who have posted tweets on topick, in the region j. T in the definition means the total number of days in our chosen period of time, and S means the number of regions where tweets on topick have been posted in the same period of time. In figure 7, the temporal dynamics and regional distributions of topic ‘Microsoft’ are transformed into state series, of which each dimension is an integer between 0 and 5. From the curve, we can see that temporal frequency of ‘Microsoft’ is always high in the first two months in 2009, and the Pacific Time Zone is the region where the largest number of tweets about ‘Microsoft’ had been posted in this period of time. Representing topics as state series in equation 1 and calculating the similarity among them, we omit taking into account a topic’s possible co-existence with another 6
http://www.public.asu.edu/~mdechoud/
A Spatio-temporal Framework for Related Topic Search in Micro-Blogging
69
Fig. 7. State series of ‘Microsoft’
topic in the same tweet [2, 4]. Considering the spatio-temporal similarity of those topics enables us to find implicit association among them although they may not appear in the same document. 3.3 Similarity Calculation among Topics We design a similarity measurement, TSαED to calculate the similarity between two topics by comparing both their temporal dynamics and their regional distributions. Correlated topics have similar spatio-temporal characteristics, and to some extent, our method takes into account topics’ possible co-existence with each other in the same micro-blog. The Euclidean distance metric is adopted to compute the similarity between two topics A and B, which have been represented as state series. The final formula is given as follows:
TSα ED( A, B) = α
T
∑( A − B ) t =1
t
t
2
+ (1 − α )
S
∑( A s =1
T +s
− BT +s )2
(2)
where At means the frequency of topic A in the tth day , AT+s means the numbers of users in sth region who have posted tweets about topic A. The parameter α is used to adjust the significance of sequence similarity and location similarity, and finally α is chosen to be 0.59 after the cyclic iterative method.
4 Experiments and Analysis We give a visualized example to compare our experimental results with real-life events. When a user typed “Microsoft”, our system could detect and show the topics of similar trends, such as “Tweetdeck” illustrated in figure 8, to help people better understand his interested topics. Here, ‘Microsoft’ and ‘Tweetdeck’ have both similar temporal dynamics and similar regional distributions in January and February, 2009. Through our survey, we find out that in early 2009, Microsoft released a test version of windows 7, but users complained that Tweetdeck (an application) cannot run properly on the windows 7, which triggered a hot discussion. We continue to follow these
70
S. Song, Q. Li, and N. Zheng
two topics, and find Microsoft released a concept plan of application 'NextGeneration Newspaper', which is somewhat similar with Tweetdeck, in September. This event also caused hot discussions, leading to another high similarity between these two topics from mid-September to the early October. Due to space constraints of this paper, we don’t give the corresponding curves.
Fig. 8. State series of ‘Tweetdeck’ Table 1. Topics related to ‘Kobe’ in 2009 Kobe NBA LeBron Orlando Lakers Conan Memorial Day Father's Day Cavs Kris Allen Letterman
The Possible Cause National Basketball Association (NBA), USA. James’s name, Kobe’s opponent. Refer to Magic, one of the teams in NBA. Kobe’s team. An TV anchor, Kobe attended his TV show. There were many tweets about Kobe losing games around the Memorial Day, May 25, 2009. Lakers won the championship on the same day, perfect gift for Kobe. LeBron’s team, Kobe’s opponent. 'American Idol' Kris Allen to sing national anthem at Lakers game Sunday, June 7, 2009. A word to describe a good player.
Another interesting example we find in twitter topics is about Kobe, a famous basketball player in National Basketball Association (NBA), USA. After applying our spatio-temporal method on the topics which appeared more than 100 times during 2009, 10 most relevant topics to ‘Kobe’ are discovered, which are shown in table 1. It can be found from table 1 that almost all of these topics have clear links with Kobe, and others have implicit associations with him, like the “Father's Day” and “Kris Allen”. These topics contain his team, his opponents, other teams in NBA and the TV anchor whose program Kobe participated in. For the users who are interested in Kobe, those related topics can not only increase their understanding of Kobe, but also find some interesting anecdotes about him. Accordingly, in table 2, we list the top ten topics related to ‘Kobe’ in every month, 2009. In table 2, the number after each month means the frequency of ‘Kobe’ turned
A Spatio-temporal Framework for Related Topic Search in Micro-Blogging
71
Table 2. Topics related to ‘Kobe’ in every month, 2009
Kobe
January (5)
February (18)
March (7)
April (18)
Hulu USA Bing Michael Jackson Cowboys Spring MySpace LeBron Iran NBA
NBA World Series Jimmy Fallon Bing Yankees Inauguration Beyonce Megan Fox White House Follow Friday
Coraline Google Voice LeBron Santa #nowplaying Superbowl Arkham Asylum Vampire Diaries #teaparty Slumdog Millionaire
Lakers BBQ NBA Follow Friday Rihanna Earth Hour Church New Moon CES #MM
May (117)
June (121)
July (7)
August (0)
Lakers Google Wave LeBron NBA California MTV BBQ Dodgers Orlando Texas
NBA True Blood LeBron Lakers Lost Michael Vick Orlando Easter CES Fridays
NBA Cavs Megan Fox Lakers US Open Kris Allen Santa ODST Zombieland SXSW
September (0)
October (4)
November (8)
December (33)
Celtics Eminem White House Easter LeBron Thanksgiving Spring Lost Drake Black Friday
Miami Celtics England Chris Brown Conan BBQ Harry Potter District 9 Summer Slumdog Millionaire
Italy Lakers Florida LeBron Paris Cavs Michael Jackson BBQ Halloween Swine Flu
up in the month. Distinguishingly, there is no related record of Kobe in August and September. Compared to the result in table 1, the difference is that the accuracy of the related topics we find in the table 2 is lower. Comparing those results in different months with each other, we can also find that in May and June, the accuracy is higher than that in other months. So, we can deduce some conclusions that: 1. The longer time we compare two topics with our spatio-temporal model, the higher accuracy we can get; 2. We can also get a high accuracy in the burst period of those topics. The above two examples have given a simple description of our experimental results, through which we can see that our proposed method can effectively detect the correlations among topics in micro-blogging, and through these associations, some interesting web contents can be presented to micro-bloggers.
5 Conclusions and Future Work In this paper we formulate the task of mining the potential correlations among topics in micro-blogs as a problem of detecting the similar spatio-temporal state series. From
72
S. Song, Q. Li, and N. Zheng
the experimental results, we can see that our similarity-based searching method can effectively discover potential correlations among topics in micro-blogs. This similarity-based method can also be used to do the research of ‘query expansion’, ‘topics recommendation’, and ‘Time series clustering’, etc. According to the analysis in section 4, we plan to add the content of burst detection prior to the detection of related topics. Future work also includes building the GroupTopic model in Twitter to mine the groups of topics which have intense correlations. Acknowledgments. This research is partly supported by the projects 863 (No. 2006AA010106), 973 (No. 2007CB311007), NSFC (No. 60703085).
References 1. Lenhart, A., Fox, S.: Twitter and status updating. Pew Internet & American Life Project (February 2009) 2. Platakis, M., Kotsakos, D., Gunopulos, D.: Searching for Events in the Blogosphere. In: WWW 2009, pp. 1225–1226 (2009) 3. Shieh, J.R., Hsieh, Y.H., Yeh, Y.T., Su, T.C., Lin, C.Y., Wu, J.L.: Building term suggestion relational graphs from collective intelligence. In: WWW 2009, pp. 1091–1092 (2009) 4. Platakis, M., Kotsakos, D., Gunopulos, D.: Discovering Hot Topics in the Blogosphere. In: EUREKA 2008, pp. 122–132 (2008) 5. Cui, H., Wen, J., Nie, J., Ma, W.: Probabilistic Query Expansion using Query Logs. In: Proceedings of the 11th International Conference on World Wide Web, pp. 325–332. ACM, New York (2002) 6. Li, Z., Wang, B., Li, M., Ma, W.-Y.: A probabilistic model for retrospective news event detection. In: Proceedings of SIGIR 2005, pp. 106–113 (2005) 7. Mitra, M., Singhal, A., Buckley, C.: Improving Automatic Query Expansion. In: Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1998) 8. Qiu, Y., Frei, H.-P.: Concept based Query Expansion. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1993) 9. The Oprah Winfrey Effect on Twitter, April 21 (2009), http://www.labnol.org/internet/oprah-winfrey-effect-ontwitter/8274/ 10. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging usage and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, WebKDD/SNA-KDD 2007, pp. 56–65. ACM, New York (2007) 11. Ben-Ari, E.: Twitter: What’s All the Chirping About? BioScience 59(7) (July/August 2009), doi:10.1525/bio.2009.59.7.19. 12. Lu, Z., Agarwal, D., Dhillon, I.S.: A spatio-temporal approach to collaborative filtering. In: RecSys 2009, pp. 13–20 (2009) 13. Mei, Q., Liu, C., Su, H., Zhai, C.X.: A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In: WWW 2006, pp. 533–542 (2006) 14. Huang, S., Wu, X., Bolivar, A.: The effect of title term suggestion on e-commerce sites. In: WIDM 2008, pp. 31–38 (2008)
A Spatio-temporal Framework for Related Topic Search in Micro-Blogging
73
15. Syeda-Mahmood, T.F., Wang, F., Beymer, D., London, M., Reddy, R.: Characterizing Spatio-temporal Patterns for Disease Discrimination in Cardiac Echo Videos. In: MICCAI (1), pp. 261–269 (2007) 16. Ribeiro-Neto, B., Cristo, M., Golgher, P.B., Moura, E.S.d.: Impedance Coupling in Content-targeted Advertising. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2005) 17. Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: Overview of the Third Text REtrieval Conference, TREC-3 (1994) 18. Xu, J., Croft, W.B.: Improving the Effectiveness of Information Retrieval with Local Context Analysis. ACM Press, City (2000) 19. Kellt, R.: Twitter Study - August 2009. In: Twitter Study Reveals Interesting Results About Usage. Pear Analytics, San Antonio (2009) 20. Gligorov, R.R., Aleksovski, Z., Kate, W.T., Harmelen, F.V.: Using Google Distance to Weight Approximate Ontology Matches. In: Proc. Int’l Conf. World Wide Web 2007, pp. 767–776 (2007) 21. Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco (2005)
Exploiting Semantic Hierarchies for Flickr Group Dongyuan Lu and Qiudan Li Institute of Automation, Chinese Academy of Sciences, No. 95 Zhong Guan Cun East Road, Haidian District, Beijing, China, 100190 {dongyuan.lu,qiudan.li}@ia.ac.cn
Abstract. The development of Web 2.0 provides a convenient platform for online members to exchange information, keep contact with others and express oneselves. Flickr group, as a representative one, is a user-organized, usermanaged community. However, the rapidly increasing amount of groups hampers users to browse them efficiently, thus brings challenges to the organization manner. As the hierarchy used in other systems (e.g., the Library of Congress) has verified its efficiency in helping users browsing, it also indicates potential significance in organizing Flickr groups. In this paper, we focus on exploiting semantic hierarchies for Flickr group. Our proposed method involves two main phases. Firstly, we extract hidden topics from groups to construct a topichierarchy. Then, through mapping the groups onto the topic-hierarchy, a grouphierarchy is constructed. To evaluate the efficiency of the solution, we perform experiments on a real-world dataset crawled from Flickr.com1. Experimental results verify the feasibility of deriving the semantic hierarchies from Flickr groups, which facilitate users browsing experience.
1 Introduction With the development of Web 2.0, online-communities play a variety of roles for the members, such as providing easy access to information exchange, keeping relationship with others and self expression. Flickr.com, in particular, provides an opportunity for users to construct self-organized communities called Flickr groups [1]. Sharing photos with groups facilitates flexibility in managing self photos as well as making them more accessible to the public [1]. In return, the behavior of sharing photos with groups also contains extra information, which offers viable alternatives for image content understanding [2]. However, huge amounts of groups hamper users to browse conveniently, thus further hamper them to participate in interested groups. How to organize numerous groups in order to provide a convenient browsing way for Flickr members is a challenging task. Existing studies on Flickr group mainly concentrate on discovering hyper-groups based on similarity relations [3]. Since the hierarchical structure well- established in organizational systems, such as the Library of Congress and the personal file cabinet, has verified its helpfulness for users to browse books or files conveniently, it also has the potential significance for organizing Flickr groups. Figure 1 illustrates three 1
http://www.flickr.com/
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 74–85, 2010. © Springer-Verlag Berlin Heidelberg 2010
Exploiting Semantic Hierarchies for Flickr Group
75
organization examples. Imagine one Flickr user navigating the groups within them. In (a), none organizing manner is constructed, the user thus navigates each group in a random order. He probably positions in an interested group (assume “bird-1”) and would like to browse some other similar groups. Unfortunately, the next one may be “elephant”. In (b), as groups in the same cluster share a common topic, the previous shortcoming can be well avoided. However, once the user interests in “bird-1” and would like to navigate some finer groups, this structure cannot satisfy him either. In (c), groups are organized using semantic hierarchical structure, where groups in higher level holds broader semantics than lower level, such as bird and gull. In this manner, the user can browse sub-class groups of “bird” along the interconnection within hierarchical structure. Iteratively, he would further narrow his interests and finally reach the goal. It can be seen from the analysis that the hierarchical structure in Flickr groups also plays a significant role in helping users browsing them conveniently. We thus focus on exploiting semantic hierarchies for Flickr group in this paper.
Fig. 1. Three organization examples for Flickr group. (a) random organization. (b) clusteringbased organization. (c) hierarchy-based organization.
As one group consists of photos associated with tags, and a common usage pattern of tags is defined as a hidden topic [4], groups thus can be constituted by hidden topics. Through constructing hidden topics into a hierarchy, a group hierarchy would be constructed by mapping the groups onto the topic-hierarchy. We summarize the research questions as follows: 1. 2.
How do we derive hidden topics from groups? And how do we construct a hidden topic-hierarchy? How do we map the groups onto the constructed topic-hierarchy?
In this paper, we answer these questions by performing experiments on a real- world dataset. To answer the first question, we adopt hierarchical topic model [4] which uses the nested Chinese Restaurant Process (nCRP) as a prior. The topic model assumes that hidden topics consist of common usage patterns of tags. Further, with nCRP such topics would be constructed into a hierarchy. To evaluate its efficiency, we build two evaluation setups to measure the results qualitatively and quantitatively. To answer the second question, we propose a mapping strategy named Bilaterally Integrated rule. “Bilaterally” means we both take posterior topic probability and group’s contribution into account. The performance is evaluated by comparing with human judgment. The remainder of this paper is organized as follows. We first review the related literature in Section 2. Then, a detailed description of proposed method is illustrated in Section 3. In Section 4, we perform the experiments to verify the validity of our method. Finally, Section 5 concludes this paper.
76
D. Lu and Q. Li
2 Related Work 2.1 Research Work on Flickr Group As a representative online-community, Flickr group has drawn increasing attention in recent years. The analysis by Negoescu et al [1] was among the first to focus on examining Flickr groups. They presented an in-deep analysis of Flickr groups from the perspective of the photo-sharing practices of their members. Their main findings involve: 1) Sharing photos with groups is an important part of the photo sharing practices of users. 2) The size of a user’s photo collection weakly influences the percentage of shared photos. 3) The users’ group loyalty is low. 4) Most users share the same photos in a rather limited number of groups. In order to offer a further analysis at a higher semantic level, Negoescu et al [5] proposed a topic-based approach to represent Flickr users and groups. They jointly analyzed Flickr groups and users from the perspective of their tagging patterns. And their analysis reveals there are a number of fundamental similarities between users and groups, despite differences also exist. Based on such representation, Negoescu [3] further proposed a novel approach to group searching through hyper-group discovery. They used two sources of information, content and relations, and thus built three topic-based representations. Based on a novel clustering algorithm, they discovered cohesive hypergroups. The proposed solution allows users to find potentially obscure groups which are still relevant to their search. Their efforts serve as an important precursor for our work here. As their studies are all based on a flat structure, we further attempt to explore whether a hierarchical structure is essential for Flickr groups. Other researches on Flickr group involve group recommendation [6, 7] and group activity prediction [8]. Chen et al [6] proposed a system named SheepDog, which first extracts photos’ visual features and then uses SVM to learn a probability-based model for concepts. When users update one or more photos to the system, it will return related popular groups/tags. Obviously, sharing photos with groups thus becomes more convenient and easier. Liu et al [7] proposed a tag ranking scheme. They aimed at ranking the associated tags according to their relevance to the image content. For each uploaded photo, they used the top tags in its ranked list to search for related Flickr groups and recommend them to users. The work introduced in [8] is interesting. They observed that although there were huge amounts of groups online, numbers of them are inactive. Therefore, they proposed a probabilistic framework to predict the activity of Flickr group based on the participation and interaction of members. Their work offers an alternative mechanism by which users choose groups. 2.2 Research Work on Exploiting Hierarchies There have been some previous research works on the construction of term hierarchies and ontologies in the information retrieval and semantic web communities [9-11]. A representative work is that Sanderson et al [9] presented a method of automatically extracting a hierarchical organization of concepts from a set of documents using a type of co-occurrence known as subsumption. The subsumption hierarchy uses the cooccurrence information to identify a pair of terms that are related, and measures
Exploiting Semantic Hierarchies for Flickr Group
Preprocessing
77
Topic Hierarchy construction
Groups Group Mapping topic hierarchy group hierarchy
Fig. 2. Framework of hierarchy discovery method
term specificity using document frequency. Some following studies utilize this subsumption model to organize the retrieved images into a hierarchy automatically from image captions [12] and inducing hierarchical relations of Flickr tags [13]. However, what they are interested in are the hierarchical relations among terms, while we aim at exploiting the hierarchical relations among the extracted hidden topics and Flickr groups. Some other related work aimed at bridging the folksonomy and traditional hierarchical ontologies [13-17]. Li et al [14] proposed an algorithm for effectively browsing large-scale social annotations, which holds similar purpose with our work. They organized annotations both considering semantic similarity and hierarchical relations. In addition, a sampling method is also deployed for efficient browsing. For discovering hierarchical relations, several features of tags are explored. Then, a decision tree is derived from manually labeled data to predict the sub-tag relations. Instead of considering the relations among tags themselves, Plangprasopchok et al [16] exploited collection and set relations contributed by users on Flickr. By exploring two statistical frameworks, they aggregated many shallow individual hierarchies into a common deeper folksonomy which reflects how a community organizes knowledge. Again, different from their works interested in annotations, we are interested in communities. There is another line of research that focus on exploiting hierarchies among documents [18, 19, 11]. Fotzo et al [18] studied how to automatically construct concept hierarchies from document collection. And they further generate a document hierarchy from those concept hierarchies. Our approach is in the same spirit with their work. However, they permitted one document belonging to different nodes if it concerns different concepts, which is different from our work. Although we assume one group consisting of multiple hidden topics, we still consider one group only concentrating on one particular subject. For example, a group of “macaw” may consist of topics on “bird”, “parrot” and “macaw”. However, what it concentrates on is “macaw”. Therefore, assigning it to “bird” or “parrot” is improper.
3 Proposed Method Figure 2 illustrates the workflow of the proposed method. There are two main steps: the “topic-hierarchy construction” component derives hidden topics from Flickr groups and further constructs them into a hierarchy; and the “group mapping” component
78
D. Lu and Q. Li
maps the groups onto the constructed topic-hierarchy to build a group-hierarchy. We will detail the two main components in the following subsections respectively. 3.1 Topic-Hierarchy Construction As the tags associated with one photo denote its description and numbers of such photos constitute one group, we thus represent each group by aggregating all the tags associated with photos in the pool. All the unique tags in the collection constitute the vocabulary. Formally, each group in the collection is denoted by a V-vector Gi = (ti1, …, tij,…, tiV), where V is the vocabulary size and tij is times of occurrence of tag j in group i. For all the Gi in the collection, we want to extract hidden topics to construct a topic-hierarchy where more general tags appear in higher level and more specific tags appear in lower level. Hidden topic consists of common usage pattern of tags, such as nature and wildlife, which often appear together. The hierarchical Latent Dirichlet Allocation (hLDA) model [4] is adopted. We fix the tree with a depth L. Firstly, the nCRP is deployed to generate the topic tree structure. We pick the first group G1 randomly and use it to generate an initial path c with L nodes. Then each subsequent group Gi is either assigned to one of the existing paths c or to a new path branching off at any existing node of the tree controlled by a probability γ. Larger γ gives more chance to generate a new path. When all Gi is assigned to a path ci (permitting replication), the tree structure T is constructed. Next, we assume each group Gi in the collection is drawn from the following generative process (let cg denote the path through that tree for the gth group):
∈
(1) For each node n T in the tree, draw a topic βk ~ Dirichlet(η). For each group: (2) Pick a path cg from the root of the tree T to a leaf, cg ~ nCRP(γ) (3) Draw an L-vector θg of topic proportions from an L-dimensional Dirichlet distribution, θg | {m, π} ~ GEM(m, π) (4) Sample tags in the group Gi from a mixture of the topic distributions along the path c: Choose level Zg,n | θ ~ Mult(θg) Choose tag Tagg,n | {zg,n, cg, β} ~ Mult(βcg[Zg,n]), which is parameterized by the topic in position zg,n on the path cg. Note that, the hyper-parameters {η, γ, m, π} affect the characteristic of the tree. Therefore, tuning the hyper-parameters can help get a suitable tree shape. For approximate inference, we use the Gibbs sample algorithm. As what we really care about in this research is not the inference procedure, we thus introduce main process briefly. More detailed algorithm and description of denotations can be referred to [20, 4]. The task is to sample the per-group paths cg and the per-tag level allocations to topics in those paths zg,n. Two main steps are performed: For each group Gi in the collection: (1) Randomly draw cg(t+1) from :
p ( cg | w, c− d , z ,η , λ ) ∝ p ( cg | c− g , λ ) p ( t g | c, w− g , z ,η )
(1)
Exploiting Semantic Hierarchies for Flickr Group
79
0.7
p(z|Gi)
Gi : parakeets
0.6
0.5
0.5
C(Gk,zl)
0.4
0.45
0.3
parakeets
0.2
0.4
0.1 0
0.35 0.3
1
2
level
3
***Parakeets and Canaries***
0.25 0.2
Those Incredible I love my Parrot Parrots! Grey Parrots
0.15 0.1
City Parrots
0.05 0
Gk
Fig. 3. Example when max-topic-proportion rule fails
(2) For each tag, randomly draw zg,n(t+1) from:
p ( z g , n , c, t , m, π ,η ) ∝ p ( z g , n | z g , − n , m, π ) p ( t g , n | z , c, w− ( g , n ) ,η )
(2)
Running the inference, an infinite set of group topics z are generated and assigned to nodes of a tree to generate a topic hierarchy; also, each group is assigned a single path from the root to a leaf. 3.2 Group Mapping and Group-Hierarchy Construction Although assigned broad-to-narrow topics along a path, each group usually concerns one particular topic exclusively. Our task in this subsection is to map each group onto one particular topic to construct a group hierarchy. In view of group’s distribution over topics, one straightforward way is to map a group onto the topic with maximum posterior topic probability p(zl|Gi) (we call it max-topic- proportion rule). However, there is a good chance that the group which is poorly distributed over one topic yet contributes greatly to it. Figure 3 shows one such example: vertically along the path, group ”parakeets” holds a low proportion on 3rd-level topic (see top right of Figure 3); but tags generated by the 3rd-level topic largely derive from this group, implying that this group plays a significant role in inferring this topic. Here we define group’s contribution to a specific topic as: V
C ( Gi , zli ) =
∑t p( z j =1 V
ij
li
| Gi )
∑ ∑t p ( z
Gk ∈Si j =1
kj
li
| Gk )
(3)
where zli denotes l-level topic on group Gi’s path, Si is the set of groups assigned the same topic with group Gi. Apparently C(Gi,zl) deserves being taken into consideration. Exclusively we propose a Bilaterally Integrated mapping rule: vertically, considering topic proportion p(zl|Gi); horizontally, considering group’s contribution C(Gi,zl). Bottom-up searching strategy is brought up to perform this rule, which is summarized in Algorithm 1.
80
D. Lu and Q. Li
4 Experiments To evaluate the efficiency of the proposed method, we firstly evaluate the quality of topic-hierarchy: two measure methods combining quantitative and qualitative assessment are performed. Secondly, we evaluate the group-hierarchy utilizing a similarity measurement. 4.1 Dataset To conduct experiments, we collected 200+ groups on one of the most popular topic: bird. After removing the groups with less than 40 participants we finally got a collection consisting of 146 groups that contain about 0.7 million photos with over 250,000 unique tags. As the tags used by only a few users reflect personal purpose or particular vocabulary, they contribute little to the collective description of groups. In the preprocessing, we thus removed the tags used by less than ten users. Finally, 5836 unique tags constitute the vocabulary. 4.2 Topic-Hierarchy Evaluation Quantitative Evaluation. We used predictive held-out likelihood as a measure of performance which is well-established in evaluating topic model [4]. Figure 4 shows how varying the depth of tree (L) affects the mean held-out value. With L increases from 2 to 3, the likelihood value decreases gradually; then, the value quickly decreases when L is increased from 3 to 6. With L further increases, the value decreases slowly. It can be seen from the figure, with L varying from 2 to 8, the predictive heldout likelihood values vary in an acceptable interval. The results reveal the reliability of the model. When L is 2 or 3, it provides a better predictive performance. To illustrate a deeper structure, we thus set L = 3 in our following evaluations.
Exploiting Semantic Hierarchies for Flickr Group
81
4
-3.2
x 10
H eld-out likelihood
-3.25
-3.3
-3.35
-3.4
-3.45
-3.5
2
3
4
5 depth of the tree
6
7
8
Fig. 4. The distribution of held-out likelihood value over the depth of the tree
Fig. 5. Two hierarchy examples for illustrating the measure method Table 1. Manually evaluating the topic-hierarchy according to three qualitative measures Isolation nd
2 -level Average
5.0
Coherence
rd
3 -level 4.2
Hierarchy 5.2
Root-level 6.4
2nd-level 5.2
3rd-level 3.2
Qualitative Evaluation. Since the held-out likelihood offers less human-interpretable meanings [21], we thus also deployed a human interpretation based evaluation metric. Five postgraduate students with background in information system participated in the evaluation task. They were invited to browse the topic-hierarchy within 10 minutes, and then asked to answer a questionnaire. Our questionnaire was developed based on the instrument employed by [11]. It involves the assessment of the following three factors: z z z
Isolation: Judge whether the auto-generated topic node at the same level are distinguishable and their semantics do not subsume one another. Hierarchy: Judge whether the generated topic hierarchy is traversed from tags containing broader semantics at the higher levels to tags containing narrower semantics at the lower levels. Coherence: Judge whether the tags in a particular topic node reflect coherent semantics.
A score of 7 represents the most satisfactory performance and 0 represents the worst. Evaluation results are shown in Table 1. Most average values are above 5 except the isolation value and the coherence value on 3nd-level. This happens because, with the level extending topics with narrower semantics are more difficult to extract. It thus disrupts the lower-level topic quality and leads to more ambiguous topics. Therefore, one extension of our work is to take particular consideration for lower level topics
82
D. Lu and Q. Li
Table 2. Similarity value between automatically constructed group-hierarchy and manually constructed group-hierarchy Inclusion(A, M)
Inclusion(M, A)
0.422
Similarity(A, M)
0.246
0.334
extraction. As a whole, the result indicates that the topic modeling method constructs a reliable topic-hierarchy. 4.3 Group-Hierarchy Evaluation Since the group-hierarchy is not easy to evaluate by some objectively derived value, we wish to measure the similarity degree between the automatically constructed hierarchy and an “ideal” hierarchy. By “Ideal” we mean users expect. Our similarity measure is based on the inclusion degree introduced in [18], which considers the “brotherhood” correspondence degree and “parent-child” correspondence degree between two hierarchies. Specifically, we refer to the relationship between two groups assigned to the same node as “brotherhood”. We similarly refer to the relationship between two groups assigned to the parent node and child node respectively as “parent-child”. To illustrate, let us look at Figure 5. For “brotherhood”, we first notice that in (A) G3 and G4 are assigned to the same node, and so do in (B). That means (A) discovered a “brother” couple corresponding to (B). Thus,
contributes 1 to the degree. Similarly, contributes 1 and contributes nothing. Next for “parent-child”, we see G1 and G3 are assigned to nodes of the same branch in (A), and so do in (B). That reflects (A) discovered a “parent-child” couple corresponding to (B). Thus, also contributes 1 to the degree. The rest couples’ contributions can be calculated in the same manner. It is important to realize that, if the number of couples in (A) is large enough, it would provide more chance for (A) to find same couples in (B). Therefore, the inclusion degree needs further dividing by the number of total “brother” and “parentchild” couples. Note that the inclusion(A, B) is thus different from inclusion(B, A). Consequently, we use the average value to calculate the similarity between A and B. To define this measure formally, let Nf(A->B) (Np(A->B)) denote the number of couples of “brothers” (”parent-child”) in A which belong to Bi; |FA| (|PA|) denotes the number of couples of “brothers” (“parent-child”) groups in A. The similarity degree is defined as follows [18]: Similarity ( A, Bi ) =
N f ( A−>B ) + N p ( A−> B ) FA + PA
+
N f ( B −> A) + N p ( B −> A) FB + PB
(4)
In our evaluation, four assessors were invited. Given the collection they were asked to construct a hierarchical structure through discussing with each other in a face-to-face meeting. We then used their consensual assessment as the “ideal” structure and measured the proposed hierarchy based on the similarity with the “ideal” one. A score of
Exploiting Semantic Hierarchies for Flickr Group
83
Fig. 6. A portion of topic-hierarchy and group hierarchy
Fig. 7. Tag occurences for three example groups in the collection
0.334 shows that our proposed method provides satisfactory results despite of containing the space to be improved. 4.4 Discussions Figure 6 illustrates a portion of the 3-level hierarchy learned from the dataset, with each topic node annotated with its top 4 probable tags. We also presented examples of groups associated with the nodes, which construct the group hierarchy. It can be seen that, the method discovers the most general tags in the root level, e.g., bird, birds, nature. While in the second level, more specialized tags, e.g., seagull, parrot, and heron etc, which reflect narrower semantics, have captured some of the major subcategories of birds. The third level provides a further subdivision into more concrete topics, e.g., cockatoo, Africangray and lovebird etc, which divide parrot into more fined categories. From the group hierarchy we can get similar broad-to-narrow relations among groups. We took a deep analysis by examining into three groups along one path. Their tag occurrences are shown in Figure 7 (a) ~ (c). It can be observed that some of the top tags in group (c) have trivial weights in (a) and (b) (e.g., cockatoo, moluccan), while the tags dominate in group (a) have equally significant weights in (b) and (c) (e.g., bird, birds). This observation serves a good explanation for the fact that these three groups concern broad-to-narrow topics (from bird to parrot to cockatoo), which forcefully verify proposed group-hierarchy discovery framework.
84
D. Lu and Q. Li
5 Conclusion and Future Work In this paper, we investigate the efficiency of the hierarchical structure in Flickr groups. To exploit the semantic hierarchies, we introduce a two-phase solution. We first adopt the hierarchical topic model to construct a topic-hierarchy as a “bridge”. Then we introduce a bottom-up searching strategy based on a Bilaterally Integrated mapping rule. Mapping the groups onto the topic-hierarchy constructs a grouphierarchy. Experimental results reveal that the aggregating tags within groups contain massive collective knowledge, which provide useful information to analyze the group structure on Flickr at a higher semantic level. Some groups concern broader topics, while some concern narrower topics. Such hierarchical structure reflects that the human cognition in real world relies on a hierarchical representation of semantics. Therefore, it indicates a mechanism by which users could browse groups in accordance with conventional behavior. In this paper, we perform the evaluation on a collection of groups on the subject of bird. A further experiment on a broader subject (e.g., animal) and other subjects (e.g., travel) would be conducted. The methods developed in this paper could be very useful in group management and organization applications. For example, once users have positioned themselves in a group of interest, they can browse more groups by exploring the group-hierarchy, especially when they are ambiguous about the goal in advance. Besides, our hierarchical construction method has the potential to be integrated in social-media content understanding system. “As the automated systems are largely incapable of understanding the semantic content of photographs, the prospects of ‘make sense’ of these photo collections are largely dependent on metadata manually assigned to the photos by the users” [22]. Previous studies concerned exploiting metadata like tag, location and time, while the behavior of sharing with groups was ignored. In other words, the massive collective knowledge contained in groups, which our experiments suggested, can be further utilized to assist “making sense” of the photos. Our future work would also attempt to investigate how to leverage such information to help understand the content of photos. Acknowledgments. This research is supported by the projects 863 (No. 2006AA010106), 973 (No. 2007CB311007), NSFC (No. 60703085).
References 1. Negoescu, R.A., Gatica-Perez, D.: Analyzing Flickr Groups. In: Proceedings of CIVR 2008, pp. 417–426. ACM Press, Niagara Falls (2008) 2. Abbasi, R., Chernov, S., Nejdl, W.: Exploiting Flickr Tags and Groups for Finding Landmark Photos. In: Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, pp. 654-661. Toulouse, France (2009) 3. Negoescu, R.A., Adams, B., et al.: Flickr Hypergroups. In: Proceedings of ACM International Conference on Multimedia, pp. 813–816. ACM Press, Beijing (2009) 4. Blei, D., Griffiths, T., Jordan, M.: The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM (to appear)
Exploiting Semantic Hierarchies for Flickr Group
85
5. Negoescu, R.A., Gatica-Perez, D.: Topickr: Flickr Groups and Users Reloaded. In: Proceeding of The 16th ACM International Conference on Multimedia, pp. 857–860. ACM Press, Vancouver (2008) 6. Chen, H.M., Chang, M.H., Chang, P.C., Tien, M.C.: SheepDog: group and tag recommendation for flickr photos by automatic search-based learning. In: Proceeding of The 16th ACM International Conference on Multimedia, pp. 737–740. ACM Press, Vancouver (2008) 7. Liu, D., Hua, X.S., Yang, L.: Tag Ranking. In: Proceedings of The 18th International Conference on World Wide Web, Madrid, pp. 351–360 (2009) 8. Choudhury, M.D.: Modeling and predicting group activity over time in online. In: Proceedings of the 20th ACM Conference on Hypertext and Hypermedia, Torino, Italy, pp. 349–350 (2009) 9. Sanderson, M., Croft, B.: Deriving Concept Hierarchies from Text. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information, pp. 206–213. ACM Press, Berkeley (1999) 10. Bloehdorn, S., Volkel, M.: TagFS — Tag Semantics for Hierarchical File Systems. In: Proceedings of The 6th International Conference on Knowledge Management, Graz, Austria (2006) 11. Chuang, S.L., Chien, L.F.: A practical web-based approach to generating topic hierarchy for text segments. In: Proceedings of The Thirteenth ACM International Conference on Information and Knowledge Management, Washington, D.C., USA (2004) 12. Sanderson, M., Tian, J., Clough, P.: Automatic organisation of retrieved images into a hierarchy. In: International Workshop OntoImage 2006: Language Resources for ContentBased Image Retrieval, Genoa, Italy (2006) 13. Schmitz, P.: Inducing ontology from Flickr tags. In: Proceedings of the Collaborative Web Tagging Workshop (2006) 14. Li, R., Bao, S., Fei, B., Su, Z., Yu, Y.: Towards Effective Browsing of Large Scale Social Annotations. In: Proceedings of The 16th International Conference on World Wide Web, Canada, pp. 943–952 (2007) 15. Zhou, M., Bao, S., Wu, X., Yu, Y.: An unsupervised model for exploring hierarchical semantics from social annotations. In: Proc. of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC 2007), pp. 673–686. Springer, Busan (2007) 16. Plangprasopchok, A., Lerman, K.: Constructing folksonomies from user-specified relations on flickr. In: Proceedings of The 18th International Conference on World Wide Web, Madrid, pp. 781–790 (2009) 17. Lalwani, S., Huhns, M.N.: Deriving ontological structure from a folksonomy. In: Proceedings of the 47th Annual Southeast Regional Conference. ACM Press, Clemson (2009) 18. Fotzo, H.N., Gallinari, P.: Learning Topic Hierarchies and Thematic Annotations from Document Collections. In: Learning Methods for Text Understanding and Mining Workshop, Grenoble, France (2004) 19. Yang, C.C., Wang, F.L.: Hierarchical summarization of large documents. Journal of the American Society for Information Science and Technology 59, 887–902 (2008) 20. Liu, J.: The collapsed Gibbs sampler in Bayesian computations with application to a gene regulation problem. Journal of the American Statistical Association 89, 958–966 (1994) 21. Jordan, B.G., Chang, J., Gerrish, S., Wang, C., Blei, D.: Reading Tea Leaves: How Humans Interpret Topic Models. In: Advances in Neural Information Processing Systems (2009) 22. Kennedy, L., Naaman, M., Ahern, S., Nair, R., Rattenbury, T.: How flickr helps us make sense of the world: context and content in community-contributed media collections. In: Proceedings of The 15th International Conference on Multimedia. ACM Press, Augsburg (2007)
Understanding a Celebrity with His Salient Events Shuangyong Song, Qiudan Li, and Nan Zheng Laboratory of Complex Systems and Intelligence Science Institute of Automation, Chinese Academy of Sciences 95 Zhongguancun East Road, Beijing, China 100190 {shuangyong.song,qiudan.li,nan.zheng}@ia.ac.cn
Abstract. Internet has become a resourceful platform for people to collect information. Specially, it becomes one of the main ways to understand a celebrity. However, the huge volume of information makes troubles for people to get what they really want. How to filter out needless information through numerous data and form a brief review of a celebrity become necessary for people to understand the person. In this paper, we propose a novel solution for understanding a celebrity by summarizing his most salient historical events, and a framework is outlined. The framework contains three main components: attention tracking, event mining from News, and event summarization. First, with the comparison of users’ attention and media attention on a celebrity, News corpus is proved to be able to represent the users’ attention. Second, keywords are extracted from the News according to different time periods for choosing summary sentences. Third, a final event description of the celebrity will be given. Finally, we will show the user interface of our system. Our experimental results show that the proposed solution can effectively process the news corpus and provide us with accurate description of the celebrity.
1 Introduction With the development of the Web 2.0, people can easily find abundant information about a celebrity, but the complexity and redundancy of the information make it difficult to obtain the most necessary content. For example, about 152,000,000 web pages are returned to a user when he queries ‘Obama’ in Google.com. Those web pages may belong to different online Medias (i.e. News, Blog, Micro-Blog, etc.) and have various types of text formats. So it is hard for people to classify them by different topics or by different periods. Describing a celebrity with his most salient historical events, referred as character description, becomes important for users to fastly and conveniently understand the celebrity. A summarized text to present event evolution is necessary for general users to review events about a celebrity. By analyzing the salient historical events about a celebrity and summarizing these events on a timeline with appropriate sentences, we are able to understand the celebrity’s life facilely. Chieu and Lee proposed a query based event extraction model to summary events about the query along a timeline [4]. In this model, they rank the sentences which are queried from a news corpus with interesting and bursty feature to represent different events about the query. However, those sentences are too brief to describe events in detail, and some sentences have content relationship with each A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 86–97, 2010. © Springer-Verlag Berlin Heidelberg 2010
Understanding a Celebrity with His Salient Events
87
other which sometimes makes the defined ‘events’ indistinguishable. On the other hand, Platakis et al. proposed a novel method of event summarization, in which an event was defined as a group of correlated terms [6]. They detected the frequent terms with similar temporal dynamics in a given period, and describe the events in this period with those terms. Through this method, events can be extracted more accurately, and some implicit association among those terms can be found. However, the result of this method is also oversimplified, and without a timeline, users could not fully understand the events of a celebrity. In this paper, we design a new summarization method to describe a celebrity with his salient historical events. First, with the comparison of users’ attention and media attention on a celebrity, News corpus is proved to be able to represent users’ attention. Therefore, we track users’ attention of a celebrity, which is based on the amount of News relevant to him, and detect his salient events by finding bursts in the stream of News corpus. Second, keywords with similar temporal dynamic are extracted from the News corpus according to different burst time periods, which can make the summary of each event more appropriate. Third, we extract sentences which contain the keywords we have obtained, and then delete redundant sentences by calculating their content similarity, helping to get a more accurate description of the celebrity. Compared with the other two summarization methods we mentioned above, this form of summary extracts the most salient events about a celebrity from the related web pages, and provides a more detailed description of each event. The rest or this paper is organized as follows: In section 2, we provide a brief review of the related work. The introduction of our method was proposed in section 3. In section 4, some analysis of our experimental results and a screenshot of our system are given. Finally we make a conclusion and discuss our plans for future work in section 5.
2 Related Work Our work is related to a series of work on character description, stream data mining, and natural language processing. The work on character description aims to represent a character with his characteristics, societal attributes, and the events occurring to him, which is an important part of our work. Expert finding has aroused the interest of many researchers [7, 11, 1]. Balog et al. [1] proposed two general strategies in expert finding: one is to model an expert’s knowledge based on the documents they are associated with, and the other is to locate documents on topic, and then find the associated expert. Both methods get good performance compared with other unsupervised techniques, indicating the importance of forming reliable associations in expert finding systems. Chen et al. [11] focused not only on the extensive knowledge about an expert but also the strong social links with him. They modeled the social network as a graph, in which the vertices indicate persons and the edges represent the relationships between persons. In this way, the problem of finding the “starring authors” in social network has turned to be detecting the vertices which have big weight and strong associations with others in a graph. Zhu et al. [7] used multiple levels of associations to solve this problem, but the basic idea is consistent with that of Chen et al [11]. In Chieu et al [4], a method of extracting the events about a person along a timeline was proposed. An event was
88
S. Song, Q. Li, and N. Zheng
represented by a single sentence, and a series of sentences, having the biggest interest score, were ranked by time to describe the person with his historical events. Our work is more related to [4], but we focus on the person’s most salient events and summarize them with a detailed description instead of a single sentence. The work on stream data mining [6, 9, 10, 5] considers a single event but not the events which have causality with it or the events which have same attributes with it. Platakis et al. attempt to discover bursty terms and correlations between them during a time interval, and adequately describe an event with those terms [6]. Kumar et al. identify bursty communities from Weblog graphs by taking account of the considerable attention [9]. In [10], a topic evolution graph is built and used to trace topic transitions, i.e. changes in the cluster labels rather than the cluster themselves. Suhara et al. proposed a method to extract the action relation between two keywords from blog articles when two such related keywords are given [5], and represented evolution of event over time for a single concept or sentiment using those correlated words. In this paper, taking a celebrity as the center, we summarize the events about him along a timeline instead of considering a single event. Finally, some work on Chinese text summarization has been done. Kuo et al. [2] proposed a sentence reduction algorithm by informative words, event words and temporal words to deal with both length constraints and information coverage. In [8], a kernel words based approach for sentence extraction in text summarization is proposed, which achieves a high accuracy rate. Lin et al. proposed a method to calculate the affix similarity between two sentences by identifying all the comment substrings between them [3]. The method in [3] is modified and introduced into our system in the event summarization part. The redundant sentences can be deleted effectively by this method to generate a more accurate description of the celebrity.
3 Proposed Algorithm Figure 1 shows the system architecture of our proposed events summarizer (summary generation system). The input of the system is a celebrity’s name and the output is the summaries of his most salient events. The system processes the summarization in three main steps as aforementioned: (1) tracking the public attention to a celebrity and detecting the bursts from the curve of attention; (2) mining the events from the News (or Weblogs) data in the period around the bursts detected in step 1; (3) summarizing the results. These steps are performed in multiple sub-steps. We download the news or blogs related to the celebrity as our stream data, and paint them into curve by the change of time intensity. From the curve, we can intuitively understand what a burst means. “A sequence of events is considered bursty if the fraction of relevant events alternates between periods in which it is large and long periods in which is small” [9]. So whenever the popularity of a specific keyword dramatically and unexpectedly increases, a burst is marked [6]. Here, we assume that a burst arises as a result of a hot event or at least an event receiving obviously more attention than other events in the same period. The keywords for representing the bursts (or events) are first extracted using the statistical methods of frequency, and then chosen with the comparison of the similarity between them. Then those words are used to rank the sentences for summarizing events. We discuss each of the substeps in the following sections.
Understanding a Celebrity with His Salient Events
89
Query a celebrity
News corpus
The degree of media attention
Blog corpus
The degree of users' attention
The extraction of the most attended events
Event 1
Summary 1
Event 2
Summary 2 The final summary returned to users
Event n
Summary n
Fig. 1. System architecture of the events summarizer
3.1 Attention Tracking The “users’ attention” and “media attention” are defined as described in Baidu-Index1. “The degree of users' attention is evaluated on the statistic basis of millions of internet users' searching frequency in Baidu, targeted on keywords, analyzed and calculated by the weights of number of various words' searching frequency in Baidu search web, and finally illustrated by curve graph.” “The degree of media attention is based on the amount of news most relevant to the keywords in Baidu news search in the last 30 days. After being weighted, the final data were obtained and displayed in the form of surface map.”
Fig. 2. The degree of users’ attention and media attention in a quarter
The curves of users’ attention and media attention we obtain from the Baidu-Index are given in Figure 2, which shows the degree of users’ attention and media attention in a quarter. We can see that the media attention and the users’ attention on a celebrity have almost followed the same trend. Therefore we can use one of them to characterize the whole changing trend. Here, we choose the news data downloaded from a popular Chinese news website Xinhua2, since the degree of media attention is based 1 2
http://index.baidu.com/ http://www.xinhuanet.com/
90
S. Song, Q. Li, and N. Zheng
on the News corpus and the News corpus has a structured format. Issued a query (a celebrity’s name), the website will return all the news about it. We take the query Liu Xiang as an example. The pages returned according to him are downloaded as our corpus. This corpus contains more than twenty thousand sentences included in 1841 independent news articles. Figure 3 shows the changing curve of numbers of News in Xinhua. It can be seen that there are 5 rapid increases in this curve, which are defined as bursts. The fourth one was around 12-03-2008, and the last one was around 03-09-2009. They are the same dates as shown in Figure 2. This similar trend has proved again that News data can characterize the whole changing trend of the attention degree about a celebrity.
Fig. 3. The changing curve of News number
3.2 Data Preprocessing The input raw texts need to be processed by a Chinese word segmenter and a part-ofspeech (POS) tagger. 1) Word segmentation and annotation Because of the flexibility of the grammar and the particular expression in Chinese sentences, we should first do some primary processing on data. We use a Chinese lexical analysis system ICTCLAS3 as a lexical analyzer to process the input raw texts. ICTCLAS includes word segmentation, POS tagging and unknown words recognition. Its segmentation precision is 97.58%. The recalling rates of unknown words using roles tagging achieve more than 90% [11]. These works are prepared for the statistic of words and the redundant segments deletion in our experiments. 2) Resolving Chinese temporal expressions The same temporal expression, for example ‘yesterday’, may denote different times in different documents, so the temporal expressions in documents should be converted into calendrical forms [2]. The most familiar two forms of date are ‘5/4/2004’ and ‘May 4th 2004’. We use the latter in this paper and convert all temporal expressions into this form. 3
http://ictclas.org
Understanding a Celebrity with His Salient Events
91
Xiang Liu
Fig. 4. The explanation of the 31-dimensional vector we defined
3.3 Mining Events from News There are several methods to denote an event: 1) describing an event with its features like time, place and the people involved in [12]; 2) charactering an event using a small subset of keywords that are able to describe one or more real life events occurring during the period of study [6]; 3) representing each event with a sentence extracted from a collection of related documents [4]. We use the second method mentioned above. We find out all the bursts in our stream News data, and take every burst as a salient event of the celebrity. For each event, we choose the News in 31 days (somewhat equivalent to one month) around the burst time as its sub-corpus. Then we count the most frequent words in this sub-corpus, and represent them in the form of vector, in which every dimension reflects the number of times the word appearing in the corresponding day. In this way, every candidate keyword will be changed into a 31-dimensional vector. We give an example to explain the 31-dimensional vector we defined. In Figure 4, we can see that the term frequency of “Liu Xiang” was normalized into a 0 to 5 state series. Accordingly, every dimension in the vector of “Liu Xiang” was an integer between 0 and 5. This processing will enable us to calculate the time similarity between terms more conveniently. 31
D is(V1 ,V2 )=
∑ (V
1n
-V 2n ) 2
(1)
n=1
Furthermore, we adopt a Euclidean-based distance metric in formula (1) to calculate the distance between two vectors. If the distance between each two vectors in a group of terms is less than 4 (the threshold we set here empirically), this group of terms is defined to be the similar timeline keywords, which will further be used to choose sentences for describing an event. In formula (1), Dis(V1,V2) means the distance between two vectors V1 and V2, we calculate it with the Euclidean-based distance between those two 31-dimensional vectors. 3.4 Events Summarization
We divide this task into two subtasks: sentence extraction and redundant sentences deletion. In sentence extraction, the extracted keywords are used to rank the sentences as aforementioned. Yang et al. [8] proposed a method for ranking Chinese
92
S. Song, Q. Li, and N. Zheng
sentences by keywords, which assumes that the more the in-degree and out-degree of an entity, the more important the sentences contain this entity. So we use the extracted keywords to choose the sentences. In this paper, we delete redundant sentences by calculating their content similarity, so as to provide a concise description of events. The method in [3] is used to calculate the content similarity between two Chinese sentences, an English example is shown here for an easy understanding. Suppose two sentences are: S’1: “It will be fine tomorrow” and S’2: “Tomorrow will be sunny”. First, we separate “Tomorrow will be sunny” into {“Tomorrow will be sunny”, “will be sunny”, “be sunny”, “sunny”}, and then compare them with “It will be fine tomorrow”. As shown in Figure 5, “Tomorrow will be sunny” has a 1 word length common word with S’1, i.e., “tomorrow”, “will be sunny” has a 2 word length common prefix with S’1, “be sunny” has a 1 word length common prefix with S’1, and “sunny” has a 0 word length common prefix with S’1. Finally, the value of the content similarity of the two sentences ConSim (S’1, S’2) is the sum of {1, 2, 1, 0}, which is 4 for the given example. Normalizing this value into a number between 0 and 1, we further define (ConSim (S’1, S’2) + ConSim (S’2, S’1)) / (ConSim (S’1, S’1) + ConSim (S’2, S’2)) as the final content similarity between them. The content similarity between two Chinese sentences is finally defined as below: Sim(S1,S2)=
ConSim(S1,S2)+ConSim(S2,S1) ConSim(S1,S1)+ConSim(S2,S2)
(2)
Fig. 5. The content similarity of the two sentences
4 Experimental Results 4.1 Experiments on Event Mining
We take the 5th burst in Figure 4 as an example to explain the whole process. Figure 6 presents the changing curve of numbers of News in a month around this burst. After preprocessing the data, we count the number of every word in our corpus, and choose those whose amount is more than 62 (twice the number of days) as our keywords candidates. We denote them with 31-dimensional vectors, normalize every dimension into a 0 to 5 state, and calculate the similarity between two terms by a Euclidean-based distance metric as mentioned in section 3.3.
Understanding a Celebrity with His Salient Events
93
Fig. 6. The changing curve of numbers of news in a month around 2009-3-10
We follow two rules for choosing the final groups of keywords: 1) Every term in this group has a similar timeline as the changing numbers of News in the same period, and this similarity is also defined by the distance between the vectors in this group of terms and the vector of the news’ number, whose dimensions are also normalized into a 0 to 5 state. 2) The distance between each two vectors in this group of terms is less than a threshold. Rule 1 is used to make sure that the keywords we extracted are related to the event we want to describe, and rule 2 indicates that those terms produce similar activity, which also means that they has a close connection with each other. The values of the thresholds in rule 1 and rule 2 are set between 4 and 5 empirically.
Fig. 7. The timeline-curve of the three words “training”, “recovering” and “homecoming”
Fig. 8. The timeline-curve of the two words “meeting”and “two sessions”
As seen in Figure 7, the three words “training”, “recovering” and “homecoming” have very similar time curves, as well as “meeting” and “two sessions” shown in Figure 8. These words all became bursty ones around March 10th 2009. Trying to evaluate the accuracy of this result, we found out that on March 8th 2009 Liu Xiang arrived at Shanghai Pudong International Airport after the surgery from the United States, and then participated in “two sessions” in China. 4.2 Sentence Extraction and Summary Formation
We first delete the same sentences in our database, which is abundant because of the reprints between News articles. Then we take sports star Liu Xiang as an example to explain how to choose the sentence candidates using the keywords we extracted in the above part. Algorithm 1 shows the procedure of choosing sentence candidates. S is
94
S. Song, Q. Li, and N. Zheng SENTENCE-CHOOSING(S) Input: Set of sentences S. 1 for each s S 2 if s contains (“Liu Xiang”) 3 and contains ((“training” and “recovering”) 4 or (“homecoming” and “recovering”) 5 or (“training” and “homecoming”) 6 or (“two sessions”) 7 or (“meeting”)) 8 and contains some time mark 9 preserve s. 10 else 11 delete s. 12 end
∈
Algorithm 1. Sentence Choosing
the set of sentences in our one month News corpus about Liu Xiang. In line 8 of algorithm 1, we add a condition that the sentences we want must contain some time mark, for the purpose of deleting the News’ titles. After such processing, the number of result sentences has been reduced to 16. We then carry out a k-means cluster analysis on the 16 sentences, and set k to be 4. The distance between two sentences is the reciprocal of the similarity between them, which can be calculate by formula (3). Then we choose the 4 “center sentences” that are closest to the center point in every cluster as our ultimate summary sentences. The final summarization result of the last burst event in Figure 3 is given below: When the "two sessions" opened, Liu Xiang was still in the United States, carrying out training for recovery after his operation. “Trapeze" once again absenting the "two sessions" has aroused some controversy. Liu Xiang finished his training recovery in Houston, United States, and flew back home on March 7th. On March 10 th, Liu Xiang went to his first public training after a 3 months long training recovery in United States. On the evening of March 10 th, despite a little tired, he insisted on flying to Beijing to attend the meeting of the CPPCC National Committee, performing his duties.
The italicized part above is the summary of one event about Liu Xiang, which describes the event “Liu Xiang came back from United States after his training recovery”. 4.3 Discussion of the Results
Empirically, our result is really able to character a person accurately and succinctly. In this section, we evaluate our experimental result from both objective and subjective point of view. We first perform a small statistical experiment to evaluate our system by examining whether the event we extracted can reflect the interest of the News viewers. We download 654 News articles, which are in the same way as we downloaded pages from Xinhua website, from another famous News website Sina4 in the same period as in our previous experiments. Two annotators annotate those News 4
http://news.sina.com.cn/
Understanding a Celebrity with His Salient Events
95
independently with “related to the event we extract” or not, then meet to compare those double annotated files. A final annotation result after their agreement show that 506 News are related to the event we extract, while the else 148 ones were not. We believe that the number of comments of News can represent the interest of users on it, so we compared the number of comments of News in related part to unrelated one. As shown in table 1, most of the News in both parts has less than 100 comments. But the News with more than 100 comments has an obviously larger proportion in related part than in unrelated part. There is a News even has more than 1000 comments, of which the number is actually 2605. The average number of comments of News in related part is 171.21, while the number of unrelated part is 46.95. From this result, we can easily judge that our summary of events can heavily reflect the users’ interest. Table 1. Number of News related/ unrelated to the event we extracted (NC means “Number of Comments”)
NC<10 10
Related 149 177 179 1 506
unrelated 51 68 29 0 148
To further evaluate our proposed method, we conduct a comparison against other two types of character description generated by Chieu’s method [4] and Platakis’s method [6]. About the same event mentioned above – “Liu Xiang came back from United Stated after his training recovery”, we give the results of those two methods below as an example: Chieu’s method: On March 10, Liu Xiang went to his first public training after a 3 months long training recovery in United States. Platakis’s method: training, recovering, homecoming, two sessions, meeting.
50 celebrities are chosen as our queries, whose descriptions are summarized by Chieu’s method, Platakis’s method and our method, and five Ph.D. students are Table 2. The score of each method
Scores Methods Questions Q1: Can it describe the events accurately? Q2: Can it describe the events succinctly? Q3: Can it describe an event allsidedly? Q4: Can it describe a celebrity allsidedly? Q5: Can it emphasize the most salient events? Q6: Does it have a good timeline? Q7: Can it separate each event distinctly?
Chieu’s method
Platakis’s method
Our method
0.823 0.722 0.845 0.812 0.708 0.901 0.774
0.954 0.720 0.612 0.785 0.912 0.284 0.925
0.952 0.612 0.842 0.754 0.855 0.970 0.934
96
S. Song, Q. Li, and N. Zheng
employed to score our experimental results with a number between 0 and 1 (0, 0.2, 0.4, 0.6, 0.8, 1.0) at seven aspects which are listed in Table 2. In general, a score close to 1 indicates that the automatically generated description is with good quality at this aspect. Table 2 shows the final average score of each method at every aspect. From table 2, we can see that our method performs well and evenly at all aspects discussed. 4.4 User Interface
Figure 9 represents our system interface, which allows obtaining web content on any given celebrity by sifting the News stream. Users are able to express the celebrity of his interest by a search string and the system will show the period of events occurring for the celebrity and output the summarization result about those events. By searching via such an interface, people can easily understand a celebrity with his salient events.
Fig. 9. An example of a celebrity’s description
Understanding a Celebrity with His Salient Events
97
5 Conclusion and Future Work This paper proposes a novel method to extract and summarize the most salient events of a celebrity from Chinese News corpus. With this method, we first extract keywords, which describe an event, and then rank the sentences and remove redundant sentences according to these keywords. The experimental results show that our summary can concisely and accurately describe a celebrity. Currently, this system works independently out of any search engine. It is our intention to integrate it with a search engine so that it can work in real time on user queries. Based on this work, we are going to address the issue of how to find the associated rules between events, and event prediction is also a key point of our future research work. Acknowledgments. This research is partly supported by the projects 863 (No. 2006AA010106), 973 (No. 2007CB311007), NSFC (No. 60703085).
References 1. Balog, K., Azzopardi, L., Rijke, M.: Formal models for expert finding in enterprise corpora. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–50 (2006) 2. Kuo, J.J., Chen, H.H.: Multi-document Summary Generation using Informative and Event Words. TALIP 7(1), 1–23 (2008) 3. Lin, K.H., Yang, C., Chen, H.H.: Emotion Classification of Online News Articles from the Reader’s Perspective. In: Proceedings of International Conference on Web Intelligence, Institute of Electrical and Electronics Engineers, Sydney, AU, pp. 220–226 (2008) 4. Chieu, H.L., and Lee, Y.K., Query Based Event Extraction along a Timeline. In: International ACM SIGIR Conference on Research and development in Information Retrieval, Sheffield, UK, pp. 425–432 (2004). 5. Suhara, Y., Toda, H., Sakurai, A.: Event Mining from the Blogosphere Using Topic Words. In: Proceedings of the 1st International Conference on Weblogs and Social Media (ICWSM 2007), Boulder, Colorado, USA (2007) 6. Platakis, M., Kotsakos, D., Gunopulos, D.: Searching for Events in the Blogosphere. In: Proc. Int’l Conf. World Wide Web, WWW 2009, pp. 1225–1226 (2009) 7. Zhu, J., Song, D., Rüger, S.: Integrating Multiple Windows and Document Features for Expert Finding. JASIST 60(4), 694–715 (2009) 8. Yang, W., Dai, R., Cui, X.: A Novel Chinese Text Summarization Approach Using Sentence Extraction Based on Kernel Words Recognition. In: FSKD 2008, pp. 134–139 (2008) 9. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: On the Bursty Evolution of Blogspace. In: Proc. Int’l Conf. World Wide Web, WWW 2003, pp. 159–178 (2003) 10. Mei, Q., Zhai, C.: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, USA, August 21-24, pp. 198–207 (2005) 11. Chen, D., Tang, J., Li, J., Zhou, L.: Discovering the Starring People from Social Networks. In: Proc. Int’l Conf. World Wide Web, WWW 2009, pp. 1219–1220 (2009) 12. Zhao, X., Qin, B., Che, W., Liu, T.: Research on Chinese Event Extraction. Journal of Chinese Information Processing 22(01), 3–8 (2008) (in Chinese)
User Interests: Definition, Vocabulary, and Utilization in Unifying Search and Reasoning Yi Zeng1 , Yan Wang1 , Zhisheng Huang2 , Danica Damljanovic3 , Ning Zhong1,4 , and Cong Wang1 1
International WIC Institute, Beijing University of Technology, China [email protected] 2 Vrije University Amsterdam, The Netherlands [email protected] 3 University of Sheffield, United Kingdom [email protected] 4 Maebashi Institute of Technology, Japan [email protected]
Abstract. Consistent description and representation method of user interests are required for personalized Web applications. In this paper, we provide a formal definition and the “e-foaf:interest” vocabulary for describing user interests based on RDF/OWL and the FOAF vocabulary. As an application, under the framework of unifying search and reasoning (ReaSearch), we proposed interests-based unification of search and reasoning (I-ReaSearch) to solve the personalization and scalability requirements for Web-scale data processing. We illustrate how user interests can be used to refine literature search on the Web. Evaluation from the scalability point of view shows that the proposed method provides a practical way to Web-scale problem solving. Keywords: user interests, interest vocabulary, Web search refinement, unifying search and reasoning.
1
Introduction
User interests are of vital importance and have impact on various real world applications, especially on the Web. Based on the idea of Linked data [1], it would be very useful if user interests data can be interoperable across various applications. However, interoperability in this context requires consistent description and representation of user interests. In this paper, we address this problem by providing a formal definition of user interests and the “e-foaf:interest” vocabulary based on RDF/OWL and the Friend of a Friend (FOAF) vocabulary. Further on, we show how we apply this vocabulary in the context of ReaSearch – the framework of Unifying Search and Reasoning [2] aimed at removing the scalability barriers of Web-scale reasoning. ReaSearch emphasizes searching the most relevant sub-dataset before the reasoning process. User interests can be considered as contextual constraints that may help to find what the users really want when the original query is vague or there are too many query results A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 98–107, 2010. c Springer-Verlag Berlin Heidelberg 2010
User Interests: Definition, Vocabulary, and Utilization
99
that the user has to wade through to find the most relevant ones [3]. Hence, we propose a concrete method to implement the “ReaSearch” framework, namely, interests-based unification of search and reasoning (I-ReaSearch). As an application domain of user interests and “I-ReaSearch”, we investigate on how they can be used in literature search on the Web. We also make a comparative study on the scalability of the proposed method.
2
The Definition and Vocabulary of User Interests
In this section, firstly, we give a formal definition of the user interest. Secondly, we propose an RDF/OWL based vocabulary so that various data sources and applications can interoperate with each other based on the proposed vocabulary. 2.1
A Definition of User Interests
In Cambridge Advanced Learner’s Dictionary, interest is defined as “the activities that you enjoy doing and the subjects that you like to spend time learning about” [4]. From our point of view, a user interest is not only about a topic or a subject, but it is also about the time (for example, when the user interest appeared? or, when the user lost the interest?), and the value of it. Hence, here we give a formal definition of user interest. A user interest is the subject that an agent (the agent can be a specific user, a group of users, or other types of intelligent agent) wants to get to know, learn about, or be involved in. It can be described as a five tuple: < Interest U RI, AgentU RI, P roperty(i), V alue(i), T ime(i) >, where Interest U RI denotes the U RI address that is used to represent the interest, AgentU RI denotes the agent that has the specified interest. P roperty(i) is used to describe the name of the ith property of the specified interest (here we assume that there are n properties that is used to describe the interest from different perspectives, and i ∈ [1, n]). V aule(i) denotes the value of P roperty(i). T ime(i) is the time that V aule(i) is acquired for the P roperty(i). In real world applications, knowledge representation languages are needed to describe user interests based on the above definition. In order to have better integration with the Web of linked data [1], we propose to describe user interests based on RDF/OWL. 2.2
The e-Foaf: Interest Vocabulary
We title the vocabulary as the “e-foaf:interest Vocabulary”, as it is aimed at extending the FOAF vocabulary on user interests evaluated from different perspectives. It focuses on extending “foaf:interest” by providing more details with regards to user interests. The e-foaf:interest vocabulary has 3 versions, namely: “e-foaf:interest Basic”, “e-foaf:interest Complement”, and “e-foaf:interest Complete”. They are composed of a set of class vocabularies and a set of property vocabularies.
100
Y. Zeng et al.
Table 1. The “e-foaf:interest” vocabulary list Vocabulary Branch e-foaf:interest Basic
e-foaf:interest Complement
Vocabulary e-foaf:interest e-foaf:interest value e-foaf:interest value updatetime e-foaf:interest appeared in e-foaf:interest appeare time e-foaf:interest has synonym e-foaf:interest co-occur with e-foaf:cumulative interest value e-foaf:retained interest value e-foaf:interest longest duration e-foaf:interest cumulative duration
Type Class Property Property Property Property Property Property Property Property Property Property
The “e-foaf:interest Complete” is the union of the set of vocabularies from “efoaf:interest Basic” and “e-foaf:interest Complement”. Here we give some details on the definition of each vocabulary (The namespaces are omitted for brevity). – e-foaf:interest (Class) Definition: e-foaf:interest is a class that is used to represent the agent’s interest. – e-foaf:interest value (Property) Definition: “e-foaf:interest value” represents the value of an interest. The value of a specified interest is an arbitrary real number. The number represents the degree of interests that a user has in a specific topic. If the agent is interested in an interest, the interest value is greater than zero (namely a positive number). If the agent is not interested in a topic, the interest value of the topic is smaller than zero (namely a negative number). Note: This property is supposed to be oriented to interest values from any perspective. Some possible perspectives are defined in “e-foaf:interest Complement”, namely, the perspective of “cumulative interest value”, “retained interest value”, “interest lasting time”, “interest appear time”, etc.). It can also be a user defined value.
User Interests: Definition, Vocabulary, and Utilization
101
– e-foaf:interest value updatetime (Property) Definition: “e-foaf:interest value updatetime” represents the update time of the interest value. It may be the time when the user specifies the interest value, or the time when an algorithm updates the value of the interest.
– e-foaf:interest appeared in (Property) Definition: ”e-foaf:interest appeared in” represents where the interest appeared in. Note: This property is used to preserve the original resources where the interests came from, so that these can be reused for calculation of interest value when needed. The design of this property is inspired by ”from” in Attention Profiling Markup Language (APML 2009) which is based on XML. – e-foaf:interest appear time (Property) Definition: “e-foaf:interest appear time” is the time when the interest appears in a certain kind of scenario. – e-foaf:interest has synonym (Property) Definition: “e-foaf:interest has synonym” represents that the subject and the object of this property are synonyms. Such as “search” and “retrieval”. Note: In some use cases, synonyms need to be merged together or marked as semantically very related; this property is very useful in such scenarios. – e-foaf:interest co-occur with (Property) Definition: “e-foaf:interest co-occur with” represents that the subject and the object of this predicate co-occur with each other in some cases.
102
Y. Zeng et al.
– e-foaf:cumulative interest value (Property) Definition: “e-foaf:cumulative interest value” is a sub property of “e-foaf: interest value” representing the cumulative value of the number of times an interest appears in a certain kind of scenario.
– e-foaf:retained interest value (Property) Definition: “e-foaf:retained interest value” represents the retained interest value of an interest in a specific time.
Note: The retained interest value can be calculated based on the interest retention function such as the one proposed in [5]. – e-foaf:interest longest duration (Property) Definition: “e-foaf:interest longest duration” is used to represent, until a specified time, the longest duration of the interest (between it appears and disappears). Note: For example, if the interest appears in the following years: 1990, 1991, 1995, 1996, 1997, 1998, 2001, the longest duration is 4 years.
– e-foaf:interest cumulative duration (Property) Definition: “e-foaf:interest cumulative duration” is used to represent the cumulative duration of an interest – the duration from the moment when it first appeared. Note: For example, if the interest appears in the following years: 1990, 1991, 1995, 1996, 1997, 1998, 2001, then its cumulative duration is 7 years.
User Interests: Definition, Vocabulary, and Utilization
103
We should emphasize that each corresponding interest value is calculated based on a specific function, and the update time might not be consistent. Hence, each interest value (including cumulative interest and retained interest) has a specific update time. An illustrative example using the e-foaf:interest Vocabulary and some working SPARQL queries can be acquired from the “e-foaf:interest” specification web site1 . The interest profile can be considered as contextual information when the specific user queries a scientific literature system (e.g. CiteSeerX, DBLP) or uses other types of applications. More refined results can be acquired when adding user interests as implicit constraints to the original (vague) query [3]. In the following section, we investigate how the user interests can be involved in unifying search and reasoning and provide a corresponding logical framework.
3
Unifying Search and Reasoning with User Interests
“ReaSearch” proposed in [2] is aimed at solving the problem of scalability for Web-scale reasoning. Its core philosophy is to select an appropriate subset of semantic data required for reasoning. ReaSearch is trying to solve the scalability issue by incomplete reasoning as the dataset acquired from the Web itself is very likely to be incomplete anyway. In [6], the author argues that for the Web search based on large scale data, more relevant results can be found by adding logic. This effort can also be considered as unifying search and reasoning. Nevertheless, more concrete strategies should be developed. In this section, we first introduce a concrete approach for “ReaSearch” based on user interests (I-ReaSearch), followed by the proposal of the two concrete strategies to implement I-ReaSearch. 3.1
The I-ReaSearch Framework
When users are trying to find useful knowledge on the Web, bridging the query topic with user background knowledge can help to understand the query results and is convenient for human to learn [7]. User interests can be considered as a special type of background knowledge from users, hence it can be considered as a context for literature search on the Web. In this paper, we propose to unify search and reasoning based on user interests. Following the notion in [2], we title the efforts as “I-ReaSearch”, which means unifying reasoning and search with Interests. The process of I-ReaSearch can be described as the following rule: 1
The “e-foaf:interest” specification is available from: http://wiki.larkc.eu/efoaf:interest. The vocabulary specification is an ongoing effort in the EU FP-7 framework project LarKC.
104
Y. Zeng et al.
hasInterests(U, I), hasQuery(U, Q), executesOver(Q, D), ¬contains(Q, I) → IReaSearch(I, Q, D), Where hasInterests(U, I) represents that the user “U ” has a list of interests “I”, hasQuery(U, Q) represents that there is a query input “Q” by the user “U”, executesOver(Q, D) denotes that the query “Q” is executed over the dataset “D”, ¬contains(Q, I) represents that the query “Q” does not contain the list of interests “I”, IReaSearch(I, Q, D) represents that by utilizing the interests list “I” and the query “Q”, the process of unifying selection and reasoning is applied to the dataset “D”. Currently, there are two strategies under “I-ReaSearch”. Both utilize user interests as the context, but their processing mechanisms are different. 3.2
Interests-Based Query Refinement
For the strategy of user interests based Query Refinement, the idea is to add more constraints to the user’s input query based on the user interests extracted from some historical sources (such as previous publication, visiting logs, etc.). The process can be described by the following rule: hasInterests(U, I), hasQuery(U, Q), executesOver(Q, D), ¬contains(Q, I) → ref inedAs(Q, Q ), contains(Q , I), executesOver(Q , D). In this rule, ref inedAs(Q, Q ) represents that the original query “Q” is refined by using the list of Interests as “Q ”. contains(Q , I) denotes that “Q ” contains the list of Interests “I”. executesOver(Q , D) represents that the refined query “Q ” executes over the dataset “D”. Namely, in interests-based query refinement, “ref inedAs(Q, Q), contains(Q , I), executesOver(Q , D)” implements IReaSearch(I, Q, D) in the I-ReaSearch general framework. Based on this rule, we emphasize that this approach utilizes the user context to provide a rewritten query so that more relevant results can be acquired. 3.3
Querying with Interests-Based Selection
One of the ways to achieve a Web-scale reasoning is to perform a selection step beforehand – this step would identify only those statements which are necessary, enabling the reasoner to finish all tasks in real time [2]. The Strategy of querying with Interests-based selection builds on this idea of selection: the assumption is that user interests might help to find a relevant subset so that the reasoner does not have to process the large amounts of data, but just those parts which are necessary. The process can be described by the following rule: hasInterests(U, I), hasQuery(U, Q), executesOver(Q, D), ¬contains(Q, I) → Select(D, D, I), executesOver(Q, D ). where “Select(I,D’)” represent the selection of a sub dataset “D ” from the original dataset “D” based on the interests list “I”, and executesOver(Q, D ) represents that the query is executed over the selected sub dataset “D ”. Namely, in
User Interests: Definition, Vocabulary, and Utilization
105
querying with interests-based selection, “Select(D , D, I), executesOver(Q, D )” implements IReaSearch(I, Q, D) in the I-ReaSearch general framework.
4
Interests-Based ReaSearch from Different Perspectives
When the query is vague/incomplete, research interests can serve as constraints that can be used to refine the queries. Research interests can be evaluated from various perspectives and each perspective reflects one unique characteristic. When the user is not satisfied with the specific perspective, the interests list used for interests-based ReaSearch will be changed. This process can be described by the following rules: IReaSearch(I, Q, D), ¬satisf ies(U, R) → IReaSearch(I , Q, D), where IReaSearch(I, Q, D) denotes that the ReaSearch process is based on the interest list “I” and the query “Q” on the dataset “D”. ¬satisf ies(U, R) denotes that the user “U” does not satisfy with the query results “R” from IReaSearch(I, Q, D). IReaSearch(I , Q, D) denotes that the interest list is changed from “I” to “I ” and the ReaSearch process is based on the new interest list“I ”. In this paper, four perspectives on the evaluation of user interests are considered. Namely, the cumulative interest value, the retained interest value, the interest longest duration, and the interest cumulative duration, which are introduced in Section 2.2. Some illustrative examples on how user interests can help to get more refined results are available from [8] and http://wiki.larkc.eu/csri-rdf.
5
Evaluation
In [8], the evaluation results from user studies have shown that the user prefers the refined results with user interests in comparison to the unrefined ones. In this section, we evaluate the proposed method from the perspective of scalability. We present a comparative study on the query effectiveness among three different strategies: 1. Query based on the original user input (no refinement). 2. Interests-based query refinement (introduced in Section 3.2). 3. Querying with Interests-based selection (introduced in Section 3.3). As an illustrative example, we take the SwetoDBLP dataset [9] which is divided into 22 sub-datasets. We evaluate the 3 implemented strategies by using these datasets at different scales. A comparative study is provided in Figure 1. Two users are taken as examples, namely Frank van Harmelen and Ricardo Baeza-Yates. Top 9 retained interests for each of them are acquired based on the retained interest function (introduced in [3]) and used to unify the selection and reasoning process. The above three different kinds of querying strategies are
106
Y. Zeng et al.
Fig. 1. Scalability on query time for three Fig. 2. Interests based refined query time different strategies with 3, 6, 9 interests
performed on the gradually growing dataset (each time adding 2 subsets with the same size, around 55M for each, and 1.08G in total). As shown in Figure 1, strategy 2 considers more constraints compared to the strategy 1, and therefore requires more time for processing – as the size of the dataset grows, the processing time grows very rapidly, which means that this method does not scale well if we just consider the required time. However, the quality of acquired query results is much better than with strategy 1. Since strategy 3 selects relevant sub-dataset in advance, the query time is significantly reduced. As the size of the dataset grows, the query time increases but not as fast as is the case for strategy 2. Meanwhile, the quality of the query results is the same as with strategy 2. Hence, this method scales better. Further on, for strategy 2 and strategy 3, we examined the impact of the number of constraints in a query so that one can immediately see the necessity of having a balance among query refinement and processing time. Figure 2 shows the query processing time with 3,6,9 interests constraints from the user “Frank van Harmelen”. We can conclude that the number of constraints in the query is positively correlated with the query processing time. Hence, even if strategy 2 and 3 yield better results compared to strategy 1, one should be cautious about adding too many constraints to the original query each time.
6
Conclusion
In order to support user interests based applications on the linked data Web, in this paper, we give a formal definition of the user interest, and define an extended vocabulary of FOAF focusing on user interests. The aim of this vocabulary is to make the description and representation of user interests in a more consistent way so that various applications can share user interests data.
User Interests: Definition, Vocabulary, and Utilization
107
As an application of user interests, interests-based unification of search and reasoning (I-ReaSearch) is proposed to solve the scalability and diversity problems for Web-scale reasoning. Two types of strategies are introduced, namely, interests-based query refinement, and querying with interests-based selection. From the result quality perspective, they perform equally well. From the scalability perspective, latter scales better than the former. This effort can be considered as a foundation towards user centered knowledge retrieval on the Web [10].
Acknowledgments This study is supported by the research grant from the European Union 7th framework project FP7-215535 LarKC2.
References 1. Bizer, C.: The emerging web of linked data. IEEE Intelligent Systems 24(5), 87–92 (2009) 2. Fensel, D., van Harmelen, F.: Unifying reasoning and search to web scale. IEEE Internet Computing 11(2), 94–95 (2007) 3. Zeng, Y., Yao, Y., Zhong, N.: Dblp-sse: A dblp search support engine. In: Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 626–630 (2009) 4. Cambridge Advanced Learner’s Dictionary, 3rd edn. Cambridge University Press, Cambridge (2008) 5. Zeng, Y., Yao, Y., Zhong, N.: Dblp-sse: A dblp search support engine. In: Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 626–630 (2009) 6. Berners-Lee, T., Fischetti, M.: Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. Harper, SanFrancisco (1999) 7. Bransford, J., Brown, A., Cocking, R.: How People Learn: Brain, Mind, Experience, and School. National Academy Press, Washington (2000) 8. Zeng, Y., Ren, X., Qin, Y., Zhong, N., Huang, Z., Wang, Y.: Social relation based scalable semantic search refinement. In: The 1st Asian Workshop on Scalable Semantic Data Processing (AS2DP 2009), co-located with the 2009 Asian Semantic Web Conference (ASWC 2009) (2009) 9. Aleman-Meza, B., Hakimpour, F., Arpinar, I.B., Sheth, A.P.: Swetodblp ontology of computer science publications. Journal of Web Semantics 5(3), 151–155 (2007) 10. Yao, Y., Zeng, Y., Zhong, N., Huang, X.: Knowledge retrieval (kr). In: Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 729–735 (2007)
2
http://www.larkc.eu
Ontology Matching Method for Efficient Metadata Integration Pyung Kim, Dongmin Seo, Mikyoung Lee, Seungwoo Lee, Hanmin Jung, and Won-Kyung Sung Knowledge Information Center, Korea Institute of Science and Technology Information, Korea {pyung,dmseo,swlee,jerryis,jhm,wksung}@kisti.re.kr
Abstract. We describe an ontology matching method for efficient metadata integration in digital contents management system. This approach is using semantic integration methods in schema level but converts data from several data sources into a single queriable format in data level. Every data schema is represented by a ontology, and then the user generates mapping rules between metadata using the ontology mapping browser. Finally, all data are converted to a single data format and architecturally, this offers a tightly coupled approach because the data reside together in a single repository. In these processes, our study is providing user friendly and efficient methods to create relationships between metadata by semantic web technologies.
1 Introduction The market of database increased rapidly after 1960s and database has been widely used in everywhere that needs to manage data [1]. The need to share or to integrate existing data sources has been increased exponentially and many researches [2,3,4,5,6] have been conducted to solve the known tasks [7]. In general, information systems are not designed for integration. Thus, whenever integrated access to different source systems is desired, the sources and their data that do not fit together have to be coalesced by additional adaptation and reconciliation functionality [8]. Our institution has provided information service on science and technology literature with digital contents management system (i.e. DCMS). Previously, DCMS allowed only mapping relations from individual source to standard metadata and it reduced the flexibility of mapping rules. In case the user did not know standard metadata, the user couldn’t create mapping rules, because every piece of metadata should be mapped to standard metadata. Likewise, if the user wants to distribute the existing metadata in different forms then the same problem occurs. In this work we propose ontology match method for efficient metadata integration by enabling the user to create mapping rules between non-standard metadata which are one of pre-mapped to standard metadata. We provide ontology browsing with GUI and the user can create mapping rule by selecting two metadata and describing the details of converting data. The user can grasp the mapping status between metadata at a glance and change mapping rules easily. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 108–115, 2010. © Springer-Verlag Berlin Heidelberg 2010
Ontology Matching Method for Efficient Metadata Integration
109
2 Related Work Most approaches are divided to tightly coupled data integration and loosely coupled data integration [1]. The tightly coupled data integration was popular solution which involved data warehousing. The warehouse system [2] extracts, transforms, and loads data (i.e. ETL) from several sources into a single queriable schema. In this approach, problems can occur at synchronization with sources, for example when a source data is changed, but the warehouse still stores the older data and the ETL process need to be re-executed. There are difficulties in constructing data warehouses when system provides only a query interface to a data sources and no access to the full data. Loosely coupled data integration is involved to providing a single access interface over a mediated schema and oriented to loosely coupled data over multiple data resources [4]. It provides a single query interface over a virtual mediated schema and “wrapper” is designed to original data sources. In this approach, “freshness” of data is not a problem because it provides real-time synchronization between data sources. Administrator must rewrite the view for the mediated schema whenever a new source gets integrated and/or an existing source changes its schema. There is some delay in query execution due to query transformation and multiple accesses to diverse sources. Semantic integration [9] is using ontology to resolve semantic conflicts between heterogeneous data sources. There are some conflicts in data type or meanings between multiple sources and semantics focuses on the acting as a mediator between heterogeneous data sources which may conflict not only by structure but also context or value. A common strategy for the resolution of such problems involves the use of ontologies which explicitly define schema terms and thus help to resolve semantic conflicts. This approach represents ontology-based data integration. The alternative is to perform semantic integration in aggregator data centers (e.g. Google, Microsoft, etc.) [10].
3 Digital Contents Management System Our institution has provided information service on science and technology literature with DCMS. The data integration in DCMS involves data warehousing which extracts, transforms, and loads data from several sources into a single repository. If you want to provide information service with large data set and to process the user query more quickly with many heterogeneous data sources then data warehousing is more suitable than loosely coupled data integration. We stored and managed every mapping rule manually in DCMS and all data can be accessed in a single repository. It’s hardly to maintain relationships between diverse sources because every schema of sources has to be matched to a standard schema. Data sources are heterogeneous and over 30. We have to improve the efficiency of mapping between data sources and provide flexibility in transforming standard schema to another schema. In this year, we redesign the entire process and standardize each step from data acquisition to distribution in order for DCMS to have more flexibility and efficiency with diverse sources.
110
P. Kim et al.
Fig. 1. Data Process in DCMS
Figure 1 shows the overview of DCMS to manage data sources, we divide the approach into seven stages; 1) Data Acquiring: we regularly collect data from international and domestic publisher in various formats (XML, CSV, DB and etc.). The data is passed through FTP or CD. The loader has been developed for each data source and performs loading task using TEMP DB. In this step, information about data acquisition is recorded in History DB, 2) Loading: in order to ensure the integrity of the data, checking the type of data and value is necessary during loading time. We also check whether the mandatory items are filled, 3) Transforming: to unify the formats of data sources, mapping rules between data sources are used to transform data. Previously, we created and managed mapping rules with DB table. From now on every schema will be represented to ontology and mapping rules will be created by ontology matching, 4) Approving: administrator makes sure that every item is correct and approves the loading of transformed data, 5) Managing: we add information for management, which are including id, loading date and manager, with data source, 6) Indexing: in order to provide retrieval service, search engine indexes items, 7) Distributing: it’s hardly control to convert existed standard data to another format. We will apply ontology matching in this step and provide more flexibility in data conversion.
4 Ontology Matching Method Our solution to the problem of creating mapping rules is not only to provide indirect links between data sources, but also to make it easy to understand the whole relationships between data sources at a glance. There are similar researches to use ontology in data integration or ontology matching with GUI tools. But we focus on more practical methods about creating ontology of data source and mapping rules between ontologies by indirect links, transforming data source to standard metadata distributing standard metadata to other services. Ontology properties, reasoner and GUI tools are used in every process and ontology has to contain all information about data source such as data format, data type, meaning, accessing path and etc. We define additional properties and the design of GUI tools. In data level, we have to store all data sources into a
Ontology Matching Method for Efficient Metadata Integration
111
Fig. 2. Process of Ontology Matching
single repository in order to provide faster service over many diverse data sources. Ontology matching process consists of five steps as shown figure 2. In this section, we explain the ontology matching method using two data sources which are providing science and technology literature to our system. One is British Library (i.e. BL) and another is Springer. We design super metadata which can contain all metadata of every data source and it is also represented by ontology. We are using OntoReasoner [11], which is a rule-based reasoner built by KISTI, to expand mapping rules with ontology properties. 4.1 Ontology Properties Ontology is a formal representation of the knowledge by a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to describe the domain. Ontology is useful to define the meaning of schema and provides various properties to clarify the relationships between data sources. Ontology languages, which are including RDF, RDFS and OWL, define object properties and data type properties. The rdfs:subPropertyOf, rdf:Seq and owl:equivalentProperty are used to describe the relationships between metadata. We add extra information about data conversion in rdfs:comment and this contains information that can’t be represented by a simple mapping. • rdfs:subPropertyOf, owl:equivalentProperty: these properties are widely used to set the relationships between data sources and useful to expand the mapping rules due to transitive attribute. • rdf:Seq: RDF provides primitives to build containers and collections to list things. Containers are open groups and contain resources or literals and possibly duplicates. Rdf:Seq is for an ordered group of resources or literals that may contain duplicate values. This property is useful when several data items are composing a single data by order. • kisti:appliedRule: there can be some problems in transforming directly source data to standard data because the type of source data is not fit to one of standard data or some data have to be applied with normalized function. If there is more information with converting values then this property contains rules or functions to be applied.
112
P. Kim et al.
• kisti:originalPath: original data formats are various and transforming program has to access original value. Kisti:originalPath is used to record fields of DB or accessing path in XML document. • kisti:dataFormat: in order to express a more sophisticated data format, this property includes regular expression of data value. DCMS provides more flexibility by using semantic web technologies such as ontology and reasoner. A reasoner expands the mapping rules based on attributes of properties. When a data source is newly added or updated, administrator can choose anything of existing data sources to create mapping rules. All items of existing data sources are connected with ontology properties and reasoned can expand the mapping rules to new data source. 4.2 Creating Ontology The original source is delivered in various formats such as XML document, CSV and DB. Ontology represents all information about data source. If the format of data source is DB then ontology includes the name and data type of fields. If the format of data source is XML then ontology includes the access path to elements or attributes. BL provides data in DB and Springer provides data in XML documents. Ontologies are minimized to include only the necessary information. Every property mapped to original data item includes the following information; meaning, data type and data format of metadata. This mapping process between data source and ontology is executed manually and will be supported by GUI tools. We are developing GUI tool which will aid the user to fill the value of properties based on original data as shown in figure 3.
Fig. 3. Mapping Springer to Ontology
Ontology Matching Method for Efficient Metadata Integration
113
Fig. 4. Mapping Data Source to Ontology
4.3 Ontology Matching with Ontology Browser The ontology browser is designed to display all information required to set up the mapping relationships between ontology properties. When mouse is over a property, data type, definition and data format of the property pop up. “rdfs:comment” is used for definition of property. Figure 4 shows relationships between ontology properties and design of GUI tool which is under developing. Every property contains all information which is necessary to understand meaning of data item and covert to another data item. “owl:equivalentProperty” can be used to connect two properties which are same meaning. In the past, the user had to create mapping rules between a metadata and standard metadata. In now, the user can select anything of ontology properties because reasoner can create final mapping rules from a metadata to standard metadata. When one value is separated to two values or two values are merged to one value, system solves these issues with intermediate node that is circle in figure 4. A intermediate node plays a role of blank node and is represented by rdf:Seq that can decide the order of data values. If “name” is to separated to “givenName” and “familyName” then we can specify the order of these properties by rdf:first and rdf:rest. Rule or function to be applied to separate this value is in kisti:appliedRule. If there is conflicts on data type or data format then the user is requested to input kisti:appliedRule. When a property is component of other property “rdfs:subPropertyOf” can be used. 4.4 Generating Mapping Rules Final mapping rules are relationships between data source and standard metadata. If a property is connected to other property that is not property of standard metadata then
114
P. Kim et al.
reasoner generate final mapping rules by using attributes of ontology properties such as “rdfs:subPropertyOf” and “owl:equivalentPropery”. Reasoner traces the route to access standard ontology using transitive attributes of these properties. 4.5 Storing Mapping Rules Final mapping rules are stored in DB table by data source and include original data, standard data and applied rules. Every function is pre-defined by the users and multiple rules can be applied by order. If one value is separated to two values then “subSring” function is usually applied. If two values are merged to one value then original data are mapped to extension data of standard data with order and the last extension data merge these values using “concat” function with delimiter ‘-’. Table 1 shows the final mapping rules of BL and Springer. Table 1. Final Mapping Rules of BL and Springer Original Data BL.Article.controlID BL.Article.title … BL.Article.page BL.Article.page BL.Author.name
Standard Data Article.ID Article.title … Article.spage Article.lpage Author.givenName
Applied Rules
BL.Author.name
Author.familyName
Springer.Article.volume End Springer.Article.title …
Article.volume.2
lower() … 114ubstring(0,2) 114ubstring(2,2) normalizedName(); getGivenName() normalizedName(); getFamilyName() concat(‘-’)
Article.title …
lower() …
4.6 Transforming Data Source System converts data source to standard metadata based on final mapping rules. In execution time, data format, data type and value of changed data must be checked. In order to execute quickly, every mapping rule is loaded in memory before the transforming process runs.
5 Conclusions and Future Work In this paper, we proposed ontology matching method for efficient metadata integration in digital contents management system. Our approach is using semantic integration methods in schema level but converts data from several data sources into a single queriable format in data level. Every data schema is represented by ontology, and then the user generates mapping rules between metadata using ontology mapping browser. Reasoner generates final mapping rules from data source to standard metadata and then these mapping rules are stored in DB table. System executes mapping rules to convert value of data source to value of standard metadata.
Ontology Matching Method for Efficient Metadata Integration
115
DCMS is under developing and standard metadata is not fixed yet, but our proposed method will be applied to DCMS. Ontology matching method can improve efficiency of mapping between data sources because it provides multiple paths to standard metadata and the user selects anything of data items. We use attributes of ontology properties and design the GUI tool which supports to create relationships and settings intuitively. Our proposed method will be tested and improved in applying to over 30 data sources and we will evaluate efficiency and coverage of our proposed method. Ontology matching system will be completed and then applied to DCMS. DCMS will control diverse data sources effectively and provide standard metadata to other services by using our proposed method.
References 1. Data Integration in Wikipedia, http://en.wikipedia.org/wiki/Data_integration 2. Chaudury, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record 26(1), 65–74 (1997) 3. Lenzerini, M.: Data Integration: A Theoretical Perspective. In: Proceedings of The 21th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 233–246 (2002) 4. Hull, R., Zhou, G.: A framework for optimizing data integration using the materialized and virtual approaches. Technical report, Computer Science Department, University of Colorado (1996) 5. Halevy, A., Rajaraman, A., Ordille, J.: Data integration: the teenage years. In: Proceedings of The 32nd International Conference on Very Large Data Bases, Korea (2006) 6. Kolaitis, P.G.: Schema mappings, data exchange, and metadata management. In: Proceedings of The 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Maryland (2005) 7. Ziegler, P., Dittrich, K.R.: Three Decades of Data Integration - All Problems Solved? In: Jacquart, R. (ed.) 18th IFIP World Computer Congress, vol. 12, pp. 3–12. Kluwer, Toulouse (2004) 8. Ziegler, P., Dittrich, K.R.: Data Integration — Problems, Approaches, and Perspectives. In: Conceptual Modelling in Information Systems Engineering, pp. 39–58. Springer, Heidelberg (2007) 9. Giunchiglia, F., Yatskevich, M., Shvaiko, P.: Semantic Matching: Algorithms and Implementation. In: Spaccapietra, S., Atzeni, P., Fages, F., Hacid, M.-S., Kifer, M., Mylopoulos, J., Pernici, B., Shvaiko, P., Trujillo, J., Zaihrayeu, I. (eds.) Journal on Data Semantics IX. LNCS, vol. 4601, pp. 1–38. Springer, Heidelberg (2007) 10. Al-Fares, M., Loukissas, A., Vahdat, A.: A scalable, commodity data center network architecture. In: ACM SIGCOMM Computer Communication Review, vol. 38(4) (2008) 11. Lee, S., Lee, M., Kim, P., Jung, H., Sung, W.: OntoFrame S3: Semantic Web-Based Academic Research Information Portal Service Empowered by STAR-WIN. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) The Semantic Web: Research and Applications. LNCS, vol. 6089, pp. 401–405. Springer, Heidelberg (2010)
Multiagent Based Large Data Clustering Scheme for Data Mining Applications T. Ravindra Babu1 , M. Narasimha Murty2 , and S.V. Subrahmanya1 1
E-Comm Research Lab, Infosys Technologies Limited, Bangalore - 560 100 {ravindrababu t,subrahmanyasv}@infosys.com 2 Dept. of Computer Science and Automation, Indian Institute of Science, Bangalore - 560 012 [email protected] Abstract. Multiagent Systems consist of multiple computing elements called agents, which in order to achieve a given objective, can act on their own, react to the inputs, pro-act and cooperate. Data Mining deals with large data. Large data clustering is a data mining activity wherein efficient clustering algorithms select a subset of original dataset as representative patterns. In the current work we propose a multi-agent based clustering scheme that combines multiple agents, each capable of generating a set of prototypes using an independent prototype selection algorithm. Each prototype set is used to predict the labels of unseen data. The results of these agents are combined by another agent resulting in a high classification accuracy. Such a scheme is of high practical utility in dealing with large datasets.
1
Introduction
An Intelligent Agent is a computing element that is autonomous with the ability react to the environment, pro-act to achieve a given task and cooperate with other agents. Multiagent systems include a number of agents to achieve a given objective. Distributed multiagent systems[17,18] have applications in multiple domains like artificial intelligence, economics, sociology, management science, philosophy and data mining. The applications of such multiagents systems is continuously on the increase in the recent times. Data Mining deals with the activity of extracting novel, general and valid abstraction from large data. Data Mining is interdisciplinary subject having an overlap with a number of domains such as Pattern Recognition, Statistics, Database Management and Artificial Intelligence. The mining methodology issues encompass clustering, classification, data characterization, etc. Large data clustering is an important data mining activity. A number of applications are found in practice that integrate agents and data mining[6,2,11,1]. Also, there are several groups that focus on agent-data mining interaction[3]. In the current work, we present a multi-agent system consisting of agents each capable of carrying out large data clustering using a different prototype generation scheme. The abstraction thus generated by each scheme is validated using A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 116–127, 2010. c Springer-Verlag Berlin Heidelberg 2010
Multiagent Based Large Data Clustering Scheme
117
k-Nearest Neighbour Classifier. Another agent integrates the results obtained by each classifier and provides final assignment of labels to the test data. This results in a highly accurate classification of given data. The proposed system integrates the following aspects. – Multiagent System[17,18,12] – Multiple Large Data Clustering or Prototype Selection Approaches 1. Leader Clustering Algorithm[16] 2. Frequent item Approach for feature selection[7,8] 3. Random Sample[4] – Classification[5] – Combining Results of Different Clustering Algorithms[10] The contents are organized in the following manner. Section 2 contains motivation for the study. It includes a discussion on current literature. Section 3 contains description of large dataset under study and results of preliminary analysis on the data. Section 4 contains description of the proposed system. A discussion of experimentation and results are provided in Section 5. The work is summarized in Section 6.
2
Motivation
Abstraction generation from large data is an important Data Mining activity. A dominant portion of Data Mining concentrates on finding efficient, effective and scalable algorithms. Since the data size is large, the algorithms will have to generate an abstraction in a single or as less number of scans as possible, of the database under study. When a high dimensional data of size of peta bytes, it is beneficial to consider such a data mining system as a multiagent system. In such a system each agent is capable of carrying out data mining activity. The resulting abstraction can be combined in some efficient manner for further use. Broadly the agent and data mining integration[15] can be classified into (a) agents supporting data mining and (b) data mining supporting agents. In the case of Data Mining supporting agents, several works deal with clustering of agents having similarities in some sense, say, goals. Gurruzzo and Rosaci[6] propose a grouping of agents based on a semantic negotiation protocol, that exploits lexical, structural and semantic aspects for similarity. Buccafurri et al[2] propose clustering of agents by deriving semantic similarities existing among concepts associated with agents. Ogston et al[11] suggest clustering of agents that remain on a network, based on similarity of objectives or data. Examples of agents supporting data mining can be found in the works of Agogino and Tumer[1] and Park and Oh[13]. Agogino and Tumer[1] focus on fault tolerant aspects of agent based cluster ensembles. Park and Oh [13] propose a multi-agent system for intelligent clustering by unsupervised learning consisting of clustering agent and clustering evaluation agent. Clustering of high dimensional large datasets has remained a challenging area. The challenges relate to clustering methods, their scalability and efficiency. Repeated scan of database is prohibitive in large datasets. Ideally, a clustering
118
T.R. Babu, M.N. Murty, and S.V. Subrahmanya
scheme should be able generate an abstraction in one or least number of scans. At the the same they should provide high accuracy. Different clustering schemes provide different abstractions of the same given data[9]. Without a priori knowledge on the underlying data, it is difficult to select an appropriate clustering algorithm. The final objective is to generate an ideal set of prototypes that classify unseen patterns with high accuracy. With the above motivation, we propose a system that combines three different prototype selection or clustering methodologies that together provide high classification accuracy. We discuss the proposed system in Section 4. In Section 3 we provide description of large dataset that is under study and provide insight into the nature of data through preliminary analysis.
3
Description and Preliminary Study of Data
A large handwritten digit dataset is considered for study. The data consists of 10-class, 100030 labeled patterns with labels 0 to 9. The total number of patterns is divided into 66700 training patterns and 33330 test patterns. A part of training set of 6700 patterns is chosen as validation dataset and leaving the remaining 60000 patterns for training. Each pattern is represented by 192 binary valued features. Figure 1 consists a sample handwritten data. Each pattern, depicted as 16X12 matrix which leads to 192 features. The sample patterns indicate the variability in the data in shape, orientation and completeness. From the figure as well as from the study, the following inference can be drawn. – Some patterns of the data are incomplete. For example, from the figure, pattern no.3 of class ‘0’, pattern no.1 of class ‘1’, pattern no.1 of class ‘2’. – Shapes are non-uniform and are available in multiple orientations such as towards left, straight and right. – The pattern occupies different regions in 16X12 structure, leading to occupancy of different features for different patterns. – It should also be noted from the figure that patterns belonging to different classes appear close, which is also a potential source for misclassification. In order to understand the inter-pattern distances within a class, within-class Euclidean distances are computed by drawing a random sample of 1000 patterns from training data belonging to each class. The computations are carried out on the sample. Table 1 contains intra-class statistics. The values are rounded to two decimal points. The table contains statistics of distance computed between one pattern with every other pattern within the considered sample. The columns contain class-wise values of mean, standard deviation, minimum and maximum of such distances. Following summary can be drawn from the table. – It should be noted here that the mean distance among the patterns of class-1 is least among the classes. – Minimum distance between two patterns within a class is nearly the same for all classes other than ’1’.
Multiagent Based Large Data Clustering Scheme
119
Fig. 1. Sample Handwritten Digit Data. The figure brings out challenges in clustering and classifying the given data, bringing out variability within classes and similarity across the classes.
– Maximum distance between two patterns within a class is nearly 11. Similarly, inter-class Euclidean distance statistics across the classes are computed by considering two random samples of 1000 from the training data belonging to each of the classes. The distances are computed between patterns of one class with another, considering two classes at a time. Table 2 contains the results. The columns contain arithmetic mean values of the such distances. Table 3 contains corresponding standard deviations. Following observations can be made from the tables.
120
T.R. Babu, M.N. Murty, and S.V. Subrahmanya Table 1. Intra-Class Euclidean Distance Statistics Class Label 0 1 2 3 4 5 6 7 8 9
Mean 7.39 5.49 7.67 7.14 7.32 7.62 7.15 6.77 7.38 7.01
S.D. 1.05 0.89 0.91 0.85 0.85 0.91 0.92 0.97 0.90 0.90
Min 2.45 2.00 2.45 2.45 2.45 2.45 2.45 2.45 2.45 2.45
Max 11.14 8.31 10.86 10.54 10.54 11.14 10.49 10.54 10.58 10.30
Table 2. Inter-Class Euclidean Mean Distance Statistics Class 1 2 3 4 5 6 7 8 9 0 8.3 8.2 7.9 8.3 8.1 7.9 8.1 8.1 8.1 1 7.7 7.6 7.5 7.8 7.6 7.3 7.7 7.6 2 7.9 8.3 8.2 7.9 7.9 7.9 8.1 3 8.1 7.8 8.0 7.6 7.7 7.8 4 8.0 8.0 7.7 7.8 7.5 5 8.0 8.0 7.9 7.9 6 8.2 7.9 8.2 7 7.7 7.2 8 7.6 Table 3. Inter-Class Standard Deviation Statistics of Euclidean Distance Class 1 2 3 4 5 6 7 8 9 0 0.8 0.8 0.7 0.8 0.8 0.9 0.7 0.8 0.8 1 0.8 0.7 0.8 0.7 0.8 0.9 0.9 0.8 2 0.8 0.7 0.7 0.7 0.9 0.8 0.8 3 0.7 0.8 0.7 0.7 0.8 0.8 4 0.7 0.7 0.8 0.7 0.8 5 0.7 0.7 0.8 0.8 6 0.6 0.7 0.7 7 0.8 0.8 8 0.8
– It should be noted from the tables that inter-class distances between and 0 and every other class is high. – Average distance between class 7 and 9 is least. – Average distance between 6 and 7 is high, which can as well be visualized by their apparent dissimilarity. – It can be noticed from Table 3, is that standard deviation of inter-class distances is similar and it is of the order of 1.
Multiagent Based Large Data Clustering Scheme
121
Statistics like correlation between patterns belonging to two classes are not provided, as the above given statistics are sufficient for choosing distance thresholds for the clustering algorithms of the proposed system. The proposed system is discussed in Section 4.
4
Description of Proposed System
Large data offers a number of challenges in terms of storage of the original set of patterns and also the representative patterns. The choice of clustering algorithms for large data depends on how many scans of the given dataset are required by the algorithm, its time-complexity, nature of clusters generated by the given clustering algorithm etc. A good discussion on clustering of large datasets is provided by Jain, Murty and Flynn[9]. A single clustering algorithm or approach is not necessarily adequate to solve every clustering problem[9]. Different Clustering Algorithms capture different aspects of data. With the objective of finding good representatives of the given large dataset, it is useful to find representative patterns using different algorithms. We combine the results of three prototype generation approaches to obtain good prediction accuracy. The proposed system is provided in Figure 2. Overall scheme vis-a-vis the figure is summarized below. – Representative patterns are identified from the given large dataset using three different schemes. In the block diagram, three Clustering Agents 1,2 and 3 are shown. The agents correspond to three prototype selection algorithms. They are concurrent. – Classification Agents classify validation/test patterns based on corresponding prototype sets provided by Clustering Agents. The classification agents
Fig. 2. Multiagent System for large data clustering and classification. The figure indicates multiple agents for carrying out different concurrent as well as sequential activities.
122
T.R. Babu, M.N. Murty, and S.V. Subrahmanya
label a test pattern, using k-Nearest Neighbour Classifier[5] based on representative patterns obtained from each of the schemes. – The Combiner Agent combines the results provided by each Classifier Agent. We elaborate each of the activities in the following subsections. 4.1
Leader Clustering
The leader clustering algorithm[16] consists of starting with a distance threshold and a random pattern as first leader. Algorithm is provided below in terms of few steps. The algorithm complexity of Leader algorithm is O(n) and it requires single database scan. The thresholds are chosen based on the results of preliminary analysis. 1. Choose a dissimilarity threshold. Start with an arbitrary pattern. Call it Leader-k, where k=1 2. For i=1 to n(total number of patterns in the training data) – For j=1,. . . k, find the dissimilarity between Leader-j and the training pattern. – If the distance is less than the threshold, place the pattern in already classified set. Else, consider the pattern as a new leader. Update k. 3. Repeat step 2 till end of training data In the current study, we follow divide and conquer [9] approach to cluster the 10-class data. This requires individual distance thresholds for each of the classes for leader clustering. 4.2
Random Sampling
The second scheme consists of drawing a simple random sample with replacement from given set of training data. It involves pseudo random number generator. It can be summarized as given below. – Choose sample size. – Generate pseudo random numbers within the range of input training data size, with a good random generator. It will select the serial numbers of the training patterns with replacement until the given sample size is reached. – Select those patterns from the input training data. This forms the prototype set. 4.3
Frequent Item Set Approach with Leader Algorithm
We use the concept of Frequent Itemsets [7,14] to improve classification. The given patterns are represented by binary valued features. The procedure and salient points can be summarized as given below. – We compute the number of times a feature has occurred across entire the training dataset, which we call as support.
Multiagent Based Large Data Clustering Scheme
123
– It was found earlier [14] with increasing support the classification accuracy improves up to a certain limit and reduces there after. – We consider all those features above certain support for each of the training patterns. These features are recorded. – These set of selected features are alone used while comparing with valid and test datasets. – Training dataset is formed with only those selected features. – Leader clustering is performed on this dataset. – The same clustering algorithm is preferred for its advantage of O(n) time complexity and abstraction generation with single database scan.
4.4
Classifier Agent
The validation and test data is classified using k-Nearest Neighbour Classifier[5], where for each test pattern, k-nearest neighbours are identified and their labels are recorded. A label is assigned to the test pattern depending on majority of labels within k-neighbours. The label of test pattern is compared with that of the assigned label to compute classification accuracy. A table of assigned labels is maintained with respect to each of prototype set-classifier combination. 4.5
Combiner Agent
The combiner consists of combining the assignments of the classifier and assigning label to the given test pattern. We consider majority voting as combiner[10]. We present summary of experimentation and results in Section 5.
5
Description of Experimentation and Results
The large dataset of high-dimension is considered for prototype selection. Table 4 contains summary of data under study. Experiments are conducted on training data to select prototypes using each of prototype selection algorithm. Representability of each prototype set is validated by labeling the validation data. The parameter set that provided best classification accuracy with validation dataset is considered for classifying the test data. Prototype Selection Algorithm-1 is the leader clustering algorithm. The experiments with this algorithm is summarized below. – Initial study involves selecting prototypes based on various distance thresholds. Table 4. Data Summary Data Set No. of patterns Training 60000 Validation 6700 Test 33330
124
T.R. Babu, M.N. Murty, and S.V. Subrahmanya Table 5. Classification Accuracy(C.A.) with Leaders No. of leaders 6402
C.A. with C.A. with Validation dataset Test dataset 99.0% 89.42%
– The initial set of thresholds for the experimentation are chosen based on preliminary results as given Tables 1, 2 and 3. – For the chosen thresholds, prototype sets are generated based on leader algorithm. – Each of these sets are used to classify validation dataset. – The representative dataset that provides best classification accuracy on the validation dataset is considered for classifying the test patterns. The list of assignments for various k values of kNNC and recorded for use by Combiner Agent. Table 5 contains results with leader clustering algorithm. In the table result of kNNC for k=1 is presented. It should be noted here that the number of prototypes is about 11% of entire training dataset. It provides classification accuracy of 89.42% with the test data. Prototype Selection-2 is a large random sample. The experiments carried out are summarized below. – Experiments are carried by starting from sample size of 1000 and increasing in steps of 1000. – The selected sample forms representative set. – Using the set the validation dataset is labeled. – The sample that provides best classification accuracy with the validation dataset is used for classifying the test dataset. – The classification history is preserved for further use by Combiner Agent. Table 6 contains the results. The sample size that provides best results with validation dataset is 10922, by leaving out repeat patterns because of simple random sampling with replacement. Classification Accuracy in the table corresponds to k=1 of kNNC. The classification accuracy with the given prototype set is 88.9%. Table 6. Classification Accuracy(C.A.) with Leaders Sample size 10922
C.A. with C.A. with Validation dataset Test dataset 97.5% 88.9%
Prototype Selection-3 The scheme involves identifying frequent features among the 192 features of each pattern. The salient features of experimentation is summarized below.
Multiagent Based Large Data Clustering Scheme
125
Table 7. Classification Accuracy(C.A.) with Frequent Item Approach Representative size C.A. with C.A. with Set Validation dataset Test dataset 6269 98.59% 89.67%
– On analysis it is found that in the training dataset of 60000 patterns, feature number 192 occurred least number of times of 797 and feature number 32 occurred maximum number of times of 44,356 with all other features occurring in between the two. – By choosing a support value of ,say, 798 feature 192 will get excluded. – By carrying out the above preliminary analysis, we choose different values for features. – With the chosen support threshold, training data is generated with the selected features. – Leader clustering algorithm is used on such training data resulting in a prototype set. It should be noted here that distance thresholds for leader clustering algorithm are again found by experimentation. – Use prototype set to classify validation dataset. Repeat the experiments with various values of support thresholds. Choose that support threshold, which provides best classification accuracy with the validation dataset. – Classify the test data set with the help of prototype set and preserve the classification history for use by the Combiner Agent. With experimentation, we note that with support threshold of 5000, we reduce number of features by 55. The results obtained with the corresponding prototype set are summarized in Table 7. With support threshold of 5000 the leader prototype set is 6269 and the classification accuracy of test data is 89.66%. After obtaining the results from the three prototype selection algorithms and the classifiers, Combiner Agent combines the results. Combining the results of three approaches using majority voting the overall classification accuracy of given test data is 94.45% for k=7 using kNNC. The result obtained is better than reported on smaller dataset of handwritten digit data earlier [14].
6
Summary and Conclusions
Large data offers a number of challenges. In order to deal with large data, one resorts to abstraction. There exists a number of methods for generating an abstraction. Depending on nature of data and the algorithm that generates the abstraction, the sets of representative patterns could be different. Taking these issues into account, we propose a Multi-agent system that generates concurrently, the representative patterns as well as classification results as generated depending on each of the representative patterns. Combiner Agent combines the results by majority voting and provides the results. We propose three algorithms, viz., leader clustering algorithm, large random sample and frequent item approach. The labels as provided by each of
126
T.R. Babu, M.N. Murty, and S.V. Subrahmanya
the representative patterns are combined by majority voting and the final result is provided. The classification accuracy provided by each prototype set is 89.42%, 88.9%, 89.67%. The combined accuracy obtained by majority voting is 94.45%. It is clear that such a scheme provides classification accuracy better than that of each of the schemes. Further, since we deal with a prototype set of entire data, the scheme is more efficient than working with entire data. Different clustering algorithms focus different aspects of data. Thus combining multiple prototype selection algorithms through combiner agent ensures that proper representatives are identified albeit it adds to complexity in a small way. Further work involves comparison with other combiner approaches including adaptive boosting and use of various other clustering schemes, where we can improve the classification better than current value. The current work focuses on data reduction as means to handle data explosion. Other directions for further exploration include VC-dimension based trade-off studies and weighting[19] of dynamically incoming new data which equivalently addresses order dependence of the data.
References 1. Agogino, A., Tumer, K.: Efficient Agent-Based Clustering Ensembles. In: AAMAS 2006 (2006) 2. Buccafurri, F., Rosacci, D., Sarne, G.M.L., Ursino, D.: An agent-based hierarchical clustering approach for e-commerce environments. In: Bauknecht, K., Tjoa, A.M., Quirchmayr, G. (eds.) EC-Web 2002. LNCS, vol. 2455, pp. 109–118. Springer, Heidelberg (2002) 3. Agent-Mining Interaction and Integration(AMII), http://www.agentmining.org 4. Cochran, W.: Sampling Techniques. John Wiley & Sons, New York (1963) 5. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Wiley-interscience (2000) 6. Gurruzzo, S., Rosaci, D.: Agent Clustering Based on Semantic Negotiation. ACM Trans. on Autonomous and Adaptive Systems (Article 7) 3(2), 7:1–7:40 (2008) 7. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. of ACM SIGMOD International Conference of Management of Data (SIGMOD 2000), Texas, pp. 1–12 (2000) 8. Liu, H., Motoda, H. (eds.): Computational Methods in Feature Selection. Chapman & Hall, CRC, FL (2008) 9. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Review 31(3), 264–323 (1999) 10. Kittler, J., Hatef, M., Duin, P.W., Matas, J.: On Combining Classifiers. IEEE Trans. on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) 11. Ogston, E., Overreinder, R., van Steen, M., Brazier, F.: A method for decentralizing clustering in large multi-agent systems. In: Proc. of AAMAS 2003, ACM-SIGART, pp. 789–796 (2003) 12. Ouchiyama, H., Hunag, R., Ma, J.: An Evoluationary Rule-based Multi-agents System. In: Emergent Intelligence of Networked Agents. SCI, vol. 56, pp. 203–215. Springer, Heidelberg (2007) 13. Park, J., Oh, K.: Multi-Agent Systems for Intelligent Clustering. Proc. of World Academy of Science, Engineering and Technology 11, 97–102 (2006)
Multiagent Based Large Data Clustering Scheme
127
14. Ravindra Babu, T., Narasimha Murty, M., Agrawal, V.K.: On simultaneous selection of prototypes and features in large data. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 595–600. Springer, Heidelberg (2005) 15. Ravindra Babu, T., Narasimha Murty, M., Subrahmanya, S.V.: Multiagent Systems for Large Data Clustering. In: Cao, L. (ed.) Data Mining and Multi-agent Interaction, ch.15, pp. 219–238. Springer, Heidelberg (2009) 16. Spath, H.: Cluster Analysis - Algorithms for Data Reduction and Classification of Objects. Ellis Horwood Limited., West Sussex (1980) 17. Weiss, G. (ed.): Multiagent Systems - A modern approach to Distributed Artificial Intelligence. The MIT Press, Cambridge (2000) 18. Wooldridge, M., Jennings, N.R.: Towards a theory of cooperative problem solving. In: Proc. of the Workshop of Distributed Software Agents and Applications, Denmark, pp. 40–53 (1994) 19. Zhang, S., Zhang, C., Yan, X.: Post-mining: maintenance of association rules by weighting. Information systems 28, 691–707 (2003)
Fractal Based Video Shot Cut/Fade Detection and Classification Zeinab Zeinalpour-Tabrizi1, Amir Farid Aminian-Modarres2, Mahmood Fathy1, and Mohammad Reza Jahed-Motlagh1 1
Computer Engineering Faculty ,Iran University of Science and Technology, Tehran, Iran 2 Sadjad Institute of Higher Education, Mashhad, Iran [email protected], {afamodarres,mahfathi,jahedmr}@iust.ac.ir
Abstract. Video segmentation plays an important role in video indexing, content-based video coding and retrieval. In this paper, we propose a new method for cut and fade detection using fractal dimension. We also classify frames into three categories: “CUT”, “FADE IN/OUT”, and “None SHOT”. To test our method, we used 20 videos which contain more than 33,000 frames in different subjects, including different type of shot boundaries. It was also successfully compared to two other methods of shot boundary detection. Results from experiments depict the improved precision and recall of the proposed method in recognition of the fade in/out frames our database. Keywords: Detection accuracy, Fractal dimension, shot boundary detection, video analysis, video frames classification.
1 Introduction Today, indexing and retrieval of digital videos is an active research area, and shot boundary detection is a fundamental step for the organization of large video data. Therefore, the issue of analyzing and automatically indexing the video content by retrieving highly representative information (e.g., shot boundaries) has been raised in the research community. We need analyzing our video to achieve high accuracy in video processing. Most of time, shot boundary detection have been noticed between different type of video structure (frame, shot, scene and scenario), in content-based video processing [1, 2]. A video shot is defined as a sequence of frames captured by one camera in a single continuous action in time and space [3]. According to whether the transition between shots is abrupt or gradual, the shot boundaries can be categorized into two types: cut (CUT) and gradual transition (GT). The GT can be further classified into dissolve, wipe, fade out/in (FOI), etc., according to the characteristics of the different editing effects [4]. Different types of shot boundaries have been illustrated in Fig.1. A large number of shot boundary detection methods have been proposed. Pair-wise comparison of the pixels (which is also called template matching), evaluates the differences in color or the light intensity between two similar pixels in two sequential A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 128–137, 2010. © Springer-Verlag Berlin Heidelberg 2010
Fractal Based Video Shot Cut/Fade Detection and Classification
129
frames. Although some irrelevant frame differences to the outside have been filtered, these approaches are still sensitive to the movements of the object and the camera. In [5, 6] this method has been used to detect shot boundaries.
(a)
(b)
(c)
(d)
Fig. 1. Different type of shot boundaries, (a) abrupt shot, (b) Fade in, (c) Fade out, (d) dissolve
A block based method has been proposed by [7], in which each frame is divided into 12 blocks which have no overlap and for each of them the best match in the neighboring according blocks in the previous image based on the light intensity was found. Also in [8], the segmenting of each frame to 4*4 areas and the comparison of colored histograms in according areas has been proposed. In [9], the idea of video sampling in special has been expanded into both the temporal and special area. In [10], two features of the histograms difference and the pair-wise comparisons of pixels in the clustering method have been combined and the result was that when these filtered features are complementary, they end in the recognition of existing shots and higher accuracy. The first task in analyzing a video directly in the discrete area has been directed by [11], in which a method to recognize the abrupt shot based on discrete cosine transform's coefficients of fames, is proposed. In [12], the method proposed was called DC- images and was created and compared. DC-images, are the declined images of the original images regarding the location: the (i,j) pixel of the DC image is equal to the average block (i,j) of the original image. In [13], shot boundaries are detected by the comparison of colored histograms of DCimages of sequential frames. These kinds of images are formed by DC terms of discrete cosine transform coefficients for a frame. In [14], authors used the information theory to recognize the abrupt and dissolve shot boundaries. In [15], authors used genetics algorithm for video segmentation; the
130
Z. Zeinalpour-Tabrizi et al.
reverse value of frames' similarity which is calculated through colored histograms, is used to calculate the Fitness function value. In case of shot boundary detection, there are some surveys [16], [17] and [18]. In this paper, a novel and highly accurate method has been proposed to detect abrupt and fade in/out shot boundaries using fractal dimension (for the first time) that its efficiency compared with the previous ones in nearly similar and remarkable. The proposed method of this article will be completely elaborated in the next parts. The results obtained from the proposed method on the data collection provided in [19], [20] are in section 3, and have been compared with the proposed method in reference [14] on our dataset. At the end, the summary and the suggestion for the continuation of the task will be presented.
2 Proposed Method The proposed algorithm is designed for abrupt and fade in/out shot boundary detection by classifying frames using their fractal dimension. We computed fractal dimension of each frame and their pair-wise difference to classify frames to the three classes: "CUT", "FADE IN/OUT", "None SHOT". The flowchart of our method is illustrated in Fig. 2. In the following section, we explain each step of our method in detail. 2.1 Preprocess In this step, the content of each frame would be prepared for computing its fractal dimension. Assume that V is a grayscale video with p frames which is defined as follow:
܄ൌ ሼ ܞሺଵሻ ǡ ܞሺଶሻ ǡ ǥ ǡ ܞሺ୮ሻ ሽ
(1)
Where ݒሺሻis k’th frame of the video for 1 § k § p and size of the frame is M*N. ሺ୩ሻ
ܞሺ୩ሻ ൌ ሼ୧ǡ୨ ሽൈ ǡ
ͳ ǡ
ͳ ǡ
ͳ
(2)
ሺሻ
Where ݒǡ is the gray value of i'th horizontal and j'th vertical pixel of k'th frame. As mentioned in Fig.2, all above computations must perform on pair-wise frame difference as follow:
۲ ܄ൌ ሼ܌ሺଵሻ ǡ ܌ሺଶሻ ǡ ǥ ǡ ܌ሺ୮ିଵሻ ሽ
(3)
Where DV indicate pair-wise frame difference of the video V and it has following conditions: ሺ୩ାଵሻ
܌ሺ୩ሻ ൌ ሼ୧ǡ୨
ሺ୩ሻ
െ ୧ǡ୨ ሽൈ ǡ ͳ ǡ ͳ െͳ
ͳ ǡ (4)
Fractal Based Video Shot Cut/Fade Detection and Classification
131
Fig. 2. Proposed method’s flowchart
For preparing frames to compute their fractal dimension, we transform the grayscale image to a three dimensional surface. The frames and their pair-wise differences transformed to three dimension surfaces using following formula: ሺ୩ሻ
ܞො ሺ୩ሻ ൌ ሼො୧ǡ୨ǡ୪ ሽൈൈଶହ
(5)
ሺ୩ሻ ܌መሺ୩ሻ ൌ ሼ୧ǡ୨ǡ୪ ሽൈൈଶହ
(6)
In which satisfied the following relation: ሺሻ
ሺሻ
ො୧ǡ୨ǡ୪ ൌ ሼ
ͳ ൌ ݒǡ ͳ Ͳǣ ݁ݏ݅ݓݎ݄݁ݐ
(7)
ሺሻ ݀ǡ
ሺሻ ͳൌ ͳ ୧ǡ୨ǡ୪ ൌ ሼ Ͳǣ ݁ݏ݅ݓݎ݄݁ݐ
(8)
Where ͳ ݅ ܯǡ ͳ ݆ ܰǡ ͳ ݈ ʹͷ . A sample of the three dimensional surface of a frame and a pair-wise frame differences is illustrated in Fig.3. The next step of preprocessing is transforming the three dimensional images to the frequency dimension, using Discrete Fourier transform. Self similarity property of fractals is matched with signal frequency property in frequency domain. Thus, the fractal dimension which is calculated in frequency domain has a perceptual relationship with frames temporal variations. Therefore, we calculate Fourier transform of frames and their pair-wise differences at this step. With using Discrete Fourier transform definition [21], the three dimensional discrete Fourier transform for each frame is calculated as follow: ሺ୩ሻ ሺ୩ሻ ൌ ሼ ୶ǡ୷ǡ ܂܄ ሽൈൈଶହ
(9)
132
Z. Zeinalpour-Tabrizi et al.
In which:
ሺሻ ǡǡ ൌ
ʹͷ
ͳ ሺሻ ሾǦjʹɎቀ ቁሿ ʹͷ ሿ ሾොǡǡ ò òòʹͷ ൌͳ ൌͳ ൌͳ
(10)
In the similar way for calculating the frame differences in frequency domain using following formula: ሺ୩ሻ ሺ୩ሻ ൌ ሼ ୶ǡ୷ǡ ܂܌ ሽൈൈଶହ
(11)
In which:
ሺሻ ǡǡ ൌ
ʹͷ
ͳ ሺሻ ሾǦjʹɎቀ ቁሿ ʹͷ ሿ ሾǡǡ ò òòʹͷ ൌͳ ൌͳ ൌͳ
(12)
In this way, the preprocessing steps required for calculating the fractal domain are completed.
a
(b)
(c)
Fig. 3. Operation of making the grayscale frame and their differences with next frame, three dimension (a) current frame (right side) and next frame ( left side), (b) three dimension of current frame (c) three dimension differences of two frame
2.2 Calculating Fractal Dimension In this step, the fractal dimension of Fourier transform of three dimensional frames
and their pair-wise difference ݀ܶ T is calculated. For Fourier coefficients of each ܸܶ F(k) frames, the power spectral density (PSD) are computed:
Fractal Based Video Shot Cut/Fade Detection and Classification
133
ଶ
ሺ୩ሻ ሺǡ ሻ ൌ ฮ ሺ୩ሻ ሺǡ ሻฮ
(13)
Then coordinate system is transformed to Polar coordinates to compute power spectrum density which is needed for fractal dimension computation: ൌ ඥݑଶ ݒଶ
(14)
ݒ Ʌ ൌ ିଵ ቀ ቁ ݑ
(15)
After variables replacement in to new system, S (k) (u,v) is converted to S (k) (f, θ). Average values of S (k) (f, θ) is computed in every radial range off and for each θ: ܵ ሺሻ ሺ݂ሻ ൌ ܵ ሺሻ ሺ݂ǡ ߠሻ
(16)
ఏ
In [25] authors proved that the power spectrum shows a linear variation between logarithm of S(f) and logarithm of the frequency of surface: ܵ ሺሻ ሺ݂ሻ ൌ ܿǤ ȁ݂ȁିఉ
(17)
In other words: ݈ ݃ቀܵ ሺሻ ሺ݂ሻቁ ൌ ݈݃ሺܿሻ െ ߚ ൈ ݈݃ȁ݂ȁ
(18)
Slope of linear regression line of these changes are related with fractal dimension of image as follow: ܦܨൌ
(a)
ሺ െ ߚሻ ʹ
(19)
(b)
Fig. 4. Two signals from the same video and two shot sample from fade out and abrupt shot (a) fractal dimension signal of video frames (b) fractal dimension signal of pair-wise frame differences
134
Z. Zeinalpour-Tabrizi et al.
Which FD shows fractal dimension of image. Now we have prepared our two fractal feature of video, which is illustrated in Fig.4. As mentioned in Fig.4 in the first part, fractal features have been peeked for fade in/out shots and for the abrupt shots which has been illustrated in the second part, clear veils in the signal were appeared. 2.3
Shot Boundary Detection and Classification
By paying attention to the changes in the two calculated fractal features during the video, we can simply realize the special properties of these features. To be more accurate, as it can be observed the fractal dimension value of each frame in the fade in/out shot boundaries has clear peek. On the other hand, the fractal dimension value of pairwise frames difference, in abrupt shot boundaries has separate veils. On this basis, in the last stage of the algorithm using two adaptive thresholds for the fractal dimension of the frames using (20) are calculated and for fractal dimension of pair-wise frames differences with (formula 21), we divide the frames into three classes: "CUT", "FADE IN/OUT", "None SHOT".
ൈ ߙଵ ܸܶܦܨ݊ܽ݁ܯ ൈ ߚଵ ܸܶܦܨ ܸܶܦܨݔܽܯ
(20)
ൈ ߙଶ ܶ݀ܦܨ݊ܽ݁ܯ ൈ ߚଶ ܶ݀ܦܨ ܶ݀ܦܨ݊݅ܯ
(21)
indicates the maximum amount of fractal dimension of discrete In which ܸܶܦܨݔܽܯ indicates the average of fractal Fourier transform of video frames, ܸܶܦܨ݊ܽ݁ܯ shows the minidimension of discrete Fourier transform of video frames, ܶ݀ܦܨ݊݅ܯ mum amount of the fractal dimension of discrete Fourier transform of pair-wise frame shows the average of fractal dimension of discrete Foudifferences and ܶ݀ܦܨ݊ܽ݁ܯ rier transform of pair-wise frame differences. The coefficients α1, α2, β1 and β2 have been measured experimentally. When one of the conditions of (formulas 20 and 21) is met, the relevant frame belongs to one of the classes of shot boundaries; if the condition of (formula 20) is met, the frame belongs to "FADE IN/OUT" class, and if it meets the condition of (formula 21), the frames belong to "CUT" class and if neither of them is met, the frame belongs to the "None SHOT" class.
3 Experimental Results In this section, the proposed method was tested on 20 videos collected from [19] and Islamic Republic of Iran News Network- IRINN [20]. Collected videos contain many commercials, characterized by significant camera parameter changes like zoomins/outs, pans, abrupt camera movements as well as significant objects, and camera motion inside single shots. Number of video shots is between 0 -19, with the average of 8 shots per video. We successfully compared our method with two other methods of shot boundary detection. In order to evaluate the performance of our shot detection method presented in the last Section, measures (22) and (23) were used, inspired by the receiver operating
Fractal Based Video Shot Cut/Fade Detection and Classification
135
characteristics in statistical detection theory [23], [24]. "Precision" equals the amount of true detected shots in all detected shots, and "Recall" is the number of true detected shots in all shots in the video. Precision
Recall
=
=
true detected shots / detected shots
(22)
true detected shots / all true shots
(23)
We attempt to use two different features, so we can compare the performance of proposed algorithm with the previous works. One of these features is the pixel-wise difference between the consecutive frames. This method showed high variance on sequences with high motion during such segments of the video. On the other hand, the segments that have changed in luminance only, have less variety in pixel-wise differences. Mutual Information between two consecutive frames was another feature that was tested on our dataset. For computing Mutual Information between two frames, the following formula is used: N-1 M-1 R It,t+1 = CRt,t+1 (i,j)
log
i=0 j=0
CRt,t+1 (i,j)
(24)
CRt (i)CRt+1 (j)
Where M and N indicate the size of frames, and Ct,t+1R(i,j) stands for the number of pixels whose color (i.e. red band) change from value i in frame ft to value j in frame ft+l. For colored videos, this value was computed separately for each RGB band and they were finally added together. Obtained results from proposed method and two above described methods, using an adaptive threshold for classifying frames, is depicted in Table 1. Table 1. Average of precision and recall of described methods
Mutual Information Proposed Method Pixel-wise Different
Recall 73.87% 96.17% 41.40%
Fade in/out Precision 92.41% 100% 50.37%
Cut Recall 88.63% 88.89% 90.58%
Precision 90.21% 94.23% 95.43%
It can be seen from Table 1 that the precision and recall of the proposed method is better and more efficient in fade in/out transition and for cut it also has acceptable performance. Our proposed method has one draw-back and it is high computational time complexity, which grows up exponentially with the length of the video. In future, we work on reducing our method's time complexity and increasing its performance for "dissolve" as another type of gradual shot boundary.
4 Conclusion A novel technique for shot transitions detection is presented. We describe a new method for classifying video frames depend on shot boundaries, using fractal dimension as features. This is the first time that the using fractal dimension of Fourier transform coefficients are used for shot boundary detection.
136
Z. Zeinalpour-Tabrizi et al.
Our methods contain 2 steps. Firstly we compute fractal dimension for Fourier coefficient of each gray-level frame and also the different pair-wise frames. Then, we classify each frame of input video to three categories: "CUT", "FADE IN/OUT" or "None SHOT". Thus our method detecting shot boundaries and also determining type of detected shots. A video dataset contain 20 videos including film, sport, studio news, advertisements, political talks and TV series logos were used for evaluation. As illustrated in previous section, performance of proposed algorithm is better than two usual methods (pixel difference and mutual information) on our dataset. We can conclude that fractal dimension is more suitable feature than mutual information for shot boundary detection and recognition.
References 1. Smoliar, S.W., Zhang, H.-J.: Content-based video indexing and retrieval. IEEE Multimedia 1(2), 62–72 (1994) 2. Lienhart, R., Pfeiffer, S., Effelsberg, W.: Video abstracting. ACM Commun. 40(12), 55– 62 (1997) 3. Cabedo, X.U., Bhattacharjee, S.K.: Shot detection tools in digital video. In: Proc. NonLinear Model Based Image Analysis, pp. 121–126 (1998) 4. Kobla, V., DeMenthon, D., Doermann, D.: Special effect edit detection using videotrails: a comparison with existing techniques. In: Proc. SPIE Conf. Storage Retrieval Image Video Databases VII, pp. 302–313 (January 1999) 5. Nagasaka, A., Tanaka, Y.: Automatic Video Indexing and Full-Video Search for Object Appearances. In: Proceedings of the IFIP TC2/WG 2.6 Second Working Conference on Visual Database Systems II, pp. 113–127. North- Holland Publishing Co., Amsterdam (1992) 6. Zhang, H., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion video. Multimedia Systems 1, 10–28 (1993) 7. Shahraray, B.: Scene change detection and content-based sampling of video sequences (April 1995) 8. Nagasaka, A., Tanaka, Y.: Automatic Video Indexing and Full-Video Search for Object Appearances. In: Proceedings of the IFIP TC2/WG 2.6 Second Working Conference on Visual Database Systems II, pp. 113–127. North- Holland Pblishing Co., Amsterdam (1992) 9. Xiong, W., Lee, J.C.-M.: Efficient Scene Change Detection and Camera Motion Annotation for Video Classification. Computer Vision and Image Understanding 71(2), 166–181 (1998) 10. Ferman, A.M., Tekalp, A.M.: Efficient Filtering and Clustering Methods for Temporal Video Segmentation and Visual Summarization. Journal of Visual Communication and Image Representation 9, 336–351 (1998) 11. Arman, F., Hsu, A., Chiu, M.: Image processing on compressed data for large video databases. In: Proceedings of the First ACM International Conference on Multimedia, pp. 267– 272. ACM, Anaheim (1993) 12. Yeo, B., Liu, B.: Rapid scene analysis on compressed video. IEEE Transactions on Circuits and Systems for Video Technology 5, 533–544 (1995)
Fractal Based Video Shot Cut/Fade Detection and Classification
137
13. Shen, K., Delp, E.J.: A fast algorithm for video parsing using MPEG compressed sequences. In: Proceedings of the 1995 International Conference on Image Processing, vol. 2, pp. 252–255. IEEE Computer Society, Los Alamitos (1995) 14. Cernekova, Z., Pitas, I., Nikou, C.: Information theory-based shot cut/fade detection and video summarization. IEEE Transactions on Circuits and Systems for Video Technology 16, 82–91 (2006) 15. Chiu, P., Girgensohn, A., Polak, W., Rieffel, E., Wilcox, L.: A genetic algorithm for video segmentation and summarization. In: IEEE International Conference on Multimedia and Expo., ICME 2000, vol. 3, pp. 1329–1332 (2000) 16. Lienhart, R.: Reliable transition detection in videos: a survey and practitioner’s guide. Int. J. Image Graph. 1(3), 469–486 (2001) 17. Albanese, M., Chianese, A., Moscato, V., Sansone, L.: A Formal Model for Video Shot Segmentation and its Application via Animate Vision. Multimedia Tools and Applications 24, 253–272 (2004) 18. Yuan, J., Wang, H., Xiao, L., Zheng, W., Li, J., Lin, F., Zhang, B.: A Formal Study of Shot Boundary Detection. IEEE Transactions on Circuits and Systems for Video Technology 17, 168–186 (2007) 19. The Open Video Project (a shared digital video collection), http://www.open-video.org/index.php 20. Iran News Network, http://www.irinn.ir/ 21. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice-Hall, Englewood Cliffs (2002) 22. Falconer, K.: Fractal Geometry: Mathematical Foundations and Applications. John Wiley & Sons, Chichester (1997) 23. Browne, P., Smeaton, A.F., Murphy, N., O’Connor, N., Marlow, S., Berrut, C.: Evaluation and combining digital video shot boundary detection algorithms. Presented at the 4th Irish Machine Vision and Information Processing Conf., Belfast, Ireland (2000) 24. Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine 8, 283–298 (1978) 25. Russ, J.C.: Fractal Surfaces. Plenum Press, New York (1994)
Performance Evaluation of Constraints in Graph-Based Semi-supervised Clustering Tetsuya Yoshida Grad. School of Information Science and Technology, Hokkaido University N-14 W-9, Sapporo 060-0814, Japan [email protected]
Abstract. Semi-supervised learning has been attracting much interest to cope with vast amount of data. When similarities among instances are specified, by connecting each pair of instances with an edge, the entire data can be represented as an edge-weighted graph. Based on the graph representation, we have proposed a graph-based approach for semisupervised clustering, which modifies the graph structure by contraction in graph theory and graph Laplacian in spectral graph theory. In this paper we conduct extensive experiments over various document datasets and report its performance evaluation, with respect to the type of constraints as well as the number of constraints. We also compare it with other state of the art methods in terms of accuracy and running time, and the results are encouraging. Especially, our approach can leverage small amount of pairwise constraints to increase the performance.
1
Introduction
Recently, semi-supervised learning, learning from both labeled and unlabeled data, has been attracting much interest in data mining and machine learning communities [3]. One of the reasons is that, in addition to the small amount of supervised information such as labeled instances, unlabeled data, which is relatively easy to collect, can be utilized to improve the performance of learning systems significantly. [2] showed the PAC learnability of semi-supervised learning, and other approaches tried to extend this work [9]. On the other hand, as “unsupervised” learning without requiring supervised information such as labeled data, clustering has been studied in statistics and machine learning communities. The objective of clustering is to create groups of instances (which are called clusters), in such a way that instances in the same cluster are similar with each other and instances in different clusters are not. Although labeled data is not required in clustering, sometimes constraints on data assignment might be available as domain knowledge about the data to be clustered. In such a situation, it is desirable to utilize the available constraints as semi-supervised information and to improve the performance of clustering [14]. By regarding constraints on data assignment as supervised information, various research efforts have been conducted on semi-supervised clustering A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 138–149, 2010. c Springer-Verlag Berlin Heidelberg 2010
Performance Evaluation of Constraints
139
[14,18,1,15,11]. Among them, [15] proposed a feature projection approach for handling high-dimensional data, and reported that it outperformed other existing methods. However, although pairwise constraints among instances are dealt with, pairwise relations among instances are not explicitly utilized. When similarities among instances are specified, by connecting each pair of instances with an edge, the entire data can be represented as an edge-weighted graph. Based on pairwise relations among instances, we have proposed a graph-based approach for semi-supervised clustering (GBSSC) and reported a preliminary evaluation [20]. In GBSSC, the graph structure for the entire data is modified by contraction in graph theory [7] and graph Laplacian in spectral graph theory [4,16] to reflect the pairwise constraints. The entire data is projected onto a subspace which is constructed via the modified graph, and clustering is conducted over the projected representation. In this paper we conduct extensive experiments over various document data using 20 Newsgroup and TREC datasets (which have been widely utilized as benchmark datasets), and report the performance evaluation of GBSSC with respect to the type of constraints as well as the number of constraints. We also compare it with other approaches [15,11] over various document datasets in terms of accuracy of cluster assignment and running time. The results are encouraging and shows its effectiveness in terms of the balance between the accuracy of cluster assignment and running time. Especially, it was shown that GBSSC can leverage small amount of pairwise constraints to increase the performance. 1.1
Problem Setting
We use a bold italic capital letter to denote a set. Let X be a set of instances. For a set X, |X| represents its cardinality. When supervised information on clustering is specified as a set of constraints, the semi-supervised clustering problem is described as follows. Problem 1 (Semi-Supervised Clustering). For a given set of data X and specified constraints, find a partition (a set of clusters) T = {t1 , . . . , tk } which satisfies the specified constraints. There can be various forms of constraints. Based on the previous work [17,18,15,11], we consider the following two kinds of constraints in this paper: must-link constraints and cannot-link constraints. Definition 1 (Pairwise Constraints). For a given set of data X and a partition (a set of clusters) T = {t1 , . . . , tk }, must-link constraints C ML and cannotlink constraints C CL are sets of pairs such that: ∃(xi , xj ) ∈ C ML ⇒ ∃t ∈ T , (xi ∈ t ∧ xj ∈ t)
(1)
∃(xi , xj ) ∈ C ML ⇒ ∃ta , tb ∈ T , ta = tb , (xi ∈ ta ∧ xj ∈ tb )
(2)
140
T. Yoshida
Must-link constraints specifies the pairs of instances in the same cluster, and cannot-link constraints specifies the pairs of instances in different clusters. Organization. Section 2 explains the details of our approach for clustering under pairwise constraints. Section 3 reports the performance evaluation over various document datasets. Section 4 briefly discusses related work. Section 5 summarizes our contributions and suggests future directions.
2
Graph-Based Semi-supervised Clustering
2.1
Preliminaries
Let X be a set of instances. For a set X, |X| represents its cardinality. A graph G(V , E) consists of a finite set of vertices V , a set of edges E over V × V . The set E can be interpreted as representing a binary relation on V . A pair of vertices (vi , vj ) is in the binary relation defined by a graph G(V , E) if and only if the pair (vi , vj ) ∈ E. An edge-weighted graph G(V , E, W ) is defined as a graph G(V , E) with the weight on each edge in E. When |V | = n, the weights in W can be represented as an n by n matrix W1 , where wij in W stands for the weight on the edge for the pair (vi , vj ) ∈ E. We set wij = 0 for pairs (vi , vj ) ∈ E. In addition, we assume that G(V , E, W ) is an undirected, simple graph without self-loop. Thus, the weight matrix W is symmetric and its diagonal elements are zeros. 2.2
A Graph-Based Approach
By assuming that some similarity measure for the pairs of instances X is specified, we have proposed a graph-based approach for constrained clustering problem. Based on the similarities, the entire data X can be represented as an edge-weighted graph G(V , E, W ) where wij represents the similarity between a pair (xi , xj ). Since each data object x ∈ X corresponds to a vertex v ∈ V in G, we abuse the symbol X to denote the set of vertices in G in the rest of the paper. We assume that all wij is non-negative. Definition 1 specifies two kinds of constraints. For C ML , our approach utilizes a method based on graph contraction in graph theory [7] and treat it as hard constraints (Sections 2.3); for C CL , our approach utilizes a method based on graph Laplacian in spectral graph theory [4,16] and treat it as soft constraints under the optimization framework (Section 2.4). The overview of our approach is illustrated in Fig. 1. 2.3
Graph Contraction for Must-Link Constraints
For must-link constraints C ML in eq.(1), the transitive law holds; i.e., for any two pairs (xi , xj ) and (xj , xl ) ∈ C ML , xi and xl should also be in the same cluster. In order to enforce the transitive law in C ML , we utilize graph contraction [7] to 1
A bold italic symbol W denotes a set, while a bold symbol W denotes a matrix.
Performance Evaluation of Constraints
(X’,W’)
(X,W)
141
(X’,W’’) (X’,W’’)
Cannot-Link Contraction For must-link
Graph Laplacian for cannot-link
clustering
Must-Link
Fig. 1. Overview of graph-based projection approach
the graph G for a data set X. By contracting the edge e into a new vertex xe , it becomes adjacent to all the former neighbors of xi and xj . Recursive application of contraction guarantees that the transitive law in C ML is sustained in the cluster assignment. In our approach, the entire data X is represented as an edge-weighted graph G. w(xi,u) u u xi Thus, after contracting an edge w(xe, u) e=(xi, xj,) w(xj, u) Contraction e=(xi , xj ) ∈ C ML into the verxe ∈CML tex xe , it is necessary to define xj the weights in the contracted graph G/e. The weights in G represent the similarities among Fig. 2. Contraction for must-link constraints vertices. The original similarities should at least be sustained after contracting an edge in C ML , since mustlink constraints are for enforcing the similarities, not for reducing. Based on the above observation, we define the weights in G/e as: w(xe , u) = max(w(xi , u), w(xj , u)) (xi , u) ∈ E or(xj , u) ∈ E w(u, v) = w(u, v) otherwise
(3) (4)
In eq.(3), the function max realizes the above requirement, and guarantees the monotonic (non-decreasing) properties of similarities (weights) after contraction. For each pair of edges in C ML , we apply graph contraction and define weights in the contracted graph based on eqs.(3) and (4). This results in modifying the original graph G into G (X , E , W ), where n = |X | (see Fig. 2). 2.4
Graph Laplacian for Cannot-Link Constraints
To reflect cannot-link constraints in the clustering process, we formalize the clustering under constraints as an optimization problem, and consider the minimization of the following objective function: J=
1 { w ||fi − fj ||2 − λ 2 i,j ij
u,v∈CCL
wuv ||fu − fv ||2 }
(5)
142
T. Yoshida
where i and j sum over the vertices in the contracted graph G , and C CL stands for the cannot-link constraints in G . fi stands for the value assigned for data xi , and λ ∈ [0, 1] is a hyper-parameter. The first term corresponds to the smoothness of the assigned values in spectral graph theory, and the second term represents the influence of C CL in optimization. Note that by setting λ ∈ [0, 1], the objective function in (5) is guaranteed to be a convex function. From the above objective function in eq.(5), we can derive the following un normalized graph Laplacian L which incorporates C CL : J=
1 { w ||fi − fj ||2 − λ 2 i,j ij
wuv ||fu − fv ||2 }
u,v∈CCL
= f tL f
(6)
where f t stands for the transposition of vector f , stands for the Hadamard product (element-wise multiplication) of two matrices, and: 1 (xu , xv ) ∈ CCL (C )uv = (7) 0 otherwise
Wc = C W ,
di =
n
dci =
wij ,
n
c wij
(9)
j=1
D = diag(d1 , . . . , dn ),
(8)
j=1
W = W − λWc
L =D −W
di = di − λdci
(10)
(11)
The above process amounts to modifying the representation of the graph G into G , where the modified weights W are defined in eq.(8). Thus, as illustrated in Fig. 1, our approach modifies the original graph G into G with must-link constraints and then into G with cannot-link constraints and similarities. Furthermore, we utilize the following normalized objective function: Jsym =
i,j
fi fj wij || − ||2 di dj
(12)
over the graph G . Minimizing Jsym in eq.(12) amounts to solving the gen eralized eigen-problem L h = αD h, where h corresponds to the generalized eigenvector and α corresponds to the eigenvalue. 2.5
Clustering
The generalized eigenvectors obtained via the modified graph corresponds to the embeeded representation of the whole data. Some clustering method (currently spherical kmeans) is applied and the clusters are obtained.
Performance Evaluation of Constraints
3
143
Performance Evaluations
3.1
Experimental Settings
Datasets. Based on the previous work [6,15], we evaluated GBSSC on 20 Newsgroup data (20NG)2 and TREC datasets3 . Clustering of these datasets corresponds to document clustering, and each document is represented as the standard vector space model based on the occurrences of terms. Note that GBSSC is generic and not specific to document clustering. Since the number of terms are huge in general, these are high-dimensional sparse datasets. As described in [6,15], we created Table 1. TREC datasets (original three datasets for 20NG (Multi5, Multi10, representation) Multi15 datasets, with 5, 10, 15 clusters). 50 documents were sampled from each group dataset # attr. #classes #data (cluster) in order to create a sample for hitech 126372 6 2301 reviews 126372 5 4069 one dataset, and 10 samples were created sports 126372 7 8580 for each dataset. For each sample, we conla1 31372 6 3204 ducted stemming using porter stemmer4 la2 31372 6 3075 and MontyTagger5, removed stop words, la2 31372 6 6279 and selected 2,000 words with large mutual k1b 21839 6 2340 information [5]. ohscal 11465 10 11162 For TREC datasets, we utilized 9 fbis 2000 17 2463 datasets in Table 1. We followed the same procedure in 20NG and created 10 samples for each dataset6 . Since these datasets are already preprocessed and represented as count data, we did not conduct stemming or tagging. Evaluation Measures. For each dataset, the cluster assignment was evaluated with respect to Normalized Mutual Information (NMI) [13,15]. Let T , Tˆ stand for the random variables over the true and assigned clusters. NMI is defined as Tˆ ;T ) (∈ [0, 1]) where H(T ) is Shannon Entropy. NMI correN M I = (H(TˆI( )+H(T ))/2 sponds to the accuracy of assignment. The larger NMI is, the better the result is. The running time (CPU time in second) for representation construction was measured on a computer with Windows Vista, Intel Core2 Quad Q8200 2.33 GHz, 2G memory. Comparison. We compared our approach with SCREEN [15] and PCP [11] (details are described in Section 4). Since all the compared methods are partitioning based clustering methods, we assume that the number of clusters k is specified. In addition, since SCREEN does not scale to high-dimensional data, by following the original procedure in [15], datasets were first pre-processed by PCA (Principal Component Analysis) using 100 eigenvectors before applying SCREEN. 2 3 4 5 6
http://people.csail.mit.edu/˜jrennie/20Newsgroups/. 20news-18828 was utilized. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download http://www.tartarus.org/˜martin/PorterStemmer http://web.media.mit.edu/˜hugo/montytagger On fbis, 35 data were sampled for each class.
144
T. Yoshida
Parameters. The parameters under the pairwise constraints in Definition 1 are: 1) the number of constraints, and 2) the pairs of instances for constraints. As for 2), pairs of instances were randomly sampled from each dataset to generate the constraints. Thus, the main parameter is 1), the number of constraints, for C ML and C CL . We set |C ML | = |C CL |, and varied the number of constraints. Each data x in the dataset was normalized such that xt x = 1, and Euclidian distance was utilized for SCREEN as in [15]. With this normalization, cosine similarity, which is widely utilized as the standard similarity measure in document processing, was utilized for GBSSC and PCP, and the initial edge-weighted graph for each dataset was constructed with the similarities. The dimension l of the subspace was set to the number of clusters k. In addition, following the procedure in [11], m-nearest neighbor graph was constructed for PCP with m = 10. λ in eq.(5) was set to 0.5. Evaluation Procedure. For each number of constraints, the pairwise constraints C ML and C CL were generated randomly based on the ground-truth labels in the datasets, and clustering was conducted with the generated constraints. Clustering with the same number of constraints was repeated 10 times with different initial configuration in clustering. In addition, the above process was also repeated 10 times for each number of constraints. Thus, for each dataset and the number of constraints, 100 runs were conducted. Furthermore, this process was repeated over 10 samples for each dataset. Thus, the average of 1,000 runs is reported for each dataset. 3.2
Evaluation of Graph-Based Projection Approach
Our approach modifies the representation of the dataset according to the specified constraints. Especially, the weights among instances are modified. The other possible approach would be to set the weight (similarity) as: i) each pair (xi , xj ) ∈ CML to the maximum similarity ii) each pair (xi , xj ) ∈ CCL to the minimum similarity First, we compared our approach for the handling of must-links in Section 2.3 with the above approaches on Multi10 datasets. The results are summarized in Fig. 3. In Fig. 3, horizontal axis corresponds to the number of constraints; vertical one corresponds to NMI. In the legend, max stands for i), min stands for ii), and max&min stands for both i) and ii). and GBSSC stands for our approach. no constraints stands for the situation where no constraints are utilized in clustering. The results in Fig. 3 show that GBSSC outperformed others and that it is effective in terms of the weight modification of the graph structure. One of the reasons for the results in Fig. 3 is that, when i) (max) is utilized, only the instances connected with must-links are affected, and thus they tend to be collected into a smaller “isolated” cluster. Creating rather small clusters makes the performance degraded. On the other hand, in our approach, instances adjacent to must-links are also affected via contraction.
Performance Evaluation of Constraints compare_lambda_values
NMI
0.55
#PC=10 #PC=20 #PC=30 #PC=40 #PC=50 #PC=60 #PC=70 #PC=80 #PC=90 #PC=100
0.51
0.52
0.49
0.50
NMI
0.54
no constraints GBSSC max min max&min
0.53
0.56
compare_weight_update
20
40
60
80
145
100
Number of constraints
0.2
0.4
0.6
0.8
1.0
lambda
Fig. 3. Weight medication comparison
Fig. 5. Contraction of must-link constraints
0.0
Fig. 4. Influence of λ
Fig. 6. Weight modification of must-link constraints
As for ii) (min), the instances connected with cannot-links are by definition dissimilar with each other and their weights would be small in the original representation. Thus, setting their weights to the minimal value does not affect the overall performance so much. These are illustrated in Fig. 5 and Fig. 6. Next, in order to evaluate the handling of cannt-links in Section 2.4, we varied the value of hyper-parameter λ in eq.(5) and analyzed its influence. The results are summarized in Fig. 4. In Fig. 4, horizontal axis corresponds to the value of λ, and #PC corresponds to the number of constraints. The performance of GBSSC was not so much affected by the value of λ. Thus, GBSSC can be said as relatively robust with respect to this parameter. In addition, the accuracy (NMI) increased monotonically as the number of constraints increased. Thus, it can be concluded that GBSSC reflects the pairwise constraints and improves the performance based on semi-supervised information. 3.3
Evaluation on Document Datasets
We report the comparison of GBSSC in Section 2 with other methods. In the reported figures, horizontal axis corresponds to the number of constraints; vertical one corresponds to either NMI or CPU time (in sec.). In the legend in the figures, red lines correspond to GBSSC, black dotted lines to SCREEN, green lines to PCP. GBSSC+PCP (with purple lines) corresponds to the situation where must-links were handled by contraction in Section 2.3 and cannot-links by PCP.
T. Yoshida
Multi5
0.45
0.55 40
60
80
0.25
100
20
40
60
80
100
20
40
60
80
Number of constraints
Multi5
Multi10
Multi15
60
80
100
5e+02 5e+01
CPU (sec)
5e+01 5e+00
5e−01
5e−01
CPU (sec)
5e+00 5e−01 5e−02
40
Number of constraints
100
5e+02
Number of constraints
5e+01
Number of constraints
GBSSC GBSSC+PCP SCREEN SDP
20
0.35 0.25
0.30
0.30
0.35
0.40
NMI
NMI
0.45
0.40
0.50
0.8 0.7 0.6 0.5
NMI
0.4 0.2
0.3
GBSSC GBSSC+PCP SCREEN SDP 20
CPU (sec)
Multi15
Multi10
5e+00
146
20
40
60
80
100
Number of constraints
20
40
60
80
100
Number of constraints
Fig. 7. Results on 20-Newsgroup
20 Newsgroup datasets. The results for 20NG datasets are summarized in Figs. 7. The results indicate that GBSSC outperformed other methods with respect to NMI when l=k 7 . For Multi5, although the performance of PCP got close to that of GBSSC as the number of constraints increased, GBSSC was faster more than two orders of magnitude (100 times faster). Likewise, GBSSC+PCP and PCP were almost the same with respect to NMI, but the former was faster with more than one order (10 times faster). Although SCREEN was two to five times faster than GBSSC, it was inferior with respect to NMI. Utilization of PCA as the pre-processing enables this speed-up for SCREEN, in compensation for the accuracy (NMI). TREC datasets. The results for TREC datasets are summarized in Fig. 8. On the whole, the results were quite similar to those in 20NG. GBSSC outperformed SCREEN with respect to NMI. It also outperformed PCP in most datasets, however, as the number of constraints increased, the latter showed better performance for review and sports datasets. In addition, PCP seems to improve the performance as the number of constraints increase. GBSSC+PCP and PCP were almost the same with respect to NMI, but the former was faster with more than one order. Since the overall tendency of running time was the same as in 20NG, we omit the figures due to space shortage. 7
The dimension l of the subspace is equal to the number of clusters k. Note that we did not conduct any tuning for the value of l in these experiments. [15] reports that SCREEN could be improved by tuning the number of dimensions.
Performance Evaluation of Constraints
0.8
0.5 40
60
80
0.4
0.4
100
20
40
60
80
100
20
60
80
la2
la12
80
0.4
NMI
0.2
0.2
100
0.3
0.4
NMI 60
0.3
0.4 0.3 0.2
40
20
40
60
80
100
20
40
60
80
Number of constraints
Number of constraints
k1b
ohscal
fbis
80
100
NMI
0.40
0.45
0.50
0.30 0.25
NMI 60
0.55
0.75 0.70 0.65 0.60 0.55
40
100
0.60
0.35
0.80
Number of constraints
Number of constraints
100
0.5
la1 0.5
Number of constraints
0.5
Number of constraints
GBSSC GBSSC+PCP SCREEN SDP 20
40
Number of constraints
GBSSC GBSSC+PCP SCREEN SDP 20
NMI
0.6
NMI
0.6 0.5
NMI
0.7
0.7
0.35 0.30
NMI
0.25 0.20
GBSSC GBSSC+PCP SCREEN SDP 20
NMI
sports 0.8
reviews
0.40
hitech
147
20
40
60
80
100
Number of constraints
20
40
60
80
100
Number of constraints
Fig. 8. Results on TREC datasets
3.4
Discussions
The reported results show that GBSSC is effective in terms of the accuracy (NMI). GBSSC outperformed SCREEN in all the datasets. Although it did not outperformed PCP in some TREC datasets with respect to NMI, but it was faster more than two orders of magnitude. Utilization of PCA as the pre-processing enables the speed-up in SCREEN in compensation for the accuracy (NMI). On the other hand, PCP showed better performance in some datasets with respect to accuracy (NMI), in compensation for the running time. Besides, since SCREEN originally conducts linear dimensionality reduction based on constraints, utilization of another linear dimensionality reduction (such as PCA) as pre-processing might obscure its effect. From these results, GBSSC can be said as effective in terms of the balance between the accuracy of cluster assignment and running time. Especially, it could leverage small amount of pairwise constraints to increase the performance. We believe that this is a good property in the semi-supervised learning setting.
148
4
T. Yoshida
Related Work
Various approaches have been conducted on semi-supervised clustering. Among them are: constraint-based, distance-based, and hybrid approaches [15]. The constraint-based approach tries to guide the clustering process with the specified pairwise instance constraints [17]. The distance-based approach utilizes metric learning techniques to acquire the distance measure during the clustering process based on the specified pairwise instance constraints [18,11]. The hybrid approach combines these two approaches under a probabilistic framework [1]. As for the semi-supervised clustering problem in Definition 1, [17] proposed a clustering algorithm called COP-kmeans based on the famous kmeans algorithm. When assigning each data item to the cluster with minimum distance as in kmeans, COP-kmeans checks the constraint satisfaction and assigns each data item only to the admissible cluster (which does not violate the constraints). SCREEN [15] first converts the data representation based on must-link constraints and removes the constraints. This process corresponds to contraction in our approach, but the weight definition is different. After that, based on cannot-link constraints, it finds out the linear mapping (linear projection) to a subspace where the variance among the data is maximized. Finally, clustering of the mapped data is conducted on the subspace. PCP [11] deals with the semi-supervised clustering problem by finding a mapping onto a space where the specified constraints are reflected. Using the specified constraints, it conducts metric learning based on the semi-definite programming and learn the kernel matrix on the mapped space. Although the explicit representation of the mapping or the data representation on the mapped space is not learned, kernel k-means clustering [8] is conducted over the learned metric. Various graph-theoretic clustering approaches have been proposed to find subsets of vertices in a graph based on the edges among the vertices. Several methods utilizes graph coloring techniques [10,12]. Other methods are based on the flow or cut in graph, such as spectral clustering [16]. Graph-based spectral approach is also utilized in information-theoretic clustering [19].
5
Concluding Remarks
This paper reported the performance evaluation of a graph-based approach for semi-supervised clustering (GBSSC) over various document data in 20 Newsgroup and TREC benchmark datasets. In GBSSC, the graph structure for the entire data is modified by contraction in graph theory [7] and graph Laplacian in spectral graph theory [4,16] to reflect the pairwise constraints. We analyzed its performance over various document datasets with respect to the type of constraints as well as the number of constraints. We also compared it with other state of the art methods in terms of accuracy of cluster assignment and running time. The results indicate that it is effective in terms of the balance between the accuracy of cluster assignment and running time. Especially, it could leverage small amount of pairwise constraints to increase the performance. We plan to
Performance Evaluation of Constraints
149
evaluate our approach with other real world datasets such as image datasets, and to improve our approach based on the obtained results.
Acknowledgments This work is partially supported by the grant-in-aid for scientific research (No. 20500123) funded by MEXT, Japan. The author is grateful to Mr. Okatani and Mr. Ogino for their help in implementation.
References 1. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: KDD 2004, pp. 59–68 (2004) 2. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with to-training. In: Proceedings of 11th Computational Learning Theory, pp. 92–100 (1998) 3. Chapelle, O., Sch¨ olkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cambridge (2006) 4. Chung, F.: Spectral Graph Theory. American Mathematical Society, Providence (1997) 5. Cover, T., Thomas, J.: Elements of Information Theory. Wiley, Chichester (2006) 6. Dhillon, J., Mallela, S., Modha, D.: Information-theoretic co-clustering. In: Proc. of KDD 2003, pp. 89–98 (2003) 7. Diestel, R.: Graph Theory. Springer, Heidelberg (2006) 8. Girolami, M.: Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks 13(3), 780–784 (2002) 9. Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: Proceedings of ICML 2000, pp. 327–334 (2000) 10. Gu¨enoche, A., Hansen, P., Jaumard, B.: Efficient algorithms for divisive hierarchical clustering with the diameter criterion. J. of Classification 8, 5–30 (1991) 11. Li, Z., Liu, J., Tang, X.: Pairwise constraint propagation by semidefinite programming for semi-supervised classification. In: ICML 2008, pp. 576–583 (2008) 12. Ogino, H., Yoshida, T.: Toward improving re-coloring based clustering with graph b-coloring. In: Proceedings of PRICAI 2010 (2010) (accepted) 13. Strehl, A., Ghosh, J.: Cluster Ensembles -A Knowledge Reuse Framework for Combining Multiple Partitions. J. Machine Learning Research 3(3), 583–617 (2002) 14. Sugato Basu, I.D., Wagstaff, K. (eds.): Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC Press (2008) 15. Tang, W., Xiong, H., Zhong, S., Wu, J.: Enhancing semi-supervised clustering: A feature projection perspective. In: Proc. of KDD 2007, pp. 707–716 (2007) 16. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007) 17. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: In ICML 2001, pp. 577–584 (2001) 18. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: NIPS, vol. 15, pp. 505–512 (2003) 19. Yoshida, T.: A graph model for clustering based on mutual information. In: Proceedings of PRICAI 2010 (2010) (accepted) 20. Yoshida, T., Okatani, K.: A Graph-based projection approach for Semi-Supervised Clustering. In: Proceedings of PKAW 2010 (2010) (accepted)
Analysis of Research Keys as Tempral Patterns of Technical Term Usages in Bibliographical Data Hidenao Abe and Shusaku Tsumoto Department of Medical Informatics, Shimane University, School of Medicine 89-1 Enya-cho, Izumo, Shimane 693-8501, Japan [email protected], [email protected]
Abstract. In order to detect research keys in academic researches, we propose a method based on temporal patterns of technical terms by using several data-driven indices and their temporal clusters. In text mining framework, data-driven indices are used as importance indices of words and phrases. Although the values of these indices are influenced by usages of terms, conventional emergent term detection methods did not treat these indices explicitly. Our method consists of an automatic term extraction method in given documents, three importance indices from text mining studies, and temporal patterns based on results of temporal clustering. Then, we assign abstracted sense of the temporal patterns of the terms based on their linear trends of centroids. Empirical studies show that an importance index is applied to the titles of four annual conferences about data mining field as sets of documents. After extracting the temporal patterns of automatically extracted terms, we compared the linear trends of the technical terms among the titles of one conference.
1
Introduction
In recent years, the accumulation of document data has been more general, according to the development of information systems in every field such as business, academics, and medicine. The amount of stored data has increased year by year. Document data includes valuable qualitative information to not only domain experts in the fields but also novice users on particular domains. However, detecting adequate important words or/and phrases, which are related to attractive topics in each field, is one of skilful techniques. Hence, the topic to support the detection has been attracted attentions in data mining and knowledge discovery fields. As for one solution to realize such detection, emergent term detection (ETD) methods have been developed [1,2]. However, because the frequency of the words were used in earlier methods, detection was difficult as long as each word that became an object did not appear. These methods use particular importance index to measure the statuses of the words. Although the indices are calculated with the words appearance in each temporal set of documents, and the values changes according to their A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 150–157, 2010. c Springer-Verlag Berlin Heidelberg 2010
Analysis of Research Keys as Tempral Patterns of Technical Term Usages
151
usages, most conventional methods do not consider the usages of the terms and importance indices separately. This causes difficulties in text mining applications, such as limitations on the extensionality of time direction, time consuming postprocessing, and generality expansions. After considering these problems, we focus on temporal behaviors of importance indices of terms related to research topics and their temporal patterns. In this paper, we propose an integrated framework for detecting temporal patterns of technical terms based on data-driven importance indices by combining automatic term extraction methods, importance indices of the terms, and trend analysis methods in Section 2. After implementing this framework, we performed an experiment to extract temporal patterns of technical terms. In this experiment, by considering the sets of terms extracted from the titles of a data mining relating conference as the example, the temporal patterns based on a data-driven importance index are presented in Section 3. With referring to the result, we discuss about the characteristic terms of the conferences. Finally, in Section 4, we summarize this paper.
2
An Integrated Framework for Detecting Temporal Patterns of Technical Terms Based on Importance Indices
In this section, we describe a framework for detecting various temporal trends of technical terms as temporal patterns of each importance index consisting of the following three components: 1. Technical term extraction in a corpus 2. Importance indices calculation 3. Temporal pattern extraction There are some conventional methods of extracting technical terms in a corpus on the basis of each particular importance index [2]. Although these methods calculate each index in order to extract technical terms, information about the importance of each term is lost by cutting off the information with a threshold value. We suggest separating term determination and temporal trend detection based on importance indices. By separating these phases, we can calculate different types of importance indices in order to obtain a dataset consisting of the values of these indices for each term. Subsequently, we can apply many types of temporal analysis methods to the dataset based on statistical analysis, clustering, and machine learning algorithms. First, the system determines terms in a given corpus, which consist of document sets on each period is published temporally. In our implementation, we have an assumption that an automatic term extraction method in natural language processing is used in this step. There are two reasons why we introduce term extraction methods before calculating importance indices. One is that the cost of building a dictionary for each particular domain is very expensive task.
152
H. Abe and S. Tsumoto Term output feedback H/sub infinity resource allocation image sequences multiagent systems feature extraction images using human-robot interaction evolutionary algorithm deadlock avoidance ambient intelligence feature selection data mining
Jacc_1996
Jacc_1997 Jacc_1998 Jacc_1999 Jacc_2000 Jacc_2001 Jacc_2002 Jacc_2003 Jacc_2004 Jacc_2005 0 0 0 0 0 0 0 0 0 0 0 0 0.012876 0 0.00885 0 0 0 0.005405 0.003623 0.006060606 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.004785 0 0 0 0 0 0 0 0 0.004975 0 0 0 0 0.005649718 0 0.004484 0 0 0 0 0 0 0 0 0 0 0 0.004673 0 0 0 0 0 0 0 0 0.004425 0 0 0 0 0 0 0.005649718 0 0.004484 0 0 0 0 0.002703 0.003623 0 0 0 0 0.004425 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.003623 0 0 0 0 0 0 0 0 0.002703 0 0 0 0 0 0.004425 0 0 0 0.002703 0
Fig. 1. Example of a dataset consisting of an importance index
The other is that new concepts need to be detected in a given temporal corpus. Especially, a new concept is often described in the document for which the character is needed at the right time in using the combination of existing words. After determining terms in the given corpus, the system counts the appearances of the terms or/and the number of documents including the terms on each document set. Then, it calculates importance indices of the terms for the documents of each period. In the proposed method, we suggest treating these indices explicitly as a temporal dataset. The features of this dataset consist of the values of prepared indices for each period. Figure 1 shows an example of the dataset consisting of an importance index for each year. Finally, the framework provides the choice of some adequate trend extraction method to the dataset. In order to extract useful temporal patterns, there are so many conventional methods as surveyed in the literatures [3,4]. By applying an adequate time-series analysis method, users can find out valuable patterns by processing the values in rows in Figure 1.
3
Experiment: Extracting Temporal Patterns of Technical Terms by Using Temporal Clustering
In this experiment, we describe the results temporal patterns by using the implementation of the method described in Section 2. As the input of temporal documents, we used the annual sets of the titles of the following four academic conferences1 ; KDD, PKDD, PAKDD, and ICDM. We determine technical terms by using the term extraction method [6]2 for each entire set of documents. The method uses the following FLR scores to extract terms in the documents: F LR(CN ) = f (CN ) × (
L
1
(F L(Ni ) + 1)(F R(Ni ) + 1)) 2L
i=1
where f (CN ) means frequency of a candidate continued noun CN , and F L(Ni ) and F R(Ni ) indicate the frequencies of different words on the right and the left 1 2
These titles are the part of the collection by DBLP [5]. The implementation of this term extraction method is distributed http://gensen.dl.itc.u-tokyo.ac.jp/termextract.html (in Japanese).
in
Analysis of Research Keys as Tempral Patterns of Technical Term Usages
153
of each noun Ni in bi-grams including each CN . In the experiments, we selected technical terms with this FLR score as F LR(t) > 1.0. Subsequently, the values of tf-idf are calculated for each term in the annual documents. As for importance indices of words and phrases in a corpus, there are some well-known indices. Term frequency divided by inversed document frequency (tf-idf) is one of the popular indices used for measuring the importance of the terms. tf-idf for each term ti can be defined as follows: T F IDF (ti , Dperiod ) = T F (ti ) × log
|Dperiod | DF (ti )
where T F (ti ) is the frequency of each term ti in the corpus with |Dperiod | documents. |Dperiod | means the number of documents included in each period. To the datasets consisting of temporal values of the importance indices, we extract temporal patterns by using k-means clustering. Then, we apply the meanings of the clusters based on their linear trends calculated by the linear regression technique for the timeline. In order to determine the meaning of the temporal patterns, we apply the linear regression analysis technique to the centroids of the clusters. The degree of each cluster c is calculated as the following: Deg(c) =
M
¯)(xi − i=1 (yi − y M ¯)2 i=1 (xi − x
x ¯)
where x ¯ is the average of the M time points, and y¯ is the average of the centoroids of each cluster. Simultaneously, we calculate the intercept Int(c) of each cluster c as follows: Int(c) = y¯ − Deg(c)¯ x The degree and the intercept are also applied to determine linear trend of each term by using the values of each index instead of the values of the centroid as xi . 3.1
Extracting Technical Terms
We use the titles of the four data mining related conferences as temporal sets of documents. The description of the sets of the documents is shown in Table 1. As for the sets of documents, we assume each title of the articles to be one document. Note that we do not use any stemming technique because we want to consider the detailed differences in the terms. By using the term extraction method with simple stop word detection for English, we extract technical terms as shown in Table 2. After merging all of titles of each conference into one set of the documents, these terms were extracted for each set of the titles. 3.2
Extracting Temporal Patterns by Using k-Means Clustering
In order to extract temporal patterns of each importance index, we used k-means clustering. We set up the numbers of one percent of the terms as the maximum
154
H. Abe and S. Tsumoto Table 1. Description of the numbers of the titles
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 TOTAL
KDD PKDD PAKDD # of titles # of words # of titles # of words # of titles # of 40 349 56 466 74 615 65 535 43 350 68 572 56 484 51 93 727 82 686 72 94 826 86 730 52 110 942 45 388 63 140 1,190 43 349 62 108 842 44 340 60 133 1,084 64 504 83 113 868 76 626 101 139 1,068 67 497 128 131 1,065 67 537 196 134 1,126 110 832 136 1,498 12,275 783 6,323 1,004
ICDM words # of titles # of words
412 628 423 528 515 520 698 882 1,159 1,863 1,224 8,852
109 121 127 105 150 317 213 264 1,406
908 1,036 1,073 840 1,161 2,793 1,779 2,225 11,815
Table 2. Description of the numbers of the extracted terms KDD PKDD PAKDD ICDM # of extracted terms 3,232 1,653 2,203 3,033 Table 3. The sum of squared errors of the clustering for the technical terms in the titles of the four conferences KDD PKDD PAKDD ICDM # of Clusters SSE # of Clusters SSE # of Clusters SSE # of Clusters SSE 15 46.71 9 58.76 11 35.13 9 21.05
number of clusters k for each dataset. Then, the system obtained the clusters with minimizing the sum of squared error within clusters. By iterating less than 500 times, the system obtains the clusters by using Euclidian distance between instances consisting of the values3 of the same index. Table 3 shows the result of the SSE of k-means clustering. 3.3
Details of a Temporal Pattern of the Technical Terms
By focusing on the result of the titles of KDD with tf-idf, in this Section, we show the details of the temporal pattern. We obtained the 15 clusters with this setting for extracting temporal patterns, the SSE is shown in Table 3. As shown in Table 4, there are several kind of clusters based on the averaged linear trends. The centroid terms mean the terms that are the nearest location to the centroids. Then, by using the averaged degree and the averaged intercept of each term, we attempt to determine the following three trends: – Popular • the averaged degree is positive, and the intercept is also positive. 3
The system also normalized the values for each year.
Analysis of Research Keys as Tempral Patterns of Technical Term Usages
155
Table 4. The centroid information about the temporal patterns of tf-idf of the terms on the titles of KDD from 1994 to 2008 Cluster No. # of Instances Term Avg. Deg Avg. Int 1 1590 sequence using data mining 0.01 0.04 2 7 data mining 0.76 15.35 3 73 database mining -0.09 1.27 4 172 web usage mining 0.02 0.26 5 142 web data 0.09 -0.27 6 195 relational data 0.13 -0.44 7 112 web mining 0.00 0.45 8 192 graph mining 0.14 -0.56 9 86 bayesian network -0.07 1.00 10 141 data streams 0.05 0.09 11 26 knowledge discovery 0.52 4.49 12 94 mining knowledge -0.05 0.90 13 117 high-dimensional data -0.03 0.80 14 102 distributed data mining -0.02 0.54 15 48 data sets 0.35 1.57
35 30 fd 25 i -f t 20 d eg ar 15 vea 10 5 0
4 9 9 1
5 9 9 1
6 9 9 1
7 9 9 1
8 9 9 1
9 9 9 1
0 0 0 2
1 0 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 0 0 2
6 0 0 2
7 0 0 2
8 0 0 2
Meaning Popular Popular Subsiding Popular Emergent Emergent Popular Emergent Subsiding Popular Popular Subsiding Subsiding Subsiding Popular
Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
Fig. 2. The temporal patterns of tf-idf of terms from the titles of KDD
– Emergent • the averaged degree is positive, and the intercept is negative. – Subsiding • the averaged degree is negative, and the intercept is positive. The averaged values of the tf-idf as temporal patterns are visualized in Figure 2. According to the meanings based on the linear trend, the patterns #5,#6, and #8 have the emergent patterns. Since the numbers of instances included in each cluster are not uniform, the pattern #2 and #11 shows the high averaged tf-idf values. Figure 3 shows top ten terms included in the pattern #6, which is the biggest cluster with the emergent linear trend. Their tf-idf values increase recent two or three years. Although they were not used in the titles in ’90s, researchers become to use these terms to express their key topics in these years.
156
H. Abe and S. Tsumoto structural
12
summarization 10
community ranking
tf-idf
8
topics 6
preserving mixture model
4
segmentation 2
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
frequent itemsets co-clustering Cluster #6
Fig. 3. Top ten emergent terms included in the pattern #6
The terms, ‘mixture model’ and ‘co-clustering’, fit to the centroid of the pattern. Besides, ‘summarization’ and ‘community’ are still used frequently on 2008. There still remains a room to improve the temporal pattern extraction method. Although these conferences share some emergent and subsiding terms based on the temporal patterns, characteristic terms can be also determined. The centroids of terms assigned as the emergent patterns express the research topics that have attract the attentions of researchers. The emergent terms in KDD, they are related to web data and graphs. As for PKDD, the phrases ‘feature selection’ determine as emergent phrases only for this conference. The mining techniques that are related to items and text are also determined in PAKDD and ICDM. These terms indicate some characteristics of these conferences, relating to people who have been contributed for each conference.
4
Conclusion
In this paper, we proposed a framework to detect temporal patterns of the usages of technical terms appeared as the temporal behaviors of the importance indices. We implemented the framework with the automatic term extraction, the tf-idf based importance index, and temporal pattern extraction by using k-means clustering. The empirical results show that the temporal patterns of the importance index can detect the trends of each term, according to the linear trends of the index for the timeline. Regarding to the results, the temporal patterns indicate that we
Analysis of Research Keys as Tempral Patterns of Technical Term Usages
157
can detect emergent terms and similar terms with the emergent terms included in each temporal pattern. In the future, we will apply other term extraction methods, importance indices, and trend detection method. As for importance indices, we are planning to apply evaluation metrics of information retrieval studies, probability of occurrence of the terms, and statistics values of the terms. To extract the temporal patterns, we will introduce temporal pattern recognition methods [7], which can consider time differences between sequences with the same meaning. Then, we will apply this framework to other documents from various domains.
References 1. Lent, B., Agrawal, R., Srikant, R.: Discovering trends in text databases, pp. 227–230. AAAI Press, Menlo Park (1997) 2. Kontostathis, A., Galitsky, L., Pottenger, W.M., Roy, S., Phelps, D.J.: A survey of emerging trend detection in textual data mining. A Comprehensive Survey of Text Mining (2003) 3. Keogh, E., Chu, S., Hart, D., Pazzani, M.: Segmenting time series: A survey and novel approach. In: Data mining in Time Series Databases, pp. 1–22. World Scientific, Singapore (2003) 4. Liao, T.W.: Clustering of time series data: a survey. Pattern Recognition 38, 1857–1874 (2005) 5. The dblp computer science bibliography, http://www.informatik.uni-trier.de/~ ley/db/ 6. Nakagawa, H.: Automatic term recognition based on statistics of compound nouns. Terminology 6(2), 195–210 (2000) 7. Ohsaki, M., Abe, H., Yamaguchi, T.: Numerical time-series pattern extraction based on irregular piecewise aggregate approximation and gradient specification. New Generation Comput. 25(3), 213–222 (2007)
Natural Language Query Processing for Life Science Knowledge Position Paper Jin-Dong Kim1, , Yasunori Yamamoto1 , Atsuko Yamaguchi1 , Mitsuteru Nakao1 , Kenta Oouchida1 , Hong-Woo Chun2 , and Toshihisa Takagi1
2
1 Database Center for Life Science (DBCLS), Tokyo, Japan {jdkim,yy,atsuko,mn,oouchida,takagi}@dbcls.rois.ac.jp Korea Institute of Science and Technology Information (KISTI), Daejoen, Korea [email protected]
Abstract. As the progress of life science, human knowledge about life science keeps increased and diversified. To effectively access the diverse and complex knowledge base, people need to cope with complex representation of desired knowledge pieces. As a potential solution to the problem, we propose a natural language query processing system which can be used to build a human-friendly question-answering system. In this position paper, we present the background, an initial design, and a preliminary investigation of the natural language query processing.
1
Introduction
Recent rapid progress of life science resulted in a greatly increased amount of life science knowledge, e.g. genomics, proteomics, pathology, therapeutics, diagnostics, etc. The increase of life science knowledge is expected to bring about an improved quality of everyday lives of the public. The knowledge is however scattered in pieces in diverse forms over a large number of databases (DBs), e.g. PubMed, Drugs.com, Therapy database, etc. In this situation, there are at least two non-trivial problems that a user has to cope with to access specific pieces of knowledge: (1) to find the DBs that are expected to contain the desired knowledge pieces, and (2) to learn the protocol, e.g. query languages, to access the knowledge pieces in the individual DBs. The problems get worse as life science progresses: as the number of DBs gets increased, it becomes more difficult to find relevant DBs, and as the knowledge becomes diversified, the representation gets complicated, requiring more complex query languages. In this work, we particularly focus on the second problem. Note that, for a long time, the most popular form of query expression has been sets of keywords with Boolean operators, which is not difficult to write. Recently, as more and more diverse knowledge pieces are discovered and accumulated in DBs, and the need for their integration grows, complex query languages are emerging to cope with the diversity. For example, SPARQL1 is becoming an important query 1
Corresponding author. http://www.w3.org/TR/rdf-sparql-query/
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 158–165, 2010. c Springer-Verlag Berlin Heidelberg 2010
Natural Language Query Processing
159
language as more and more knowledge pieces are published as RDF statements, e.g. BioRDF2 . SPARQL queries are however not easy to write, due to the complex vocabulary, syntax and semantics. We suppose the natural language processing (NLP) technology to be a solution to the problem. Human users feel comfortable with natural language without particular education, yet its expressive power is high enough to represent knowledge in complex structure. If natural language queries (NLQ) can be seamlessly translated into SPARQL queries, human users do not need to be troubled to learn a complex query language, still having the same level of access to their desired knowledge as they can do with SPARQL. In this position paper, we present the background, our initial plan, and a preliminary investigation of developing a NLQ processing system. The paper is organized as follows: Sec.2 presents our analysis on previous similar efforts; Sec.3 introduces the LifeQA project in which the NLQ processor will be plugged as a core component; Sec.4 describes our design of NLQ processing; Sec.5 discusses some issues revealed during our initial investigation; and Sec.6 concludes the paper.
2
Previous Works
NL interface has long been expected to be an ideal interface for information systems, as such there have been several attempts to develop NL interfaces, e.g. PowerSet3, BioQA4 . However, most of earlier attempts were not very successful in gaining a large population of users. We attribute the reason to (1) unclear benefit of NL interface, and (2) insufficient performance of NLP modules. Most of earlier information systems with NL interface did not feature finegrained knowledge representation with complex structure, and thus could not show a clear benefit of using NLQs over simple Boolean queries. Since LifeQA project is based on rich knowledge bases, life science DBs, and features SWbased knowledge representation, we expect the users to gain clear benefit from a query language with high expressive power. As for the performance of NLP modules, while it is reported that state-ofthe-art NLP technology shows reasonable performance for IR or IE applications [8], NLP technology has long been developed most for declarative sentences. On the other hand, NLQs are usually interrogative or imperative sentences, which causes the insufficient performance problem. The issue is discussed in Sec.5 in more detail.
3
LifeQA Project
The LifeQA project is established to address the two problems introduced in Sec.1. The final product of the project, LifeQA system/service, is planned to 2 3 4
http://esw.w3.org/HCLSIG BioRDF Subgroup http://www.powerset.com/ http://cbioc.eas.asu.edu/bioQA/v2/
160
J.-D. Kim et al.
Natural Language (NL) Layer
Description Logic (DL) Layer
… Database (DB) Layer
Fig. 1. The 3-layer of knowledge representation in LifeQA system
NL Query (Voice)
NL Answers (Voice)
Speech Recognizer
Speech Synthesizer
NL Query (Text)
NL Answers (Text)
NLQ Processor
NLD Generator
DL Query
DL Answers
Answer Aggregator
be a hub for life science DBs that can be accessed via natural language (NL) interface. With the service, a user will be able to execute queries written or spoken in natural language, e.g. What toxicities are associated with Aspirin? The natural language queries (NLQ) will then be translated into various queries for the individual DBs that are connected to the hub. The results from the databases will be aggregated and translated back to natural languages description (NLD) in text or voice. The backbone of the LifeQA system is in the translation between knowledge representations in NL, DL, and DB layers, as illustrated in Fig.1. The knowledge representation in DL layer, which will rely on Semantic Web (SW) technology, is important for an effective integration of diverse knowledge pieces from various DBs. The performance of the translation between NL and DL layers, which will rely on NLP technology, is important to ensure the effectiveness of the NL interface. The population of the DB layer is important to secure rich source of life science knowledge. On its success, the LifeQA service is expected to improve the logistics of the delivery of life science knowledge to potential beneficiaries. Figure 2 illustrates the overall workflow of the planned LifeQA system. We suppose development of the NLQ processor to be the most challenging task of the project, which is the theme of this paper. Query Converter 1
DB 1
Query Converter 2
DB 2
Query Converter 3
DB 3
Fig. 2. The flow of information in LifeQA system
Natural Language Query Processing
161
Table 1. Sample queries prepared for TREC 2007 Genomics Track Id Query T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
4
What antibodies have been used to detect protein TLR4? What biological substances have been used to measure toxicity in response to cytarabine? What cell or tissue types express members of the mammalian TIM gene family? What diseases are associated with lysosomal abnormalities in the nervous system? What drugs have been tested in mouse models of Alzheimer’s disease? What entrosomal genes are implicated in diseases of brain development? What molecular functions does helicase protein NS3 play in HCV? What mutations in apolipoprotein genes are associated with disease? Which pathways are possibly involved in the disease ADPKD? What proteins does epsin1 interact with during endocytosis? What Streptococcus pneumoniae strains are resistent to penicillin and erythromycin? What signs or symptoms of anxiety disorder are related to lipid levels? What toxicities are associated with cytarabine? What tumor types are associated with Rb1 mutations?
Natural Language Query Processing
As a preliminary step toward developing a NLQ processing system, especially for life science queries, we analyzed the fourteen sample queries developed for TREC 2007 Genomics Track [3] as in Table 1. Based on the analysis, we propose a pipeline system involving the following 5 steps: 1. 2. 3. 4. 5.
NER step finds entity mentions in the input query. Parsing step analyzes the syntactic/semantic structure of the query. Targeting step finds determines what to be retrieved as the result. Conditioning step details the condition of the target to be retrieved. Encoding step generates a corresponding SPARQL query.
Finding linguistic structures of NL texts, the NER and Parsing steps are now regarded as standard components of NLP systems. For example, the words, antibodies and TLR4 in the sample query T1 refer to the important entities that need to be recognized to answer the query. In T4, for another example, the words, diseases, lysosomal, abnormality, and nervous system are important entity references. Note that although lysosomal is in adjective form, it refers to lysosome, thus needs to be recognized as a biological entity reference. NER taggers [1,2,5,11] are responsible for the task. For example, GeniaTagger [11] recognizes antibodies and TLR4 as proteins, and MetaMap [1] recognizes diseases, lysosomal, abnormalities, and nervous system as Disease or Syndrome, Body Location or Region, Congenital Abnormality, and Body system, respectively. On the other hand, syntactic/semantic parsers [6,9,7,10] compute the syntactic/semantic structure of the queries, which helps interpretation of the queries. For example, Fig.3 and 4 represent the syntactic and semantic structure of T1, respectively, as analyzed by Enju [7]. The Targeting and Conditioning steps are required to determine what the query wants to get in the response. Typically, in interrogative sentences, interrogative pronouns or wh-words often indicate the target entities. For example, in T1 and T4, the interrogative pronoun what governs the word antibodies and diseases, respectively, indicating them as the target entities.
162
J.-D. Kim et al.
Fig. 3. Syntactic parse of the sample query T1
Fig. 4. Semantic parse of the sample query T4
Natural Language Query Processing
163
Fig. 5. Shortest path between the entity references in T1
Fig. 6. Shortest path between the entity references in T4
Once the target entity is determined, the condition of retrieval needs to be figured out. For example, what the query T1 wants to get is not all the antibodies, but those that detect TLR4 . The conditioning step can be casted as the task to find all the predications concerning the target entities, which can be classified as an information extraction (IE) task. In the BioNLP field, the problem of extracting complex predications (events) on bio-entities was recently addressed by a community-wide collaboration [4]. Output from the community should be useful in developing the conditioning module. Figure 5 and 6 represents the shortest paths between critical entities in the semantic structure of T1 and T4, demonstrating the usefulness of the parsing step for finding the conditions of the target entities as expressed in the queries. Figure 5 illustrates that T1 wants to find the antibodies that detect TLR4 . It can be encoded in SPARQL as follows, which is the final step of NLQ processing: PREFIX bio: Select: ?target_entity_name Where { ?x bio:name ?target_entity_name; rdf:type bio:Antibody; bio:detect ?y. ?y bio:name "TLR4"; rdf:type bio:Protein. }
164
5
J.-D. Kim et al.
Discussions
Our preliminary analysis identified a couple of issues that needed to be addressed to obtain a reasonable performance of query processing. First, the performance of NLP modules needs to be improved especially for queries. We observed that even state-of-the-art parsers [7,10] failed to produce correct parses for four out of the fourteen sample queries: T2, T7, T10, T12 . With T2, the prepositional phrase (PP) in response to cytarabine is incorrectly attached to the verb measure. With T12, the scope of coordination, sign or symptom, is misjudged. With T7 and T10, the interrogative auxiliary verb, does, is misjudged to a verb. The pp-attachment problem and coordination problem are known to be difficult for parsers to correctly interpret. However, the confusion with regard to interrogative auxiliary verbs is an evidence that the taggers are not optimized for interrogative sentences. We are planning to address the problem by developing a treebank of queries for re-training of parsers. As for the problem of pp-attachment and coordination, we also plan to develop lexical resources to improve the performance of pp-attachment and coordination. Second, semantic standardization of the entities, e.g. Antibody and Protein, and relations, e.g. detect, is necessary for an effective matching between queries and knowledge pieces to be retrieved. In SW technology, the standardization is implemented around ontologies. In developing new ontologies, it is important to secure connections to existing ones. For example, the GENIA ontology is developed for the purpose of textmining. Being used as the reference ontology of the BioNLP’09 shared task [4], it is widely recognized as a standard ontology of bio-textmining. It means that there are many BioNLP tools that have been or to be developed based on the GENIA ontology, thus if the ontology for the LifeQA system is developed to be compatible with the GENIA ontology, many resources from the BioNLP community will become easily available for the project. While the GENIA ontology is a domain-specific ontology developed in a bottom-up style, Basic Formal Ontology (BFO), which is the backbone of the OBO foundry ontologies, is one of the de facto standard top-level ontologies, OBO foundry encourages developers of bio-ontologies to ensure compatibility with BFO so that different ontologies developed for different purposes can remain compatible with each other. In developing the ontology for the LifeQA system, we are planning to keep referencing the two ontologies. Following is the entity types dealt with in the QR task of TREC 2007 Genomics track [3], which will constitute the minimal set of entries to be organized in the ontology for the QA system: Continuants biological substances, antibodies, cell or tissue types, drugs, genes, proteins, strains, tumor types Occurrents diseases, signs or symptoms, toxicities, molecular functions, mutations, pathways
Natural Language Query Processing
6
165
Conclusions
Motivated to develop a NLQ processing system, we first reviewed previous similar works to avoid repetition of similar mistakes, and analyzed the sample queries from TREC 2007 Genomics Track. As the results, we figured out two problems in the previous approaches: insufficient performance of NLP modules and lack of clear benefit from NLQs. We propose to address the two problems by developing a corpus of NLQs with annotation for syntactic/semantic structures, and by incorporating SW technology to deal with fine-grained knowledge representation, which we believe facilitates the uniqueness of our approach in developing NL interfaces.
References 1. Aronson, A.R., Lang, F.M.: An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association 17(3), 229–236 (2010) 2. Carpenter, B.: Lingpipe for 99.99% recall of gene mentions. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop. pp. 307–309 (2007) 3. Hersh, W., Voorhees, E.: Trec genomics special issue overview. Information Retrieval 12(1), 1–15 (2009) 4. Kim, J.D., Ohta, T., Pyysalo, S., Kano, Y., Tsujii, J.: Overview of BioNLP’09 Shared Task on Event Extraction. In: Proceedings of Natural Language Processing in Biomedicine (BioNLP) NAACL 2009 Workshop, pp. 1–9 (2009) 5. Leaman, R., Gonzalez, G.: Banner: an executable survey of advances in biomedical named entity recognition. In: Pacific Symposium on Biocomputing, pp. 652–663 (2008) 6. McClosky, D., Charniak, E.: Self-Training for Biomedical Parsing. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics Human Language Technologies (ACL-HLT 2008). pp. 101–104 (2008) 7. Miyao, Y., Tsujii, J.: Feature forest models for probabilistic hpsg parsing. Computational Linguistics 34(1), 35–80 (2008) 8. Ohta, T., Tsuruoka, Y., Takeuchi, J., Kim, J.D., Miyao, Y., Yakushiji, A., Yoshida, K., Tateisi, Y., Ninomiya, T., Masuda, K., Hara, T., Tsujii, J.: An intelligent search engine and gui-based efficient medline search tool based on deep syntactic parsing. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 17–20. Association for Computational Linguistics (2006) 9. Rimell, L., Clark, S.: Porting a lexicalized-grammar parser to the biomedical domain. Journal of Biomedical Informatics 42(5), 852–865 (2008) 10. Sagae, K., Tsujii, J.: Dependency parsing and domain adaptation with LR models and parser ensembles. In: Proceedings of the CoNLL 2007 Shared Task (2007) 11. Tsuruoka, Y., Tateishi, Y., Kim, J.D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)
A Semantic Web Services Discovery Algorithm Based on QoS Ontology Baocai Yin, Huirong Yang, Pengbin Fu, and Xiaobo Chen Beijing Municipal Key Laboratory of Multimedia and Intelligent Software Technology, College of Computer Science, Beijing University of Technology, 100124, Beijing, China {ybc,yanghuirong}@bjut.edu.cn
Abstract. Discovery of web services has gained great research attention due to great numbers of web services. However, many functionally-equivalent web services are returned by semantic web services registry. So it is necessary to rank the web services which have similar functionality. We often uses Quality of Service (QoS) to filter the semantic web services discovery results based on user constraints of QoS descriptions. But different service providers and petitioners may use different QoS concepts and measurement methods for describing service quality. This leads to the issues of semantic interoperability of QoS. In this paper, we firstly analyze the classes, attributes and relationships from QoS vocabulary, then we design and build a general and flexible QoS ontology to support web services non-functional requirements(NFRs). Finally, we propose a semantic web services discovery algorithm based on QoS ontology. The algorithm supports the automatic discovery of web services and it can improve the efficiency for users to find the best services.
1 Introduction Web services discovery is a key step of building distributed applications. But current standards of web service lack the descriptions of service’s non-functional attributes, i.e., its QOS. QoS is defined in ISO-9126 as: “The totality of features and characteristics of a product or service that bear on its ability to satisfy stated or implied needs”. QoS describes the capabilities of a product or service to meet the requirements of consumers. Without QoS information, it is difficult to rank the similar services only by its functionality. Some researchers add QoS information into service discovery process[1][4][5]. The QoS information, such as reliability and accessibility of a web service, is used to enhance the availability and efficiency of web service selection. But different providers and requesters may have quite different QoS description, they may use different concepts, scales and measurements. Hence, a ontology is needed in order to solve the interoperability of QoS . In this paper, we design and build a general and flexible QoS ontology to express QoS information of web services. The QoS ontology can offer a unified QoS description for both providers and participants. We use Semantic Markup for web Services (OWL-S) as the description language to represent QoS ontology. Besides, we propose a QoS ontology based web service discovery algorithm to support the automatic A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 166–173, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Semantic Web Services Discovery Algorithm Based on QoS Ontology
167
discovery and composition of web service. The algorithm supports the QoS semantic matching and evaluation, and it considers not only the functional matching but also the QoS matching of web service providers and requestors. It can satisfy the requester’s functional requirements and NFRs.
2 QoS Ontology 2.1 QoS Ontology Ontology is a common understanding and description about the domain knowledge. Ontology can solve semantic descriptions and ambiguity issues. QoS means nonfunctional attributes of service, describing all aspects of performance of service. The QoS ontology describes the information of web services quality which is independent from the domain. Service provider release a web service with QoS datus according to the description of a QoS ontology, and the users of service, request a service with the relevant QoS data within the same QoS ontology, thus the automatic discovery and composition of web services might be possible[6]. Several QoS ontology have been developed in the past few years [2,5,6,7,8,9], but currently there are no general and standard QoS ontology. In [5], it introduced many classes of QoS ontology, but it did not mention the attributes and relationships between them. The onQoS ontology in [6] is composed of three extensible layer: upper, middle and lower. The onQoS upper ontology describes the QoS ontological language. It provides “the words” we need in order to provide the most appropriate information for formulating and for answering the QoS queries. The onQoS middle ontology defines QoS parameters, metrics and scales. And it introduced the QoS parameters detailed, including its subclass, such as availability, capacity, cost and performance etc. It also gives a well-defined specification for metrics, value types etc. But it does not mention the properties of level, tendency etc. The QoS ontology language [7], focused on the formulation of a QoS ontology. It provides a quite complete list of common QoS properties but failed to present quantifiable measurements. In [8], DAML-QoS ontology consists of three layers: the QoS profile layer, which is used for matching; the QoS property definition layer, which is used for describing constraints of QoS properties; and metrics layer for measuring QoS factors. Although it is well defined to support advertisement, selection and semantic matching, while the QoS ontology vocabulary is absent. In [9], QoS is described into objective and subjective aspects. The most QoS characteristics and concepts is introduced in this paper. But the QoS metric is neglected. In [2], many important attributes has been defined, but the QoS metric, weight is not included. In[4] defines a set of core QoS properties which are common for various web service applications and introduced most NFRs of web services. In our design and implementation, we mainly consider the clarity and rationality of ontology requirements introduced in [4] , and consider the quality characteristics defined in [2,4,5,6,7,8,9] and ISO-9126. Then we propose our QoS ontology. The QoS ontology model is built based on the requirements of QoS stack which is depicted in fig1. It meets the most important NFRs that we gathered and are depicted by QoS Stack in fig1. We choose the NFRs from such aspects: expressiveness, robustness, flexibility, scalability, performance, completeness, friendliness and
168
B.C. Yin et al.
Fig. 1. The QoS Stack
interoperability. The left part of fig1 describes the most general requirements of qualities, while the right part describes the concrete requirements. Left is required for each QoS property, however the right part is optional. Our QoS ontology is composed of three extensible layers: top, middle and lower. Top is the most common NFRs ,the middle is the concrete NFRs, the lower meets the NFTs of a specific domain. The most important NFRs and charaterists contain of two part. One part is the general NFRs and characteristics. They are QoS Property, QoS Metric, QoS Relationship, QoS Role, QoS Weight etc; The other part is the concrete NFRs and characteristics of a QoS ontology. They are QoS Accessibility, QoS Reliability, QoS Capacity, QoS Scalability, QoS Economics, QoS Availability etc. They will be explained in the following section as concepts of QoS ontology. 2.2 Top QoS Ontology The top QoS ontology defines the basic attributes of all qualities, the most generic concepts associated with them and the way they are measured. They are the basic requirements, mainly included the following aspects: . 1)
2)
3)
4)
QoSProperty[4,5,6,7,9]: Represents a measurable NFRs a service within a given domain. Software Quality Characteristics is defined in ISO-9126: a set of attributes of a software product by which its quality is described and evaluated. A software quality characteristic may be refined into multiple levels of subcharacteristics. We map this definition to QoS. So the QoSProperty is a set of attributes of web service quality characteristics. In some papers, it is also called QoSParameter or Quality[9] QoSMetric[4,6,7,9]: Defines the way each QoS Property is assigned with a value. In [4], it defined the metric as concrete requirement, but we think the metric is required not optional. Because when the metric is not the same, for example, simple and complex. The comparison between two QoS parameters is useless. QoSRelationship[4,5,7,9]: Describes how qualities are correlated. It is related to the QoSProperty class via a hasRelationship Qualities optional object property. The Relationship may be Proportional or inverselyProportional. Qualities are potentially related in terms of direction (opposite, parallel, independent, or unknown ) and strength(such as weak, mild, strong, or none). QoSTendency[4,5,7]: In[4]and[7], tendency is named “impact”. It represents the way the QoSParameter value contributes to the service quality perceived by the user. The QoSTendency property enables the system to estimate the degree of
A Semantic Web Services Discovery Algorithm Based on QoS Ontology
169
user satisfaction with regards to a given QoS parameter value. A QoS proterty can have tendency directions: negative, positive, close, exact or none. 5) QoSMandatory[4]: This requirement allows a requester specify which QoS properties are strongly required while others may be optional.[4] 6) QoSAggregated[5][7]: It is a quality composed from other qualities. The price performance ratio, for instance, aggregates price and performance. 7) QoSWeight[4][5][6]: A float range ([0,1]) specifying the user request preferences. A QoSWeight of 0 is equivalent to an unconstrained QoS attribute. 8) QoSLevel[4]: Specify different quality levels of a service so that a request can choose the most appropriate quality levels for its demands. 9) QoSRole[4]: Providers, requesters and third-party users. 10) QoSDynamism[4][7]: Represent the nature of property’s datatype, whether it is static or dynamic.It can be specified once(static) or requires periodically updating its measurable value(dynamic). In[7],it is called “Nature”. 11) QoSPriority[4]: Allows a requester specify which properties are strongly required while others may be optional. 2.3 Middle QoS Ontology The middle Qos ontology describes the concrete level of requirements and defines the vocabulary of middle QoS ontology, such as properties, metrics, relationships etc[2,4,5,6,7,8,9]. It is a set of domain-independent quality concepts and is widely used by various web applications. It can be completed by a domain-specific lower ontology. The QoSProperty middle ontology depicted in fig2 links the top ontology with the lower ontology and it is a common super class for the following concepts: 1) 2) 3) 4)
5) 6) 7) 8)
9)
Availability[2,6,7,9]: The probability of web services can access successfully. It calculates based on customer feedback. Accessibility[7]: The possibility that it can be expressed as a yardstick, used to indicate the probability of an instance of service successfully at some point. Integrity[2,5,6,7,9]: A measure of the service’s ability to prevent unauthorized access and preserve its data integrity. Performance[2,5,6,9]: It is able to provide services to the limit of measurement. It may be based on throughput and response time. The subclass of Performance is Throughtput[2,8] and Response Time[2,8] Reliability[2,5,6,7,8,9][ISO 9126]: It indicates the ability to correctly carry out their functionalities. It is a comprehensive evaluation of the quality of service. Security[2,5,6,9] [ISO 9126]: Validation involves parties which is used to encrypt messages, provide access control, confidentiality and non-repudiation. Stability[2,5,9] [ISO 9126]: It is the change rate of service’s attributes, such as its service interface and method signatures. Economics[5,8,9]: The service requestor in order to obtain services in time and space (memory) must pay for them. The main feature is the use of cost. In[2,6,7],economics equals to cost. Robustness[2,5,6,9]: It is the fault tolerance when you input non-normal data and invoke service incorrectly.
170
B.C. Yin et al.
Fig. 2. Core QoSProperty Middle Ontology
Fig. 3. QoS Ontology
10) Capacity[2,6,7,9]: The number of concurrent requests a service allows. When the service goes beyond its service capacity to work, its reliability and availability will be adversely affected. 11) Trust: Mainly depends on users’ experiences of using it and evaluates the credibility of user reports. 12) Scalability[2,5,6,7,9]: Defines whether the service capacity can increase as the consumer’s requirement. 13) Interoperability[2,5,9] [ISO 9126]: It defines such a service that is compatible with the standard. This is very difficult for consumer program or Agent and services operation. 14) Network_related: Defines the attributes that related with Net. Each QoS property have one or more metric representing the way to measure its value. The QoS metric concepts provide necessary information of how to measure, compare and assess QoS value. The QoSMetric middle ontology includes such concepts: Value, Unit (Time, Currency), ValueType (List, Set, Range, Vector; String, Numeric, Nominal, Boolean) etc. The detailed describes the QoS metric attributes, so did the QoS relationships. The QoS lower ontology meets the NFRs of a specific domain. It is used for special domain and not be discussed in this paper. In [6], a QoS Network Ontology was introduced as an example of QoS lower ontology. QoS ontology development is a relatively huge systematic project, including domain knowledge acquisition and analysis, conceptual design and domain ontology constraints, iterative testing a series of proposals and process. After careful comparison and selection, we chose Protégé to develop our QoS ontology. It is open, easy to expand, and supports the standards well, including OWL. Its interface is very simple and friendly. Fig3 is the screen shot of the QoS ontology by using Protégé.
3 A Semantic Web Services Discovery Algorithm Based On QoS Ontology 3.1 The System Model First of all, we extend the UDDI to support publishing and querying web service with QoS. Web service providers publish the semantic web service with QoS to UDDI
A Semantic Web Services Discovery Algorithm Based on QoS Ontology
171
Fig. 4. The Web Services Discovery Framework Based on QoS Ontology
registry through OWL-S/UDDI Translator which can implement the mapping between OWL-S profile and UDDI elements. Web service requester sends service request with QoS to UDDI registry, before that we validate and format the request information by query information processor. When receiving a request, the service discovery engine selects the advertisements from the UDDI that are relevant with the current request. In this process, the Reasoner is called to compute the level that matches, it uses the domain Ontology DB as data and using the matching algorithm to compute the matching level. Thus, we can get some candidate services that satisfy the requestor’s functional needs. The candidates web services are sent to the QoS Processor. When it begins to calculate the matching degree of QoS, the QoS Processor first sends both the candidates’s QoS and requestor’s QoS to the Reasoner, then it calls the QoS filter to extract the QoS data in the QoS Ontology. The Reasoner calculates the matching degree of the requester’s QoS with the candidates’s QoS. In this process, the Reasoner calls the Semantic Matcher to compute the matching level and it uses the QoS ontology DB as data. Thus, the best QoS suitable services can be selected from the candidate services. The UDDI registry returns the service to the web services requestor. The full process describes in fig3. 3.2 Semantic Matching of QoS Parameters In this paper, we propose a new method to obtain the QoS matching degree between a service requester and a service advertisement. We calculate the QoS parameters from both the semantic and numerical aspects. In our algorithm, the first step is to calculate the semantic matching degree of QoS parameters, it will be introduced in the following section 3.3, thus we can identify whether the QoS parameters have the same semantic meaning. The second step is to calculate the numerical matching degree of requester and advertisement. The first step actually uses our existing matching algorithm in [3] to calculate the semantic similarity of QoS parameters. In this process, our QoS ontology is used which can guarantee the QoS semantic consistency, where the I/O parameters concepts will be substituted by Qos concepts, the I/O domain ontology taxonomy tree will be substituted by QoS ontology taxonomy tree. If there are many QoS parameters , we can use the QoSWeight attribute of our QoS ontology to calculate for each parameters value. For example, a QoS description model of service requesters might be: QoSR=((cost,90,0.2),(responseTime,0.1,0.2),(reliability,0.9,0.2),(availability,0.8,0.2),(trust,0.9,0.2))
The QoSWeight in this model is (0.2,0.2,0.2,0.2,0.2). We can add other QoS information into this model to get more accurate matching results. As for the numerical
172
B.C. Yin et al.
matching, because of the different quality parameters have different data types and data ranges, such as cost and response time, in order to fulfill numerically matching, the values of quality parameters needed to quantify and standardize. The quantification and standardization of QoS parameters is not to be discussed here. 3.3 QoS Ontology Based Web Service Discovery Algorithm Researchers pay more attention on the functional matching of web services in recent years. Such as Paolucci’s model[9], semantic similarity is divided into Exact, Plug in, Subsumes and Fail. It mainly consider the semantic matching of web service’s input/output parameters and it consider little about the QoS matching, especially for the semantic matching of QoS parameters. In this paper, we compare the matching degree of a service requester with a service advertisement from the both functional and non-functional aspects. The overall matching degree of a service requester with a service provider can be calculated by two parts: one is the semantic matching of I/O, i.e., functional matching in[3], the other is non-functional matching, i.e., QoS matching introduced in section 3.2. In this process, ontology is used in order to solve semantic ambiguity. Hence, our matching algorithm of service defines in the formula(1): SD(W1,W2 ) = σ (SDI / O (W1,W2 )) +θ (SDQoS (W1,W2 ))
(1)
Among them, n
SDI / O (W1 ,W2 ) = max ∑ i =1
k 1 SDQoS (W1 ,W2 ) = (max ∑ 2 i =1
{
m
∑ SD ( S
1i
j =1
l
∑ SD(S j =1
1i
(2)
, S 2 i ) /(min( n, m ))
(3)
, S2i ) /(min( k , l )) + SDnumerical )
Among formula (2), S D ( S 1, S 2) =
e −α
(4)
(p = 0 )
1 ( p+n)
e β × lw − e − β × lw × β × lw e + e − β × lw
e −α ( p + n )− d ×
β × lw
( S 1 S 2 a r e n o t s i b l in g )
− β × lw
e − e e β × lw + e − β × lw
(S 1 S 2 a re s ib lin g )
In formula (1), SD (W1 ,W2 ) is the final matching degree between Web Service1(W1) and Service2(W2).
SDI / O
is the I/O semantic matching degree.
semantic matching degree. we should calculate
SDQoS
SDQoS
and
and
SDI / O
SDI / O
SDQoS
is the QoS
has the similar calculating process. But
respectively. Because
SDQoS
is the sum of se-
mantic matching and numerical matching. When it calculates the SDI / O , we actually use the formula (4) that we defined in[3]. We use a exponential-decay function to define our formula (4). In formula (4), p, d, lw, n is a group of numbers defined in taxonomy tree. There are detailed explain in [3] and not discussed here. The parameters of S1 and S2 in formula (4) will be matching pairs by pairs, calculating the similarity between each pairs of I/O, in order to obtain maximum similarity and matching combinations. For S1, S2, quantities of I/O parameters does not necessarily equal, no matching parameter is not to participate in the final computing. The SDnumerical is the
A Semantic Web Services Discovery Algorithm Based on QoS Ontology
173
numerical matching degree of W1, W2 which is a float value between [0,1]. Mary papers have introduced the method of calculating SDnumerical and we do not discuss here. Finally, a float between [0,1] is obtained. It is a precise numerical matching degree of two web service’s semantic similarity.
4 Conclusion and Future Work In this paper, we design and define a QoS ontology in detail. First, we introduce the NFRs of web services in detail and specify most important QoS concepts. Then we use the Protégé as ontology editing tools to develop a web service QoS ontology model. This QoS ontology model can be extended easily. Then we add the QoS ontology into our web services discovery model and propose our QoS ontology based web services discovery algorithm. The QoS Ontology have been used to provide a standard formally description of QoS properties to improve the effectiveness and accuracy of web service discovery results. The algorithm can guarantee that the best matched service is returned to satisfy the user's both functional and NFRs needs. We also extend the UDDI to support publishing and querying web service with QoS ontology. The extended UDDI registry allows for semantically describing of QoS. In the near future, we are going to publish, discover and compose web services based on the existing system model. We want to complete auto-discovery and auto-composition of web services.
References 1. Web Services Architecture W3C Working Group Note (2004), http://www.w3.orgH 2. Garcia, D.Z.G., Toledo, D., Felgar, M.B.: Semantic-enriched QoS Policies for Web Service Interactions. In: ACM International Conference Proceeding Series - Proceedings of the 12th Brazilian Symposium on Multimedia and the Web - WebMedia, vol. 192, pp. 35–44 (2006) 3. Yang, H.R., Liu, S.S., Fu, P.B., Qi, H.H., Gu, L.H.: A Semantic distance measure for matching web services. In: Proceedings, 2009 International Conference on Computational Intelligence and Software Engineering, CiSE (2009) 4. Tran, V.X.: WS-QoSOnto: A QoS Ontology for Web Services. In: IEEE International Symposium on Service-Oriented System Engineering, pp. 233–238 (2008) 5. Yao, S.J., Chen, C.X., Dang, L.M., Liu, W.: Design of QoS ontology about dynamic web service selection. Computer Engineering and Design 29(6), 1500–1548 (2008) (in Chinese) 6. Giallonardo, E., Zimeo, E.: More Semantics inQoS Matching. In: Proc. of the IEEE Intl. Conf. on Service Oriented Computing and Applications, pp. 163–171. IEEE Computer Society, Los Alamitos (2007) 7. Papaioannou, I.V., Tsesmetzis, D.T., Roussaki, I.G., Anagnostou, M.E.: A QoSOntology Language for Web-Services. In: Proceedings of the 20th International Conference on Advanced Information Networking and Applications, pp. 101–106 (2006) 8. Zhou, C., Chia, L., Lee, B.: DAML-QoS Ontology for Web Services. In: Int. Conference on Web Services 2004 (ICWS 2004), San Diego, California, USA (2004) 9. Maximilien, E.M., Singh, M.P.: A Framework and Ontology for Dynamic Web Services Selection. IEEE Internet Computing 8(5), 84–93 (1993-2004) 10. Paolucci, M., Kawamura, T., Payne, T.R., et al.: Semantic Matching of Web Services Capabilities. In: The 1st International Semantic Web Conference, Sardinia, Italia, pp. 333–347 (2002)
Implementation of an Intelligent Product Recommender System in an e-Store Seyed Ali Bahrainian1, Seyed Mohammad Bahrainian2, Meytham Salarinasab3, and Andreas Dengel1,4 1
Computer Science Dept., University of Kaiserslautern, Germany 2 Shahid Beheshti University,Tehran, Iran 3 Moje Fannavari Houshmand, Iran 4 Knowledge Management Dept., DFKI, Kaiserslautern, Germany {ali.bahrainian,mohammad.bahrainian,msalari14}@gmail.com, [email protected]
Abstract. With the emergence of new technologies and modern methods of marketing and the increasing intensity of competition among firms and companies for attracting new customers and making them loyal, a novel automatic solution is needed more than ever. The combination of Electronic Customer Relationship Management (E-CRM) and Artificial Intelligence (AI) has appeared as a solution in recent years. Recommending appropriate products to customers according to their needs is one of the methods of CRM. This paper introduces a system named VALA. It is a product recommender system using adjustable customer profiles and a dynamic grouping process which recommends products to each customer dynamically, as his/her preferences change. In other words the User Interface (UI) alters automatically as the customer profile changes. This recommender system combines collaborative filtering and non-collaborative filtering methods in order to come up with useful and unique suggestions for each customer.
1 Introduction It has been a long time since companies have begun selling their products using web technology and online marketing. The number of companies that use such strategies for marketing is increasing rapidly; and likewise the number of customers who search and buy their desired products through the Internet is growing out of number. The result of this phenomenon is that even large groups of human consultants are no more able to serve all customers of a company. Thus, companies are looking for a solution to this problem. Human workforces are required to be replaced by entities that are more flexible, economical and robust. Intelligent software agents can be a beneficial replacement for human workforces. They can serve customers as artificial consultants. Since the number of both customers and products are growing so rapidly, companies want their market and customer satisfaction services to be artificially intelligent. In today’s e-stores with millions of products, customers can barely find the product they are looking for. They must spend a great amount of time searching for a product. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 174–182, 2010. © Springer-Verlag Berlin Heidelberg 2010
Implementation of an Intelligent Product Recommender System in an e-Store
175
Nevertheless, their search may not be successful. Intelligent product recommender systems are the solution to this problem. Reference [7] states that recommender systems are an important part of many websites and play a central role in the Ecommerce effort toward personalization and customization. This paper presents an intelligent product recommender system. This system is designed to recommend products to customers based on their preferences and demands. The organization of this paper is as follows: In section 1.1 and 1.2 basic definitions, different types of recommender systems and other researches in this area are discussed. In 1.3 the need and necessity of such system is described. In section 2 components of VALA are explained and in 2.3 the execution process of VALA is mentioned. In section 3 the results of testing the system are presented, and finally in section 4 the final words regarding VALA are stated. 1.1 E-CRM and Product Recommender Systems E-CRM is a strategy that companies use to serve their customers and maximize the benefits of both the customers and themselves. Reference [1] defines CRM in this way:”having the technology to provide an integrated view of all customer interactions and changing the corporate culture to leverage this information to maximize the benefits to the customer and the company.” To change the corporate culture and as a result, maximize the benefits of both the customer and the company some sort of a product recommender system is needed. Such system must recommend products to customers according to their customer profile. This improves customers’ shopping experience and increases the sale. A lot of recommender systems ask the customer to rank or rate the products, in order to be able to produce a list of recommended products for him/her. These recommender systems that their customer profiles are called static [3] can only get updated using feedbacks from the customer, like ranking different products. Other recommender systems like [4] and [5] with a dynamic or non-invasive profile, however, get updated automatically with the assistance of intelligent software agents. This means that the user profile is created without asking the customer about his/her preferences. In VALA we have adopted the latter approach. We have tried to recommend products to customers without directly asking them to rank products. 1.2 Other Researches on Product Recommender Systems There has been a great deal of research on recommender systems. The first group of these systems use collaborative filtering methods for recommending products to customers [6], [8]. This means that they incorporate information from groups of customers to recommend products to each customer by determining which customer belongs to which group. In other words, these systems use preference information not only from the customer being served but also from other customers. The service “customers who bought X also bought Y” is an example of this method which nowadays is broadly used in online stores. Another group of these systems, on the other hand, use preference information only from the customer being served. These systems which are referred to as noncollaborative recommender systems, recommend products to customer only based on
176
S.A. Bahrainian et al.
his/her customer profile. The system described in reference [9] is an example of these systems. The third group of these systems adopt a collaborative filtering method besides a neural network for putting customers into clusters dynamically. As in [4] most of these systems use a fuzzy neural network to have a more accurate understanding of customers’ behaviours and better group them into clusters. Reference [9] states that it is clear that future recommender systems incorporate both collaborative and non-collaborative perspectives. In this research, what we have strived for is designing an intelligent recommender system with a precise and robust performance. For this reason we have incorporated both collaborative filtering and a non-collaborative method (with the use of genetic algorithm), and combined them to come up with a vigorous recommender system. To the best of our knowledge, such a strategy for designing a recommender system is not reported. 1.3 Our Goal and the Need for This System Current recommender systems are not adequately intelligent to give suitable suggestions to customers. Because of the type of filtering they adopt, most of them haven’t been successful in giving unique recommendations to each customer according to his/her preferences; and the ones that have been, directly ask the customer to rank lots of products in order to recognize customer’s taste. The aim of this research is designing a recommender system from the perspective of E-CRM and AI that gives each customer unique and appropriate suggestions without directly asking him/her about his/her preferences in order to give precise recommendations to each customer; and in this context serve them better than before.
2 Description of VALA System To give a brief description of VALA, this system consists of customer profiles and 3 intelligent agents. The first agent is the clustering agent which is responsible for the collaborative part of our recommender system. The second agent is the profile agent. Each customer profile is created, modified and managed by a profile agent. In other words each customer has a specific profile and profile agent. This agent is equipped with a genetic algorithm which is responsible for the non-collaborative part of our recommender system. The third agent is called the Interface agent which is responsible for modifying the User Interface (UI) and presenting the recommended products to each customer. Fig. 1 presents different parts of the VALA recommender system and also shows the relations between different parts. In the following sections all aspects regarding each part are described: 2.1 Intelligent Agents Profile Agent: As mentioned before, for each customer signing up to the system there is one profile agent. This agent observes customer’s behavior and creates and modifies the customer’s profile. It keeps a record of the customer’s preferred products
Implementation of an Intelligent Product Recommender System in an e-Store
177
Fig. 1. VALA System
(those that were visited or bought by the customer) in his/her profile and creates and sends a list of these products to the clustering agent. It also ranks the products that are suggested by clustering agent and after ranking these products it applies a genetic algorithm on those products that according to the customer profile are appropriate for the customer. When it comes to shopping, every customer has his/her own criteria and taste for buying products. Most customers prefer specific brands from specific countries. For instance a customer may be interested in American software programs, Japanese laptops and American books. Among American books he/she may be interested in books from Wiley. What we have strived for is to apply the customer’s criteria to the list of recommended products using a genetic algorithm. The main goal of the genetic algorithm is to select those products that match customer’s criteria and recommend them to the customer. After applying the genetic algorithm to the ranked products, a list of suggested products is sent to the Interface agent. When observing customer’s behavior, the Profile agent recognizes the customer’s criteria. Detailed description of the genetic algorithm is given in part 2.4. Clustering Agent: This agent is responsible for grouping the customers, and putting them into clusters based on their preferences. As mentioned before this agent is responsible for collaborative suggestions in the VALA recommender system. Customers with same interests are grouped in the same cluster. Each profile agent sends information about its customer profile (preferences of the customer) to the clustering agent. The clustering agent receives all the data and then it puts the customers into clusters and ranks and rates the existing products of each cluster. It then creates a base vector for recommended products and sends it to all profile agents. It must be emphasized that the products are rated from highest recommended to the lowest. Therefore, customer profile agents with the same preferences receive the same list of recommended products from the clustering agent. Interface Agent: This agent receives the list of recommended products from profile agent and via UI it presents them to the customer. Since this agent only displays 20 products at a time and there may be more than 20 recommended products or even the customer’s preferences may change in a short time there must be a dynamic mechanism for displaying the products. The recommended products are sent to the Interface agent in the form of messages. As in [2] we use message fading for displaying the
178
S.A. Bahrainian et al.
most relevant products. This means that the massages (which include the recommended products) that the profile agent sends to the interface agent are time stamped and they slowly become less relevant and get replaced by new massages. 2.2 The Customer Profile Format In this part, the format of the customer profile is described. In order to make the profile smaller and more practical for our research we have assumed that our e-store only offers 3 different product categories: Books, CDs and laptop computers. Each category has 3 sub categories. A part of a typical customer profile is shown in Fig. 2. For the sake of simplicity in presentation we have narrowed down the number of products that are visited by the customer. Each category has a number between 0 and 1 in front of it. The sum of all 3 numbers is 1. This number represents the customer’s interest in each category in comparison to others. It is computed by dividing the number of products in a category by the total number of all products in all 3 categories. The same goes with each sub category. The sum of the numbers in front of all 3 subcategories must be 1. Thus, we have used a fuzzy approach for keeping a record of customer’s preferences. In the example of customer profile, the customer has visited 3 books. 2 of them are from science sub category and 1 of them is from story sub category. The book from story sub category is from UK and from Oxford publication. So, the number in front of both of them is 1.
Fig. 2. The Customer Profile Format
Implementation of an Intelligent Product Recommender System in an e-Store
179
The second part of the profile which is not depicted in the figure keeps the index number of all visited products. Each product in the system has a unique index number to specify that product. This index number is used for sending and receiving the specifications of products between the 3 agents. With the use of this profile format realizing the customer’s preferences and recommending relevant products is much more precise than those profiles that only keep the name of the product as in [4]. In the third column of the profile in front of each subcategory, the countries which the products are from and also their brand are stated. The number in front of each country and each brand indicates the number of products that the customer has visited from that specific country and that specific brand. Later we use these numbers in order to create our fitness function for the genetic algorithm. The genetic algorithm uses both the fuzzy numbers and exact numbers as thresholds for determining whether a certain group of products meet customer’s criteria and preferences or not. The ratio of the products that are recommended to customer is exactly the same as the numbers in front of each category. If a customer’s preferences changed very fast that the recommended products were no more suitable for him/her, there are buttons on the UI for each category that the customer can use. In this case the profile agent first changes the ratio of recommended products and sends a new recommended list with all products from the customer’s desired category. Then if the profile agent notices that there are only a few products in customer’s profile under his/her desired category, meaning that the information in the profile is not adequate for recommending products, it suggests top visited products by other customers under that sub-category. 2.3 VALA’s Performance In this part, the performance of VALA and the relation among different components of the system in the execution process is described.
The customer signs up to the system. Thus, a profile agent is assigned to this customer. This agent creates the customer’s profile. The customer starts browsing or searching for products. In the meanwhile the profile agent is observing customer’s behavior. If the customer visits the details of a product, the profile agent will add the information of that product to the profile. It also keeps the index number of that product for passing it to the clustering agent. A higher priority is given to those products that have been visited more than once. After modifying the customers’ profiles, the profile agents send the index number of most visited products to the clustering agent. The clustering agent receives those index numbers and puts them into clusters. This part of system which is the collaborative part creates a base vector of the most visited products by all customers. This base vector is then sent to each profile agent. Each profile agent receives the base vector and compares the products included in the base vector with those of the customer profile. It then generates a profile vector for that customer. The profile vector contains values that lie in the [0, 1] interval. If a product was in a sub-category that the customer has never visited its products, then an element of the vector is zero. Otherwise its corresponding value
180
S.A. Bahrainian et al.
in the profile vector is non zero. Thus the list of most visited products that were sent by the clustering agent is now filtered according to the customer’s profile. Then customer’s shopping criteria and the profile vector are given to the genetic algorithm as inputs and the genetic algorithm generates the list of best suitable products for the customer as output. After generation of the list of suitable products for the customer, the profile agent sends it to the Interface agent. The Interface agent displays the recommended products on the UI.
2.4 The Genetic Algorithm In VALA GA is used for finding the products from the list of most visited products by all customers that best meet a customer’s shopping criteria. Those criteria play the major role in the fitness function of the GA. The fitness function is a mixture of the number that represents the interest of customer in each category, the number that represents the interest of customer in each sub-category, the number of products that the customer has visited from a certain brand and the number of products that the customer has visited from a certain country. The set of high rated products in the profile vector are our population. The products that are finally recommended to the customer are chosen from this list. The mutation technique is applied to this problem by randomly changing the index number of a few products, to the index number of other products that are included in the profile vector but have a lower rank. This is to verify whether the customer is interested in products that have received a lower rank by the profile agent, and if so, the GA corrects itself repeatedly. For example, if a customer visits a product that has a lower rank but it is recommended to the customer because of mutation, the profile agent modifies the profile regarding this phenomenon and learns from the behavior of customer. It is obvious that the GA in VALA keeps a record of its status in a memory for future use. The structure of the GA in VALA is a multithread type. This means that a thread of GA is invoked by the profile agent in each demand. The GA is used to give the best combination of high rated products according to customer’s criteria. The dynamic and learner nature of GA makes it possible for the recommender system to continually correct itself and recommend proper products to customers according to their preferences. The profile agent observes the customer for getting feedback from him/her to modify the profile and improve the suggestions.
3 Test and Evaluation For testing this system, an e-store was simulated. We used Java programming language which is the best language for multi-agent web based systems. The system was tested using 5 unreal customers. We tested the VALA recommender system, first to determine the relevancy rate of the products that were recommended to the 5 customers. For this reason we added 100 other customers with random shopping behavior, to make the clustering agent capable of generating the base vector. It must be emphasized that in Fig. 2 we divided each sub-category to smaller sub-categories to give the recommender system a higher chance for recommending suitable and relevant
Implementation of an Intelligent Product Recommender System in an e-Store
181
products to customers. For instance, the science category was divided to computer, physics and chemistry. The impact of this that was tested is the decrease of average number of suggestions per customer. However, the relevancy rate of the suggestions was increased. Depending on how many groups and clusters of customers are made by the clustering agent, the relevancy rate of suggestions that are sent to the profile agent would change. To test the relevancy rate, we have used the number between [0, 1] that profile agent gives to each product while rating and ranking. The relevancy rate for different number of groups of customers is stated in the table below: T01 T02 T03
Num-groups 7 8 13
Num-suggestions 46 35 23
Relevancy rate 57.4 % 72.2 % 86.1 %
This indicates that if the number of groups increases, the relevancy rate will enhance. Thus, the more iteration the clustering process does the better the relevancy rate is. The second test which is related to the GA is customer’s interest in suggested products by VALA. We have tested the relevancy rate of recommended products for the 5 customers. Each customer receives suggestions which their relevancy rate in two sequential tests for 20 suggestions is as below: (by sequential we mean that in the second test the system has learned from the reaction of customers) Customer 01 Customer 02 Customer 03 Customer 04 Customer 05
Relevancy-Test 1 18 17 16 19 16
Relevancy-Test 2 18 20 17 18 18
It is clear that the relevancy rate has increased on average. This means that the more the customer uses the system the better the suggestions will get. The decrease in relevancy for customer 04 can be explained by stating that in the GA the index number of 2 products are changed randomly as mutation. Thus, after using the system for several times, since the profile agent learns the preferences of the customer, the system must be able to suggest at least 18 relevant products out of 20. In conclusion, it seems that using a Genetic Algorithm for filtering the recommended products to the customer is an improvement in overall results of a recommender system. If other criteria could be added to the system as fitness function of the genetic algorithm, the system’s performance would even be better and more robust.
4 Final Words In this paper, we designed a recommender system that combines collaborative filtering and a non-collaborative method for recommending suitable and unique products to each customer.
182
S.A. Bahrainian et al.
There is one more issue that must be mentioned about VALA system. The Genetic Algorithm is executed after each time that the customer’s profile is updated.
References 1. Anderson Jr., W.O.: Customer relationship management in an e-business environment. In: Proceedings of 2001 IEEE Change Management and the New Industrial Revolution, IEMC apos (2001) 2. Appleby, S., Steward, S.: Mobile software agents for control of distributed systems based on principles of social insect behaviour, vol. 2. IEEE, Singapore ICCS apos (1994) 3. Pani, A.K., Venugopal, P.: Implementing e-CRM using Intelligent Agents on the Internet. In: Service Systems and Service Management. IEEE, Melbourne (2008) 4. Canuto, A.M., Junior, M.F.G.: Carcara: A Multi- agent System for Web Mining Using Adjustable User Profile and Dynamic Grouping. In: IEEE/WIC/ACM International Conference on Intelligent Agent Technology, 2006. IAT apos, December 18-22, pp. 187–190 (2006) 5. Chan, P.K.: Constructing Web User Profile: A Noninvasive learnig approach. In: Masand, B., Spiliopoulou, M. (eds.) WebKDD 1999. LNCS (LNAI), vol. 1836, pp. 39–55. Springer, Heidelberg (2000) 6. Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to weave an information tapestry. Communications of the ACM 35(12), 61–70 (1992) 7. Resnick, P., Varian, H.R.: Recommender systems. Communications of the 6. National Center for Biotechnology Information (1997) 8. Shardanand, U., Maes, P.: Social information filtering: algorithms for automating word of mouth. In: Proceedings of the Computer Human Interaction- 1995 Conference, Denver, pp. 210–217 (1995) 9. Yager Ronald, R.: Recommender systems based on representations. In: Machine Perception Artificial Intelligence, Computational Web Intelligence, vol. 58, pp. 3–17 (2004)
Recommendation of Little Known Good Travel Destinations Using Word-of-Mouth Information on the Web Kouzou Ohara, Yu Fujimoto, and Tomofumi Shiina Department of Integrated Information Technology, Aoyama Gakuin University 5-10-1 Fuchinobe, Chuoku, Sagamihara, Kanagawa 252-5258, Japan
Abstract. In this paper, we propose a method to recommend to a tourist (user) such a travel destination that is little known to many people, but of interesting for the user. To this end, we use two recommendation techniques, i.e. collaborative filtering and content-based filtering. We use the collaborative filtering method to predict the user’s preference and select a destination that is well known and of interesting for the user. Then, with the destination as a clue, we make a final recommendation by finding out such a destination that is similar to the clue, but not well known itself by means of the content-based filtering method. To characterize travel destinations, we focus on many pieces of word-of-mouth information about them on the Internet, and use tf-idf values of keywords appearing in them to construct feature vectors for destinations. We conduct a user study and show that the proposed method is promising.
1 Introduction Nowadays, due to the rapid growth of the Internet, we can access vast amounts of information on the Web, which makes it so common for us to first visit websites of the travel information when making a travel plan. However, actually, sometimes it is diÆcult for us to find out the information we really need. To solve such a problem, many travel recommendation systems have been proposed so far[1,2,3,4,5]. These systems adopt one or more recommendation techniques commonly used, i.e. collaborative filtering, content-based filtering, and knowledge-based filtering. Collaborative filtering methods make recommendations based on the preferences of similar users[6], while content-based filtering methods are based on the similarity between items to recommend. Whereas, knowledge-based filtering methods adopt knowledge about users and items to make recommendations. These systems and commercial travel service sites oer various kinds of information about travel including destinations, accommodations, transportation, activities at the destination, and so on. Among them, destinations could be the first key issue in making a travel plan, and indeed there are many criteria to evaluate them. In this paper, we focus on such a destination that is of interesting for the tourist, but not well known to many people. Such a destination is expected to be really attractive for tourists. We refer to such a destination as the little known good travel destination. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 183–190, 2010. c Springer-Verlag Berlin Heidelberg 2010
184
K. Ohara, Y. Fujimoto, and T. Shiina
In this paper, we aim at recommending to a tourist (user) such little known good travel destinations. To this end, we use two recommendation techniques, i.e. collaborative filtering and content-based filtering, as well as the word-of-mouth information about destinations on the Web. Here, we have to note that if a destination is little known to many people, it means that only a few piece of information is available even on the Internet. In that case, it is known that collaborative filtering methods can not recommend such a destination. Thus, in our approach, we use collaborative filtering to predict the users’ preferences based on the limited number of explicit feedbacks from users for well known destinations. Through this prediction, we determine the destination that is well known and of interesting for the target user. Then, we choose, from among destinations that are not evaluated by so many people, the final candidate that is similar to the destination determined by the collaborative filtering, by means of the content-based filtering technique. To characterize candidate destinations, we use tf-idf values of keywords that appear in the word-of-mouth information about them on the Web. Actually, there are a large number of review articles of destinations on the Internet, which can be used as good sources of the word-of-mouth information. Furthermore, we conduct a user study using actual review articles on the Web and show that the proposed method would be promising.
2 Recommendation of Little Known Good Travel Destinations Figure 1 illustrates the outline of the proposed method. In this framework, the flow of the recommendation of little known good travel destinations is roughly divided into two phases. In the first phase, the system predicts the user’s preference and selects a candidate destination that could be interesting for the user from among well known travel destinations by means of the collaborative filtering technique. After that, in the second phase, it determines the final destination to recommend which could be preferred by the user, but little known to many people by means of the content-based filtering technique. The following subsections describe each phase in detail. 2.1 Prediction of User Preference Using Collaborative Filtering In the proposed method, we assume that the system asks the user to rate some well known travel destinations when heshe uses it for the first time in order to acquire hisher preference. It is noted that the system can not always successfully acquire hisher preference through this explicit user feedbacks. The questionnaire should consist of destinations as well known as possible so that every user can rate them properly. However, in fact, it is diÆcult to make such a perfect questionnaire. Eventually, the destinations in the questionnaire may not include the best for a user, or a user may not well know some of them. To overcome this issue and predict the user’s preference more precisely, we use collaborative filtering in the first phase of the proposed method. In other words, we can complement a user’s preference with similar users’ preferences by using the collaborative filtering technique because it is expected that users can rate most of the well-known destinations in the questionnaire even though the explicit feedbacks are incomplete. On the other hand, it is desirable that the number of destinations in the questionnaire is not
Recommendation of Little Known Good Travel Destinations
185
Contents-based Filtering review A
review B
Collaborative Filtering Word-of-mouth Information Answer to questionnaires Prediction of preference Basic information on individuals' preference
User
other users
Well known travel destination interesting for the user
Answer to questionnaires
Little known good travel destination
Fig. 1. Outline of the proposed method
so many because answering a large number of questions would be boredom for the user and degrade the quality of the feedbacks. Thus, to avoid sparse feedbacks and catch the preferences of various kinds of users as precisely as possible, we have to choose the limited number of diverse destinations that would be well known to many people for making the questionnaire. In the same way as most of existing collaborative filtering approaches, we use the Pearson correlation eÆcients as the similarity measure between users. Let X and Y be sets of users and destinations, respectively, i.e. X x1 xn and Y y1 ym . In addition, let S be a matrix whose (i j) element S i j is the preference rating by user i for destination j. Then, the similarity between users a and b is defined by the following Pearson correlation eÆcients ab : ab
k¾Yab (S ak
a
S )(S bk S b )
2 k¾Yab (S ak S a )
(1)
2 k¾Yab (S bk S b )
where Yab is a set of destinations that are evaluated both by a and b, and S i (i a b) is the average of ratings by user i for destinations included in Yi that is a set of destinations evaluated by user i. Since we cannot compute ab when Yab is empty, ab is defined as 0 in that case. ab ranges from 1 to 1, and takes 1 if the preference ratings by users a and b completely agree, while takes 1 in the opposite case. Once we compute the correlation coeÆcients between all the possible pairs of users, we can predict the preference rating by an arbitrary user for the destination that have not been evaluated by himher yet. Suppose that user a has not rated destination y j , which
186
K. Ohara, Y. Fujimoto, and T. Shiina
means that the value of S a j is undefined. Then, the predicted value of S a j , referred to as Sˆ a j , is given as follows: Sˆ a j
Sa
i¾X j
ai
(S i j S i )
i¾X j
ai
(2)
where X j is a set of users who have already rated destination y j . Namely, Eq. (2) computes Sˆ a j by modifying the average of ratings by user a, S¯ a , by the second term, where its numerator is the weighed sum of the dierence between the average of ratings by user i (S¯ i ) and hisher rating of destination j (S i j ), while its denominator is the scaling term to prevent the numerator from being overestimated when the number of users who have rated destination j is large. Intuitively, if a user is similar to user a and heshe gives a higher (lower) rating to destination j than hisher average rating, a positive (negative) value is added to the average of ratings by user a. It is noted that the sign of the value added is inverted if the user’s preference is disagree with the preference of user a. Eventually, we choose the destination that has the largest preference rating as the output of the first phase. 2.2 Selection of Little Known Good Destination Using Content-Based Filtering The destination chosen in the first phase is expected to be preferred by the user. However, it is one of well known destinations. Since our aim is to recommend to a user little known good travel destinations, in the second phase, we have to choose such a destination from among those which are little known to many people with the output of the first phase as a clue. In this paper, we adopt the number of reviews of a destination on a certain website as an objective criterion to measure how well known the destination is. Namely, we consider a destination is ”little known” if the number of reviews of the destination is less than the predefined threshold which is a parameter of the system to be specified by the user. It is noted that collaborative filtering might not recommend such little known candidates because the number of preference ratings available for predicting ratings of them by the target user is considerably limited. Thus, in the second phase, we uses content-based filtering instead of collaborative filtering. To use the content-based filtering, we have to characterize destinations so that the similarity between them can be computed. To this end, we use the word-of-mouth information about destinations on the Internet. More precisely, we extract characteristic words from the word-of-mouth information about each destination on the Web, and use their tf-idf scores to form feature vectors that represent the destinations. The tf-idf score wi j of word c j for document di is given as follows: wi j
T Fi j IDF j
(3)
where T Fi j (term frequency) of word c j in document di and IDF j (inverse document frequency) of c j in the set of documents D are calculated as follows: T Fi j IDF j
ninj k
log
(4) ik D
d
D x j
d
(5)
Recommendation of Little Known Good Travel Destinations
187
Here, ni j denotes the frequency of word c j in document di . As characteristic words, or keywords in a document, we used nouns, adjectives, and adjective verbs. With tf-idf values of n keywords, we denote a travel destination yi by an n dimensional vector fi (wi1 win ). Then, the similarity between two destinations xi and x j is defined by the cosine similarity Æ, ( fi f j )
Æ
fi f j fi f j
(6)
where fi f j is the dot product of the two vectors and f is the Euclidean norm of the n dimensional vector f (w1 wn ), i.e. w21 w2n . Since, in our context, the tf-idf score composing feature vectors cannot be negative, the cosine similarity ranges from 0 to 1. Obviously, the closer to 1 the similarity between the feature vectors of two destinations, the more similar they. As a result, given the feature vector fc of the destination determined in the first phase, we choose in the second phase the destination from among little known ones such that its feature vector f satisfies the following condition: f
arg max Æ( fc f )
(7)
3 User Study To evaluate the proposed method, we conducted a user study. The purpose of this evaluation is to confirm whether the proposed method works well and outperforms the collaborative filtering in terms of recommending travel destinations that are of interesting for the user, but not well known to many people. To this end, here, we predicted a user’s preference and ranked some travel destinations having only a few reviews, i.e. those which are regarded as little known destinations, both by the proposed method and by the collaborative filtering method, and then compared their rankings with the true ranking by means of the Kendall rank correlation coeÆcient[7]. 3.1 Dataset We used 10 travel destinations randomly sampled from the Japanese travel information website ”Biglobe travel”1 . For each destination, we 1) obtained 30 reviews (word-ofmouse information) by anonymous users on the same website and aggregated them into one document per destination, 2) extracted nouns, adjectives, and adjective verbs from them as keywords, and 3) computed their tf-idf scores to construct the feature vector of the destination. Then, we asked 8 users to rank the 10 travel destinations on a scale of 1 to 5 according to their preference, where 5 means the destination is the most interesting for the user. Then, we artificially generated a set of well known travel destinations and a set of little known ones by dividing the 10 destinations into two equal halves A and B, and randomly masking some preference ratings of destinations in them. Indeed, we regarded the group A as a set of well known destinations and randomly masked a bit of preference ratings of destinations in A, while we randomly masked more ratings for the group B because we regarded it as a set of little known ones. 1
http: travel.biglobe.ne.jp
188
K. Ohara, Y. Fujimoto, and T. Shiina
3.2 Method of User Study We randomly chose a user and four destinations belonging to the group B, and masked their preference ratings by the user. Then, we applied the proposed method to predict the preference of the user for the destinations chosen, and ranked them as follows: 1. Apply collaborative filtering to the group A to predict ratings by the user for destinations whose true rating was masked, and determine the best destination y in A (the first phase); 2. Rank the four destinations chosen in the group B according to the cosine similarity Æ between each of them and y so that the destination with a higher similarity is ranked higher. (the second phase). We refer to the ranking resulted from this procedure as RCF CBF . On the other hand, we applied the collaborative filtering to the whole dataset, i.e. A B and ranked the four destinations chosen in B according to the predicted preference ratings. Note that this collaborative filtering method computes the similarities between all possible pairs of users using all of available ratings for the 10 destinations according to Eq. (1) and predicts the ratings of the four destinations chosen in B according to Eq. (2), while the proposed method uses only ratings of the destinations belonging to the group A at its first phase. We refer to the ranking resulted from this collaborative filtering method as RCF . To compare two rankings mentioned above, we used the Kendall rank correlation coeÆcient in this experiment. Let R(y) be the rank of destination y in ranking R. When given two rankings R1 and R2 of m destinations, y1 ym , the Kendall rank correlation coeÆcient is defined as follows:
(R1 R2 )
2p m(m 1)2
1
(8)
where p is the number of pairs of destinations yi and y j such that the ranks of them agree, i.e. both R1 (yi ) R1 (y j ) and R2 (yi ) R2 (y j ) or both R1 (yi ) R1 (y j ) and R2 (yi ) R2 (y j ). Thus, 1 if the two rankings completely agree, while 1 if one ranking is the reverse of the other. 0 means the two rankings are independent. We conducted this experiment several times varying the target user, the four test destinations, and the partition of the 10 destinations; computed (RCF CBF RT ) and (RCF RT ); and compared their average and standard deviation. Here, RT is the true ranking of the four test destinations. 3.3 Results We evaluated the proposed method under the two dierent situations: in one situation, a small number of ratings of destinations in the group B were masked, while in the other situation a large number of ones were masked. Figure 2 shows the results in each case, where the average and standard deviation of are shown. From Fig. 2a, we can observe that the result by using only the collaborative filtering method ( (RCF RT )) is better than that by the proposed method ( (RCF CBF RT )). The
Recommendation of Little Known Good Travel Destinations 1
1
0.5
0.5
0
0
-0.5
-0.5
-1
Γ (RCF , RT )
Γ (RCF +CBF , RT )
-1
Γ (RCF , RT )
189
Γ (RCF +CBF , RT )
(a) In case that the number of masked rat-
(b) In case that the number of masked rat-
ings in the group B is small
ings in the group B is large
Fig. 2. Comparison of the average and standard deviation of the resulting Kendall rank correlation coeÆcients
average and standard deviation of (RCF RT ) were 067 and 021, respectively, while 033 and 037 for (RCF CBF RT ). These results show that it is only necessary to use the collaborative filtering method if there are only a few destinations whose preference ratings are undefined, which is an ideal situation for collaborative filtering. The reason why the average value of (RCF CBF RT ) is worsen than that of (RCF RT ) would be attributed to the dierence in the number of destinations used for the collaborative filtering. When computing RCF , all the 10 users’ preference ratings are taken into account. Whereas, RCF CBF is determined based only on the half of them. On the other hand, in Fig. 2b, it is found that the proposed method ( (RCF CBF RT )) outperforms the collaborative filtering method ( (RCF RT )). The average and standard deviation of (RCF CBF RT ) were 028 and 025, respectively, while 022 and 050 for (RCF RT ). In this case, we masked more preference ratings of destinations in the group B than the case of Fig. 2a. In such a situation, it is easy to expect that the collaborative filtering method does not work well because the number of ratings available for predicting the preference of the target user for destinations in the group B is considerably restricted, which leads to an inaccurate prediction. Whereas, the content-based filtering method used in the proposed framework does not rely on the preference ratings of the other destinations. Indeed, it is said that the performance of the proposed method depends on the performance of the collaborative filtering for the group A in the first phase. Note that since the group A simulates a set of destinations well known to the public in this experiment, the number of available ratings in A is relatively larger than the group B. Thus, it is natural that the collaborative filtering method for the group A would achieve a certain degree of accuracy.
190
K. Ohara, Y. Fujimoto, and T. Shiina
4 Conclusion In this paper, we proposed a method to recommend to a tourist the little known, but good travel destination by means of two recommendation techniques, i.e. the collaborative and content-based filtering methods. Applying the collaborative filtering method to the set of destinations given ratings by many users allows us to complement the target user’s preference with similar uses’ preferences and to choose a well known destination that would be of interesting for the user. With the destination chosen based on the collaborative filtering method, the content-based filtering method can make a final recommendation by finding out the destination to recommend such that it is little known, but of interesting for the user. Through the user study, we confirmed that the proposed method works well and outperforms the use of only the collaborative filtering method when recommending an appropriate destination from among those that are little known, or rated by only a few users. The scale of our user study shown in this paper is limited. Thus, we have to evaluate our method through a larger-scale user study. It is one of our future work.
References 1. Ricci, F., Werthner, H.: Case base querying for travel planning recommendation. Information Technology & Tourism 4, 215–226 (2002) 2. Ricci, F.: Travel recommender systems. IEEE Intelligent Systems, 55–57 (November December 2002) 3. Delgado, J., Davidson, R.: Knowledge bases and user profiling in travel and hospitality recommender systems. In: Proceedings of the IFITT’s Global Forum for Travel and Tourism Technology and eBusiness, forcusing on Multi-Channel-Strategies, ENTER 2002 (2002) 4. Berka, T., Pl¨obnig, M.: Designing recommender sysmtes for tourism. In: Proceedings of the 11th International Conference on Information Technology in Travel (2004) 5. Hinze, A., Junmanee, S.: Advanced recommendation models for mobile tourist information. In: Proceedings of OTM Confederated International Conferences, CoopIS, DOA, GADA, and ODBASE 2006, pp. 643–660 (2006) 6. Resnick, P., Iacovou, N., Suchak, M., Bergstorm, P., Reidel, J.: Grouplens: An open architecture for collaborative filtering of netnews. In: Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work, pp. 175–186 (1994) 7. Abdi, H.: Kendall rank correlation. Encyclopedia of Measurement and Statistics (2006)
The Influence of Ubiquity on Screen-Based Interfaces Sheila Petty1 and Luigi Benedicenti2 1
2
Faculty of Fine Arts University of Regina Faculty of Engineering and Applied Science, University of Regina Regina SK S4S 0A2 {sheila.petty,luigi.benedicenti}@uregina.ca
Abstract. This paper brings together the disciplines of media studies and software systems engineering and focuses on the challenge of finding methodologies to measure and test certain effects of ubiquitous computing. We draw on two examples: an interactive screen-based art installation and an interface for ubiquitous videogaming. We find that the second example demonstrates that the identification of a common framework for ubiquity enables one to adopt a set of subjective measurements that can then be aggregated into a qualitative appraisal of some aspects of screen ubiquity. Ultimately, however, we postulate that we need to merge the measurement system built on the engineering development principles illustrated in our second example with a humanistic and artistic approach, albeit analytical.
1 Introduction This paper is an interdisciplinary collaboration by two researchers in the New Media Studio Laboratory at the University of Regina. Our paper builds on our work on screen-based interfaces in our respective disciplines of media studies and software systems engineering. We start with the general premise that screens shape our world and identities in such ubiquitous ways that their very presence and influence often go unproven, or at the very least, unchallenged. According to Kate Mondloch, “From movie screens to television sets, from video walls to PDAs, screens literally and figuratively stand between us, separating bodies and filtering communication between subjects….present-day viewers are, quite literally, “screen subjects” [1]. She further contends that the way in which we view or consume artworks made with screen interfaces has been underexplored as a system or method [1]. The challenge of creating coherent frameworks or methodologies to describe how screen media create meaning has occupied a significant place in debates among new media scholars, game and interface designers. Until very recently, primacy has been placed on what happens behind the screen with a focus on the technology and software used by computer programmers and designers. It may be time to redress the balance by bringing focus to bear on the screen itself and examine how images/sensations evoked on the computer screen create meaning with the user. This has led us to theorize the following: can we understand and measure the cultural and aesthetic experiences of screen users? Is it possible to develop a methodology to map the nexus between technology, aesthetics and impact/experience? A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 191–199, 2010. © Springer-Verlag Berlin Heidelberg 2010
192
S. Petty and L. Benedicenti
As early as the 1980s, Chris Crawford in The Art of Computer Game Design, advocated that “real art through computer games is achievable, but it will never be achieved so long as we have no path to understanding. We need to establish our principles of aesthetics, a framework for criticism, and a model for development” [2]. In his essay on whether computer games will ever be a legitimate art form, Ernest W. Adams disagrees with the need for a model of development as he feels art should be intuitively produced, but he agrees with the notion of the necessity for a methodology of analysis [3]. Other theoretical positions have evolved focusing on either the technological construction of new media or their social impact. For example, in the quest to quantify effective human interface design, Brenda Laurel turns to theatre and Aristotle’s Poetics by creating categories of action, character, thought, language, melody (sound) and enactment [4]. Pamela Jennings, in return, has critiqued turning to a past rooted in linear narrative by demanding that the “quest for other concepts of time and narrative structure…take us outside of the neatly packaged Aristotelian methods of linear information processing. The mono-orgasmic narrative structure must unfold into an endless time frame where multiple climaxes can occur” [5]. Cubitt takes a similar view, arguing, “the possibilities for a contrapuntal organisation of image, sound and text [should be] explored, in pursuit of a mode of consciousness which is not anchored in the old hierarchies” [6]. Lunenfeld takes a much more radical stance by suggesting “once we distinguish a technoculture from its future/present from that which preceded it, we need to move beyond the usual tools of contemporary critical theory. However, his assertion of the need for a “hyperaesthetic that encourages a hybrid temporality, a real-time approach that cycles through the past, present and future to think with and through the technocultures” [7] offers its own set of problematics: computer-based forms are neither a-historical, nor represent a leap in technology so distinct that they are unlinked to preceding forms. To a certain degree, each of these scholars touches directly or indirectly on the notion that there is something unique about computer-based forms that defies inscription in previous modes of analysis, and all seem to be grasping for a common language of description for the pervasive nature of ubiquitous information processing “in the human environment” [8]. Challenges and Prospects: The primary challenge, therefore, for our research team is to find ways to measure and test certain effects of ubiquitous computing, sometimes referred to as “everyware computing,” “invisible but everywhere computing” or “pervasive computing” [8]. The point of view chosen for this study draws on visual arts literature as it relates to the entirety of this paper; however, the works cited in the following examples frame this investigation also from the software engineering context. More specifically, we are concerned with the question of what tools are necessary to measure the cultural and aesthetic aspects of a “screen text”? What criteria should prevail? Can they be flexible enough to allow for “culture”? Whose culture? These questions cannot be answered thoroughly within this contribution, but it is our intention to provide a multifaceted analysis. To plumb the depths of these issues, we have chosen two screen-based objects/texts/interfaces for analysis.
The Influence of Ubiquity on Screen-Based Interfaces
193
2 Example 1: An Interactive Screen-Based Art Installation Media artist Santichart Kusakulsomsak’s 2006 interactive screen-based installation, Whispering Pond explores notions of culture, interactivity and virtuality as agents for communication in a space constructed by digital technology. As creator/ designer/author he infused encoded meanings to his experience of being a Thai national living in a Canadian context. According to Kusakulsomsak, these were “often almost subliminal, under the surface of the work, and frequently reflect the Buddhist aspect of my identity in content or aesthetic design” [9]. The environment, or “gloop,” of Whispering Pond was constructed from three main components including a darkened room where the project was installed, a projection system on a platform located at the center of the room, and an input unit. The projected images suggest a skyline in the background and bunches of weeds moving horizontally in the foreground of a pond, almost concealing the pond to afford interactors a sense of isolation and privacy so that they become part of the pond’s hidden world. The third component is located right next to the side of the projection platform and is separated by a semi-transparent floor-to-ceiling screen. This unit is on a knee-level height stand with a laptop computer placed on top. The laptop’s screen is blank with a button for interactors to type in and submit their thoughts. As the user hits the submit button, the software randomly selects a series of words from the user’s input/story and displays them on the horizontal plane of the projection platform so that they appear to be floating in a circle on the pond. They remain for a few days before fading away so that they interact with words of other users. According to Kusakulsomsak, “as the user’s private words join other “ripples” of thought in the pool, they create another overlap between private and public space. As users observe their words and those of others flowing through the pond, they sense a shared social space as they seek to read meaning into the overlapping private words” [9]. Kusakulsomsak has revealed that some users found the undirected nature of the project’s physical arrangement somewhat confusing, leaving them unclear in terms of what to do. This could be a result of the placement behind a screen of the laptop used for user input, leading some to believe it was used for running the digital components of the installation (2007). However, as one of his main goals was to allow interactors to explore and experience the installation without direction, a certain amount of confusion among some interactors was inevitably unavoidable. The installation is meant to be experienced as an activated space because it presents “temporal and spatialized encounters between viewing subjects and technological objects, between bodies and screens. A potentially new mode of screen-reliant spectatorship emerges in the process” as the interactors’ experiences become an integral part of the work itself [1]. The challenge for analysis, however, is that users would not necessarily know the artist’s goal or intent. This type of screen-based interface is primarily artist-driven and intuitive and results and successes are anecdotal and not immediately measurable, but dependent upon viewers’ points of view or worldviews. Can we move, therefore, beyond the seemingly ephemeral and immeasurable to a more empirical study of ubiquity including how ubiquity affects screen-based interfaces? To test this question in a preliminary fashion, we have chosen to focus on
194
S. Petty and L. Benedicenti
another screen-based interface example that might lead to more formalized methods of testing or analysis.
3 Example 2: An Interface for Ubiquitous Videogaming Much current electronic entertainment is traditionally associated with specific locations in which they are consumed, such as homes or college dorm rooms. If exposure to this activity is limited it is comparable to other sedentary activities like watching television and in moderation it is accepted by most healthy living guidelines if compensated by other types of physical activity (see for example [11]). However, as electronic games become increasingly more pervasive, a ubiquitous gaming model incentivizing physical activity should be sought to ensure that location dependence does not affect a person’s ability to function in society [12, 13, 14]. An extreme example of this lack of functionality, one that appears to be increasingly more common in electronic entertainment, is described in [15]. Given the ubiquity premise, does the location of the screen interface needs to be fixed, and is there is a way to engage in location-based multi-user games allowing, or sometimes even requiring that users move in the physical world in order to achieve part of the game’s objectives? To answer these questions, we have developed a framework for ubiquitous Massively Multiplayer Online Role-Playing Games (MMORPGs) and used it to develop an example of such a game. The framework allows for ubiquity and location tracking of users, and supports events like the interruption of the online connection to allow the continuation of the game with little or no perception of such connection loss. This framework is depicted in Figure 1, and is described in more extensive design in [16] and [17]. Notably, the framework can be used also in different types of games (e.g., serious games), as described in [16]. The architecture makes extensive use of agents, a platform we chose for the simplicity of its multitasking and communication model. Agents in our chosen platform are free to move among computers and are able to send and receive messages in real time. This is possible in part because they are written in Java and thus intrinsically portable. More information on agents can be found in [18]. As is common for MMORPGs, the framework is based on a client-server model. However, the client model allows for a partial caching of the game world and some degree of artificial intelligence in learning other players’ moves to supply a set of believable non-player characters when the client is experiencing a connection loss with the server. This mechanism is unique to this platform. The screen still acts as the primary interface between the player and the game world (Figure 2). However, the interface is complemented by several attributes that allow the screen to act as a window into the game world that has a tangible physicality and thus can be manipulated, rather than being placed at a distance from the player. One of these attributes is the connection with a GPS sensor, which allows the client to determine its location in the physical world, and thus synchronize the player’s location in the game world with the client’s physical location. GPS is a satellite-based global positioning system generally available in most countries through the use of low-cost receivers.
The Influence of Ubiquity on Screen-Based Interfaces
195
Fig. 1. Ubiquitous MMORPG framework
In most cases (cell phones, tablet PCs) such a sensor is integrated in the hardware. In the game we created, we designed some of the objectives around locating a physical object that was mapped in the virtual world. Thus, the player could only complete the challenge by moving to a specific location, specifically on campus at the University of Regina. Another important attribute is the use of a microphone to listen to specific ambient sounds, which trigger some events in the game. This is a more abstract sort of correspondence between the real world and the game world, but it is just as effective in motivating the player to move to a specific location and unlock additional bonus capabilities in the game. An onboard camera, if present, can also be used to match specific target objects in the real world to achievements in the game world. In our game, we used this capability to bring game objects in the real world, and use the screen as an augmented reality screen. The player was asked to capture the animals that had escaped the game world and were hiding in the real world. This was accomplished by superimposing and anchoring game icons on the video stream from the camera, giving the impression that the game objects were in fact parts of the stream itself. Finally, we used accelerometers in the game device to afford us a finer level of control in the game world. We did not use this feature extensively but it is now subject of considerable work especially in new portable devices (e.g., iPhone, Android phones, and others).
196
S. Petty and L. Benedicenti
Fig. 2. Screen interface to the game world
Fig. 3. Server block diagram
The Influence of Ubiquity on Screen-Based Interfaces
197
Fig. 4. Client block diagram
The high level design diagrams of the server and the client are shown in Figures 3 and 4, respectively. A detailed description of each component of the system can be found in [16]. Each block in the system is an agent, compliant with a standard messaging protocol fully detailed in [16] and [18]. For the purpose of this example, it is perhaps sufficient to say that the components are aware of each other, can freely migrate through the network, and that their interaction is possible regardless of their physical location by means of a persistent communication channel based on a set of internal unique agent identifiers that are independent of the machine from where the agents originated. It also benefits to mention that NPC stands for non-player character, a standard acronym in multi-user role-playing games [16]. We based the game on Thai mythology, which would constitute a cultural enticement for North American players. To ensure a higher level of interest, we prepared some promotional material. We then approached the student population at the University of Regina to determine whether the adoption of a mobile screen would still be acceptable. Extensive details, including a list of all questions and tasks performed by the students, is shown in [16]. A summary of the conclusions is as follows: players of this game ranked it at a “good” level (76.85%) and were interested in continuing the experience. Most players did not perceive the transition from online to offline, which reinforces the game’s believability even when the other players need to be simulated. The limited amount of interaction helped this score considerably as online chat was not possible by any other means than typing. The more interesting result, however, is
198
S. Petty and L. Benedicenti
that all players liked the game and its various means of interaction and did not perceive a difference from a regular MMORPG, thus proving albeit in a very limited empirical set that the location of the screen interface does not need to be fixed, and that the complement of additional sensory inputs we used to link the screen to physical locations is perceived as a normal occurrence and does not impede the enjoyment of the game. Compared to the first example, this second example shows that the identification of a common framework for ubiquity enables one to adopt a set of subjective measurements that can then be aggregated into a qualitative appraisal of some aspects of screen ubiquity.
4 Discussion and Future Work The comparison of the examples presented in the previous section yields a study in contrast: wherever there is an artistic endeavor, subjectivity is extremely important as the fruition of the experience becomes intrinsically individual and thus subjective. This poses limits to the range of tools and evaluation that can be applied to original work as depicted in our first example. The creation of a large series of artifacts for the sole purpose of evaluation is not feasible. On the other hand, many artists are interested in feedback and welcome the opportunity to discover how their work influenced the human environment exposed to it, and in turn attracted attention to the work (case in point: viral videos). Wherever more rigorous studies can be attempted, the corresponding artistic content risks losing focus, and the less anecdotal an experience becomes, the more difficult it appears to become to characterize it in an objective way. In the relatively limited context of linking design choices to the enjoyment of a ubiquitous electronic entertainment experience, the subjective measurement system described in the second example appears to be reliable [16 and 17]. But even if this system is derived from the social sciences, it still fails to capture some of the essence of the experience while lacking the objectivity that characterizes a scientific study. This, therefore, reflects on the apparent inability to choose a set of tools derived exclusively from the Information Technology point of view or the Fine Arts point of view. We propose instead that the purely scientific approach to experimental measurement cannot account in full for the unknowns of human interaction and media fruition. The measurement system built on the engineering development principles illustrated in our second example can be useful to characterize part of the experience, but we need to merge it with a humanistic approach, albeit analytical, to include more elusive aspects of the human experience that can be evoked only through the artistic endeavor, as shown in our first example. Perhaps we need to adopt a framework that encompasses a fuller set of attributes, as postulated by Daniel Pink in [19]. In conclusion, therefore, our attempt to investigate the research questions posed at the beginning of this paper has confirmed a depth that this paper can only begin to probe. Our future work will concentrate on this possibility and we hope that it will contribute to significant advances in software systems and computer engineering as well as the humanities and fine arts.
The Influence of Ubiquity on Screen-Based Interfaces
199
References 1. Mondloch, K.: Screens: Viewing Media Installation Art. University of Minnesota Press, Minneapolis (2010) 2. Crawford, C.: The Art of Computer Game Design. McGraw-Hill/Osbourne Media, Berkeley (1984) 3. Adams, E.W.: Will Computer Games Ever Be a Legitimate Art Form? In: Mitchell, G., Clarke, A. (eds.) Videogames and Art. Intellect Books, Bristol (2007) 4. Laurel, B.: Computers as Theatre. Addison-Wesley, New York (1991) 5. Jennings, P.: Narrative Structures for New Media: Towards a New Definition. Leonardo 29(5), 345–350 (1996) 6. Cubitt, S.: The Failure and Success of Multimedia. Paper presented at the Consciousness Reframed II Conference at the University College of Wales, Newport (August 20, 1998) 7. Lunenfeld, P.: Snap to Grid: a User’s Guide to Digital Arts, Media, and Cultures. The MIT Press, Cambridge (2000) 8. Greenfield, A.: Everyware: The Dawning Age of Ubiquitous Computing. New Riders, Berkeley (2006) 9. Kusakulsomsak, S.: Whispering Pond. In: Master of Fine Arts, Faculty of Graduate Studies and Research. University of Regina, Regina (2007) 10. Hansmann, U., Merck, L., Nicklous, M.S., Stober, T.: Pervasive Computing: The Mobile World. Springer, New York (2003) 11. Raley, R.: Tactical Media. University of Minnesota Press, Minneapolis (2009) 12. Shields, R.: The Virtual. Routledge, London (2003) 13. Canada’s Physical Activity Guide for Healthy Living. Public Health Agency of Canada (2010), http://www.phac-aspc.gc.ca/hp-ps/hl-mvs/pag-gap/pdf/ guideeng.pdf (last accessed April 5, 2010) 14. Salguero, R.A.T., Morán, R.M.B.: Measuring problem video game playing in adolescents. Addiction 97, 1601–1606 (2002) 15. Griffiths, M.D., Davies, M.N.O., Chappell, D.: Online computer gaming: a comparison of adolescent and adult gamers. Journal of Adolescence 27, 87–96 (2004) 16. Setzer, V.W., Duckett, G.E.: The Risk to Children Using Electronic Games (2006) 17. S Korean dies after games session (August 10, 2005), http://news.bbc.co.uk/1/hi/technology/4137782.stm (Accessed February 22, 2006) 18. Feungchan, W.: An Agent-Based Novel Interactive Framework for Ubiquitous Electronic Entertainment. PhD Thesis, University of Regina (2009) 19. Benedicenti, L., Feungchan, W.: Designing fun games: an empirical study. Presented at GDC Canada, Vancouver (2009) 20. Martens, R., Hu, W., Liu, A., Mahovsky, J., Saenchai, K., Schauenberg, T., Zhou, M., Paranjape, R., Benedicenti, L.: TEEMA TRLabs Execution Environment for Mobile Agents (December 12, 2001), http://agent.reg.trlabs.ca/java.php (Accessed December 19, 2005) 21. Pink, D.: A Whole New Mind: Why Right-Brainers Will Rule the Future. Riverhead Trade (March 7 (2006) (Rep. Upd. edition)
Perception of Parameter Variations in Linear Fractal Images Daryl H. Hepting and Leila Latifi Computer Science Department, University of Regina, Canada [email protected],[email protected]
Abstract. Parametric images, defined by a small number of parameters, may help to democratize access to image creation because simple parameter manipulations can yield interesting variations. Foe example, many people appreciate the aesthetics of fractal images, but few are inclined to engage in the mathematics needed to create them. A perception-driven interface for fractal image creation could find a wide audience as people could use it as an outlet for their own creative expression. This paper discusses some first steps along that path, with a study and analysis of how participants perceived changes between smoothly varying images. Further steps towards a perception-driven interface are then laid out.
1
Introduction
The study of human visual perception has been addressed by many researchers who have, for example, studied humans’ reactions when they see different images. When these images are computer-generated fractals, it is possible to gain additional information because the parameters used to generate each image are known. Therefore, they seem to be ideally-suited as stimuli for a perceptual study. Linear fractal images are generated by iteratively applying a set of linear transformations (each comprising rotations, translations and scalings) to an initial point. These images have the very desirable property of “database amplification” [1], meaning that the images can be encoded by very simple and compact equations. Furthermore, while images close together in parameter space are similar, over the range of parameters there can be stunning differences between generated images. Two samples of images used in the perceptual study are shown in Figure 1. Although fractals have been studied in many ways, this study of how they are perceived, and the connection between that perception and the input parameters used to generate the images, is new. Can an individual’s exploration of this image parameter space (here, a two dimensional space defined by 2 rotation angle parameters, θ1 and θ2 ) be augmented with support to more easily find and understand surprising images? In his book, Digital Harmony, Whitney [2] talked about exploring for months within the parameter space he chose for “Arabesque” and not finishing. Can technology help the artist in this case? If we can understand how people perceive these images in A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 200–211, 2010. c Springer-Verlag Berlin Heidelberg 2010
Perception of Parameter Variations in Linear Fractal Images
(a)
201
(b)
Fig. 1. Two examples of the linear fractals used in the study. On the left, the values for the rotation angle parameters, θ1 and θ2 , are 0.0 and 18.0 degrees. On the right, these values are 28.0 and 31.0 degrees.
relation to one another, we have the possibility to create a much more intuitive interface for image creation [3]. In turn, this can empower many more people to express their creativity, which could lead to more user-generated content on video sharing sites like youtube (http://www.youtube.com). The rest of this paper is organized as follows. Section 2 provides a review of relevant literature. Section 3 introduces the approach that was developed to generate a manageable stimulus set. Section 4 presents the experiment’s results and the method used to evaluate and analysis the results. To develop a baseline for the perceptual study, the Section 5 presents a discussion of the results along with directions for future work.
2
Background
Fractals are simple and recursive geometric objects that can be divided into many parts, and in which each part is a smaller, possibly rotated, copy of the whole. They can be magnified an unlimited number of times and their structure will be the same in every magnification. Benoit Mandelbrot [4] coined the term fractal – “to break” or “to fragment” which comes from the Latin verb “frangeres” – although he was not the first to study them. Julia sets [4], defined by a complex quadratic equation, are some of the most famous fractals. The Julia set corresponding to c = −1 is shown in Figure 2. Considering the real and imaginary parts of the complex number, two parameters may be manipulated. Linear fractals (see Figure 1) are conceptually similar to quadratic fractals, but permit a wider range of parameter manipulations. They are defined by a set of contractive linear transformations. Each of these linear transformations can be described in terms of rotation, scaling, and translation. Scaling and Translation each have two parameters, one for x and one for y while Rotation just
202
D.H. Hepting and L. Latifi
Fig. 2. An example of a Julia set, for c = −1
has one parameter, θ. In total, five parameters may be manipulated for each transformation defining a linear fractal. Intricate shapes arise when two or more transformations are used. To better understand how people perceive fractal images, we examine both computational and perceptual perspectives, to see how they can inform each other. 2.1
Computed Metrics
In order to provide a baseline for human judgements, we computed various metrics from the images. Dimension. Fractals are generally quantified by a non-integer dimension, which indicates how completely the fractal fills the space. The examples from Figure 1 are not lines but neither do they fill the plane. To approximate fractal dimension, we use a box-counting procedure [5]. As the size of box goes to 0, we count how many boxes are required to cover the fractal, and determine the dimension according to the following equation (where s is the box size): D = lim
s→0
log(N (s)) . log(1/s)
(1)
Lacunarity. Two fractal patterns could have the same dimension but look different. Whereas fractal dimension is a measure of how much space is filled by a fractal, lacunarity is a complementary measure of how it fills the space [6]. The gliding box algorithm [7] is a very popular method to calculate lacunarity. According to this algorithm a box of size r slides over an image, and the number of pixels inside that box is counted to determine the mass, M . n(M, r) is the
Perception of Parameter Variations in Linear Fractal Images
203
number of a gliding-boxes with radius r and mass M . Q(M, r) is the probability distribution obtained when n(M, r) is divided by the total number of boxes of size r. Lacunarity at scale r, denoted L(r), is defined as the mean-square deviation of the variation of mass distribution probability Q(M, r) divided by its square mean. M 2 Q(M, r) L(r) = M (2) 2 [ M M Q(M, r)] Mean Squared Error. Mean Squared Error (MSE) is a pixel-based method for comparing two images, one of the most important problems in image processing. The definition of the MSE metric is given in Equation 3 where the images to be compared, I and K, are of size m × n. This method is popular because it is simple, though it is not found to be a good match for human perception. It is used here to quantify the difference between successive images. M SE =
m−1 n−1 1 2 ||I(i, j) − K(i, j)|| mn i=0 j=0
(3)
Principal Component Analysis. Principal component analysis (PCA) is a standard method in exploratory data analysis to decrease the dimensionality of data by identifying patterns in data, highlighting the similarity and differences and extracting the meaningful variables [8]. To assess image similarity, we provide a training set of images from which vectors are computed. Standard practice indicates that we need only choose enough vectors (say N ) to account for at least 80% of the variance. We then project all images onto these chosen vectors to then represent each image as a point within this N -dimensional space, from which it is easy to assess similarity based on proximity. 2.2
Perceptual Evaluation
Both the quantity and quality of stimuli are important factors to consider when designing a study. To make a thorough comparison of a set of stimuli, the participant should make judgments about all possible pairs of stimuli, which is simply not feasible in all but simple situations [9]. For this reason, much effort has been placed on developing games as a way to encourage the crowdsourcing of comparisons [10]. Our study employed the method of limits, which involves the presentation of ascending and descending series of stimuli varying along a single dimension. The difference threshold is the smallest amount of change in intensity that is yields a “just noticeable difference”. In this study, the difference threshold is calculated as the change in a stimulus that is required to elicit a response of “different” from the participant. By doing this in both an ascending and descending series, two estimates of the difference threshold are obtained.
204
3
D.H. Hepting and L. Latifi
Experiment Design
In order to assess viewers’ perceptions of fractal imagery, it is first important to select the set of stimuli. The fractals used in this study consist of two transformations, which each have: 1 rotation angle parameter, 2 scale parameters, and 2 translation parameters. If 10 values are allowed for each parameter, 1010 = 10, 000, 000, 000 images would be generated. It is important to select a subset which will not overwhelm any participant, yet be representative of what is possible from the parameter space. Barnsley [11] proved that small changes in input parameters results in small changes in the image, therefore it is worthwhile choose samples from the whole parameter space. Therefore the scaling and translation parameters were fixed and the rotation angles were sampled every 3.6 degrees in order to create a grid of 100 × 100 = 10, 000 images. 3.1
Method
How should the study be structured in order to gain as much useful information as possible, and as accurately as possible? The 10,000 images in stimulus set were divided into incrementally-varying sequences of 100 images. Each participant was shown eight sequences, in both the forward and backward directions (as dictated by the method of limits) to see how he or she could detect changes related to the rotation angle parameter value changes.
Fig. 3. Screenshot of the web application implemented for the study. This interface shows the sequences of images to the participants, 2 images at a time. New images appear on the right and move to the left.
Perception of Parameter Variations in Linear Fractal Images
205
Fig. 4. Method of Limits Directions for Columns and Diagonals
A web-based program was implemented to run the study (see Figure 3 for a screenshot). The application randomly presented each participant with these sequences (4 from rows/columns and 4 from diagonals in the grid) in forward and reverse directions for a total of 16 sequences. Each sequence consisted of 100 images, which were shown two images at a time. Each image, aside from the first and last, was on the screen in the right position first, then the left. The participants’ task was to click the “changed” button anytime they thought the two images on the screen were noticeably different. We ran 2 conditions of the study, each with 25 participants drawn from the undergraduate and graduate student populations at the University of Regina. In the first condition, image sequences were drawn from the columns and diagonals of the image grid (see Figure 4, left) – 4 columns and 4 diagonals were randomly selected. In the second condition, image sequences were drawn from the rows and opposite diagonals of the image grid (see Figure 4, right) – 4 rows and 4 opposite diagonals were randomly selected. Each sequence was viewed in both forward and backward directions. Therefore every user saw 16 sequences of 100 images for a total of 1600 images per participant. In each condition, each image was rated as part of a column or a row and as part of a diagonal or an opposite diagonal, in forward and backward directions. Each session took between 30 and 40 minutes. Any sequence was assigned to only 1 participant, yet almost all images were seen by two participants, except for those images at the intersection of two sequences.
4
Analysis
The results of the computational analysis is shown, in part, in Figure 7. The values for each metric were rescaled so that the minimum was set to 0 and the maximum was set to 1. Four metrics were computed: – Fractal Dimension was determined using the calculator written by Paul Bourke1 . – Lacunarity was computed with a MATLAB2 routine3 . 1 2 3
http://local.wasp.uwa.edu.au/∼pbourke/fractals/fracdim/ http://www.mathworks.com/products/matlab/ http://www.mathworks.com/matlabcentral/fileexchange/ 25261-lacunarity-of-a-binary-image.
206
D.H. Hepting and L. Latifi
Fig. 5. Clustering of PCA values to 6
– Mean Squared Error (MSE) values were computed for images in the same way that participants viewed them: forward and backward. Therefore, for all but the first and last image in a sequence, there were two MSE values, which were then averaged. The first and last image in each sequence were not considered because participants compared them with an image in the next or previous set of sequences which might be totally different. – Principle Component Analysis (PCA) used the eigenface approach of Turk and Pentland [12], laid out in a MATLAB procedure by Dailey4 . a principle component analysis (PCA) of the stimulus images was performed. The diagonals of the image matrix (θ1 = θ2 , and θ2 = 356.4 − θ1 ) (199 images in total) were used for training. The 40 most important eigenvectors of the covariance matrix for the set of training images formed the basis of a 40-dimensional coordinate system for the images. Values for the image sequences (shown as part of Figure 7) were obtained from the magnitude 4
http://www.cs.ait.ac.th/∼mdailey/matlab/
Perception of Parameter Variations in Linear Fractal Images
207
Fig. 6. Fractal images from the centers of each partition illustrated in Figure 5
of the vector defined between the 40-dimensional image coordinates and the origin of that space. Once the projection was done in MATLAB, the coordinates were partitioned using the Partition Around Medoid (pam) procedure in R. Figure 5 shows the images clustered into 6 groups, based on their 40-dimensional coordinates. Similar images have the same colour. Ideally, participants would click at or near the same place when viewing a sequence in forward or backward directions. This did not happen, and clicks were often not even matched between the two viewing directions. Therefore, we treated the clicks as distinguishing between piles of similar images. We had two ratings for each sequence. We derived the pairwise distance for all images in the sequence: if two images were in the same pile, their distance was 0; if two images were in different piles, their distance was 1. Each pair of images could have a distance of 0, 1, or 2 when the two ratings were combined. We used multidimensional scaling to process this data to find where the boundaries between regions of similar images, where clicks were likely to occur, by plotting the distance between neighbouring images. The higher the peak, the more likely that a click had occurred there (see Figure 7). The multidimensional scaling was performed with isoMDS routine from R5 . To enhance the comparison, the values of numerical approaches were computed according to the sequences which the participants saw, as shown in Figure 7. The four vertical lines in Figure 7 show the correspondence between the extrema of the MSE, PCA, fractal dimension, and lacunarity, from left to right with a participant’s clicks. Comparing the metrics’ values with the participants’ clicks shows that most of the clicks corresponds to the extrema in the image comparisons and indicates that participants may be sensitive to them. 5
http://www.r-project.org/
208
D.H. Hepting and L. Latifi
Fig. 7. A comparison of perceptual and computational data, 1st line: MSE values; 2nd line: PCA Euclidean values; 3rd line: PCA Magnitude values; 4th line: fractal dimension values; 5th line: fractal lacunarity values; 6th line: click zone locations, computed from the following 2 lines; 7th line: clicks in forward direction; 8th line: clicks in backward direction. Vertical lines placed at maxima for fractal dimension, PCA, MSE and lacunarity, from left to right.
Perception of Parameter Variations in Linear Fractal Images
209
Fig. 8. Clustering the participants into two groups
To assess the correspondence between participant judgments of similarity (clicks) and computed metrics, we examined a five image window on either side of an extrema value. If there were one or more clicks within that window, one click was identified as corresponding to the extrema. Clicks not in the neighbourhood of any extrema were deemed to be “non-corresponding.” Clicks close to the sequence minima (0) and maxima (1) are counted separately. Then, the remaining values are divided into four intervals: (0, 0.25], (0.25, 0.5], (0.5, 0.75], and (0.75, 1). With the intention of classifying the participants and understanding their behaviour, the participants’ corresponding clicks’ value along with their total clicks were clustered using the Partition Around Medoid (pam) method in R. A plot of clusters with k = 2 is presented in Figure 8. According to the average of their performance, one group seemed most attuned to the fractal dimension and were more accurate in corresponding their clicks to extrema. The other group seemed most attuned to the PCA values, were less accurate with their clicks,
210
D.H. Hepting and L. Latifi
and clicked more frequently. This is a very promising start in understanding individual behaviours
5
Conclusion and Future Work
This work has attempted to gain an understanding of how people perceive fractal imagery, and to use that understanding as a first step towards perceptually-based interaction with fractals. The study of linear fractals is apt because the images are straightforward to manipulate and the imagery is beautiful and thoughtprovoking. This paper has presented new methods for managing and presenting stimuli and for gathering and analyzing similarity ratings of incrementallyvarying fractal images. The adaptation of the method of limits was successful and it has opened many questions for further study. The study has indicated that fractal dimension and principal component analysis values may be sufficiently related to participants’ judgments of fractal similarity that those computed metrics may be used as approximations of perceived similarity. Some trends seemed to emerge when looking at participant performance on the task, and this seems to be a fruitful area for further investigation. The techniques used in this study may be of help in answering larger questions about the relation between image similarity metrics and perceptual judgements. The method of data collection could be improved. Many people did complete the task and they were able to identify many extrema indicated by the computational metrics. However, the sessions were too long for some people, and so the quality of the data collected suffered. The design of the experiment emphasized breadth over depth: there was one rating of every image in each direction instead of many ratings of some images. It might be easier to focus on specific areas of the parameter space, which this study has helped to locate. Furthermore, the operation of the study could be changed. Instead of having the series pass by, perhaps a more effective approach would be to continue to the display the last selected image constant until a new one is clicked. The method of limits could be applied using different criteria, such as PCA magnitude, fractal dimension, or lacunarity. Further analysis can be done to better understand the relationship between the manipulation of parameters and the features that elicit clicks. There are some indications that users applied different strategies to the task they were given but it would very interesting to study the implications of that possibility for future interface designs. It may be that some of the participants are better informants than others.
References 1. Smith, A.R.: Plants, fractals, and formal languages. ACM SIGGRAPH Computer Graphics 18, 1–10 (1984) 2. Whitney, J.: Digital harmony: On the complementarity of music and visual art. McGraw-Hill, New York (1980)
Perception of Parameter Variations in Linear Fractal Images
211
3. Hepting, D.H., Latifi, L., Oriet, C.: In search of a perceptual basis for interacting with parametric images. In: Proc. of the Seventh ACM Conference on Creativity and Cognition, pp. 377–378 (2009) 4. Mandelbrot, B.B.: The Fractal Geometry of Nature. W. H. Freeman, New York (1983) 5. Tel, T., Fulop, A., Vicsek, T.: Determination of fractal dimension for geometrical multifractals. Journal of Physics A 159, 155–166 (1989) 6. Plotnick, R., et al.: Lacunarity analysis: a general technique for the analysis of spatial patterns. Physical Review A 44, 5461–5468 (1996) 7. Allain, C., Cloitre, M.: Characterizing lacunarity of random and deterministic fractal sets. Physical Review A 44, 3552–3558 (1991) 8. Sirovich, L., Kirby, M.: Low dimensional procedure for the characterization of human faces. Journal of the Optical Society of America 4, 519–524 (1987) 9. Bonebright, T.L., et al.: Data collection and analysis techniques for evaluating the perceptual qualities of auditory stimuli. ACM Transactions on Applied Perception 2(4), 505–516 (2005) 10. Law, E., von Ahn, L.: Input-agreement: A new mechanism for collecting data using human computation games. In: ACM Conference on Human Factors in Computing Systems,CHI (2009) 11. Barnsley, M.F., Devaney, R., Mandelbrot, B.B., Peitgen, H.O., Saupe, D., Voss, R.F.: The science of fractal images (1988) 12. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)
Music Information Retrieval with Temporal Features and Timbre Angelina A. Tzacheva and Keith J. Bell University of South Carolina Upstate, Department of Informatics 800 University Way, Spartanburg, SC 29303, USA {atzacheva,bellkj}@uscupstate.edu
Abstract. At a time when the quantity of music media surrounding us is rapidly increasing and the access to recordings as well as the amount of music files available on the Internet is constantly growing, the problem of building music recommendation systems is of great importance. In this work, we perform a study on automatic classification of musical instruments. We use monophonic sounds. The latter have successfully been classified in the past, with main focus on pitch. We propose new temporal features and incorporate timbre descriptors. The advantages of this approach are: preservation of temporal information and high classification accuracy.
1 Introduction Music has accompanied man for ages in various situations. Today, we hear music media in advertisements, in films, at parties, at the philharmonic, etc. One of the most important functions of music is its effect on humans. Certain pieces of music have a relaxing effect, while others stimulate us to act, and some cause a change in or emphasize our mood. Music is not only a great number of sounds arranged by a composer, it is also the emotion contained within these sounds (Grekow and Ras, 2009). The steep rise in music downloading over CD sales has created a major shift in the music industry away from physical media formats and towards Web-based (online) products and services. Music is one of the most popular types of online information and there are now hundreds of music streaming and download services operating on the World-Wide Web. Some of the music collections available are approaching the scale of ten million tracks and this has posed a major challenge for searching, retrieving, and organizing music content. Research efforts in music information retrieval have involved experts from music perception, cognition, musicology, engineering, and computer science engaged in truly interdisciplinary activity that has resulted in many proposed algorithmic and methodological solutions to music search using content-based methods (Casey et al., 2008). This work contributes to solving the important problem of building music recommendation systems. Automatic recognition or classification of music sounds helps user to find favorite music objects, or be recommended objects of his/her liking, within large online music repositories. We focus on musical instrument recognition, which is a challenging problem in the domain. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 212–219, 2010. c Springer-Verlag Berlin Heidelberg 2010
Music Information Retrieval with Temporal Features and Timbre
213
Melody matching based on pitch detection technology has drawn much attention and many music information retrieval systems have been developed to fulfill this task. Numerous approaches to acoustic feature extraction have already been proposed. This has stimulated the research on instrument classification and new features development for content-based automatic music information retrieval. The original audio signals are a large volume of unstructured sequential values, which are not suitable for traditional data mining algorithms, while the higher level data representative of acoustical features are sometimes not sufficient for instrument recognition. We propose new dynamic features, which preserve temporal information, for increased accuracy with classification. The rest of the paper is organized as follows: section 2 reviews related work, section 3 discusses timbre, section 4 describes features, section 5 presents the proposed temporal features, section 6 shows the experiment results, and finally section 7 concludes.
2 Related Work (Martin and Kim, 1998) employed the K-NN (k-nearest neighbor) algorithm to a hierarchical classification system with 31 features extracted from cochleagrams. With a database of 1023 sounds they achieved 87% of successful classifications at the family level and 61% at the instrument level when no hierarchy was used. Using the hierarchical procedure increased the accuracy at the instrument level to 79% but it degraded the performance at the family level (79%). Without including the hierarchical procedure performance figures were lower than the ones they obtained with a Bayesian classifier. The fact that the best accuracy figures are around 80% and that Martin and Kim have settled into similar figures shows the limitations of the K-NN algorithm (provided that the feature selection has been optimized with genetic or other kind of techniques). Therefore, more powerful techniques should be explored. Bayes Decision Rules and Naive Bayes classifiers are simple probabilistic classifiers, by which the probabilities for the classes and the conditional probabilities for a given feature and a given class are estimated based on their frequencies over the training data. They are based on probability models that incorporate strong independence assumptions, which may, or may not have a bearing in reality, hence are naive. The resultant rule is formed by counting the frequency of various data instances, and can be used then to classify each new instance. (Brown, 1999) applied this technique to 18 Mel-Cepstral coefficients by a K-means clustering algorithm and a set of Gaussian mixture models. Each model was used to estimate the probabilities that a coefficient belongs to a cluster. Then probabilities of all coefficients were multiplied together and were used to perform the likelihood ratio test. It then classified 27 short sounds of oboe and 31 short sounds of saxophone with an accuracy rate of 85% for oboe and 92% for saxophone. Neural networks process information with a large number of highly interconnected processing neurons working in parallel to solve a specific problem. Neural networks learn by example. (Cosi, 1998) developed a timbre classification system based on auditory processing and Kohonen self-organizing neural networks. Data were preprocessed by peripheral transformations to extract perception features, then were fed to the network to build the map, and finally were compared in clusters with human subjects’
214
A.A. Tzacheva and K.J. Bell
similarity judgments. In the system, nodes were used to represent clusters of the input spaces. The map was used to generalize similarity criteria even to vectors not utilized during the training phase. All 12 instruments in the test could be quite well distinguished by the map. Binary Tree is a data structure in which each node contains one parent and not more than 2 children. It has been pervasively used in classification and pattern recognition research. Binary Trees are constructed top-down with the most informative attributes as roots to minimize entropy. (Jensen and Amspang, 1999) proposed an adapted Binary Tree with real-valued attributes for instrument classification regardless of pitch of the instrument in the sample. Typically a digital music recording, in form of a binary file, contains a header and a body. The header stores file information such as length, number of channels, sampling rate, etc. Unless it is manually labeled, a digital audio recording has no description of timbre or other perceptual properties. Also, it is a highly nontrivial task to label those perceptual properties for every piece of music based on its data content. In music information retrieval area, a lot of research has been conducted in melody matching based on pitch identification, which usually involves detecting the fundamental frequency. Most content-based Music Information Retrieval (MIR) systems query by whistling/humming systems for melody retrieval. So far, few systems exists for timbre information retrieval in the literature or market, which indicates it as a nontrivial and currently unsolved task (Jiang et al., 2009).
3 Timbre The definition of timbre is: in acoustics and phonetics - the characteristic quality of a sound, independent of pitch and loudness, from which its source or manner of production can be inferred. Timbre depends on the relative strengths of its component frequencies; in music - the characteristic quality of sound produced by a particular instrument or voice; tone color. ANSI defines timbre as the attribute of auditory sensation, in terms of which a listener can judge that two sounds are different, though having the same loudness and pitch. It distinguishes different musical instruments playing the same note with the identical pitch and loudness. So it is the most important and relevant facet of music information. People discern timbre from speech and music in everyday life. Musical instruments usually produce sound waves with frequencies, which are an integer (a whole number) multiples of each other. These frequencies are called harmonics, or harmonic partials. The lowest frequency is the fundamental frequency f0, which has close relation with pitch. The second and higher frequencies are called overtones. Along with fundamental frequency, these harmonic partials distinguish the timbre, which is also called tone color. The human aural distinction between musical instruments is based on the differences in timbre. 3.1 Challenges in Timbre Estimation The body of a digital audio recording contains an enormous amount of integers in a time-order sequence. For example, at a sampling rate 44,100Hz, a digital recording has 44,100 integers per second. This means, in a one-minute long digital recording, the
Music Information Retrieval with Temporal Features and Timbre
215
total number of the integers in the time-order sequence will be 2,646,000, which makes it a very large data item. The size of the data, in addition to the fact that it is not in a well-structured form with semantic meaning, makes this type of data unsuitable for most traditional data mining algorithms. Timbre is rather subjective quality and not of much use for automatic sound timbre classification. To compensate, musical sounds must be very carefully parameterized to allow automatic timbre recognition.
4 Feature Descriptions and Instruments Based on latest research in the area, MPEG published a standard group of features for digital audio content data. They are either in the frequency domain or in the time domain. For those features in the frequency domain, a STFT (Short Time Fourier Transform) with Hamming window has been applied to the sample data. From each frame a set of instantaneous values is generated. We use the following timbre-related features from MPEG-7: Spectrum Centroid - describes the center-of-gravity of a log-frequency power spectrum. It economically indicates the pre-dominant frequency range. We use Log Power Spectrum Centroid, and Harmonic Spectrum Centroid. Spectrum Spread - is the Root of Mean Square value of the deviation of the Log frequency power spectrum with respect to the gravity center in a frame. Like Spectrum Centroid, it is an economic way to describe the shape of the power spectrum. We use Log Power Spectrum Spread, and Harmonic Spectrum Spread. Harmonic Peaks - is a sequence of local peaks of harmonics of each frame. We use the Top 5 harmonic peaks - Frequency, and Top 5 Harmonic Peaks - Amplitude. In addition, we use the Fundamental Frequency as a feature in this study.
5 Design of New Temporal Features Describing the whole sound produced by a given instrument by single value of a parameter which changes in time, may be omitting a large amount of relevant information encoded within the sound. For example, calculating the average of the values taken in certain time points. For this reason, we design features, which characterize the changes of sound properties in time. 5.1 Frame Pre-processing The instrument sound recordings are divided into frames. We pre-process the frames, in way that each frame overlaps the previous frame by 2/3 as shown on Figure 1. In other words, if frame1 is abc, then frame2 is bcd, frame3 is cde, and so on. This preserves temporal information contained in the sequential frames. 5.2 New Temporal Features After the frames have been pre-processed, we extract the timbre related features described in section 4 for each frame. We build a database from this information, shown
216
A.A. Tzacheva and K.J. Bell
Fig. 1. Overlapping frames
in Table 1. x1 , x2 , x3 , ..., xn are the tuples (or objects - the overlapping frames). Attribute a is the first feature extracted on them (log power spectrum centroid). We have a total of 7 attributes, 2 of which in a vector form. Next, we calculate 6 new features based on the attribute a value for the first 3 frames t1 , t2 , and t3 . The new features are defined as follows: d1 = t2 − t1 d2 = t3 − t2 d3 = t3 − t1 tg(α) = (t2 − t1 )/1 tg(β) = (t3 − t2 )/1 tg(γ) = (t3 − t1 )/2 This process is performed by our Temporal Cross Tabulator. y1 , y2 , y3 , ..., yn are the new objects created by cross tabulation, which we store in a new database - Table 2. So, our first new object y1 in Table 2 is created from the first 3 objects x1 , x2 , x3 in Table 1. Our next new object y2 in Table 2 is created from x2 , x3 , x4 in Table 1. New object y3 in Table 2 is created from x3 , x4 , x5 in Table 1. Since classifiers do not distinguish the order of the frames, they are not aware that frame t1 is closer to frame t2 than it is to frame t3 . With the new features α, β, and γ, we allow for that distinction to be made. tg(α) = (t2 − t1 )/1 takes into consideration that the distance between t2 and t1 is 1, while tg(γ) = (t3 − t1 )/2 because the distance between t3 and t1 is 2.
Music Information Retrieval with Temporal Features and Timbre
217
Fig. 2. New Temporal Features
This temporal cross-tabulation increases the current number attributes 6 times. In other words, for every attribute (or feature) from Table 1, we have d1 , d2 , d3 , α, β, and γ in Table 2. Thus, 15 current attributes (or features: log power spectrum centroid, harmonic spectrum centroid, log power spectrum spread, harmonic spectrum spread, fundamental frequency, top 5 harmonic peaks amplitude - each peak as a separate attribute, and top 5 harmonic peaks frequency - each peak as a separate attribute) multiplied by 6 = 90. The complete Table 2 has 90 attributes, which comprises our new dataset.
6 Experiment We have chosen 6 instruments: viola, cello, flute, english horn, piano, and clarinet for our experiments. All recordings originate from MUMS CD’s (Opolko and Wapnick 1987), which are used worldwide in similar tasks. We split each recording into overlapping frames, and extract the new temporal features as described in the previous section 5. That produces a dataset with 1225 tuples and 90 attributes. We import the dataset into WEKA (Hall et al., 2009) data mining software for classification. We train two classifiers: Bayesian Neural Network and J45 Decision Tree. We test using bootstrap. Bayesian Neural Network has accuracy of 81.14% and J45 has
Fig. 3. Results Summary
218
A.A. Tzacheva and K.J. Bell
Fig. 4. Results - Detailed Accuracy by Class
accuracy of 96.73%. The summary results of the classification are shown in Figure 3 and the detailed results in Figure 4.
7 Conclusions and Directions for the Future We produce a music information retrieval system, which automatically classifies musical instruments. We use timbre related features. We propose new temporal features. The advantages of this approach are preservation of temporal information, and high classification accuracy. This work contributes to solving the important problem of building music recommendation systems. Automatic recognition or classification of music sounds helps user to find favorite music objects within large online music repositories. It can also be applied to recommend musical media objects of user’s liking. Directions for the future include automatic detection of emotions (Grekow and Ras, 2009) contained in music files.
References 1. Brown, J.C.: Musical instrument identification using pattern recognition with cepstral coefficients as features. Journal of Acousitcal society of America 105(3), 1933–1941 (1999) 2. Casey, M.A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M.: Content-Based Music Information Retrieval: Current Directions and Future Challenges. Proceedings of the IEEE 96(4), 668–696 (2008) 3. Cosi, P.: Auditory modeling and neural networks. In: Course on Speech Processing, Recognition, and Artificial Neural Networks. Springer, Heidelberg (1998) 4. Grekow, J., Ras, Z.W.: Detecting Emotion in Classical Music from MIDI Files, Foundations of Intelligent Systems. In: Rauch, J., Ra´s, Z.W., Berka, P., Elomaa, T. (eds.) ISMIS 2009. LNCS (LNAI), vol. 5722, pp. 261–270. Springer, Heidelberg (2009)
Music Information Retrieval with Temporal Features and Timbre
219
5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explorations, New Zealand 11(1) (2009) 6. Jensen, K., Arnspang, J.: Binary decision tree classification of musical sounds. In: Proceedings of International Computer Music Conference, Beijing, China (1999) 7. Jiang, W., Cohen, A., Ras, Z.W.: Polyphonic music information retrieval based on multi-label cascade classification system. In: Ras, Z.W., Ribarsky, W. (eds.) Advances in Information and Intelligent Systems. SCI, vol. 251, pp. 117–137. Springer, Heidelberg (2009) 8. Martin, K.D., Kim, Y.E.: Musical instrument identification: A pattern recognition approach. In: Proceedings of Meeting of the Acoustical Society of America, Norfolk, VA (1998) 9. Opolko, F., Wapnick, J.: MUMS-McGillUniversityMasterSamples.CD’s (1987)
Towards Microeconomic Resources Allocation in Overlay Networks Morteza Analoui and Mohammad Hossein Rezvani Department of Computer Engineering, Iran University of Science and Technology (IUST) 16846-13114, Hengam Street, Resalat Square, Narmak, Tehran, Iran {analoui,rezvani}@iust.ac.ir
Abstract. Inherent selfishness of end-users is the main challenging problem in order to design mechanisms for overlay multicast networks. Here, the goal is to design the mechanisms that can be able to exploit the selfishness of the endusers in such a way that still leads to maximization of the network’s aggregated utility. We have designed a competitive economical mechanism in which a number of independent services are provided to the end-users by a number of origin servers. Each offered service can be thought of as a commodity and the origin servers and the users who relay the service to their downstream nodes can thus be thought of as producers of the economy. Also, the end-users can be viewed as consumers of the economy. The proposed mechanism regulates the price of each service in such a way that general equilibrium holds. So, all allocations will be Pareto optimal in the sense that the social welfare of the users is maximized.
1 Introduction Overlay multicasting solution can approximate many of the same design goals of IP multicasting as well as being easier to deploy than IP multicasting [1, 2]. It has been a well-known design philosophy to use multi-rate transmission, where the receivers of the same multicast group can receive the services at different rates [3]. A crucial problem with the overlay networks is that the users are inherently selfish. The reason lies in the fact that they belong to different administrative domains. In this context, the most important question is the following; “how should we exploit the inherent selfishness of the end-user nodes, so that the aggregate outcome of the activity of individual nodes behaving toward their own self-interests still leads to the network’s overall utility maximization?” We believe that the so called problem could be examined by microeconomics theories. Since each consumer in the economy is in fact a selfish utility maximizer, the behavior of each end-user node in the overlay network could be mapped to that of a consumer. From microeconomic theory point of view, the service provided by each multicast server, plays the role of the “commodity” in an economy. Hence, we can model the overlay network as the “overlay economy” in which several commodities (multicast services) are provided. The overlay economy has two types of producers for each commodity: one is the origin server that provides the service (commodity) for the first time, and the other is the relaying nodes which relay the service to their downstream nodes. Also, by means of “equilibrium” concept A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 220–231, 2010. © Springer-Verlag Berlin Heidelberg 2010
Towards Microeconomic Resources Allocation in Overlay Networks
221
from microeconomics theory, we can tune the allocated rates of the overlay nodes in such a way that the welfare can be maximized in the overlay economy. The remainder of this paper is organized as follows: Section 2 discusses the related work. Section 3 introduces the formalism of the network model that we have used throughout the paper. Section 4 proposes the economic mechanism for managing the interactions of the overlay network. Section 5 discusses the performance evaluation of the proposed mechanism in the overlay network environment and shows the experimental results. Finally, we conclude in Section 6.
2 Related Work There already exists a significant body of research toward self-organization in the overlay networks. Some of the projects which have been carried out in the area of the overlay multicast include ZIGZAG [4], Narada [5], NICE [6], and SCRIBE [7]. To address the selfishness of the users in the overlay network, some works have applied game theory. In the game theoretic methods, the overlay nodes are considered as game players with conflicting interests regarding the shared resources [8, 9, 10, 11]. The relation between the users' selfish behavior and the network routing has been investigated in [8]. In Wu et al. [9], an auction-based game theoretic approach is proposed in which the downstream peers submit their bids for the bandwidth at the upstream peers. They show that the outcome of this game is an optimal topology for the overlay network and it minimizes the streaming costs. Also, the authors of [10] have proposed an auction-based model to improve the performance of BitTorrent. They have investigated the use of a proportional share mechanism as a replacement to BitTorrent's current mechanism and have shown that it achieves improvement in terms of fairness and robustness without any significant modifications to the BitTorrent protocol. A few other works have also been proposed to adjust the selfish behavior of the users within distributed pricing. Cui et al. [3] are the first to apply the priced-based resource sharing method to the overlay network. The significant advantage of their work over previous works is that it absolutely relies on the coordination of the users. This feature makes the protocol deployable to the existing network infrastructures and accords fully with the primary design objectives of the overlay network, which is to avoid any change to the existing infrastructure by leaving the necessary tasks to the users. Another priced-based work in the area of the overlay networks is [12]. They propose an intelligent model based on optimal control theory. They, also advise mathematical models that mimic the selfishness of the overlay users, represented by some forms of utility function. They show that, even when the selfish users all seek to maximize their own utilities, the aggregate network performance still becomes nearoptimal.
3 The Model of the Overlay Network We consider an overlay network consisting of V end hosts denoted as V = {1, 2, ...,V } . Let us suppose that the overlay network consists of N media services, denoted as N = {1, 2, ..., N } . So, there are N origin servers among V hosts
222
M. Analoui and M.H. Rezvani
( N < V ) , each serve a distinct type of media service. We denote S = {s1 , s 2 ,..., s N } as a set containing N servers. Suppose the network is shared by a set of N multicast groups. Any multicast group (multicast session) consists of a media server, a set of receivers, and a set of links which the multicast group uses. Let us suppose that the overlay network consists of L physical links, denoted as L = {1, 2, ..., L} . The capacity of each link, that is the bandwidth of each physical link l ∈ L is denoted as cl . All the nodes, except the origin servers and leaf nodes, forward the multicast stream via unicast in a peer-to-peer fashion. Fig. 1 shows an overlay network consisting of two multicast groups. In this example there is S = {s1 , s 2 } in which s1 (node 0) indicates one group and s 2 (node 3) indicates the other group. Here, the solid lines indicate one group and dashed lines indicate the second group. Also, the physical network consists of eight links ( L = 8 ) and two routers.
Fig. 1. Overlay network consisting of two multicast groups
Each multicast session n ∈ N consists of some unicast end-to-end flows, denoted as set F : n
F
n
= { f ij | ∃ i, j ∈ V n
n
: Μ ij = 1}
n
Where Μ denotes “adjacency matrix” of the multicast group
(1)
n . Each flow f ijn of
the multicast group n passes a subset of physical links, denoted as
L ( f ij ) ⊆ L n
(2)
For each link l , we have
F (l ) = { f ij ∈ F n
n
n
| l ∈ L ( f ij )} n
(3)
Where F (l ) is the set of the flows belonging to the multicast group n and passing n
through the link l . Each flow f ij ∈ F n
n
in the multicast group n has a rate
xijn . We
show the set of all downstream nodes of each overlay node i in the multicast group n by Chd (i , n) . Also, the set Buy (i ) specifies all the multicast groups in the overlay network from which the node i receives (buys) the services. Similarly, Sell (i )
Towards Microeconomic Resources Allocation in Overlay Networks
223
specifies all the multicast groups in the overlay network for which the node i provides (sells) the services.
4 Competitive Overlay Market System The context of this section explains the problems of the consumer and the producer in the economy which we refer to as "competitive overlay economy." The utility function of each user i can be simply expressed as a continuous, strongly increasing, and strictly quasi-concave function of the allocated bandwidths of each service as follows n
eco u i (t )
=
∑
n∈Buy ( i )
β n ln(1 +
x pi (t ) B
n
(4)
)
Where, ∑ β n = 1 . As mentioned earlier in Section 3, x pi (t ) is the rate at which the n
n∈N
n
consumer i receives the service n from its parent node p in time slot t . Also B denotes the maximum allowed bandwidth of the service n . Each consumer i on join1
N
ing the overlay economy is endowed with an endowment vector e i = ( e i , ..., ei ) . k
Each element ei in this vector indicates the amount of the service k which is endowed to the user i . This endowment in actually is the amount of credit that enables the consumer i to enter the economy and take share in future interactions. One can imagine that each node in the overlay economy is provided with a budget, but in the form of the commodity, i.e., in the form of the amount of the service. We now let P ≡ ( p1 , ..., p N ) be the vector of market prices. As a matter of notation, if every price of this vector is "strictly larger than zero" we will write P >> 0 . So, the amount P. e i can be thought as initial “budget” for the consumer i . The utility maximization problem of each consumer i in the competitive overlay economy in each time slot t is formally stated as follows
max
eco u i (t ) n { x pi (t ,.) | n ∈ Buy (i )}
(5)
⎛ ⎞ ⎜ p (t ) x n (t ,.) + ⎟≤. q ( t − 1 ) ∑ ∑ n pi l ⎜ ⎟ n) n∈Buy ( i ) l∈L ( f pi ⎝ ⎠ s.t. N
∑p k =1
∑x
n∈Buy ( i ) n
n
(t − 1).ei + k
k
n pi (t ,.) n
b ≤ x pi (t ,.) ≤ B ,
(6)
t −1
Π ( P ( τ )) ∑ τ =0
i
≤ CDi ∀n ∈ Buy (i )
(7) (8)
224
M. Analoui and M.H. Rezvani
Condition (6) implies the "budget constraint" of the node i who acts as a consumer in the overlay economy. The sum on left-hand side of (6) is just the expenditure of the node i in which the first part includes the prices of the different demanded services and the other part (the inner sigma) is the summation of the prices of all the n
links that f pi goes through which are recently updated in the previous time slot, or in n
other words, the most recent underlying physical network prices that f pi has to pay. The right-hand side of (6) simply is the initial budget of node i plus its share in the profit earned by selling the services to its downstream nodes in the previous time t −1
slots, denoted as
Π ( P ( τ )) . We defer the investigation of the profit function to ∑ τ i
=0
(16). Note that due to the dual role of the intermediate nodes, one can imagine that they are acting as both consumers and producers in the economy. Hence, the budget of each consumer can arise from two sources: from an endowment of the services already owned, and from shares in the profit which it earns as a producer. Also, note that the right-hand side of the constraint (6) as well as some part of the left-hand side are a-priori known because they are expressed based on the prices of the previous time slot, namely t − 1 . Constraint (7) implies that the summation of the rates of all services demanded by the consumer node i should be equal or less than its downlink capacity, denoted as CDi . Constraint (8) states that the receiving rate of each service should be in the interval between its minimum and its maximum allowed bandwidth. Before we proceed, let us define m i ( P (t )) , the budget of the consumer i , as following: t −1
m i ( P (t )) = P (t ).e i +
Π ( P ( τ )) ∑ τ i
(9)
=0
In words: "the budget of the user i in the beginning of the time slot t is the sum of its initial endowed budget, namely P (t ).e i , and the profit earned during the past time t −1
slots, namely
Π ( P ( τ )) ." Note that since the feasible region of the constraints (6) ∑ τ i
=0
to (8) is compact, by non-linear optimization theory, there exists a maximizing value n
of argument { x pi (t ,.) | n ∈ Buy (i )} for the above optimization problem, which can be solved by Lagrangian method [13]. We denote the solution to the consumer i 's problem in the time slot t as following : n
x pi (t , P (t ), m i ( P (t )),
∑ q l (t − 1), CDi , b n , B n ),
n) l∈L ( f pi
∀n ∈ Buy (i )
(10)
n
The allocation set { x pi (t , P ( t ), m i ( P (t )),...) | n ∈ Buy (i )} is in fact the "Walrasian Equilibrium Allocation" (WEA) of the consumer i in the overlay economy. In microeconomics terminology, the WEA, is the consumer’s demanded bundle, which mainly depends on the market prices and the consumer’s budget.
Towards Microeconomic Resources Allocation in Overlay Networks
225
Now, we turn to the optimization problem of the producers in our model. The competitive overlay economy includes two sources of producers. The producers can be either the origin servers (i.e., j ∈ S ) which originally offer the service to the network or can be the users who relay the services to their downstream nodes(i.e., j ∈ V − S ). From now on, we will refer to both of these types of producers as set J . We now let y j ∈ R
N
be a "production plan" for a given producer j . If, for ex-
ample, there are three services in the overlay economy and y j = (7, 0, 13) , then the production plan implies that the producer j produces 7 units of the service number 1 and 13 units of the service number 3 but does not produce the service number 2 at all. To summarize the technological possibilities in production, we suppose each producer j possesses a “production possibility set” Y j , j ∈ J which is closed, bounded, and convex. Interested readers can refer to pages 206-207 of [14] to find an in-depth discussion of these assumptions. Each producer j seeks to solve the following problem:
max
∑p
y j ( t )∈Y j n∈Sell ( j )
n n (t ,.). y j (t )
∑y
s.t.
n∈Sell ( j ) n
n j (t ,.)
n
y j (t ,.) ≤ x pj (t ,.) , n
y j (t ,.) =
∑x
n jk k∈Chd ( j , n )
(11)
≤ CU j
∀n ∈ Sell ( j )
∀n ∈ Sell( j )
(t ,.) ,
(12) (13) (14)
Since for each producer of this class, all the required input services are present by its corresponding consumer node, we consider no cost of production for it in (11). Constraint (12) states the uploading capacity constraint of the producer j . It simply states that the sum of the productions of the all services produced by the producer j should not be greater than j 's uploading capacity. Constraint (13) implies that the rate of each supplied service by the producer j cannot exceed the rate at which it is provided to j . In fact, the intermediate node j currently uploads to its downstream nodes those services which it has downloaded from the upstream nodes. Constraint (14) n
states that y j should be equal to the sum of the rates of the service type n that is supplied by the producer j to be used by its downstream nodes. Altogether, we can say that the solution to (11) is in fact the production plan of the producer j . We can represent the production plan concerning to the intermediate user j as following:
y j (t , P, CU j , { x (t ,.) | n ∈ Sell ( j )}, { x jk (t ,.) | k ∈ Chd ( j , n) & n ∈ Sell ( j )}) pj n
n
(15)
226
M. Analoui and M.H. Rezvani
Each element of this production plan represents the amount of the service type n which the user node j sells to its downstream nodes in the overlay economy. If the node j does not provide a service type, the corresponding element in the production plan y j will be equal to zero. The producer j 's profit function in time slot t is defined as follows Π j ( P (t )) ≡ max P (t ).y j (t ) (16) y j ∈Y j Because the flow rates in the overlay network are continuous and the constraint set is closed and bounded, a maximum of producer profit will exist [13]. It is common in microeconomics literature to describe each market by a function named “excess demand.” Then, the whole of system may be described compactly by a single N dimensional “excess demand vector”, each of whose elements is the excess demand function for one of the N markets. The aggregate excess demand for the service k ∈ N in each time slot t is in the following form: z k ( P (t )) ≡
∑ xi
k
i∈I
(t , P (t ), m i ( P (t ),.) − .
∑ y j ( P (t ),.) − ∑ ei k
j∈J
k
(17)
i∈I
And the aggregate excess demand vector is Z ( P (t )) ≡ ( z1 ( P (t )), ..., z N ( P (t )))
(18)
When z k ( P (t )) > 0 , the aggregate demand for commodity k exceeds the aggregate endowment of commodity k , and so there is excess demand for commodity k . When z k ( P (t )) < 0 , there is excess supply of the commodity k . A Walrasian equilibrium * * price vector P (t ) >> 0 clears all markets. That is Z ( P (t )) = 0 . Since the utility functions and production possibility sets satisfy the conditions of "Existence Theorem", the existence of the Walrasian equilibrium price will be guaranteed. Interested readers can refer to pages 210-211 of [14] to see the proof of this theory.
5 Experimental Analysis In order to control the competitive overlay economy we consider a dedicated server which is responsible for calculating the existing equilibrium price and informing all nodes about it. We name this server Overlay Market Control Server (OMCS). In fact, the OMCS bears the characteristics of "rendezvous point" in protocols such as [6, 7, 15]. In such protocols upon joining the overlay network, the new node sends a request to the rendezvous point and then acquires from the rendezvous point the IP address of its assigned parent in the multicast tree. Upon joining or leaving of the nodes, the OMCS solves N equations of the Eq. (18) with N unknown equilibrium prices to find the new equilibrium price, P (t ) , and broadcasts it to the all nodes i ∈ V by proper messages. Due to space limitations, we omit the messaging details of the project here. *
Towards Microeconomic Resources Allocation in Overlay Networks
227
The minimum and the maximum allowed bandwidths of the service are defined to be 500Kbps and 4Mbps, respectively. Each user i , on joining the network is endowed with an initial income ei , which has Poisson distribution with mean 400Kbps. The time between the two consecutive "joining" events is uniformly distributed in the interval [0, 200 msec]. Similarly, the inter-arrival time of the "leaving" events has uniform distribution in the interval [0, 2 sec]. We have used BRITE topology generator [16] to create the backbone of the underlying physical network which consists of 512 routers and 1024 physical links. The bandwidth of each physical link has Heavytailed distribution in the interval [10Mbps, 100Mbps] and the propagation delay has uniform distribution in the interval [1ms, 2ms]. Each overlay node is connected to a backbone router through an access link, whose capacity is exponentially distributed with an average of 15 Mbps. We have assumed that the downlink capacity of each access link is two times bigger than its uplink capacity. Also, the maximum tolerable loss rate and the maximum tolerable delay of each overlay flow are %5 and 1 second, respectively. The number of services in the competitive overlay economy is 5. So, there exist 5 origin servers in the economy. In order to evaluate the resilience of the proposed algorithm against the network dynamics, we ran our simulations with dynamic networks in which the nodes join and leave according to the given distributions. Fig. 2 shows the average throughput of each node resulted from the microeconomics-based resource allocation mechanism compared to the case in which no priced-based mechanism is used. In the non-price mechanism, the algorithms of the joining and the leaving are designed similar to those of the price-based case, but they do not contemplate the calculation of the equilibrium price at all. For the sake of completeness, a comparison with the average upper bound throughput is provided in Fig. 2 as well. By "average upper bound throughput", we mean the average uploading capacity of non-leaf nodes in all multicast trees. Clearly,
Fig. 2. Average throughput per node (average social welfare)
228
M. Analoui and M.H. Rezvani
Fig. 3. Percentage of improvements by the microeconomic-based mechanism
the aggregate receiving rate of the overlay nodes in an overlay tree cannot exceed the sum of the upload capacities of the non-leaf nodes. So, we can gain further insights into the microeconomics-based mechanism by evaluating it with "the average upper bound throughput" as a best-case metric. Clearly, the metrics related to the upper bound throughput vary with topological form of the multicast trees at any given moment. From the "First Welfare Theorem" it is guaranteed that every WEA is Pareto optimal in the sense that the average social welfare of the economy is maximized. Interested readers can refer to pages 218-219 of [14] to see the proof of this theory. It is evident from Fig. 2 that the resultant average social welfare of the proposed mechanism is better than the case in which no price-based mechanism is used. Fig. 3 illustrates the percentage of the nodes who their utility have been improved using the microeconomics-based mechanism during the network dynamics. To this end, for each population we have logged acquired utility of each user in both cases of using the microeconomics-based and the non-priced mechanisms and then have compared these two values with each other. Next, we have normalized the number of improved users by the total number of the users and have stated the result in the form of the percentage of improvements. It is clear from the figure that using the proposed mechanism enhances the perceived quality of the services in the multicast groups. Fig. 4 shows the price of each service during network dynamics. In order to allocate the demanded services to the new users, the proposed algorithms at first seek the upper levels of each multicast tree. Thus, the resultant multicast trees typically have near-balanced shapes. For the sake of illustration, let us consider the case in which each multicast tree has the structure of a balanced tree. Let K denote the maximum n
number of allowed children for each user node, and Vleaf denote the number of the leaf nodes in the n -th multicast tree. Then, we have [17] V ( K − 1) + 1 n Vleaf = K
(19)
Towards Microeconomic Resources Allocation in Overlay Networks
229
Since we have assumed K = 4 in the experimental setup, it follows from Eq. (19) 3V + 1 n . In other words, when each user is allowed to have four children, that Vleaf = 4 approximately 75 percent of the users will be leaf nodes. Clearly, the leaf users only consume the services and do not supply any services at all. Therefore, since the total number of the consumers in the overlay economy is potentially greater than the total number of the producers, we expect the prices to increase during the time.
Fig. 4. The price of each service during network dynamics
From the Fig. 4, we can see that the prices of the last rounds increase much slower than those of the early rounds. The economic intuition underlying it is clear. As the time passes, the nodes who are located in the upper levels of the tree earn more profit by selling the services to their children. The earned profit can be added to the current budget of each node and improve its ability for buying the services from its parents. So, the upper level nodes will reach to a state in which they will have enough budgets to buy all of their demanded services from their parents. In other words, as the time passes the upper level nodes no longer have budget constraint and can demand more service compared to the early rounds. This in turn allows the parents to share their uploading capacity more than before and increase their supply. With respect to (17), when the gap between the supply and the demand decreases, Z ( P ) increases with smaller slope than before. This acts as a barrier against increasing the prices and causes the price vector not to rise rapidly in the last rounds. Another factor that significantly affects the price vector is the amount of the initial endowments in the society. The term
∑ ei
k
in the right-hand side of Eq. (17) represents the aggregated
i∈I
amount of the initial endowments for a given service k . This value increases proportional to the ever increasing number of the users; leading to slow increase of Z ( P ) ,
230
M. Analoui and M.H. Rezvani
and therefore the price vector. Another reason that causes the price vector to rise slowly in the ultimate rounds of the system lies in the nature of the perfectly competitive market system. As the number of the consumers and the producers become sufficiently large, no single one of them, alone, has the power to significantly affect the market price. As it is evident from the Fig. 4 the prices of the services 1 and 2 rise with larger slope compared to that of the services 3, 4, and 5. The reason lies in the fact that in the experimental setup we have assigned the most relative importance to the services 1 and 2 ( β 1 = 0.45, β 2 = 0.40) . So, as is shown in Fig. 5, the consumers will try to buy the services 1 and 2 in higher quantities compared to the other services. This results in the more excess demand and the more displacement of the prices for these two services.
Fig. 5. The rate of each service during network dynamics
6 Conclusions In this paper we have viewed the overlay multicast network as a competitive exchange economy. The proposed algorithms manage the provisioning of multiple services and allocate the bandwidth to the users in multi-rate fashion, with a goal of maximizing the aggregate utility of the system. To this end, the algorithms try to regulate the price of each service in such a way that the demand of each service becomes equal to its supply. Our experiments have proved the efficiency of the system and have shown that it works near-optimal and allocates the resources in such a way that the Pareto optimality is met.
Towards Microeconomic Resources Allocation in Overlay Networks
231
References 1. Zhu, Y., Li, B.K., Pu, Q.: Dynamic Multicast in Overlay Networks with Linear Capacity Constraints. IEEE Transactions on Parallel and Distributed Systems (2009) (to appear) 2. Zhu, Y., Li, B.: Overlay Networks with Linear Capacity Constraints. IEEE Transactions on Parallel and Distributed Systems 19(2), 159–173 (2008) 3. Cui, Y., Xue, Y., Nahrstedt, K.: Optimal Resource Allocation in Overlay Multicast. IEEE Transactions on Parallel and Distributed Systems 17(8), 808–823 (2006) 4. Tran, D.A., Hua, K.A., Do, T.: ZIGZAG: An Efficient Peer-To-Peer Scheme for Media Streaming. In: Proc. of IEEE INFOCOM 2003, San Franciso, CA, USA (2003) 5. Chu, Y.H., Rao, S., Seshan, G.S., Zhang, H.: A case for end system multicast. IEEE J. on Selected Areas in Communications 20(8) (2002) 6. Banerjee, S., Bhattacharjee, B., Kommareddy, C.: Scalable Application Layer Multicast. In: Proc. of ACM SIGCOMM 2002, Pittsburgh, PA, USA (2002) 7. Castro, M., Druschel, P., Kermarrec, A.-M., Rowstron, A.: SCRIBE: a large-scale and decentralized application-level multicast infrastructure. IEEE J. on Selected Areas in Communications 20(8) (2002) 8. Roughgarden, T., Tardos, É.: How bad is selfish routing? J. ACM 49(2), 236–259 (2002) 9. Wu, C., Li, B.: Strategies of Conflict in Coexisting Streaming Overlays. In: INFOCOM 2007, pp. 481–489 (2007) 10. Levin, D., LaCurts, K., Spring, N., Bhattacharjee, B.: Bittorrent is an auction: analyzing and improving bittorrent’s incentives. In: SIGCOMM, pp. 243–254 (2008) 11. Wu, C., Li, B., Li, Z.: Dynamic Bandwidth Auctions in Multioverlay P2P Streaming with Network Coding. IEEE Trans. Parallel Distrib. Syst. 19(6), 806–820 (2008) 12. Wang, W., Li, B.: Market-Based Self-Optimization for Autonomic Service Overlay Networks. IEEE J. on Selected Areas in Communications 23(12), 2320–2332 (2005) 13. Bertsekas, D.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999) 14. Jehle, G.A., Reny, P.J.: Advanced Microeconomic Theory. Addison-Wesley, Reading (2001) 15. Pendarakis, D., Shi, S.Y., Verma, D., Waldvogel, M.: ALMI: an application layer multicast. In: 3rd USENIX Symp. on Internet Technologies and Systems (2001) 16. Medina, A., Lakhina, A., Matta, I., Byers, J.: BRITE: An Approach to Universal Topology Generation. In: Proc. IEEE Int’l Symp. Modeling, Analysis and Simulation of Computer and Telecomm. Systems, MASCOTS (2001) 17. Horowitz, E., Sahni, S., Mehta, D.: Fundamentals of Data Structures in C++. W.H. Freeman Press, New York (1995)
Investigating Perceptions of a Location-Based Annotation System* Huynh Nhu Hop Quach1, Khasfariyati Razikin1, Dion Hoe-Lian Goh1, Thi Nhu Quynh Kim1, Tan Phat Pham1, Yin-Leng Theng1, Ee-Peng Lim2, Chew Hung Chang3, Kalyani Chatterjea3, and Aixin Sun4 1
Wee Kim Wee School of Communication & Information, Nanyang Technological University {hnhquach,khasfariyati,ashlgoh,ktnq,tppham,tyltheng}@ntu.edu.sg 2 School of Information Systems, Singapore Management University [email protected] 3 National Institute of Education, Nanyang Technological University {chewhung.chang,kalyani.c}@nie.edu.sg 4 School of Computer Engineering, Nanyang Technological University [email protected]
Abstract. We introduce MobiTOP, a Web-based system for organizing and retrieving hierarchical location-based annotations. Each annotation contains multimedia content (such as text, images, video) associated with a location, and users are able to annotate existing annotations to an arbitrary depth, in effect creating a hierarchy. An evaluation was conducted on a group of potential users to ascertain their perceptions of the usability of the application. The results were generally positive and the majority of the participants saw MobiTOP as a useful platform to share location-based information. We conclude with implications of our work and opportunities for future research.
1 Introduction In recent years, various location-based annotation systems [2, 5, 7, 8] have popularized the use of maps for people to create and share geospatial content. Put differently, a location-based annotation system allows users to create and share multimedia content that are typically associated with latitude-longitude coordinates using a map-based visualisation. As an information sharing platform, location-based annotation systems could facilitate the users' needs in information discovery by the availability of searching and browsing features [20]. Also, in the spirit of social computing, such systems could also allow users' to create annotation as well annotating existing content [1]. Threads of discussion or topics that are organised hierarchically are then induced from the collaborative effort. Despite the growing amount of research in this area, to the best of our knowledge, there are few studies done to investigate the usability of these applications. We argue that this is critical in the understanding how users perceive these applications and *
This work is partly funded by A*STAR grant 062 130 0057.
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 232–242, 2010. © Springer-Verlag Berlin Heidelberg 2010
Investigating Perceptions of a Location-Based Annotation System
233
their constituent features. This will help in the design and implementation of locationbased annotation systems. In this paper, we investigate the usability of MobiTOP (Mobile Tagging of Objects and People). As its name suggests, the application supports location-based tagging or annotating. MobiTOP offers a Web-based platform where users are able to freely create, contribute, and comment on location-based content. The application also enables users to explore, search and browse annotations using a variety of techniques. In previous work, we have conducted a small-scale pilot evaluation of MobiTOP [9]. While useful in guiding the development of further iterations of the system, the results were not generalizable due to the small number of participants involved. Here, we complement the previous study by involving a larger number of participants. The remainder of this paper is as follows. Section 2 provides an overview of the related research while Section 3 introduces MobiTOP, a location-based annotation system that we have implemented. Section 4 presents evaluation of the system. This paper closes with Section 5 that discusses the implications of our work and opportunities for future research.
2 Related Work Here, we review literature related to location-based annotation systems. One such system is the World Explorer [4] where the users are used to explore and browse large-scale georeferenced photo collections. Using spatial, textual and photographic data mined from Flickr, the system visualizes the most representative tags of a geographical area. This visualization improves users’ exploring and browsing experiences. However, World Explorer does not provide any search function that allows users to look for specific tags. Moreover, users of World Explorer are unable to share or discuss their contents directly on the system. GeoAnnotator [3], on the other hand, facilitates location-based discussion threads by connecting annotations to geographic references and other annotations. However, users are limited to sharing only textual content and this functionality is not extended to other types of content such as multimedia content. Urban Tapestries [6], is another system that allows users to share their location-based multimedia contents. Moreover, this system also allows users to follow a discussion thread as a hierarchical content. However, there is no usability study done on the system’s map interface and the annotation visualization. There are limited usability studies related to location-based annotation systems. Komarkova et al [19] proposed a set of 138 heuristics for usability evaluation of location-based applications. In that study, 14 GeoWeb applications were used to test this framework. The usability of such systems has been evaluated and criticized by a group of expert users. Despite the fact that major online web-mapping systems such as Google Maps or Microsoft Live Search have been significantly improved regarding their usability, there are up to now limited usability evaluations by the end-users of such systems. Studies [19, 21, 22, 23] have found evaluating the usability of applications directly by the end-users to be more promising.
3 Introducing the Web-Based MobiTOP System MobiTOP has been introduced in our previous work [9], which described carefully the architecture of the whole system as well as the concept of multimedia hierarchical
234
H.N.H. Quach et al.
annotation. The latest version of Web-based MobiTOP provides more functions such as identification, organization, searching and visualization of location-based content. Moreover, using Google Maps™ API for representing the MobiTOP’s user interface, these functions has been organised consistently in the web application. In this section, we describe the web user interface of MobiTOP as well as explore its functionality. The MobiTOP Web client offers an AJAX-based user interface to facilitate its widespread use without the need to installing additional software. We have adopted a map-based visualization to access the location-based annotations (Figure 1). An important component of MobiTOP is its support for hierarchical multimedia annotations that allow users to annotate existing annotations, essentially creating a thread of discussion. Here, annotations consist of locations, images and other multimedia, as well as textual details augmented by tags, titles and descriptions. The content of an annotation is displayed across two columns in MobiTOP (Figure 1). One column displays the hierarchical view of the selected annotation while the other column displays the annotation’s content. The content itself is divided among various tabs and consists of the annotation’s details, tag cloud, and media attachments. MobiTOP’s functionality may be divided into seven main components: • • •
Registration: Before a user can start to contribute and browse the annotations in MobiTOP, an account needs to be registered. A registered user would be able to view the main interface of MobiTOP (Figure 1) after logging in. Map navigation: MobiTOP provides standard features for map navigation such as zooming and panning. Users are also able to reposition the map to a specific area by entering the address in the search bar. Browsing annotations: Users are able to access annotations through various ways. One of these is by using the View menu at the top left corner of the screen (Figure 1). This menu encapsulates the different annotation access features such as viewing all the annotations in the system, the user’s contributed annotations, recently contributed annotations, and the tag cloud generated from all Search bar
View menu functions
Filtering feature
Tabs to various content
List of annotations
Hierarchy of related annotations
$QQRWDWLRQ¶V detailed content
Map
Markers
Fig. 1. User interface of the MobiTOP Web client
Investigating Perceptions of a Location-Based Annotation System
•
• •
235
annotations. These functions enable the users to make serendipitous information discovery. Another way for the user to browse the annotations is by navigating the tree view that is displayed in the individual annotation’s details. Searching annotations: Users are able to search desired annotations by entering relevant keywords in the search bar. However, retrieved annotations could clutter the map and impede the searching process if too many results are returned [14, 17]. We overcome this problem by clustering the results. Here, the annotations in the search results are grouped based on their locations (Figure 2). The clustering algorithm is an adaptation of DBScan [12] that groups the annotations by density. The novelty of this approach is that the clustering results vary between different zoom levels depending on the distance between the annotations. The numbers on each marker on the map in Figure 2 shows the numbers of annotations in the cluster. In addition, a tag cloud of each cluster is also shown to the user. Users are thus able to explore individual annotations in each cluster by clicking on the tag cloud (Figure 3). Further, users are able to search without clustering the resulting annotations. Filtering annotations: In addition to clustering, filtering the annotations to narrow search results is also supported. Here, options are available to narrow the results by distance, user ratings and by time. (Figure 1). Creating annotations: When creating a new annotation, a user enters title, tags, description and attaches relevant multimedia files (Figure 4). We attempt to alleviate the problem of noisy tags [18] as well as to save users’ time and effort in keying in tags [10] by providing tag recommendations (Figure 4). The tags are recommended based on the location of the annotation [11], its parents’ tags and the owners’ contributed tags thus far. Given the current location of the user, the
Summary of individual clusters
Clusters on map
Fig. 2. Clustered search results list are displayed on left panel and on the map
236
H.N.H. Quach et al.
Current cluster tag cloud
List of annotations in current cluster
Individual annotations in current cluster
Fig. 3. Interface of the clustered the search results showing the annotations of a cluster
List of recommended tags
Fig. 4. Creating an annotation and list of recommended tags
Investigating Perceptions of a Location-Based Annotation System
237
Fig. 5. Editing an annotation textual content as well as its attachments
•
algorithm first aggregates the tags that had been used in the surrounding location. Each of the tags that had been used in the surrounding location is given a score based on the frequency of its usage. We go further to make distinctions between the number of times the tag is used by the current user and other annotation creators. This is to maintain the preference of the current user over other owners. The tag’s score is also determined by how recently the tag was used. Again, we made distinction between the current user and other owners. Finally, the top ten tags with the highest score are recommended to the user. Editing/deleting annotations: Users are only able to edit or delete the annotations that they had created. The edit form (Figure 5) provides functions that are similar with that to create annotations. Users are also able to edit the textual content, add or delete multimedia files.
4 Usability Evaluation A study of the MobiTOP’s Web user interface was conducted to determine its usability. A total of 106 participants took part in the evaluation. There were 57 male and 49 female participants, and they were students and working adults. Their ages ranged from 18 to 37, with an average age of 23. Further, 38% of the participants had a computer science background while the rest had other academic backgrounds. Participants were familiar with the use of social computing applications such as blogs, wikis, photo/video sharing, social tagging and social networking. Here, 65% of the participants reported to view such content at least once a week, while 55% of the participants reported to contribute such content at least once a month.
238
H.N.H. Quach et al.
4.1 Methodology During the one-hour evaluation session, participants were first briefed on the concept of annotations and were introduced to seven components of Web-based MobiTOP as describe in Section 3. After that, the short demonstration was provided to the participants to show them how to perform some basic tasks directly on the web application. Right after the introduction and demonstration, the travel planning scenario together with fifteen tasks were assigned to each of the participants in order to evaluate the user interface. The tasks focused on using the seven components of MobiTOP as described in Section 3 in order to plan and share travelling trips through the system. Research assistants were on hand to clarify doubts that the participants had while doing the tasks. After completion of their tasks, participants were required to complete a questionnaire with demographic questions and those related to their perceptions of the usability of the MobiTOP system. Each question was given in the form of the affirmative statement followed by a scale of 1 (Strongly Disagree) to 5 (Strongly Agree). The usability section of the questionnaire was further divided into two parts. The first sought to determine MobiTOP’s overall usability via four indicators [13]: • • • •
Learnability: measures how easily the users learn to navigate the system and complete a task. Efficiency: determines the users’ ability to complete a task within a reasonable amount of time. Error Handling: verifies the users understanding of the error encountered and the ability to recover from the errors. Satisfaction: validates the users’ sense of satisfaction after completing the tasks and intention to adopt the system.
The second part of the questionnaire focused on the usability of each component, and questions were asked about the ease of use of the respective features. Participants were also encouraged to elaborate in their evaluation by answering three subjective questions about which components they liked or disliked, as well as suggestions on useful features that could be included in future versions of MobiTOP. 4.2 Results Table 1 shows the mean and standard deviation (SD) of MobiTOP’s overall usability with respect to the four indicators. Results indicate that MobiTOP was perceived to be relatively usable in general. For instance, during the study, most participants were observed to be able to complete the tasks after the short briefing, suggesting the learnability of the system (“It is easy to learn using the application”- Participant 2). In addition, the efficiency indicator suggested that participants took a reasonable amount of time to complete their tasks. It was observed that all of the participants were able to complete the tasks within the specified amount of time. Further, participants generally knew the meaning of error messages encountered and were able recover from the errors without seeking help from the research assistants. Finally, they appeared to have enjoyed creating and sharing annotations with others, and most of them felt satisfied after completing the tasks (“I can get a lot of information if users upload their experiences to MobiTOP… It is an interesting system”- Participant 37).
Investigating Perceptions of a Location-Based Annotation System
239
Table 1. Overall usability results (1 = strongly disagree; 5 = strongly agree) Usability Indicators Learnability Efficiency Error Handling Satisfaction
Mean 4.08 3.72 3.60 3.84
S. D. 0.36 0.54 0.59 0.51
In addition to overall usability, Table 2 shows the mean and standard deviation of the usability of each of MobiTOP’s seven major features. The results indicate that participants found the individual components to be usable as well: •
•
•
•
User Registration. Overall, all participants knew how to register for a MobiTOP account without any trouble. They found the registration form to be intuitive and this could be due to their familiarity with other Web applications’ registration components. In addition, all participants knew how to handle registration errors. Annotation Navigation. All the features of the View menu were appreciated by the participants. For instance, Participant 4 commented that “viewing my annotations helps me to conveniently keep track of all my uploaded annotation”. Participant 29 found viewing annotations of a particular user to be useful as it “allows me to search for their friends’ activities easily”. Similarly, Participant 7 found tag cloud function to be convenient: “I don't have to think about the word I want to search, it’s great”. Finally, Participant 45 liked the viewing of recent annotations as it “allows me to quickly update the information in the system”. On the other hand, the “View All Annotations” feature received less positive responses compared to the others. One likely reason was that most users could not access their preferred annotations are among the large number of annotations in the system. In summary, the participants were able to learn how to use the features in the View menu without any difficulty and most of them found that this component greatly helped the way they accessed the annotations. Map Navigation. Most of the participants felt comfortable in browsing the map. Additionally, they felt that the map-based interface was quite intuitive. The reason could be their familiarity with Google Maps as 70% of the participants used Web-based mapping applications at least once a week. The way of representing annotations on the map was also well received by the participants (“It’s quite easy to navigate the map and explore annotations through pop-up windows”-Participant 98). Creating Annotations. The participants found that the annotations were easy to create because of the simplicity and responsiveness of the interface. As Participant 46 remarked: “The speed of uploading the annotations amazes me. It doesn't take more than a minute to add a new annotation. It's simple to attach a picture too”. Although the concept of hierarchical annotations was new to most of the participants, they were able to create sub-annotations easily. Perhaps the tree view visualization provided them with the proper mental model to understand the concept of hierarchical annotations. The participants also realized the advantages in organizing the annotations hierarchically. This sentiment was echoed by Participant 6 who felt that creating sub-annotations was easy as “we
240
•
•
•
H.N.H. Quach et al.
usually share our experiences in the some similar topics with others. The tree structure let us to conveniently organize the information”. Editing/Deleting Annotations. Participants found editing an existing annotation to be easy and useful as they were able to provide updated information to their contributions. They also found deleting their annotations to be easy (“The … delete annotations (was) made simple and easy” – Participant 63). Searching Annotations. Participants found that the search without clustering feature was easy to use as the results were ordered by relevance and organized across pages. Participant 44 found that the “searching feature is easy to use and it helps me to find the information I need”. On the other hand, presenting the search results in clusters was a new concept to some users. However, most participants managed to complete the tasks related to this function. A common sentiment shared by the participants was that clustering helped to reduce information overload. Participant 10 sums this nicely: “Clusters are neatly organized and the tag cloud of each cluster helps in the searching process”. However, there were comments on the unresponsiveness of searching with clustering. This was because of the processing time required by the clustering algorithm. Filtering Annotations. Most of the participants agreed that being able to filter the annotations by different attributes was helpful in discovering information and at the same time reduces information overload. Participant 106 commented that “it is a handy tool to narrow down the information from the large results list”. Table 2. Components’ usability results (1 = strongly disagree; 5 = strongly agree) Components Usability Registration Annotation Navigation Map Navigation Creating Annotation Editing/Deleting Annotation Searching Annotation Filtering Annotation
Mean 4.22 4.19 3.99 4.11 4.13 3.82 4.17
S. D. 0.51 0.48 0.61 0.52 0.63 0.56 0.55
5 Discussion and Conclusion In this paper, a usability evaluation was conducted with the goal of ascertaining the usability of MobiTOP, a location-based annotation system. The overall usability was found to be above-average by the 106 participants. This is despite the fact that there were new concepts being utilized to support information access such as hierarchical annotations and clustering of search results. Moreover, observations during the evaluation showed that participants needed very little training to be able to use the system. Arising from our results, the following are some implications for the design of location-based annotation systems: • Using familiar visualizations to represent new concepts helps users to orientate themselves more quickly to the system. We have adopted the tree view to represent the hierarchical aspect of annotations and the map-based visualization to represent the annotations. As these visualizations provide the relevant mental
Investigating Perceptions of a Location-Based Annotation System
•
•
•
241
model for users to map the respective concept with the visualization, users are more likely to find the application easy to use, as demonstrated by our results. As with any information system, searching is an essential component. Additionally for a location-based system, searching is often tied to a specific location and is visualized on a map. However, presenting results as individual annotations on the map may overwhelm the user especially when there are large numbers of annotations returned. As such, clustering results on the map should be considered to alleviate the information overload. Provide filtering functions that are based on the different attributes of the data model. As our annotations are contributed by the users, having a mechanism that distinguishes the more useful annotations from the less useful ones would benefit the users. In terms of geo-spatial attributes, narrowing the radii of search focuses the users to the relevant area of interest. Finally, being able to filter the annotations by time attributes narrows the annotations to the relevant time period. Finally, eliminate the need for the user to manually input the data by providing recommendations. For instance, in MobiTOP, relevant tags are suggested to the user when while creating an annotation. The users of course have the freedom to make their selections. This reduces the mental effort needed to create annotations, thus improving users’ perceptions of the usability of the application.
There are limitations in our study that could be addressed in future work. First, our clustering algorithm is limited to geo-spatial locations. Perhaps clustering the annotations semantically [15] in addition to location would further help users obtain relevant content. Next, our evaluation was cross-sectional in nature and confined to the use of MobiTOP in a single session. Further work could track the usability and usefulness of the system over a longer period of time. Finally, memorability was not considered as a usability indicator. It would be interesting for future work to investigate this aspect of MobiTOP [16].
References [1] Kim, T.N.Q., Razikin, K., Goh, D.H.-L., Theng, Y.L., Nguyen, Q.M., Lim, E.P., Sun, A., Chang, C.H., Chatterjea, K.: Exploring hierarchically organized georeferenced multimedia annotations in the MobiTOP system. In: Proceedings of the 6th International Conference on Information Technology: New Generations, pp. 1355–1360 (2009) [2] Leclerc, Y.G., Reddy, M., Iverson, L., Eriksen, M.: The GeoWeb—A New Paradigm for Finding Data on the Web. In: Proceedings of the International Cartographic Conference, Beijing, China (2001) [3] Yu, B., Cai, G.: Facilitate Participatory Decision-Making in Local Communities through Map-Based Online Discussion. In: Proceedings of the Fourth International Conference on Communities and Technologies, pp. 215–224 (2009) [4] Ahern, S., Naaman, M., Nair, R., Yang, J.: World Explorer: Visualizing Aggregate Data from Unstructured Text in Geo-Referenced Collections. In: Proceedings of the JCDL Conference, pp. 1–10 (2007) [5] Girardin, F., Blat, J., Nova, N.: Tracing the Visitor’s Eye: Using Explicitly Disclosed Location Information for Urban Analysis. IEEE Pervasive Computing 6(3), 55 (2007) [6] Lane, G.: Urban Tapestries: Wireless Networking, Public Authoring and Social Knowledge. Personal Ubiquitous Computing 7(3-4), 169–175 (2003)
242
H.N.H. Quach et al.
[7] Doyle, S., Dodge, M., Smith, A.: The potential of Web-based mapping and virtual reality technologies for modeling urban environments. Computer, Environment and Urban System 22, 137–155 (1998) [8] Friedl, M.A., McGwire, K.C., Star, J.L.: MAPWD: An interactive mapping tool for accessing geo-referenced data set. Computers and Geosciences 15, 1203–1219 (1989) [9] Razikin, K., Goh, D.H.-L., Theng, Y.L., Nguyen, Q.M., Kim, T.N.Q., Lim, E.-P., Chang, C.H., Chatterjea, K., Sun, A.: Sharing mobile multimedia annotations to support inquirybased learning using MobiTOP. In: Liu, J., Wu, J., Yao, Y., Nishida, T. (eds.) AMT 2009. LNCS, vol. 5820, pp. 171–182. Springer, Heidelberg (2009) [10] Naaman, M., Nair, R.: ZoneTag’s Collaborative Tag Suggestions: What is This Person Doing in My Phone. IEEE Multimedia 15(3), 34–40 (2009) [11] Moxley, E., Kleban, J., Manjunath, B.S.: SpiritTagger: A Geo-Aware Tag Suggestion Tool Mined from Flickr. In: Proceeding of the 1st ACM International Conference on Multimedia Information Retrieval, pp. 24–30 (2008) [12] Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of the KDD 1996, pp. 226–231. AAAI, Menlo Park (1996) [13] Nielson, J.: Usability Engineering. Morgan Kaufmann, San Diego (1993) [14] Jaffe, A., Naaman, M., Tassa, T., Davis, M.: Generating Summaries and Visualization for Large Collections of Geo-Referenced Photographs. In: Proceedings of the Multimedia Information Retrieval, pp. 89–98 (2006) [15] Cutting, D.R., Karger, D.R., Pederson, J.O., Tukey, J.W.: Scatter/Gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference, pp. 318–329 (1992) [16] Hornbæk, K.: Current practice in measuring usability: Challenges to usability studies and research. International Journal of Human-Computer Studies 64, 79–102 (2006) [17] Nguyen, Q.M., Kim, T.N.Q., Goh, D.H.-L., Theng, Y.L., Lim, E.P., Sun, A., Chang, C.H., Chatterjea, K.: TagNSearch: Searching and Navigating Geo-referenced Collections of Photographs. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 50–61. Springer, Heidelberg (2008) [18] Goh, D.H.-L., Lee, C.S., Chua, A.Y.K., Razikin, K.: Resource Discovery through Social Tagging: A Classification and Content Analytic Approach. Online Information Review 33(3), 568–583 (2009) [19] Komarkova, J., Novak, M., Bilkova, R., Visek, O., Valenta, Z.: Heuristic Evaluation of Usability of GeoWeb Sites. In: Abramowicz, W. (ed.) BIS 2007. LNCS, vol. 4439, pp. 411–423. Springer, Heidelberg (2007) [20] Lim, E.-P., Liu, Z., Yin, M., Goh, D.H.-L., Theng, Y.-L., Ng, W.K.: On organizing and accessing geospatial and georeferenced Web resources using the G-Portal system. Information Process & Management 41(5), 1277–1297 (2005) [21] Nielsen, J.: Useit.com: Technology Transfer of Heuristic Evaluation and Usability Inspection (2005), http://www.useit.com/papers/heuristic/learning_inspection.ht ml [cit 2010-01-22] [22] Haklay, M., Tobon, C.: Usability evaluation and PPGIS: toward a user-centered design approach. International Journal of Geographical Information Science 17, 577–592 (2003) [23] Haklay, M., Zafiri, A.: Usability Engineering for GIS: Learning from a screenshot. The Cartographic Journal 45(2), 87–97 (2008)
Apollon13: A Training System for Emergency Situations in a Piano Performance Yuki Yokoyama and Kazushi Nishimoto Japan Advanced Institute of Science and Technology 1-1, Asahidai, Nomi, Ishikawa, 923-1292, Japan {y_yoko,knishi}@jaist.ac.jp
Abstract. During a piano performance, there is always the possibility that the musician will cease playing on account of an unexpected mistake. In a concert, such a situation amounts to an emergency state in the piano performance. Therefore, we propose a system named “Apollon13” that simulates emergency states by replacing particular notes with different ones, in the manner of mistouches, by referring to the performer’s degree of proficiency as determined by a performance estimation algorithm. From the results of user studies, we confirmed that Apollon13 is basically effective as a training system for handling emergency states. However, the estimation algorithm could not precisely identify the note-replacement points where the subjects become upset. Accordingly, we evaluated the estimation algorithm by comparing it with the player’s subjective assessment based on the data of an experiment. As a result, we found a clear relationship between the subjective assessment and the points, obtained by experiment, at which players become upset. This result suggests that an algorithm could gain the ability to detect the “upset points” by approximating a human’s subjective assessment. Keywords: emergency training, performance estimation, piano performance, note-replacement.
1 Introduction This paper proposes a novel piano-performance training system named “Apollon13.” This system aims to foster the ability to avoid performance cessation caused by unexpected mistakes such as mis-touches. Performance cessation, where the performer “freezes up,” is a “fatal situation” in a piano concert. Therefore, the performer must avoid such a situation by any means, and the performance must go on despite whether mistakes occur. However, no countermeasures to this situation have been taught in conventional piano lessons, and there is no active training methodology for avoiding performance cessation. A piano lesson usually consists of several steps. The first step is basic training. In basic training, an educand learns the method of reading scores and trains in fingerings using some etude (e.g., HANON). The second step is building a repertoire. This step is further divided into two sub-steps. The first is partial exercise and the second is full A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 243–254, 2010. © Springer-Verlag Berlin Heidelberg 2010
244
Y. Yokoyama and K. Nishimoto
exercise. In this step, the educand learns musical performance and musical expression. Although the educand can build a repertoire in these steps, the educand is not yet able to train for a piano concert. Generally, the way to train one’s skills toward a piano concert is simply to repeat a full performance again and again after memorizing the score and fingering. However, this way of training cannot develop the educand’s ability to cope with an unexpected accident: The only way to accomplish this is to actually perform in concerts. Obviously, it is impossible for typical educands to use concerts for training. Various piano-performance training systems have been developed [1][2][3]. However, these systems have only supported the users in becoming able to perform a musical piece accurately in accordance with its score. The problem of performance cessation during a concert has been completely out of the scope of such systems. Consequently, there has been no active ways to train performers in avoiding performance cessation. In the aerospace field, astronauts and pilots spend much time in training. Of course they learn how to control aircraft and spaceships under normal conditions. However, to accomplish a mission safely, it is much more important to know how to deal with abnormal emergency situations quickly and effectively. For this purpose, in this field, they conduct training for emergency situations using simulators. We introduce such a situational training concept to piano-performance training. Apollon13 simulates unexpected mistakes as emergency situations. Using Apollon13 in the final stage of exercises before a concert, it is expected that the educand can acquire the ability to avoid the worst result, i.e. performance cessation. There has been no training method or training system against performance cessation up to now. Therefore, we believe that our attempt has very high levels of novelty and utility.
2 How to Simulate Emergency Situations How to simulate emergency states was important in designing Apollon13. While there are many causes of emergency states, we focused on mis-touches in performance. A mis-touch results in an unexpected sound, which makes the player upset and, in the worst case, leads to performance cessation. To induce a similar situation, Apollon13 replaces a few of the performed notes with different notes. By trying to keep playing even when the output notes are different from his/her intended notes, the player would be able to learn how to recover from mis-touches without falling into performance cessation. It’s important to understand that the note replacement function should be used only in the final stage where the player is repeating the full exercise, in contrast conventional piano-lesson support systems are used in the initial stage. Musicians use various feedbacks in playing musical instruments. In particular, they are alert to auditory feedback. The proposed system’s note replacement intentionally breaks our auditory sense. In the initial stage of a piano lesson, however, auditory feedback is a fundamental element. Therefore, the note-replacement function must not be used in the initial stage of a piano lesson.
Apollon13: A Training System for Emergency Situations in a Piano Performance
245
Previous literature [4] demonstrated that note replacement has the effect of disorienting piano performance. However, although a keyboard with the notereplacement function is used in this research, the objective of this research is to formulate a kind of stuttering model. Therefore, the way of note replacement in the earlier work is factitious, since such mis-touches never happen in real piano performances. To adopt note replacement in piano practice, a note-replacement method that simulates realistic mis-touches is required. To simulate such realistic mistouches, there are two factors that should be considered: which performed note should be replaced, and by which note. In section 3, we describe the employed simulation method.
3 System Setup 3.1 Overview Apollon13 is a MIDI (Musical Instrument Digital Interface) based system that consists of a MIDI-keyboard, a personal computer, and a MIDI sound module. Apollon13 has two operation modes: a practice-monitoring mode and a rehearsal mode (Table 1). In the practice-monitoring mode, the system tracks and records the user’s full piano performances. In using this mode, the user repeats the full performance of a musical piece many times. A score-tracking function (described in 3.2) compares each performance with the score and records how accurately it is performed. When the practice-monitoring mode is finished, the system decides which notes should be replaced. Too many replacements would become an excess burden for the user. Therefore, the system finds only a few notes where the user would surely become upset by note replacement based on the performance estimation results using the recorded tracking data (described in 3.3). We call such a selected note a “replacing-point” hereafter. In the rehearsal mode, the system tracks the user’s performance again. When the user performs the replacing-point, the system replaces this note with another note neighboring the correct note. This is done because an actual mis-touch in piano performances follows such a pattern. Table 1. Operation mode of proposed system Practice monitoring mode System
Score tracking Recording performance
User
Repeat of full performance
→ Decision of notereplacement part
Rehearsal mode Score tracking Note replacement Continue performing even if mis-touches are simulated
3.2 Score Tracking A score-tracking technology is necessary to obtain performance data for performance estimation. Apollon13 utilizes the score-tracking function of “Family Ensemble” (FE)
246
Y. Yokoyama and K. Nishimoto
[5]. FE is a piano-duo support system for a novice child and his/her parent who is an amateur at piano. Since FE’s score-tracking function is robust, it is applicable to tracking performances that include mistakes. We modified FE’s score-tracking function in two points. First, FE’s original score-tracking function tracks only the highest notes at each place. We modified it to polyphony compatible by simply summing all note numbers of the notes in the chord and regarding the sum as the note number of the chord. Second, FE outputs three tracking data: performed position in the score, whether the performed note is correct or incorrect, and timestamp of each performed note. We further added velocity data to represent the loudness of each note for the performance estimation. 3.3 Performance Estimation The aim of performance estimation is to find where the user would surely become upset by note replacement. The performance-estimation algorithm classifies each score-event (i.e., (a) note-on event(s) at the same instant in FE’s score data) into four categories. The criterion of estimation is performance stability. If the performance of a score-event is highly stable throughout all performances in the practice-monitoring mode, the score-event is estimated as “skillful.” If the performance of a score-event is not so stable, it is estimated as “poor.” If the performance of a score-event becomes stable, it is estimated as “improved.” The other score-events are estimated as “other.” 3.3.1 Factors Used for Performance Estimation Previous related studies [2][6] used three factors for the performance estimation, i.e., IOI (Inter Onset Interval), duration, and velocity. On the other hand, we use three factors obtained from the score-tracking function, i.e., IOI (calculated by the received timestamps of the performed score-events), velocity of each score-event, and data on whether performed score-event is correct or erroneous (CE-data, hereafter). Mukai et al. used deviation of IOI for estimating the performance: If the deviation value of the same fingering pattern is large, this pattern is estimated as poorly stable [7]. We also use deviation of IOI as well as that of velocity. However, we calculate the deviations of each score-event in all of the full performances, while Mukai et al. calculated those of the same fingering patterns. Fluctuation in the overall tempo of each performance influences the deviation of tempo at each score-event. To cancel this effect, we calculated normalized local tempo at each score-event. First, average tempo of each entire performance is calculated. Then, the normalized local tempo is calculated by dividing each local tempo at each score-event by the average tempo of the performance. Here, the note value of each score-event is necessary to calculate the normalized local tempo; therefore, we added the note value data to the score data of FE. 3.3.2 Classification of Each Score-Event The performance estimation requires at least three sessions, each of which should include at least ten full performances. This algorithm classifies each score-event based on the deviations (stability) with progress of the sessions as follows:
Apollon13: A Training System for Emergency Situations in a Piano Performance
1.
Calculating “coarse score” A)
B)
C)
D)
Calculating “tempo score”: First, the deviation of normalized local tempo at each score-event for all performances in all practice sessions is calculated. Then, all of the score-events are sorted based on their deviation value. Finally, the 30% of score-events with the smallest deviation values score 2 points, the 30% of score-events with the largest deviation values score 0 point, and the remaining score-events with moderate deviation values score 1 point. Calculating “velocity score”: First, the deviation of velocity at each scoreevent for all performances in all practice sessions is calculated. Then, all score-events are sorted based on their deviation value. Finally, the 30% of score-events with the smallest deviation values score 2 points, the 30% of score-events with the largest deviation values score 0 point, and the remaining score-events with moderate deviation values score 1 point. Calculating “accuracy score”: First, the accuracy rate of each score-event for each practice session is calculated based on CE-data. Then, the transition of accuracy rate for each score-event through all practice sessions is obtained from the regression line of the accuracy rates. Finally, one-third of the score-events, having the highest gradient values of the regression lines, score 2 points, one-third of the score-events, having the lowest gradient values, score 0 point, and the remaining one-third of score-events with moderate gradient values score 1 point. Coarse score is calculated by the following equation:
Coarse score = tempo score * 5 + velocity score * 3 + accuracy score *2 2.
247
(1)
Calculating “adjustment score” A)
B)
C)
Calculating “fine tempo score”: First, the deviation of normalized local tempo at each score-event for all performances in each practice session is calculated. Then, the transition of deviation for each score-event through all practice sessions is obtained from the regression line of the tempo deviations. Finally, one-third of the score-events with the lowest gradient values of the regression lines score 1 point, one-third of the score-events with the highest gradient values score -1 point, and the remaining onethird of score-events with moderate gradient values score 0 point. Calculating “fine velocity score”: First, the deviation of velocity at each score-event for all performances in each practice session is calculated. Then, the transition of deviation for each score-event through all practice sessions is obtained from the regression line of the velocity deviations. Finally, one-third of the score-events with the lowest gradient values of the regression lines score 1 point, one-third of the score-events with the highest gradient values score -1 point, and the remaining one-third of score-events with moderate gradient values score 0 point. Adjustment score is calculated by the following equation:
248
Y. Yokoyama and K. Nishimoto
Adjustment score = fine tempo score + fine velocity score 3.
(2)
Classifying each note into one of four categories (skillful, improved, poor and other) based on the coarse score and the adjustment score. Table 2 shows the classifying rules. Table 2. Classifying rule of performance estimation value Skillful part Improved part
Poor part Other part
coarse score >= 15 coarse score < 15 and adjustment score > 0 or coarse score < 5 and adjustment score = 2 coarse score < 5 or coarse score < 15 and adjustment score < 0 coarse score < 15 and adjustment score = 0
4 Experiments After users continuously use Apollon13, if they eventually lose the tendency to become upset when certain notes are suddenly replaced by incorrect ones, we can say that it is an effective training system for emergency situations in piano performance. In this experiment, we investigate the effects of training with Apollon13 by analyzing the users’ subjective assessments and their performance data. 4.1 Experimental Settings and Procedures We conducted experiments with three subjects, who were 23–24-year-old males. They have 18–20 years experience of playing the piano. We prepared a compulsory musical piece “Merry Christmas Mr. Lawrence” composed by Ryuichi Sakamoto, which has only a two-page score. It takes about two minutes to perform it. We selected this piece since it is not too difficult but not so easy to perform by one hand. The subjects received the score one week before the experiments. We asked them to practice one week freely to finish the partial exercise stage. We confirmed that the subjects could play through it before the experiments. Table 3 shows the equipment used in the experiments. The experimental period was five days, and two sessions were held each day: ten sessions in total for each subject. A session takes about thirty minutes. In a session, each subject was required to perform the compulsory piece ten times or more. The interval between the sessions was at least five hours. The first five sessions were assigned as “practice sessions.” In these sessions, Apollon13 works in practicemonitoring mode. The remaining five sessions were assigned as “rehearsal sessions.” In these sessions, Apollon13 works in rehearsal mode. In one rehearsal session, we enabled Apollon13’s note-replacement function in about five randomly selected performances.
Apollon13: A Training System for Emergency Situations in a Piano Performance
249
Table 3. Equipment used in experiments MIDI-keyboard MIDI sound source MIDI-IO PC
YAMAHA grand piano C5L + silent ensemble professional model YAMAHA MU128 Midiman MIDISPORT2×2 Notebook type, CPU: Core2Duo T7250 2.00 GHz, memory: 1.0 GB
The number of replacing-points in one performance was four, and these were selected according to the results of the performance estimation obtained in the practice sessions. At present, although the performance estimation algorithm works, it cannot decide in which category the users definitely become upset. Therefore, in this experiment, the system chooses one replacing-point for each category classified by the performance-estimation algorithm (four points in total) to collect data for validating the performance-estimation algorithm. At the end of each practice session, we asked the subjects to indicate, note by note, where they could skillfully perform and where they could not. At the end of each rehearsal session, we asked the subjects where they became upset. 4.2 Results Figure 1 shows the transition of the ratio of the number of replacing-points where each subject became upset to the number of all replacing-points in each session. The horizontal axis indicates the rehearsal sessions and the vertical axis indicates the ratio. Thus, as the sessions progressed, the subjects gradually came to avoid getting upset by the note replacement. To investigate the effect of note replacement in detail, we analyzed fluctuations in the performances before and after the replacing-points. For this analysis, we first prepared the target performances (with note replacement) and the baseline performances (without note replacement). We employed the performances in the 4th and 5th practice sessions as the baseline performances: average IOI and velocity of each score-event of the baseline performances were calculated as the baseline data. On the other hand, we prepared two target performances: “R1-3” target performances consist of the performances where the note-replacing function was activated in the 1st to 3rd rehearsal sessions, and “R3-5” target performances consist of those in the 3rd to 5th rehearsal sessions. We also calculated average IOI and velocity of each scoreevent of R1-3 and R3-5. Finally, the difference values of seven score-events before and after each replacing-point were calculated (namely, the difference values at the 15 points in total, including the replacing-point, were obtained). Figure 2 shows an example of the obtained average difference values between subject A’s R1-3 performances and his baseline performances at a certain replacingpoint. The horizontal axis indicates the score events. The 8th event corresponds to the replacing-point. The vertical axis indicates the average difference. If the performance becomes disordered by the replaced note at the 8th event, the graph becomes undulant after there. Therefore, we compared deviation of the performance data before and after the 8th score-event by F-test. As a result, we found a significant difference in IOI
250
Y. Yokoyama and K. Nishimoto
Fig. 1. Transition of percentage of upset points
Fig. 2. Difference in performance around a replacing-point
of subject A’s R1-3 (p<0.05) and a marginal difference in IOI of his R3-5 (p<0.1). We could not find differences in IOI of subjects B and C or in velocity of any subject.
5 Validation of Performance-Estimation Algorithm From the results of the experiment described in 4.2, continual use of the system made the subject imperturbable despite the note replacement. However, in the experiments, we selected the replacing-points from all four categories, i.e. skillful, improved, poor and other, as estimated by the proposed algorithm described in 3.3.2. In order to more accurately and dependably simulate emergency situations by note replacement, it is necessary to find which category is the most effective as well as to validate the proposed algorithm’s classification performance. 5.1 Algorithm-Estimated Categories Figure 3 shows the results of classification by the estimation algorithm. We designed this algorithm to classify all of the score-events into four categories as evenly as possible. Similar to 4.2, Table 4 shows the comparison results of the deviation of the performance data in IOI and velocity before and after the replacing-points by F-test for each subject and each category. In Table 4, “**” indicates a significant difference (p<0.05) and “*” indicates a marginal difference (p<0.1).
Apollon13: A Training System for Emergency Situations in a Piano Performance
251
Fig. 3. Results of estimation based on algorithm Table 4. Comparison results of the deviation of the performance data in IOI and velocity before and after replacing-points by F-test for each subject and each category
Category Skillful Improved Poor Other
Session R1-3 R3-5 R1-3 R3-5 R1-3 R3-5 R1-3 R3-5
Subject A IOI Vel **
Subject B IOI Vel
Subject C IOI Vel **
*
* * *
** * *
**
** **
The results in Table 4 show that the replacing-points at which the subjects can become upset are distributed into multiple categories. Thus, it is actually difficult to isolate one category that can effectively make the subjects upset based on the algorithm-estimated categories. 5.2 Subject-Estimation-Based Categories At the end of each practice session, we asked each subject to categorize his own performance into one of two categories (skillful or poor). Based on the estimation results, we translated them into four categories corresponding to those of the algorithm estimation as follows. First, we gave estimation scores 1 and -1 to skillful score-events and poor score-events, respectively, while we gave estimation score 0 to the score-events where the subjects did not classify performance to either category; this was done for all of the estimation results of all practice sessions. Second, we calculated regression lines of the estimation scores for each score-event: the x-axis corresponds to the number of the practice session and the y-axis corresponds to the estimation value. Third, we looked for the score-events whose regression line’s
252
Y. Yokoyama and K. Nishimoto
Fig. 4. Results of subject-estimation-based classification into four categories Table 5. Comparison results of the deviation of the performance data in IOI and velocity before and after replacing-points by F-test for each subject and each category obtained by the subjectestimation-based classification
Category Skillful Improved Poor Other
Session R1-3 R3-5 R1-3 R3-5 R1-3 R3-5 R1-3 R3-5
Subject A IOI Vel
Subject B IOI Vel
** **
**
Subject C IOI Vel * **
** * ** **
** *
gradient is positive: we classified these score-events into the “improved” category. Finally, the score-events that were not classified into the improved category were classified into “skillful” if their estimation score of the final (5th) practice session was 1, “poor” if it was -1, and “other” if it was 0. Figure 4 shows the results of subject-estimation-based classification into four categories. Each subject has different estimation criteria. The concordance rate between algorithm classification and subject-estimation-based classification is about 25% for each subject. Similar to 5.1, Table 5 shows the comparison results of the deviation of the performance data in IOI and velocity before and after the 8th score-event by F-test for each subject and each category obtained by the subject-estimation-based classification. In Table 5, “**” indicates a significant difference (p<0.05) and “*” indicates a marginal difference (p<0.1). The results in Table 5 show that the replacing-points at which the subjects become upset are concentrated in specific categories, although the specific categories depend on the subjects. For subject A, the disorder of IOI gathers in the “Other” category and the disorder of velocity gathers in the “Improved” category. For subject B, IOI disorder gathers in the “Poor” category and velocity disorder gathers in the “Improved” category. For subject C, IOI disorder gathers in the “Improved” and “Other” categories, and velocity disorder gathers in the “Other” category.
Apollon13: A Training System for Emergency Situations in a Piano Performance
253
Fig. 5. Classification results of subjectively upset replacing points
5.3 Relations between Subjectively Selected Upset-Points and Categories We classified the replacing-points where the subjects became upset (this data was obtained from the interview results after each rehearsal session) into algorithmestimated categories and subject-estimation-based categories. Figure 5 shows the results. These results show that the replacing-points where the subjects became upset also gather into specific subject-estimation-based categories rather than into specific algorithm-estimated categories. Based on the subject-estimation-based categories, 76% of the upset-points gather in the “Improved” categories for subject A, 60% gather in the “Poor” category for subject B, and 58% gather in the “Skillful” category for subject C. However, based on the algorithm-based categories, the upset-points are distributed among all four categories for all subjects. Consequently, we can find fairly strong evidence of a relationship between the subject-estimation-based categories and the upset-points, although this relationship differs depending on the subject.
6 Discussion From the results shown in Fig. 1, although all of the subjects became upset by many replacing-points in the early rehearsal sessions, they gradually gained imperturbability through training with Apollon13. Furthermore, from the comparison results between R1-3 and R3-5 for subject A, the objective degree of performance disorder in IOI decreased. Consequently, these results show that Apollon13 has a certain amount of efficacy. We further investigated the relationships between subjective/objective upset-points and algorithm-estimated categories/subject-estimation-based categories (5.1–5.3). We found that both objective and subjective upset-points gather in a specific subjectestimation-based category, depending on the subject, while both objective and subjective upset-points were distributed into many algorithm-estimated categories. This result suggests that if we could classify the score-events into four categories similar to the subject-estimation-based categories, we would be able to more reliably make the users upset by a note-replacing technique. Accordingly, a more practical simulator of emergency situations in piano performance can be achieved.
254
Y. Yokoyama and K. Nishimoto
7 Conclusion In this paper, we proposed Apollon13, which is a training system for emergency situations in a piano performance. Apollon13 simulates “mis-touches” by using a note-replacing function. The user can gain the ability to overcome unexpected accidents by continuing the performance even when an unexpected note replacement happens while using Apollon13. Therefore, he/she can train to avoid the worst situation of a piano performance in a concert, i.e., performance cessation. From the results of experiments using three subjects, we confirmed that Apollon13 has a certain amount of efficacy as a training simulator of emergency situations in piano performance. However, the algorithm for classifying the score-events into four categories, i.e. skillful, improved, poor, and other, is still inadequate. As a result, the present system cannot dependably make the user upset by note replacement. To more reliably simulate emergency situations, it is necessary to develop a classification algorithm that can classify the score-events in the manner of subjective human judgment. Accordingly, we intend to tackle this problem in the future.
References 1. Dannenberg, R.B., Sanchez, M., Joseph, A., Joseph, R., Saul, R., Capell, P.: Results from the Piano Tutor Project. In: Proceedings of the Fourth Biennial Arts and Technology Symposium, pp. 143–150 (1993) 2. Shirmohammadi, S., Khanafar, A., Comeau, G.: MIDIATOR: A Tool for Analyzing Students’ Piano Performance. In: Revue de recherche en education musicale, vol. 24 (2006) 3. Smoliar, S.W., Waterworth, J.A., Kellock, P.R.: pianoFORTE: A System for Piano Education Beyond Notation Literacy. In: Proceedings of the Third ACM International Conference on Multimedia, pp. 457–465 (1995) 4. Takahashi, N.: Effect of auditory feedback disturbance in keyboard performance -Evaluation of possibility as a model system for stuttering. IEICE Technical Report 106(501), 87–92 (2007) (in Japanese) 5. Oshima, C., Nishimoto, K., Hagita, N.: A Piano Duo Support System for Parents to Lead Children to Practice Musical Performances. ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMCCAP) (Article 9) 3(2) (2007) 6. Akinaga, S., Miura, M., Emura, N., Yanagida, M.: Toward realizing automatic evaluation of playing scales on the piano. In: Proc. of ICMPC 9, pp. 1843–1847 (2006) 7. Mukai, M., Emura, N., Miura, M., Yanagida, M.: Generation of suitable phrases for basic training to overcome weak points in playing the piano. MUS-07-018 (2007)
Exploring Social Annotation Tags to Enhance Information Retrieval Performance Zheng Ye1,2 , Xiangji Jimmy Huang1 , Song Jin2 , and Hongfei Lin2 1
School of Information Technology York University, Toronto, Ontario, M3J 1P3, Canada 2 Department of Computer Science and Engineering, Dalian University of Technology Dalian, Liaoning, 116023, China {yezheng,jhuang}@yorku.ca, [email protected], [email protected]
Abstract. Pseudo relevance feedback (PRF) via query expansion has proven to be effective in many information retrieval tasks. Most existing approaches are based on the assumption that the most informative terms in top-ranked documents from the first-pass retrieval can be viewed as the context of the query, and thus can be used to specify the information need. However, there may be irrelevant documents used in PRF (especially for hard topics), which can bring noise into the feedback process. The recent development of Web 2.0 technologies on Internet has provided an opportunity to enhance PRF as more and more high-quality resources can be freely obtained. In this paper, we propose a generative model to select high-quality feedback terms from social annotation tags. The main advantages of our proposed feedback model are as follows. First, our model explicitly explains how each feedback term is generated. Second, our model can take advantage of the human-annotated semantic relationship among terms. Experimental results on three TREC test datasets show that social annotation tags can be used as a good external resource for PRF. It is as good as the top-ranked documents from first-pass retrieval with optimal parameter setting on the WSJ dataset. When we combine the top-ranked documents and the social annotation tags, the retrieval performance can be further improved.
1
Introduction
As reported in [1], search engine users tend to issue short queries, mostly consisting of only 2-3 keywords, which are not enough to express the exact information need, and lead to ambiguous queries. For example, a query “Java” may refer to either coffee or the Java programming language. To address the above problems, various approaches have been proposed in IR literature. Relevance feedback (RF) can be considered as a way to provide context information to an IR system. However, RF requires users to provide feedback information, such as users’ interests, examples of relevance documents for a specified query. Since users are always reluctant to spend extra effort on providing feedback information, the advantage of RF is not obvious. In this sense, implicit feedback and PRF appear to be more attractive. A major advantage of implicit A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 255–266, 2010. c Springer-Verlag Berlin Heidelberg 2010
256
Z. Ye et al.
feedback and PRF is that we can improve the retrieval performance without requiring any users effort. For pseudo relevance feedback (PRF), it is assumed that the top k documents from the first-pass retrieval are relevant. Then, the original query can be expanded by new informative words. PRF has proven to be an effective technique for improving IR performance [2,3,4,5,6,7]. However, the general assumption behind PRF does not always hold, especially for difficult topics, PRF could cause the query drift [8]. In this case, the IR performance can get hurt. Steve’s [9] study showed that only half of the topics can be improved by PRF in general because of the bad quality of feedback documents. The recent development of Web 2.0 technologies on Internet has provided an opportunity to enhance PRF as more and more high-quality resources can be freely obtained. In recent years, social annotations have become a popular way to allow users to contribute descriptive metadata for Web information, such as Web pages and photos. In social annotations, users provide different keywords describing a Web page from various aspects. These features may be used to boost IR performance. A series of studies have been done on exploring the social annotations for folksonomy [10], recommendation [11], semantic Web [12], Web search [13,14] etc. Positive impact has been found in these studies. However, it is not clear how social annotations can be of help as a resource for query expansion. As described in [12], social tags, which link to each other through Web resource, often share similar topics. Thus, a set of tags attached with the same resource (e.g. web page, photo, production) could help better understand it. This unique feature makes it valuable for deriving semantic relationship among terms. However, to date, its potential for PRF has been largely unexplored. In this paper, we propose a novel query expansion method, in which social annotation tags are used as an external textual resource of feedback terms. In particular, we propose a generative model to select feedback terms, in which term-dependency methods are used to model the human-annotated semantic relationship in social annotation tags. Our model not only explicitly explains how the feedback terms are generated, but also produces promising performance. Extensive experiments have been conduct to evaluate the quality of social annotation tags and the performance of our proposed model. The remainder of our paper is organized as follows. Section 2 briefly introduces the social annotation tags used in the experiments. Section 3 introduces our proposed generative model for PRF based on the social annotation tags. In Section 4, we describe the experimental settings in this paper. In Section 5, we report the experimental results and list some discussions about our work. Finally, we conclude the paper and discuss future work in Section 6.
2
Social Annotation
In this section, we first briefly introduce social annotation systems and their advantages, in order to let readers better understand our proposed model. Then, we preliminarily evaluate the quality of the used social annotation tags for PRF.
Exploring Social Annotation Tags to Enhance IR Performance
2.1
257
Social Annotation Dataset
In social annotation services like Delicious1 , they allow users to annotate and categorize Web resources with keywords, always called tags. These tags are freely chose by users without a pre-defined taxonomy or ontology. A single Web page could be annotated by several tags from many disparate users. For example, the tags, such as “conference”, “acm”, “research”, and “sigir”, are used by many users to annotate the Web page of SIGIR 2010’s homepage. As we can see, the tags can be good keywords for describing this Web page from various aspects. In addition, different tags describing the same Web resource are semantically associated to some extent [12]. An annotation for a Web page typically consists of at least three parts: the URL of the resource (e.g. a Web page), one or more tags, and the user who makes the annotation. Thus, we abstract it as a triple as follows: useri , tagj , urlk
(1)
which means that user i has annotated URL k with tag j. In this paper, we focus on which Web page annotated with which tags, and do not care much about who annotated the Web page. In our experiments, we collect a sample of social annotation data by crawling its website in March 2009. The dataset consists of 7,063,028 tags on 2,137,776 different URLs with 280,672 different tags. We mainly utilize Web pages, tags, and the relationship between them. Thus, the social annotation sample is organized as one article per annotation after filtering out the user information. Based on our analysis of the social annotation structure, we divide the social annotation articles into four fields as shown in Table 1. Table 1. Fields of Social Annotation Article Field Description URL Unique identifier for web page Frequency Annotation frequency of the web page Title Summary of the web page Tags Tags of the web page annotated
2.2
Evaluation of Social Annotation Collection
In order to evaluate the potential usefulness of social annotation tags for PRF, we consider all the terms extracted from the social annotation tags using a term co-occurrence method. The term co-occurrence is usually used to measure how often terms appear together in the text window. Intuitively, the higher the cooccurrence of a candidate term with the query terms, the more likely it will be selected as a good expansion term. 1
http://del.icio.us
258
Z. Ye et al.
In our experiment, the text window is defined as one annotation field (e.g. tags, title). For a query term qj and a candidate term ti in the social annotation sample S, the co-occurrence value defined as follow: f ∈S log(tf (ti |f ) + 1.0) × log(tf (qj |f ) + 1.0) (2) cooc(ti , qj ) = log(N ) where N is the sum of articles in social annotation sample S, tf (.|f ) is frequency of a term in field f . Based on this equation, terms with the highest cooc(ti , qj ) are selected as the candidate expansion terms for the query. The final expansion terms can be selected as follows: coof (ti , Q) =
single
idf (qj )idf (ti )log(cooc(ti , qj ) + 1.0)
(3)
qj ∈Q
where idf is computed as log(N/df ), and df is the number of articles which contain term t. Inspired by the work of Cao et al. [15], we test each of these terms to see its impact on the retrieval effectiveness. In order to make the test simpler, we make the following simplifications: 1) Each expansion term is assumed to act on the query independently from other expansion terms; 2) Each expansion term is added into the query with equal weight λ (it is set at 0.01 or -0.01). Based on these simplifications, we measure the performance change due to the expansion term e by the ratio: tchg(e) = [M AP (Q ∪ e) − M AP (Q)]/M AP (Q)
(4)
where M AP (Q) and M AP (Q∪e) are respectively the MAP of the original query and expanded query (expanded with e). In our experiment, we empirically set the threshold to 0.005. It means good (or bad) expansion term which can improve (or hurt) the effectiveness should produce a performance change such that |chg(e)| > 0.005. Now, we examine whether the candidate expansion terms from social annotation sample are good terms. Our verification is made on three TREC datasets: AP, WSJ and Robust 2004. For each query, we evaluate top 100 expansion terms with largest probabilities from pseudo-relevant documents and with largest cooccurrence values from social annotation sample respectively. Table 2 shows the proportion of good, bad and neutral expansion terms for all the queries. From Table 2, we can see that, on one hand, the proportion of good terms extracted from social annotation sample is higher than that extracted from pseudorelevant documents. It indicates that social annotations have the potential to be a good resource for PRF. On the other hand, the proportions of bad terms are also higher than that extracted from pseudo-feedback documents, which leaves us the challenge to select out the good terms for PRF. The research of Cao et al. [15] shows that the retrieval effectiveness can be much improved if more good expansion terms add into the original queries. The challenge now is to develop
Exploring Social Annotation Tags to Enhance IR Performance
259
Table 2. Proportion of good terms, neutral terms and bad terms in top 10 feedback documents and Delicious tags, respectively Collection Good Terms Neutral Terms Bad Terms AP (17.55%, 19.05%) (64.01%, 59.98%) (18.44%, 20.97%) WSJ (15.83%, 18.33%) (66.93%, 59.21%) (17.24%, 22.46%) Robust2004 (16.57%, 18.47%) (66.05%, 60.05%) (17.38%, 21.48%)
an effective approach to correctly select the good terms for PRF. We detail our proposed term ranking approach in the following section.
3
Our Proposed Approach
In this section, we propose a generative model to predict the importance of each candidate expansion term. Since our model is evaluated under language modeling framework, we first generally introduce feedback approaches in language models. 3.1
Feedback under Language Modeling Framework
In KL-divergence retrieval model [16], the query and document are represented as language models. The divergence score between the query language model θQ and the document language model θD is used to rank the documents. In particular, given the distributions θQ and θD , the Kullback-Leibler divergence (or relative entropy) between θQ and θD , denoted as D(θQ |θD ), is defined as D(θQ |θD ) =
P (w|θQ )log
w
=−
P (w|θQ ) P (w|θQ )
(5)
P (w|θQ )logP (w|θD ) + cons(q)
w
where cons(q) is a document-independent constant that can be dropped, since it does not affect the ranking of documents. The essential issue in the KL-divergence retrieval model is to estimate θQ and θD . The model-based feedback approach [6] proposed to reestimate the query language model according to the top documents from the first-pass retrieval. In general, we can derive a feedback language model θF to smooth the original query language model θQ as in [6]. The updated query language model is as follows: θQ = (1 − α) ∗ θQ + α ∗ θF (6) 3.2
A Generative Feedback Model
Following the discussion above, we need to derive a feedback model θF from the social annotation tags to update the query model. We propose a generative
260
Z. Ye et al.
model to model the generation process of feedback terms. Figure 1 depicts the generation process of candidate term t. Particularly, our model assumes the following generative process for each feedback term t in the social annotation sample S: – p(di |S): given a query Q, we first generate document di from the social annotation tags sample S; We view each social annotation article defined in Section 2.1 as a document. – p(ej |di ): then, the query-related evidence, ei , is generated from document di . – p(t|ej ): given the evidence ei , the feedback term t is generated based on the semantic relationship between ei and t. With this generative process, the feedback model θF can be estimated as follows: p(t|θF ) =
n k
p(di |S) ∗ p(ej |di ) ∗ p(t|ej )
(7)
i=1 j=1
where n is the number of documents in S chose for generating the feedback terms, and k is the number of query-related evidences. Now, the main challenge is how to estimate the corresponding generation probabilities, which are detailed as follows. Generating feedback documents p(di |S) When given a query Q and the social annotation sample S, the first step is to determine which documents will be chose for feedback purpose. We simply retrieve the query from the social annotation sample S, and the top ranked documents are used. The generation probabilities p(di |S) are proportional to the ranking scores of corresponding documents. Generating query-related evidence p(ej |di ) Given a query Q, we break it down into a set of single query terms, e = qi and query term pairs, e = (qi , qj ). We view each of them as a query-related evidence for generating feedback terms, since a single term or a term pair reflects one aspect of the information need represented by the query. A good feedback term should reflect the information need from every aspect. To estimate p(ej |di ), we first build a language model for each document. We use Dirichlet prior [17] (with a hyperparameter of μ = 1000) for smoothing the document language model. p(e|di ) =
c(e; di ) + μp(e|S) |di | + μ
(8)
where c(e; di ) is the frequency of evidence e in document di , p(e|S) is the probability of evidence e in S and |di | is the length of document di . Generating feedback terms p(t|ej ) Probability p(t|ej ) represents when the evidence ej is seen, how likely that term t will be seen. We consider the co-occurrence between evidence ej and feedback
Exploring Social Annotation Tags to Enhance IR Performance
261
Fig. 1. The Generation Process of Candidate Term t
term t, which is usually used to measure how often terms appear together in the text window. For single term evidence, we still use Equation 2 detailed in Section 2.1 to estimate p(t|ej ). For term pair evidence ej = (qi , qj ), we estimate p(t|ej ) as follows: p(t|ej ) ∝ cooc(t, ej ) = cooc(t, qi , qj ) = idf (qi )idf (qj )idf (t)log(cooc(t, qi ) + 1.0)log(cooc(t, qj ) + 1.0)
4
(9)
Experimental Setting
4.1
Test Datasets
We used three standard TREC datasets in our experiments: AP88-90 (Associated Press); WSJ87-90 (Wall St. Journal); and Robust2004 (the dataset of TREC Robust Track started in 2003). Table 3 shows the details of these datasets. In all the experiments, we only use the title field of the TREC queries for retrieval, which is consistent with the real application since feedback is expected Table 3. Statistics of Evaluation Collections Collection AP WSJ Robust 2004
#Docs Description Topics 158, 240 Associated Press (1988, 1989, 1990), Disks 2 and 3 51-200 74, 520 Wall Street Journal (1990,1991, 1992), Disk 2 51-200 528, 155 Disk4 and 5 (no CR) 301-450
262
Z. Ye et al.
to be the most useful for short queries [6]. In the process of indexing and querying, each term is stemmed using Porter’s English stemmer, and stopwords from InQuery’s standard stoplist with 418 stopwords are removed. In each run, the MAP (Mean Average Precision) performance measurement for top 1000 documents is used as evaluation metric. The MAP measurement reflects the overall accuracy and the detailed descriptions for MAP can be found in [18]. 4.2
Baseline Models
For the basic retrieval model, we use the KL-divergence retrieval model [16]. In particular, we use Dirichlet prior (with a hyperparameter of μ = 1000) for smoothing the document language model, which can achieve good performance generally [17]. We fix the document language model in all the experiments in order to make a fair comparison. To evaluation the feedback performance of our proposed model, we compare our proposed model with relevance model [7], the first estimation method called RM1 [19], which is a representative state-of-the-art approach for estimating query language models with PRF [19]. Relevance models do not explicitly model the relevant or pseudo-relevant document. Instead,they model a more generalized notion of relevance R. The formula of RM1 is: p(w|θD )p(θD )P (Q|θD ) (10) p(w|R) ∝ θD
The relevance model p(w|R) is often interpolated with the original query model θQ as in Eq. 6 to improve performance. This interpolated version of relevance model is called RM3. In [19], Lv et al. systematically compare five state-ofthe-art approaches for estimating query language models in ad hoc retrieval, in which RM3 not only yields effective retrieval performance in both precision and recall metric, but also performs steadily.
5
Experiments
In our experiments, we compare our proposed model with the RMD feedback model, which conducts feedback based on the top-ranked documents from the first-pass retrieval. The main goal of this set of experiments is to explore the potential of the social annotation for PRF. In order to find the optimal parameter setting, we use the training method in [20] for both the baseline and our proposed approach, which is popular in IR domain for building strong baseline. In particular, we sweep over values, in both baseline and our proposed model, for the number of top documents (|D| ∈ {3, 5, 10, 20, 30}), the number of expansion terms (k ∈ {10, 20, 30, 40, 50, 70}), and the interpolation parameter (α ∈ {0.0, 0.1, . . . , 1.0}). We use 2-fold cross-validation, in which the TREC queries are partitioned by the parity of queries number on each collection. Then, the parameters learned on the training set are applied to the test set for evaluation purpose as in [21]. In addition, we combine the two sources to conduct feedback.
Exploring Social Annotation Tags to Enhance IR Performance
263
Table 4. Performance comparison on AP, WSJ and Roust2004 datasets. A star and a “+” indicate a statistically significant difference over the baseline of basic language model without feedback (KL) and the RM model on top ranked documents respectively, according to the Wilcoxon matched-pairs signed-ranks test at the 0.05 level. Method KL RM3 on Social Tags RM3 our model our model on the combined resource
5.1
AP 0.1378 0.1356 0.1648∗ 0.1458∗ 0.1730∗+
WSJ Robust2004 0.2481 0.2248 0.2314∗ 0.2282 0.2591∗ 0.2509∗ 0.2581∗ 0.2363∗+ 0.2785∗+ 0.2605∗+
Evaluation of the Proposed Feedback Model
Table 4 shows the results of our proposed model, compared with the RM3 feedback model and directly applying RM3 on social tags. First, we can see that when we directly applying RM3 on social tags, we can not get much performance gain. Even worse, the retrieval performance has been decreased significantly on WSJ dataset. In the contrary, our proposed model can significantly improve the retrieval performance on all the three test datasets. The results indicate that social annotation has the potential for PRF, but a dedicated and effective model is needed. We conjecture that the performance gain is mainly brought by incorporating human-annotated semantic relationship among terms in our proposed model. Second, comparing our proposed model with RM3 on the ranked documents, comparable performance has been observed on the AP and WSJ dataset. It suggests that the social annotation tag is a high-quality resource for PRF. In addition, although performance gain has been observed on the Robust2004 dataset, it’s not as impressive as that of the RM3 model. One possible reason is that there are more useful feedback terms and term combinations in the top ranked documents that in the social annotation. 5.2
Evaluation of the Combined Resource
Since the retrieval performance can be boost substantially by PRF using both resources: the top-ranked documents and the social annotation tags, it motivates us to combine the two resources for PRF in order to get even better retrieval performance. In our experiments, we use a simple linear combination method to rank the candidate terms as follows: w(t) = (1 − λ) ∗ wour (t) + λ ∗ wRMD (t)
(11)
Note that the weights obtained from both feedback approach are normalized to an interval of [0, 1]. Then, we use the top ranked terms obtained to update the query language model in the same way as in Eq. (6). From Table 4, we can see that when the combined resource is used for selecting feedback terms, the retrieval performance can be significantly improved over all
264
Z. Ye et al.
Fig. 2. Performance of PRF using combined resources over different values of interpolation parameter λ under setting: 10 top documents, 70 feedback terms and α = 0.7
baselines on single feedback resource. We suggest using the combined resource to conduct PRF. Figure 2 shows the results of the combined approach over different interpolation parameter λ. As we can see from this figure, in general, the performance can be improved markedly on all three datasets. When the value of λ is around 0.5, the retrieval performance peaks on WSJ and Robust2004 dataset. This is because the top ranked terms in both resources can be selected when λ is around 0.5. If the best performance is achieved when λ is larger than 0.5, it indicates that there are more useful terms in the top ranked documents. Otherwise, the social annotation will have more useful terms.
6
Conclusions and Future Work
In this paper, we have explored the potential of social annotation tags as a new resource for PRF via query expansion. In particular, we have proposed a generative model to select the feedback terms, which is capable of taking advantage of the human-annotated semantic relationship among terms in social annotation tags. Our experimental results have shown that the expansion terms extracted from social annotation tags are even as good as that from the top ranked documents on the WSJ dataset. Furthermore, when the top ranked documents and the social annotation tags are combined for selecting expansion terms, the retrieval performance can be further improved. We have also shown that our feedback model works satisfactorily on the different TREC datasets. We believe that as more reliable social annotation data is available on the Web, the performance of our proposed model can be further improved. In future work, we would like to work in the following directions. First, we plan to investigate more sophisticated approach to combine different textual resources
Exploring Social Annotation Tags to Enhance IR Performance
265
for PRF. Second, for the term selecting model, we will explore machine learning techniques to identify useful feedback terms for PRF.
Acknowledgments We thank the reviewers’ valuable and constructive comments. This research is jointly supported by NSERC of Canada, the Early Researcher/Premier’s Research Excellence Award, Natural Science Foundation of China (No. 60973068, No. 60673039 ), Doctoral Fund of Ministry of Education of China (No.20090041110002) and the National High Tech Research and Development Plan of China (No.2006AA01Z151).
References 1. Beitzel, S.M., Jensen, E.C., Frieder, O., Lewis, D.D., Chowdhury, A., Kolcz, A.: Improving automatic query classification via semi-supervised learning. In: ICDM, pp. 42–49 (2005) 2. Carpineto, C., de Mori, R., Bigi, B.: An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems (TOIS) 19(1), 1–27 (2001) 3. Xu, J., Croft, W.B.: Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst. 18(1), 79–112 (2000) 4. Rocchio, J.J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document, pp. 313–323 (1971) 5. Robertson, S., Walker, S., Beaulieu, M., Gatford, M., Payne, A.: Okapi at trec-4. In: Forth Text REtrieval Conference (TREC-4) 6. Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: CIKM 2001: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 403–410. ACM, New York (2001) 7. Lavrenko, V., Croft, W.B.: Relevance based language models. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127. ACM, New York (2001) 8. Mitra, M., Singhal, A., Buckley, C.: Improving automatic query expansion. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 206–214. ACM, New York (1998) 9. Cronen-Townsend, S., Zhou, Y., Croft, W.B.: A framework for selective query expansion. In: CIKM 2004: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 236–237. ACM, New York (2004) 10. Mathes, A.: Folksonomies - cooperative classification and communication through shared metadata. In: KDD 2008: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2004) 11. Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W.C., Giles, C.L.: Real-time automatic tag recommendation. In: SIGIR 2008: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 515–522. ACM, New York (2008)
266
Z. Ye et al.
12. Wu, X., Zhang, L., Yu, Y.: Exploring social annotations for the semantic web. In: WWW 2006: Proceedings of the 15th International Conference on World Wide Web, pp. 417–426. ACM, New York (2006) 13. Hotho, A., Jschke, R., Schmitz, C., Stumme, G.: Information retrieval in folksonomies: Search and ranking. In: Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011, pp. 411–426. Springer, Heidelberg (2006) 14. Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., Su, Z.: Optimizing web search using social annotations. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 501–510. ACM, New York (2007) 15. Cao, G., Nie, J.Y., Gao, J., Robertson, S.: Selecting good expansion terms for pseudo-relevance feedback. In: SIGIR 2008: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 243–250. ACM, New York (2008) 16. Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: SIGIR 2001: Proceedings of The 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 111–119. ACM, New York (2001) 17. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22(2), 179–214 (2004) 18. Voorhees, E.M., Harman, D.: Overview of the sixth text retrieval conference. Information Processing and Management: an International Journal 36, 3–35 (2000) 19. Lv, Y., Zhai, C.: A comparative study of methods for estimating query language models with pseudo feedback. In: CIKM 2009: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1895–1898. ACM, New York (2009) 20. Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 154–161. ACM, New York (2006) 21. Metzler, D., Novak, J., Cui, H., Reddy, S.: Building enriched document representations using aggregated anchor text. In: SIGIR 2009: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 219–226. ACM, New York (2009)
A Hybrid Chinese Information Retrieval Model Zhihan Li, Yue Xu, and Shlomo Geva Discipline of Computer Science Faculty of Science and Technology Queensland University of Technology Brisbane, Australia zhihanlee@gmail, [email protected], [email protected]
Abstract. A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose a hybrid Chinese information retrieval model by incorporating word-based techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach. Keywords: Chinese Segmentation, Information Retrieval, Chinese characters.
1 Introduction The amazing growth speed in the number of Chinese Internet users indicates that building Chinese information retrieval systems is in great demand. A major difference between Chinese (Asian language) information retrieval (IR) and IR in European languages lies in the absence of word boundaries in sentences. Words have been the basic units of indexing in IR. As Chinese sentences are written in continuous character strings, pre-processing is necessary to segment sentences into shorter units that may be used as indices. Hence, a segmentation processing for corpora is necessary before building indexing and ranking. Chinese word segmentation is a difficult, important and widely studied sequence modeling problem. Word segmentation is therefore a key precursor for language processing tasks in these languages. For Chinese, there has been significant research on finding word boundaries in un-segmented sequences [3], [5], [6], [8]. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 267–276, 2010. © Springer-Verlag Berlin Heidelberg 2010
268
Z. Li, Y. Xu, and S. Geva
For Chinese information retrieval, the query is usually a set of Chinese words rather than a sequence of Chinese characters. For character-based Chinese information retrieval, since the texts are not segmented, the retrieved documents which contain the character sequence of the query may not be relevant to the query as they may not contain the words in the query. Therefore, the quality of character-based Chinese information retrieval is not satisfactory. The impact of Chinese word segmentation on the performance of Chinese information retrieval has been investigated in previous research. [4] has conducted a series of experiments which conclude that word segmentation has a certain positive impact on the performance of Chinese information retrieval. However, [4] suggests that for Chinese IR, the relationship between word segmentation and retrieval performance is in fact non-monotonic. In this investigation, [4] used a wide range of segmentation algorithms with accuracies from 44% to 95%. The experimental results showed that retrieval performance increases with the increase of segmentation accuracy in the first part from the lowest segmentation accuracy of 44%. However, after a point around 77%, the retrieval performance decreases from plateaus with the increase of segmentation accuracy. Both Chinese characters and words can be used as the indexing units for Chinese IR. Both of these have advantages and disadvantages. In general, character indexing based IR may achieve better recall since it can retrieve most of the relevant documents as long as they contain the query terms (the query terms are sequences of Chinese characters in the documents, not segmented words, since the documents are not segmented). However, the retrieval precision is not necessarily good. This is because many irrelevant documents are ranked high due to high query term frequency, since they have many instances of the query term sequences, many of which are actually not valid words. On the other hand, the word indexing based IR can make a better ranking and therefore achieve a little better performance than that of character indexing based IR, but the improvement is limited since some relevant documents may not contain the query terms as segmented words and thus will not be retrieved. In this paper, we propose to combine the two approaches in order to achieve better retrieval performance. In our approach, we create two indexing tables, one a Chinese character indexing table and the other a segmented word indexing table. Three methods are proposed to make use of the two indexing tables, with the hope of improving the accuracy of ranking. We first briefly describe the segmentation system that we used for Chinese word segmentation, then in Section 3, we introduce our hybrid approach of Chinese IR based on both segmentation words and Chinese characters. After that, in Section 4, we represent the experimental results. Section 5 concludes the paper.
2 Segmentation Procedure As Chinese word segmentation is not a research focus of this thesis, we have used the segmentation system developed by the Institute of Information science, Academia Sinica in Tai-Wan (http://ckipsvr.iis.sinica.edu.tw). In this system, the processes of segmentation could be roughly divided into two steps; one is resolving the ambiguous matches, the other is identifying unknown words. These processes adopt a variation of
A Hybrid Chinese Information Retrieval Model
269
the longest matching algoriithm, with several heuristic rules to resolve the ambiguiities and achieve the high succeess rate of 99.77% reported in [7]. After a disambiguattion process, for the needs of th he unknown word extraction, a POS bi-gram model is applied to tag the POS of words. In the unknown-word extraction process, a bottom m-up merging algorithm, which utilizes hybrid statistical and linguistic informationn, is adopted. In practice, the client sen nds a request, a piece of a Chinese document, to the systtem and the system then responsses and returns the segmented document in XML style. For example, an original Chin nese document (Figure 1-(a)) and the corresponding ssegmented document (Figure 1-(b)) 1 are given below:
(a) Original Chinese Document
(b) Segmented Chinese Document g. 1. Example of Segmented Document Fig
After segmentation, worrds are separated with white blanks, shown in Figure 1--(b). Additionally, a POS tag is provided p immediately following each word in the brackeet.
3 Retrieval and Rank king Approaches For retrieval process, we used the character-based Boolean model. The approoach considers each query term as a a sequence of Chinese characters, regardless of whetther it is a word or not. If a doccument contains such a sequence, it will be retrieved aas a
270
Z. Li, Y. Xu, and S. Geva G
candidate of relevant docum ments and will be ranked in the ranking procedure. Ussing the character-based retrievaal model, guarantees that none of the documents containning query terms will be missed d in the first retrieval procedure. In the following sub ssections, we first describe thee architecture of our hybrid Chinese IR model, and tthen discuss the three ranking methods m in detail. In Figure 2, the architeccture of the hybrid Chinese IR model is depicted. In this model, we first construct th he segmented corpus by segmenting all the documentss in the original corpus. Then tw wo indexing tables are built up from the original corpus and the segmented corpus; onee is the character indexing table containing all characters appearing in the original co orpus, the other is the Chinese word indexing table contaaining all segmented words appearing in the segmented corpus.
Fig. 2. Architecture A of the Hybrid Chinese IR Model
We have conducted an experiment e to evaluate the retrieval performance by ussing either the character indexin ng table only or using segmented word indexing table onnly. The result, as given in tablle 1 below, shows that the recall and precision of the ppure character-based retrieval model m are better than those of the pure word-based model. This is because, in general, the character indexing based model can retrieve most of the relevant documents as long as they contain the query terms, while for the woordbased retrieval model, in which w the corpus contains segmented documents, the reecall of retrieval is worse since th he query terms very often are not segmented words. In our previous research, we have developed a Chinese information retrieval system tthat uses the traditional Boolean model to perform retrieval only, based on the characcter indexing table. As indicateed in the architecture, in this research the retrieval stepp is performed based only on th he character indexing table to generate candidate relevvant documents. The previously y developed character-based retrieval model is used to pperform the retrieval task. Even though the charactter based retrieval model can retrieve most of the relevvant documents, it doesn’t necesssarily mean that the most relevant documents will be ddelivered to the user. The relevancy ranking plays a very important role in determinning the most relevant documents. In this research, we propose two ranking methhods based on both character ind dexing and word indexing; these are described in the following sections.
A Hybrid Chinese Information Retrieval Model
271
Table 1. Performance of Pure Character and Pure Word Models Top N
10 15 20 30 100 Ave
Pure Character Precision Recall (%) (%) 39.7 19.6 35.2 26.1 32.7 30.9 27.5 36.7 14.1 53.8 29.9 33.4
Pure Word Precision (%) Recall (%) 23.9 20.1 18.0 15.5 8.7 17.2
13.7 15.8 18.9 21.2 32.6 20.4
3.1 Ranking Methods (1) Word-enhanced Ranking Method In our previous research [1], [2], we used a tf-idf based algorithm to calculate the ranking value of a retrieved document to determine the top N relevant documents. The document rank score for a query of m terms is calculated with the following equation: 1 Here, m is the number of query terms, n is the number of distinct query terms that appear in the document as character sequences which are not necessarily segmented words (in this thesis we refer to the query terms appearing in the document as query character sequences, to differentiate from the segmented words), tfi is the frequency of the ith term in the document and idfi is the inverse document frequency of the ith term in the collection. The equation can ensure two things: firstly, the n5 strongly rewards the documents that contain more query terms. The more the distinct query terms are matched in a document, the higher the rank of the document. For example, a document that contains four distinct query terms will almost always have higher rank than a document that contains three distinct query terms, regardless of the frequency of the query terms in the document. Secondly, when documents contain a similar number of distinct terms, the score of a document will be determined by the sum of the tf-idf value of the query terms, as in traditional information retrieval. According to our experiments, n5 is the best value for the NTCIR5 Chinese collection that we used in our experiments. The word-enhanced ranking method proposed here is an extension of the traditional tf-idf based ranking method mentioned above. The equation to calculate the document rank score is given in Equation (3.2), in which not only the frequency of the query character sequences but also the frequency of the segmented words is taken into consideration. 2
272
Z. Li, Y. Xu, and S. Geva
In Equation (3.2), m is the number of query terms, nc is the number of query terms that appear in the document as character sequences but not segmented words, nw is the number of query terms that appear in the document as segmented words, tfic is the frequency of the ith query term appearing as a sequence in the document and idfic is the inverse document frequency of the ith query term appearing as a sequence in the collection. Similarly, tfiw and idfiw are the frequency of the ith query term appearing as a segmented word in the document and the inverse document frequency of the ith query term in the collection, respectively. is derived from two parts, the freIn the equation (3.2), the ranking score quency of the query terms as character sequences (the first part of the equation) and the frequency of query terms as segmented words (the second part of the equation). The first part is actually Equation (3.1). The second part is an additional contribution to the ranking score from the query terms which are also segmented words. The intention of adding the second part to Equation (3.1) is to enhance the impact of the segmented words on the ranking. If there are no query terms that appear in the document as segmented words, = 0, the second part becomes 0 and Equation (3.2) becomes Equation (3.1). The higher the , the more the ranking score is increased by the second part. The idea behind this strategy is to increase the rank of the documents which contain query terms that are segmented words. This method emphasizes the importance of the segmented query terms in calculating the document rank score. If a document contains the query terms as segmented words, the document will get a higher ranking score than that scored by using Equation (3.1). (2) Average based Ranking Method This method is also an extension of the tf-idf based ranking method. It simply calculates the average of the frequencies of the query terms as character sequences and as segmented words. The equation to calculate the document rank score is given below: 2
3
Equation (3.3) also consists of two parts, the frequency of query terms as character sequences and the frequency of query terms as segmented words. Similar to the wordenhanced method, if there is no query terms that appear in the document as segmented words, = 0, the second part becomes 0 and only the first part is used to derive the ranking score. When the number of query terms that are segmented words increases, the contribution of the first part decreases, while the contribution of the second part increases. The higher the , the more the contribution from the second part increases. Different from the word-enhanced method, in which the number of query terms as segmented words increases, this method not only increases the contribution of the segmented words but also decreases the contribution of the query term as character sequences only. The idea is that if a document contains fewer query terms that there are segmented words, the topic of this document may be irrelevant to the query terms; hence, the contribution from these query terms should be reduced.
A Hybrid Chinese Information Retrieval Model
273
4 Experimental Results and Evaluation We have conducted an experiment to evaluate the performance of the proposed three ranking methods. The experiment is conducted on a DELL PC within Pentium 4 processor and 1GB physical memory, 80GB hard disk space. The system is implemented in C# by using MS Visual Studio 2005, MS SQL Server 2005 and Windows XP Professional Operating System. 4.1 Data Collection The Chinese corpus obtained from the NTCIR5 (http://research.nii.ac.jp/ntcir/) is used as the testing data: it contains 434,882 documents of news articles in traditional Chinese. Even though the experiments are conducted in traditional Chinese, the techniques proposed in this work can be applicable to simplified Chinese. The detailed information of the test set is as shown in table 2 below. Table 2. Statistic of Chinese Corpus Document collection United Express (ude) Ming Hseng News (mhn) Economic Daily News (edn) Total
Year 2000 40445 84437 79380 204262
Year 2001 51851 85302 93467 230620
No. of articles 92296 169739 172847 434882
The document itself is in XML format with the following tags: • • • • • • •
The tag for each document Document identifier Language code: CH, EN, JA, KR Title of this news article Issue date Text of news article
Paragraph marker
Queries used in the experiment are from the NTCIR5 CLIR task. There are a total of 50 queries created by researchers from Taiwan, Japan and Korea. NTCIR5 provided both English queries and the corresponding Chinese queries. The Chinese queries are used in this research. 4.2 Retrieval Model and Evaluation Measures We use the traditional Boolean model as our retrieval model to retrieve potential candidate relevant documents. If a document contains one or more query terms, no matter whether as character sequences or segmented words, the document will be retrieved as a candidate relevant document. For all the ranking methods tested in this experiment, the same retrieval model is used to ensure a fair comparison. For indexing units, all segmented words and characters in the whole collection of Chinese documents are
274
Z. Li, Y. Xu, and S. Geva
extracted as units. We create two indexing tables for characters and segmented words, respectively. In our experiment, the traditional precision and recall evaluation metrics are used to measure the effectiveness of the proposed ranking methods. The evaluation is done to various numbers of retrieved documents, ranging from top 10, 15, 20, 30 to 100 documents. 4.3 Performance Comparison The baseline model used in this experiment to compare with the proposed methods is the traditional character-based ranking model described in Equation (3.1). The experiment results are given in the following tables (3 and 4) and figures (3 to 6 ). From the evaluation results of precision and recall, we find that the performances of the three models are very close; only the average model achieves slightly better results. On average the precision is improved by 0.2% and the recall is improved by only 0.1%. The experimental results show that Chinese word segmentation can improve the Table 3. Precision Comparison TOP N 10 15 20 30 100 Ave
Character (%) 39.7 35.2 32.7 27.5 14.1 29.9
Word-Enhanced (%) 39.3 35.5 32.5 27.8 14.5 29.9
Average (%) 39.9 35.5 32.9 27.5 14.6 30.1
Table 4. Recall Comparison TOP N 10 15 20 30 100 Ave
Character (%) 19.6 26.1 30.9 36.7 53.8 33.4
Word-Enhanced (%) 19.7 25.9 30.4 36.6 53.9 33.3
Fig. 3. Precision between Ranking Models
Average (%) 20.1 26.5 31.3 35.7 53.9 33.5
A Hybrid Chinese Information Retrieval Model
Fig. 4. Curve of Precision
Fig. 5. Recall between Ranking Models
Fig. 6. Curve of Recall
275
276
Z. Li, Y. Xu, and S. Geva
performance of Chinese information retrieval, but the improvement is not significant. This also confirms that a high accuracy of word segmentation, reported as 95% in [8] and used in our experiment, may not increase retrieval performance and may eventually decrease while the accurate of segmentation increases over a point [4].
5 Conclusion In this chapter, we propose two methods for ranking retrieved documents by a hybrid of the character based relevancy measure and the segmented word based relevancy measure, in order to improve the performance of Chinese information retrieval. From the experimental results, we find that those approaches achieved slight improvement over the traditional character based approach, which indicates that the influence of taking segmented words into consideration in ranking retrieved documents is limited.
References 1. Lu, C., Xu, Y., Geva, S.: Translation disambiguation in web-based translation extraction for English-Chinese CLIR. In: Proceedings of the 2007 ACM Symposium on Applied Computing, pp. 819–823 (2007) 2. Geva, S.: GPX - Gardens Point XML IR at INEX 2005, INEX 2005. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 240–253. Springer, Heidelberg (2006) 3. Gao, J., Li, M.: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics 31(4), 531–574 (2005) 4. Peng, F., Huang, X., Schuurmans, D., Cercone, N.: Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7 (2002) 5. Xue, N.: Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003) 6. Sproat, R., Shih, C.: Corpus-Based Methods in Chinese Morphology and Phonology. AT&T Labs — Research (2002) 7. Ma, W.-Y., Chen, K.-J.: Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17, pp. 168–171 (2003) 8. Wang, X., Liu, W., Qin, Y.: A Search-based Chinese Word Segmentation Method. In: Proceedings of the 16th International Conference on World Wide Web, pp. 1129–1130 (2006)
Term Frequency Quantization for Compressing an Inverted Index Lei Zheng and Ingemar J. Cox Department of Computer Science University College London London, WC1E 6BT, United Kingdom [email protected], [email protected]
Abstract. In this paper, we investigate the lossy compression of term frequencies in an inverted index based on quantization. Firstly, we examine the number of bits to code term frequencies with no or little degradation of retrieval performance. Both term-independent and term-specific quantizers are investigated. Next, an iterative technique is described for learning quantization step sizes. Experiments based on standard TREC test sets demonstrate that nearly no degradation of retrieval performance can be achieved by allocating only 2 or 3 bits for the quantized version of term frequencies. This is comparable to lossless coding techniques such as unary, γ and δ-codes. However, if lossless coding is applied to the quantized term frequency values, then around 26% (or 12%) savings can be achieved over lossless coding alone, with less than 2.5% (or no measurable) degradation in retrieval performance.
1
Introduction
An inverted index is a fundamental data structure of information retrieval (IR). Conceptually, it can be considered as a table, the rows and columns of which are the terms in the lexicon and the documents in the collection respectively. [1,2] Each row of the index consists of a list, where each entry is known as a posting. A posting indicates that the corresponding document contains the term. Within a posting, doc ID and term frequency are two basic elements, where doc ID is the identifier of the document containing the term and term frequency indicates the number of occurrence of the term in the document. Indexing the Web creates a very large inverted index, and there is therefore considerable interest in index compression. Index compression can be broadly classified into two categories, i.e. lossless and lossy compression. Lossless compression seeks to reduce the size of the inverted index while guaranteeing identical retrieval performance. In contrast, lossy compression allows a degradation in performance, but normally this loss in performance can be offset by the significantly greater reduction of the index size. The prior work on index compression is discussed in detail in Section 2. In this paper, we describe a lossy compression method based on quantizing the term frequency within each posting. Our method of term frequency quantization is discussed in Section 3, where both term-independent and term-specific A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 277–287, 2010. c Springer-Verlag Berlin Heidelberg 2010
278
L. Zheng and I.J. Cox
quantizers are investigated. In addition, an iterative technique is described for learning quantization step sizes. We then conduct a set of comparative experiments using standard TREC test sets in Section 4. Section 5 discusses remaining issues, particularly decoding complexity. Finally, in Section 6, we summarize our results and suggest possible directions for the future work.
2
Related Work
In this Section, we review related Work in both lossless and lossy compression. 2.1
Lossless Compression
Three widely used coding techniques for losslessly compressing term frequencies within postings of an inverted index are unary codes, γ-codes and δ-codes. In a unary code [1], an integer n is coded as (n − 1) “1”s followed by one “0”. It is obvious that the length of the unary code for an integer n is n bits. In a γ-code [1] an integer n is coded as follows: 1. Separate integer n into two parts: the highest power of 2 (denoted by 2i−1 ) and the remaining integer j, i.e. n = 2i−1 + j. 2. Encode i using a unary code, the length of which is i bits. 3. Encode j using an (i − 1) bits binary code. 4. Concatenate both the unary and binary codes to produce the γ-code of integer n. The length of a γ-code is the sum of the lengths of its unary code and binary code segments. Mathematically, the length of a γ-code for an integer n is Lγ (n) = 2i − 1 = 2log2 n + 1
(1)
Finally, the δ-code [1] is related to the γ-code. The difference is Step 2. In the γ-code, i is encoded by a unary code, the length of which is i bits. However, in the δ-code, i is encoded by a γ-code, the length of which is (2log2 i + 1) bits. Correspondingly, the length of a δ-code for an integer n is Lδ (n) = 2log2 i + i = 2log2 (log2 n + 1) + log2 n + 1 2.2
(2)
Lossy Compression
Carmel et al. [3] proposed a posting pruning strategy, in which less important postings were removed from the inverted index. Blanco and Barreiro [4] extended the work by considering a term pruning strategy, which eliminated rows from the index table. It decided which of the terms in the lexicon should be deleted from the index. More recently, Zheng and Cox [5,6] suggested a document pruning strategy. Their approach decided whether all postings that point to a specific document should be deleted from the index. This is equivalent to eliminating entire columns from the index table.
Term Frequency Quantization for Compressing an Inverted Index
279
The quantization of term frequencies has been considered in the context of retrieval models. Recent Work on binned retrieval models [7] partitioned terms into a set of bins. Anh and Moffat [7] proposed a number of partitioning strategies based on the value of term frequencies. Each bin contained a set of terms that were treated equally in the subsequent ranking score computation. A systematical review of binned retrieval models was provided in [8]. However, quantization of term frequencies for the purpose of index compression has not be previously considered.
3
Term Frequency Quantization
In this Section, we describe our method of term frequency quantization. Both term-independent (global) and term-specific (local) quantizers are investigated. In addition, an iterative technique is described for optimizing quantization step sizes. 3.1
Term-Independent Quantizers
(1) Uniform quantizer Q1 Uniform quantization [9] is a point-wise operation, which can be described with the following Equation. |x| + 0.5 (3) y = QΔ (x) = sgn(x) · Δ · Δ where x is the input signal, y is the quantized version of x, and sgn(·) is the sign function. The parameter Δ is referred to as the quantization step size. It is set according to (i) the peak magnitude of input signal and (ii) the number of bits allocated for coding.1 The quantization step size, Δ, has a direct impact on the compression rate and the distortion of the recovered signal. (2) Non-uniform quantizer Q2 Empirical studies show that the value of term frequencies are often small, and thus it does not seem sensible to assign the quantization step size uniformly. Intuitively, we should quantize term frequencies that have commonly occurring small values, with small quantization step sizes, while quantizing term frequencies with unusually high values with large quantization step sizes. In this way, we can reduce the average distortion, as defined by Equation (4) below. Such a strategy is known as non-uniform quantization. For simplicity, our non-uniform quantizer Q2 is defined as follows. If we use n bits to code the term frequencies, all the term frequency values between 1 and 2n will be quantized to their true value. However, all term frequencies greater than 2n are quantized to 2n . 1
For the sake of simplicity, the largest term frequency for a collection is assumed a large fixed number, e.g. 2048. Under such an assumption, the quantization step size of a uniform quantizer is solely determined by the allocated number of bits.
280
L. Zheng and I.J. Cox
3.2
Iterative Technique
Representing the signal x by its quantized value, y = QΔ (x), causes a distortion or error, d(x, y), which can be measured in a variety of ways. The most common is the mean square error, i.e. d(x, y) =
np
(xi − yi )2
(4)
i=1
where np is the number of postings in the index. We now describe an iterative technique using k-means clustering algorithm [10] in order to optimize the quantization step sizes of a quantizer, i.e. minimizing the mean square error between the quantized term frequency values and their original values. 1. Initialize counter, cnt = 0. 2. Initialize the quantizer according to the peak magnitude of input signal and the number of bits allocated for coding. By allocating n bits for coding, we can have at most 2n values of quantized term frequencies. 3. Partition all the original term frequencies in the index table into the 2n quantization bins. Each term frequency is assigned to the closest quantization bin based on the difference between the original and the quantized values. 4. Compute the centroid of each of the 2n bins. If the centroid is not an integer, approximate it to the nearest integer, since all term frequencies should be integer. The centroids of the bins become the new quantization levels. 5. Increment cnt. 6. If (i) cnt < 1000 AND (ii) the current set of centroids = the previous set of centroids, then go to Step 3. 7. The centroids of the quantization bins are the learned quantization levels for term frequencies. In practice, the iterative technique does not always converge, and oscillate between two or more sets of values. The counter, cnt, guarantees that the loop terminates after no more than 1000 iterations. 3.3
Term-Specific Quantizers
In Section 3.1, we discussed term-independent quantizers, that is, for all the terms in the lexicon, the quantizer being used is the same. However, a more reasonable strategy is that the quantizers are adaptive to different terms (i.e. term-specific quantizers). In general, term-specific quantizers can be either uniform or non-uniform. Similarly, the iterative technique described in Section 3.2 can also be applied. Clearly, a set of term-specific quantizers is likely to outperform a single termindependent quantizer. However, this performance improvement comes at the cost of additional storage for multiple quantization tables.2 2
For the sake of simplicity, we do not consider the compression of quantization tables here. However, in practice, quantization tables can also be compressed, which will correspondingly reduce the required additional storage.
Term Frequency Quantization for Compressing an Inverted Index
281
Suppose an inverted index consists of nt terms and np postings. A n-bits term-independent quantizer requires a single table of 2n entries that provide the corresponding quantization levels. If each entry is a 1-byte (i.e. 8-bits) integer, then the total overhead is 8 × 2n bits. And the overhead per posting is 8 × 2n /np bits. For a practical inverted index, the number of postings, np , is normally huge, e.g. more than 1 million. Consequently, for a 2 or 3-bits quantizer, the overhead per posting is negligible. However, when term-specific quantizers are used, the overhead is replicated nt times, since a quantization table is needed for each of the terms. As a result, the overhead of term-specific quantizers is 8 × 2n × nt /np bits per posting. Note that for a practical inverted index, the number of postings, np , is much larger than the number of terms, nt . Therefore, although this overhead is higher than that of a term-independent quantizer, it is still a small number for 2 or 3-bits term-specific quantizers. As an example, based on our test on the Financial Times collection, the overhead of a 3-bits term-independent quantizer is 9.65 × 10−8 bits per posting, and that of 3-bits term-specific quantizers is 0.023 bits per posting.
4
Experimental Results
We use standard TREC test sets in our experiments. For comparison purposes, two document collections are tested — the Financial Times and the Los Angeles Times collections. For both collections, the TREC 6, 7 and 8 ad hoc topics (topics 301-450) are used to evaluate the retrieval performances. The “title” part and the “description” part of TREC topics are used as evaluation queries. All our experiments are conducted using the LEMUR toolkit [11]. Documents are stemmed using the Krovetz stemmer [12]. Okapi BM25 [13] is used as the score function of the retrieval system. In order to examine the effectiveness of our compressed index, we measure two precision-based metrics, namely precision at 10 (P@10) and mean average precision (MAP). 4.1
Experiment 1: Term-Independent Quantization
We first compare the performance of our two term-independent quantizers, i.e. the uniform quantizer Q1 and the non-uniform quantizer Q2, as described in Section 3.1. Figure 1 shows the experimental results, which clearly indicate that the nonuniform quantizer Q2 is superior. As noted earlier, term frequencies usually have low values. Therefore, quantizing low-value term frequencies with smaller quantization step sizes generates better results. 4.2
Experiment 2: Learning Quantization Step Sizes
Here we investigate the effectiveness of the iterative technique described in Section 3.2. Since the performance of the non-uniform quantizer Q2 is superior for both metrics, Q2 is being used as the initialization of the iterative technique.
282
L. Zheng and I.J. Cox
(ii) Experimental Collection: Los Angeles Times 0.3
0.32
0.28
0.3
0.26
0.28
0.24
Precisions
Precisions
(i) Experimental Collection: Financial Times 0.34
0.26 0.24 Global Q1 −−− P@10 Global Q1 −−− MAP Global Q2 −−− P@10 Global Q2 −−− MAP
0.22 0.2 0.18 11
10
9
8
7 6 5 Number of bits
0.22 0.2 Global Q1 −−− P@10 Global Q1 −−− MAP Global Q2 −−− P@10 Global Q2 −−− MAP
0.18 0.16 4
3
2
0.14 11
1
10
9
8
7 6 5 Number of bits
4
3
2
1
Fig. 1. Comparison of two term-independent quantizers: the uniform quantizer Q1 and the non-uniform quantizer Q2
Table 1. The quantization table learned by the iterative technique for both collections
1-bit quantizer 2-bits quantizer 3-bits quantizer
Financial Times Collection Los Angeles Times Collection 1, 3 1, 3 1, 2, 4, 14 1, 2, 4, 14 1, 2, 3, 4, 5, 7, 17, 87 1, 2, 3, 4, 5, 9, 25, 269
(ii) Experimental Collection: Los Angeles Times 0.3
0.32
0.28
0.3
0.26
0.28
0.24
Precisions
Precisions
(i) Experimental Collection: Financial Times 0.34
0.26 0.24 Global Q2 −−− P@10 Global Q2 −−− MAP Global Q2i −−− P@10 Global Q2i −−− MAP
0.22 0.2 0.18 11
10
9
8
7 6 5 Number of bits
0.22 0.2 Global Q2 −−− P@10 Global Q2 −−− MAP Global Q2i −−− P@10 Global Q2i −−− MAP
0.18 0.16 4
3
2
1
0.14 11
10
9
8
7 6 5 Number of bits
4
3
2
1
Fig. 2. Comparison of two term-independent quantizers: the non-uniform quantizer Q2 and its iterative version Q2i
Table 1 shows the quantization table of 1-bit to 3-bits quantizers learned by the iterative technique (Q2i) for both collections. Note that the majority of quantization levels represent small values. Figure 2 summarizes the experimental results, showing that the iterative technique is effective for optimizing quantization step sizes. For example, if we use 3-bits to code term frequencies, based on the Financial Times collection, there is no precision degradation of P@10 and MAP by applying the iterative technique, compared with a 2.0% degradation of P@10 and a 2.3% degradation of MAP without iteration.
Term Frequency Quantization for Compressing an Inverted Index
(ii) Experimental Collection: Los Angeles Times
0.34
0.3
0.32
0.28
0.3
0.26
0.28
0.24
Precisions
Precisions
(i) Experimental Collection: Financial Times
0.26 0.24 Global Q2i −−− P@10 Global Q2i −−− MAP Term−specific Qs −−− P@10 Term−specific Qs −−− MAP
0.22 0.2 0.18 11
10
9
8
7 6 5 Number of bits
0.22 0.2 Global Q2i −−− P@10 Global Q2i −−− MAP Term−specific Qs −−− P@10 Term−specific Qs −−− MAP
0.18 0.16 4
3
283
2
1
0.14 11
10
9
8
7 6 5 Number of bits
4
3
2
1
Fig. 3. Comparison of the term-independent quantizer Q2i and term-specific quantizers Qs
4.3
Experiment 3: Term-Specific Quantization
In the previous experiments, the same quantizer was applied to all terms. Here, we compare the performance of the term-independent quantizer Q2i with termspecific quantizers Qs. The term-specific quantizers, Qs, are also based on the iterative technique. Figure 3 shows that the performance of the term-specific quantizers Qs is superior to the term-independent quantizer Q2i, as expected. 4.4
Experiment 4: Combining Lossy and Lossless Term Frequency Coding Techniques
It is possible to losslessly compress term frequencies, and given their distribution, considerable savings can be made. In fact, lossless coding typically requires as few as around 2 bits per posting. Of course, it is also possible to apply lossless encoding after lossy quantization. This is enumerated in Table 2, where we apply the global quantizer, Q2i, followed by one of three forms of lossless compression, previously described in Section 2.1. Among the three lossless coding techniques, γ-code has the best performance for both collections, requiring 1.8951 bits per posting for the Financial Times collection and 1.9904 bits per posting for the Los Angeles Times collection. Of the three combinations of term frequency quantization plus lossless coding, the unary code is superior. Specifically, for a 3-bits term frequency quantizer plus unary code, 1.6833 bits per posting is needed for the Financial Times collection and 1.7418 bits per posting is needed for the Los Angeles Times collection. This represents a 11.2% and 12.5% improvement over the best lossless only coding, γ-coding. Note that for 3-bits quantization using Q2i, there is no measurable degradation in retrieval performance. Further gains in compression can be achieved by using a 2-bits quantizer followed by unary coding. This results in a 24.9% (Financial Times collection) and 26.3% (Los Angeles Times collection) saving over γ-coding alone. The degradations are 2.4% in P@10 and 2.1% in
284
L. Zheng and I.J. Cox Table 2. Required number of bits per posting for various coding methods Financial Times Collection
Coding Methods
Unary Code 4-bit Q2i + Unary 3-bit Q2i + Unary 2-bit Q2i + Unary Coding Methods
γ-code 4-bit Q2i + γ-code 3-bit Q2i + γ-code 2-bit Q2i + γ-code Coding Methods
δ-code 4-bit Q2i + δ-code 3-bit Q2i + δ-code 2-bit Q2i + δ-code
bits/posting Improve
2.0194 1.8494 1.6833 1.4240
Coding Methods
γ-code 1.42% 4-bit Q2i + γ-code 4.79% 3-bit Q2i + γ-code 12.95% 2-bit Q2i + γ-code
bits/posting Improve
2.1175 2.1040 2.0082 1.9293
Coding Methods
Unary Code 8.42% 4-bit Q2i + Unary 16.64% 3-bit Q2i + Unary 29.48% 2-bit Q2i + Unary
bits/posting Improve
1.8951 1.8682 1.8043 1.6497
Los Angeles Times Collection
0.64% 5.16% 8.89%
Coding Methods
δ-code 4-bit Q2i + δ-code 3-bit Q2i + δ-code 2-bit Q2i + δ-code
bits/posting Improve
2.1953 1.9597 1.7418 1.4660
10.73% 20.66% 33.22%
bits/posting Improve
1.9904 1.9546 1.8765 1.7079
1.80% 5.72% 14.19%
bits/posting Improve
2.2258 2.2077 2.0907 2.0063
0.81% 6.07% 9.86%
MAP based on the Financial Times collection, and 1.6% in P@10 and -0.8% in MAP based on the Los Angeles Times collection. A further advantage of term frequency quantization followed by unary coding is that the decoding complexity is less than that of γ-code alone, as discussed in Section 5.
5
Decoding Complexity
Decoding complexity is an important factor, as it is directly related with the response time of an information retrieval system.3 To analyze decoding complexity, we use pseudo-code to enumerate the decoding processes for both term frequency quantization (in Algorithm 1) and the lossless coding techniques, including unary coding (in Algorithm 2), γ-coding (in Algorithm 3) and δ-coding (in Algorithm 4).
1 2 3
read current m bits as an integer j; n ← jΔ; /*Δ is the quantization step size*/ return n;
Algorithm 1. Decoding a quantized term frequency
3
Similar results are obtained for comparing encoding complexities. Although it is useful for evaluating the time cost for indexing the collection, encoding complexity is not directly related to the response time of an information retrieval system. Therefore, we are more interested in decoding complexity here.
Term Frequency Quantization for Compressing an Inverted Index
1 2 3 4 5 6 7 8
285
i ← 1; read the current bit b(i); while (b(i) = 1) do i ← i + 1; read the current bit b(i); end n ← i; return n;
Algorithm 2. Decoding an unary coded term frequency
1 2 3 4 5 6
call Algorithm 2 to decode the unary code part i; tmp1 ← i − 1; tmp2 ← 2tmp1 ; read next tmp1 bits as an integer j; /*decode the binary code part*/ n ← tmp2 + j; /*sum the unary code part and the binary code part*/ return n;
Algorithm 3. Decoding a γ-coded term frequency
1 2 3 4 5 6
call Algorithm 3 to decode the γ-code part i; tmp1 ← i − 1; tmp2 ← 2tmp1 ; read next tmp1 bits as an integer j; /*decode the binary code part*/ n ← tmp2 + j; /*sum the γ-code part and the binary code part*/ return n;
Algorithm 4. Decoding a δ-coded term frequency Term frequency quantization is a fixed length encoding. Dequantizing a m-bits quantized term frequency involves 2 operations, i.e. reading from the memory and multiplying by the quantization step size. In comparison, lossless coding techniques are variable length encodings. To decode an m-bits unary coded term frequency involves (3m + 1) operations as shown in Algorithm 2. As discussed in Section 2.1, a γ-code is a concatenation of a unary code and a binary code. Correspondingly, the decoding complexity is also the sum of these two parts. Therefore, besides the complexity required by the unary code part, another 4 operations are needed for completing the whole decoding. Similarly, a δ-code is a concatenation of a γ-code and a binary code. As a result, the decoding complexity is the sum of these two parts. Therefore, in addition to the complexity required by the γ-code part, another 4 operations are needed for completing the whole decoding. Based on above discussions, it is clear that term frequency quantization has less decoding complexity than the three widely used lossless coding techniques.
286
6
L. Zheng and I.J. Cox
Conclusion and Future Work
Each posting in an inverted index contains a doc ID and an associated term frequency, where the term frequency is an integer. In this paper, we examined how the number of quantization levels for coding the term frequencies affects retrieval performance, as measured by P@10 and MAP. Both global quantizers and term-specific quantizers were investigated for compressing term frequencies. In addition, an iterative technique was described to optimize the quantization step sizes. Experiments based on standard TREC test sets demonstrated that nearly no degradation of retrieval performance can be achieved by allocating only 2 or 3-bits for quantized term frequencies. Further savings can be achieved by combining term frequency quantization with existing lossless coding techniques. Experimental results suggest that a 2-bits (or 3-bits) quantizer plus unary code leads to about a 26% (or 12%) saving in the storage required for term frequencies with less than 2.5% (or negligible) degradation of P@10 and MAP. Unfortunately, these very significant improvements in the compression of term frequencies do not lead to equivalent savings for the inverted index as a whole. This is because a posting consists of more than just the term frequency. At the very least, each posting also includes a corresponding doc ID. Prior Work [1] suggests that this field typically requires about 4 bits based on the test of TREC data sets. Therefore, for a simple inverted index, the overall saving may be closer to 9% (or 4%) with less than 2.5% (or negligible) degradation of P@10 and MAP. Regarding future work, we note that term frequency quantization can be applied together with other posting compression techniques that compress the doc ID [14] and/or positional information [15]. In addition, these posting compression techniques could also be applied in conjunction with other lossy index compression techniques, such as [3,4,5]. Further work is needed to investigate the interplay between the various techniques.
References 1. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Inc., San Francisco (1999) 2. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) 3. Carmel, D., Cohen, D., Fagin, R., et al.: Static index pruning for information retrieval systems. In: Proceedings of the 24th SIGIR, pp. 43–50 (2001) 4. Blanco, R., Barreiro, A.: Static pruning of terms in inverted files. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 64–75. Springer, Heidelberg (2007) 5. Zheng, L., Cox, I.J.: Document-oriented pruning of the inverted index in information retrieval systems. In: Proceedings of 2009 International Conference on AINAW, pp. 697–702 (2009)
Term Frequency Quantization for Compressing an Inverted Index
287
6. Zheng, L., Cox, I.J.: Entropy-based static index pruning. In: Boughanem, M., et al. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 713–718. Springer, Heidelberg (2009) 7. Anh, V.N., Moffat, A.: Simplified similarity scoring using term ranks. In: Proceedings of the 28th SIGIR, pp. 226–233 (2005) 8. Metzler, D., Strohman, T., Croft, W.B.: A statistical view of binned retrieval models. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 175–186. Springer, Heidelberg (2008) 9. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Transactions on Communications 28(1), 84–95 (1980) 10. MacKay, D.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003) 11. Ogilvie, P., Callan, J.: Experiments using the LEMUR toolkit. In: Proceedings of the 10th Text Retrieval Conference (2001) 12. Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the 16th SIGIR, pp. 191–202 (1993) 13. Sparck-Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval. Information Processing and Management 36(6), 779–808 (2000) 14. Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th WWW, pp. 401–410 (2009) 15. Yan, H., Ding, S., Suel, T.: Compressing term positions in web indexes. In: Proceedings of the 32th SIGIR, pp. 147–154 (2009)
Chinese Question Retrieval System Using Dependency Information Jing Qiu1 , Le-Jian Liao2 , and Jun-Kang Hao3 1
Department of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China [email protected] 2 Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, 100081, China [email protected] 3 Division of Student Affairs, Hebei University of Science and Technology, Shijiazhuang 050018, China [email protected]
Abstract. Discussion boards and online forums provide a platform for users to share ideas, discuss issues and communicate with each other. Nowadays most of the discussion boards are used as problem-solving platforms which can be seen as question-answering knowledge bases. Therefore, if system can automatically locate the similar questions which have appeared previously, then the same answers could be returned to users to eliminate the time users wait for. In this paper, a novel question retrieval method is proposed. Keywords based inverted index is chosen as the basis technology, and dependency information, language model, and simple semantic information is used to extend it. Experimental results show the effectiveness and benefits of our approach.
1
Introduction
Discussion boards and online forums are widely used in various areas, which allow users share ideas, discuss issues, and communicate with each other more easily. As a result, no matter what motivation for users to join in discussion boards, they would like to use discussion board as problem-solving platforms. Cong et al. [1] found that 90% of 40 discussion boards that were investigated contain question-answering content. Therefore, discussion boards can be seen as question-answering knowledge bases, which contain much valuable information. For example, when users ask questions similar to the questions which have appeared in the discussion board, the same answers could be returned to users as potential solutions and suggestions. This would save the time of users to wait for answers, and improve the efficiency. Gloden soft is a high-tech enterprise which engaged in software development and consultation in architecture field. This enterprise built a online platform for users to discuss and communicate with each other. The old discussion platform of Gloden soft is developed base on the database, and use the fuzzy query A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 288–295, 2010. c Springer-Verlag Berlin Heidelberg 2010
Chinese Question Retrieval System Using Dependency Information
289
method to find similar questions. The drawbacks are: fuzzy query method just able to find out the sentences which contain the key words, but without combine semantic information, and the query results have the same weight (not ranked in order); and fuzzy query makes the database index useless, which slow down the query speed. The discussion platform of Gloden soft acts as background, this paper focus on the retrieval of questions from discussion board automatically and effectively in architecture field. A novel retrieval method is proposed to search the similar questions in the question database and return their answers to users. Keywords based inverted index technology is chosen as the basis technology, which we use dependency information, language model, and simple semantic information to extend it. Experimental results show the effectiveness and benefits of our approach. The rest of this paper is organized as follows: We discuss related work in Section 2. Section 3 brief introduces the dependency grammar. Section 4 describes three question retrieval methods. Experimental results are reported in section 5. Section 6 concludes the paper.
2
Related Work
Cong et al. [1] developed a question detection system based on classification method. Results can be automatically extracted from both questions and nonquestions in forums by using sequential pattern features. Ding et al. [2] used CRFs (Conditional Random Fields) to detect the contexts and answers of questions. Hong et al. [3] showed that non-content features play a key role in improving overall performance, and they did a careful comparison of how different types of features contribute to the final result. Feng et al. [4] proposed a system to automatically answer students’ queries by matching the reply posts from an annotated corpus of archived threaded discussions with students’ queries. Jeon et al. [5][6] and Cao et al. [7] detected questions in the question-answering services that are semantically similar to user’s query. Traditional Question Answering tasks in TREC style have been well studied [8]. It contained relatively limited types of questions, which makes it possible to identify the answer type. And the work mainly focused on constructing short answers.
3
Dependency Grammar
Dependency Grammar (DG) is a grammatical theory proposed by the Tesniere [9]. Then Gaifman [10] studied its formal properties, and his results were brought to the attention of the linguistic community by Hayes [11]. Dependency Grammar is concerned directly with individual words by dependency relationship which is an asymmetric binary relationship involves two words, one is called a head, another is called a modifier. Dependency grammar represents each sentence as a
290
J. Qiu, L.-J. Liao, and J.-K. Hao
set of dependency relationships through a tree structure. A word in a sentence may have more than one modifiers and it may modify at most one word. The root of the dependency tree is call the head of the sentence, and it dose not modify any word.
4 4.1
Question Retrieval Methods Keywords Based Question Retrieval Model
When use keywords based information retrieval method, a document is represented as a set of keywords, and the query can be expressed as a combination of boolean relation of keywords. If documents satisfy the boolean queries, the system return these documents as the result. And the documents are sorted according to their weights. The general framework of keyword-based question retrieval model is: First, decompose the query into the combination of the keywords boolean relations; second, search index files, find out the set of questions that satisfy the boolean query; finally, calculate the weights of retrieved question documents, and return them by sequence. Inverted index approach is used to build index files, and TF/IDF algorithm is used to calculate the weight of question document. Synonyms is used for query expansion which could improve the recall of the system. 4.2
Dependency Relation Based Question Retrieval Model
Inverted index approach is proposed to solve the practical application issues that people need retrieve the record by the value of the property. Keywords based question retrieval method has high efficiency, but it supposes the words in the document are independently with each other, and has less syntactic and semantic information. Therefore, we propose a novel inverted index approach which has dependency information in, we call it dependency relation based inverted index. Dependency relation based inverted index has the same structure with keyword based inverted index, but use the dependency relation pairs instead of the keywords. In other words, it uses the set of dependency relation pairs to express a sentence or a document rather than keywords. Its advantage when retrieve documents is: it can add syntactic and semantic information in the retrieve process, and has a high retrieval efficiency at the same time. Here, it is necessary to explore the different between the ”AND” query in keywords based inverted index and the dependency relation based inverted index, since both of the two methods use two words simultaneously to locate the target records. However, ”AND” query of the keywords based inverted index return the target documents when they contain both of the keywords, but not considers whether there are certain relations between words. And the dependency relation pair in dependency relation based inverted index is generated by syntactic parsing, it can not be constituted by arbitrary two words. So dependency relation based inverted index has some properties of ”AND” query, and with syntactic knowledge in.
Chinese Question Retrieval System Using Dependency Information
291
Corresponding to dependency relation based inverted index method, improved TF/IDF algorithm is used to calculate the weight of question document. Here TF indicates dependency relation pair frequency in a document, and IDF indicates inverse document frequency whose value will becomes larger when fewer documents contain the certain dependency relation pair. 4.3
Language Model Based Question Retrieval Model
Language modeling techniques for information retrieval have received increasing attention since they have been introduced [12][13][14][15]. however, the language models used in most previous work are the unigram models [12][15]. The unigram language model makes a strong and unrealistic assumption that each word occurs independently. Work has been done to explore bigram and trigram models [14]. But, it is well known that the bigram and trigram models have a limitation in handling long-distance dependencies. Then a new dependency structure language model was proposed to overcome the limitation of unigram and bigram models [13]. Because the question retrieval task is to decide whether a question document is similar with the query, such a task can be modeled by a probabilistic hypothesis test, using the likelihood ratio. The calculation details about likelihood ratio of the dependency structure language model can be seen in [13]. We use this dependency structure language model to retrieve the question documents that similar with the query, and sort them according to their likelihood ratio values.
5
Experiments and the Results
The following experiments have been performed on the question database of Golden soft. 5.1
Question Retrieval Systems
Keywords based question retrieval is implemented in the framework of Lucence, which is a high-performance and full-featured text search engine library. Chinese word breaker is combined to treat Chinese sentences. Fig. 1 shows the test page of keywords based question retrieval system. The system could return satisfied results to some extent. Dependency relation based question retrieval has the same idea with keywords based question retrieval, but adds dependency information in. It has fast retrieval speed, since it based on inverted index technology; and it also has high precision value as well, since it has more syntactic and semantic knowledge. Fig. 2 shows the test results when use the same test query sentence. Dependency structure language model based question retrieval can compensates for the weakness of unigram and bigram language model. It can naturally handle long-distance dependencies by the linguistic syntactic structure inside the statistical language model. Different from previous two systems, statistics and syntax ideas is used in language model based system instead of inverted index. And the retrieval precision is much better. Fig. 3 shows the test results.
292
J. Qiu, L.-J. Liao, and J.-K. Hao
Fig. 1. Keywords based question retrieval system
Fig. 2. Dependency relation based question retrieval system
Chinese Question Retrieval System Using Dependency Information
293
Fig. 3. Dependency structure language model based question retrieval system Table 1. Performance comparison
Keywords based system Dependency relation based system Dependency structure language model based system
5.2
Precision Recall 1.3% 94.4% 31.6% 87.5% 78.6%
92.3%
Performance Comparison
Table 1 shows the performances when use the three methods to retrieve 20 different questions, for each question we annotated the relevant records manually in advance. We can see that the precision of dependency relation based system (31.6%) is significantly higher than keywords based system (1.3%), this shows that syntactic information can improve the accuracy of the retrieval system. However, the recall of dependency relation based system (87.5%) is lower than keywords based system (94.4%), this is because of the limitations of the dependency parser and the complexity of the natural language. From the table we can observe that both the precision and recall of the dependency structure language model based system reach a high level, which much better than previous two systems. But the disadvantage of this approach is that the computational cost is too high to practical application.
294
5.3
J. Qiu, L.-J. Liao, and J.-K. Hao
System Improvement
Keywords based inverted index question retrieval system has high efficiency, but with no syntactic and semantic information in. Therefore, it often returns irrelevant results. Dependency relation based inverted index question retrieval system builds inverted index by using dependency relation pairs, which add syntactic information to some extent but not enough. An improved method to add more syntactic and semantic information in inverted index based question retrieval system is sorts the results returned by keywords based system according to their likelihood ratio values of dependency structure language model. This method can enhance the precision of system effectively, however, keywords based system returns a large number of results, it is time consuming to use dependency structure language model to each of them. We exam the dependency structure language model running time when the space of the returned question documents is 20: it needs 1.8/3.8/7 seconds when the average length of question sentence is 10/20/30 respectively. According to the test results, we can set a threshold equals to 20. When returned question documents less than 20, dependency structure language model is used to sort them; else, dependency relation based inverted index is used to adjust the sequence of them (the records which also contained in dependency relation based inverted index will output priority).
6
Conclusion
Keywords based inverted index question retrieval method has fast search speed, but it lacks of syntactic and semantic information, therefore it often returns the irrelevant results. We propose a novel dependency relation based inverted index method for question retrieval, which uses dependency relation pairs instead of the keywords. An improved TF/IDF algorithm is also proposed to calculate the weight of question document. The experimental results show that the performance of dependency relation based system is much better than keywords based system. We also propose an improved system model which combines the language model method with dependency relation based inverted index method by using a threshold, which can ensure the efficiency and the accuracy of the system. Acknowledgments. This paper is supported by the Natural Science Foundation of China (60873237), and Natural Science Foundation of Beijing (4092037). We thank Harbin Institute of Technology Information Retrieval Laboratory for providing us with LTP modules. And we also thank Golden soft for providing us the questions and answers database in architecture field.
Chinese Question Retrieval System Using Dependency Information
295
References 1. Cong, G., Wang, C.Y., Song, Y.I., Sun, Y.: Finding question-answer pairs from online formus. In: 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 467–474. ACM, New York (2008) 2. Ding, S., Cong, G., Lin, C., Zhu, X.: Using conditional random fields to extract contexts and answers of questions from online forum. In: 46th Annual Meeting of the Association for Computational Linguistic: Human Language Tchnologies (ACL: HLT), Columbus, pp. 710–718 (2008) 3. Hong, L.J., Davison, B.D.: A classification-based approach to question answering in discussion boards. In: 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 171–178. ACM, Boston (2009) 4. Feng, D., Shwa, E., Kim, J., Hovy, E.: An intelligent discussion-bot for answering student queries in threaded discussions. In: 11th International Conference on Intelligent User Interfaces, pp. 171–177. ACM, New York (2006) 5. Jeon, J., Croft, W.B., Lee, J.H.: Finding semantically similar questions based on their answers. In: 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 617–618. ACM, New York (2005) 6. Jeon, J., Croft, W.B., Lee, J.H.: Finding similar questions in large question and answer archives. In: 14th ACM International Conference on Information and Knowledge Management, pp. 84–90. ACM, New York (2005) 7. Cao, Y., Duan, H., Lin, C.Y., Hon, H.W.: Recommending questions using the MDL-based tree cut model. In: 17th International Conference on World Wide Web, pp. 81–90. ACM, New York (2008) 8. Voorhees, E.M.: The TREC question answering track. Nat. Lang. Eng. 7(4), 361–378 (2001) 9. Tesniere, L.: Elements de Syntaxe Structurals. Klincksieck, Pairs (1959) 10. Gaifman, H.: Dependency Systems and Phrase-Structure Systems. Information and Control 8, 304–337 (1965) 11. Hayes, D.G.: Dependency Theory: A Formalism and Some Observations. Language 40, 511–525 (1964) 12. Jin, H., Schwartz, R., Sista, S., Walls, F.: Topic tracking for radio, TV broadcast, and newswire. In: DARPA Broadcast News Workshop, Herndon, Virginia, USA, pp. 199–204 (1999) 13. Lee, C., Lee, G.G., Jang, M.: Dependency structure language model for topic detection and tracking. Information Processing and Managment 43(5), 1249–1259 (2007) 14. Miller, D.R.H., Leek, T., Schwartz, R.M.: A hidden Markov model information retrieval system. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221. ACM, California (1999) 15. Spitters, M., Kraaij, W.: A language modeling approach to tracking news events. In: TDT Workshop 2000, pp. 101–106. Gaithersburg (2000)
A Novel Automatic Lip Reading Method Based on Polynomial Fitting Meng Li and Yiu-ming Cheung Department of Computer Science, Hong Kong Baptist University {mli,ymc}@comp.hkbu.edu.hk
Abstract. This paper addresses the problem of isolate number recognition using visual information only. We utilize the intensity transformation and spatial filter to estimate the minimum enclosing rectangle of mouth in each frame. For each utterance, we obtain the two vectors composed of width and height of mouth, respectively. Then, we present a method to recognize the speech based on the polynomial fitting. Firstly, both width and height vectors are normalized and arranged into the constant length via interpolation. Secondly, least square method is utilized to produce two 3-order polynomials that can represent the main trend of the two vectors, respectively, and reduce the noise caused by the estimate error. Lastly, the positions of three crucial points (i.e. maximum, minimum, and right boundary point) in each 3-order polynomial curve are formed as a feature vector. For each utterance, we calculate the average of all vectors of training data to make a template, and utilize Euclidean distance between the template and testing data to perform the classification. Experiments show the promising results of the proposed approach in comparison with the existing methods.
1 Introduction Lip reading is to understand speech by visually interpreting the lip movement of speakers [1]. It has received considerable attention from the community because of its wide applications such as visual-speech recognition, human identification, and so forth [2], [3], [4]. Recently, rather than the traditional applications, lip-reading has also been employed in the human-agent interaction system to implement emotional state monitoring and non-verbal human computer interface [5], [6]. Thus far, two kinds of approaches have been widely used in lip reading system, namely image-based and model-based ones. In the image-based approach, the pixels of lip region are transformed by PCA, DWT or DCT, to become a feature vector [7], [8], [9]. Under the ideal environment, the accuracy of image-based approach is quite high, but the performance in real environment will be degraded seriously. One main reason is that the approach is sensitive to the illumination, mouth rotation, and some other conditions. From the practical viewpoint, the image-based approach is not an appropriate choice for automatic lip reading system. By contrast, in the model based approach, e.g. Snake, ASM and AAM [10], [11], [12], [13], shape and position of lip contours, tongue, teeth or some other features like width and height of mouth, are modeled and controlled by a set of parameters. In general, model-based approach is invariant to the effects of scaling, rotation, translation, and illumination in comparison with the image-based approach. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 296–305, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Novel Automatic Lip Reading Method Based on Polynomial Fitting
297
In this paper, we propose a lip reading approach based on the width and height of mouth. We utilize the intensity transformation and spatial filter to the image of Range of Interesting (ROI) in gray-scale space such that the minimum enclosing rectangle of lip can be localized automatically. Then, given a video clip for an utterance, we can obtain two vectors composed of the width and height of mouth from each frame, respectively. Based on the least square method, two 3-order polynomials are built to fit the width and height vectors. The positions of three crucial points (i.e. maximum, minimum, and right boundary point) in each 3-order polynomial curve are formed as a feature vector. For each utterance, we calculate the average of all vectors of training data to make a template, and utilize Euclidean distance between the template and testing data to perform the classification. Experimental results have shown the promising results of the proposed approach in comparison with the existing ones.
2 Lip Localization and Feature Extraction The images captured by camera are comprised of RGB values. We heuristically project these RGB values into the gray-level space based on the following equation: I = 0.299R + 0.587G + 0.114B.
(1)
In order to enhance the contrast between lip and surrounding skip region, we adjust the histogram of the image and make it equalized. Then, we make an accumulation of graylevel value for each row of the image. The slopes of the curve contain the information about the boundaries between the lips and the surrounding skin region. The minimum value on the curve is retained as the row position of mouth corner points or the nearby position. The row can be named as horizontal midline of mouth as shown in Figure 1. The curve of gray level values along with the horizontal midline is saved in vector G. We build a sub-vector Gs by a segment of G between the first maximum from the left and the first maximum from the right. To make the curve smooth, we let C = Cl + Cr with (i) Cl
Cr(i)
=
=
⎧ (i) ⎪ ⎨ Gs
(i−1)
(Cl
(i)
> Gs )
i = 1, 2, . . . , n
(3)
i = n − 1, n − 2, . . . , 1
(4)
⎪ ⎩ C (i−1) (C (i−1) ≤ G(i) ) s l l
⎧ (i) ⎪ ⎨ Gs
(i+1)
(Cr
(2)
(i)
> Gs )
⎪ ⎩ C (i+1) (C (i+1) ≤ G(i) ) r s l (i)
where Cl and Cr are the auxiliary vectors, Cl dimension of the vector G. Initially, we let
is the ith element of Cl , and n is the
(1)
Cl = G(1) (n) Cr = G(n) .
(5)
298
M. Li and Y.-m. Cheung
Fig. 1. Accumulation curve of gray level value for each row. The vertical crossing line represents the relation between the horizontal midline of mouth and the minimum value of the accumulation curve.
Also, let the minimum of the most left and most right value in C be the threshold values. Subsequently, those elements C (i) s, whose values are less than the threshold values, build a new vector C . Accordingly, the average of the elements in C can be obtained by m (i) i=1 C cavg = , (6) m where m is the number of the elements in C . Then, we adjust the contrast of image using ⎧ (1.5cavg < Iin < 1) ⎨ 255 Iout = (7) ⎩ 500 − 500 (0 < I ≤ 1.5c ), in avg cavg where Iin is the input gray-level value, and Iout is the output. For the adjusted image, an 11 × 1 searching block is performed along with the midline. The positions of the most left and right not-all white block are marked as the column of mouth corner candidates. The procedure is performed through an iterative process until the position of mouth corner candidates is unchanged any more or the image becomes a binary one. Figure 2 illustrates the result of contrast adjustment and the estimate result of the corresponding mouth corner. Furthermore, a mean filter is performed on the initial image using the following 3×3 mask:
A Novel Automatic Lip Reading Method Based on Polynomial Fitting
(a)
299
(b)
Fig. 2. (a) Image of contrast adjust result, and (b) the estimate result of mouth corner
Fig. 3. The subtracted image between source gray-scale I (0) and the filtered image If
⎡1
1 1 9 9 9
⎢ ⎢1 M =⎢ ⎢9 ⎣
⎥ ⎥ ⎥. ⎥ ⎦
(8)
I (i+1) = I (i) ∗ M,
(9)
1 1 9 9
1 1 1 9 9 9
Subsequently, we let
⎤
(i)
where I is the result of the ith time filter. The times the filter is performed is determined by δi = dist(I (i+1) , I (i) ), (10) where δi is the Euclidean distance between I (i) and I (i+1) . The procedure should be stopped once δi is less than a given threshold. As a result, the final I (i+1) is marked as If . Since the positions of the left and right mouth corners have been estimated in Section 2.1, we can utilize them to calculate the center of mouth easily. For each I (i) , a gray(i) value vector Gmu is built by the segment from the center point to the top of image along the normal direction. Subsequently, the vector ΔGacc is calculated by ΔGacc =
n
(i) (|G(0) mu − Gmu |).
(11)
i=1
The point corresponding to the extreme value of maximum (except boundary value) is retained as the row position of the upper bound of mouth.
300
M. Li and Y.-m. Cheung
Fig. 4. The estimate of mouth upper and lower bound
Then, the subtracted image between I (0) and If can be calculated. For the sake of observation, an image inversion transformation is employed as shown in Figure 3. We obtain the gray-level value along the normal direction pass the middle point of mouth to the bottom of the image. The point performing the extreme value of minimum (except boundary value) is retained as the row position of lower bound of mouth. The estimate of mouth upper and lower bound is shown in Figure 4.
3 The Proposed Approach to Lip Reading Recognition For one utterance, e.g. an isolate digit or a word, we can capture the video clip of speaker’s lip motion, and split it into a frame sequence. Then, in each frame, we utilize the method described in Section 2 to get the two vectors composed of width and height of mouth, respectively. Since the time for each utterance is not a constant, an interpolation method is therefore utilized to make the length of the two vectors be the same. In this paper, the length is set at 100. These two vectors are marked as Fw and Fh . An example is shown in Figure 5. Since the range of lip motion for different people is not a constant, we use the ratio between displacement and the original position of lip to represent the trend of motion. The normalization method is shown below: Fwnorm = K
(Fw − Fw1 ) Fw1
(12)
Fhnorm = K
(Fh − Fh1 ) , Fh1
(13)
where Fw1 is the first element of Fw , and K is a gain coefficient that makes the motion trend more significant. In this paper, the value of K is set at 30 by a rule of thumb. Then, we utilize the least square method to find the two polynomials, whose parametric form is as follows: n
ak xk , (14) P = k=0
to fit Fwnorm and Fhnorm . To find the coefficient ak , we should minimize I=
m n
( ak xki − yi )2 . i=0 k=0
(15)
A Novel Automatic Lip Reading Method Based on Polynomial Fitting
301
(a)
(b)
(c) Fig. 5. (a) Some frames of the utterance of “5” in Chinese Mandarin, (b) the corresponding width vector, and (c) the corresponding height vector. Although there are some noise caused by the estimate error, it can be seen that the main trend of the width is original-narrow-original, and the height is original-high-original.
The minimum value of I can be found by setting its gradient at zero, i.e. m n
∂I =2 ( ak xki − yi )xki = 0, ∂ai i=0
(16)
k=0
where yi is Fwnorm or Fhnorm , m is the maximum index of vector that equals to 99, and i i xi equals to 0.1i. Moreover, n is set at 3 upon the characteristic of human speech. Thus, we can get the solution Aw = [aw0 , aw1 , aw2 , aw3 ]T and Ah = [ah0 , ah1 , ah2 , ah3 ]T . An example of polynomial fitting result is shown in Figure 6. The polynomial shapes for the same utterance are constrained to have similar expression. Thus, we can get the global maximum and minimum in the two polynomials, marked as (xwmin , ywmin ), (xhmin , yhmin ), (xwmax , ywmax ) and (xhmax , ywmax ), to build the two feature vectors Fw and Fh with Fw = [xwmin , ywmin , xwmax , ywmax , ywbound ]T
(17)
302
M. Li and Y.-m. Cheung
Fh = [xhmin , yhmin , xhmax , yhmax , yhbound ]T ,
(18)
where ybound is the most right value of polynomial when x ∈ [0, 9.9] (e.g. x = x99 ). ∈ Moreover, for the feature vectors, both xmin and ymin are set at zero if xmin [0, 9.9]. Similarly, we can determine the value of xmax and ymax . For example, as shown in Figure 6, the feature vectors are Fw = [4.1558, −2.3476, 0, 0, 0.9460]T and Fh = [0, 0, 4.1266, 12.9164, 1.2969]T .
(a)
(b) Fig. 6. The polynomial curve fitting to: (a) the width vector, and (b) the height vector, as shown in Figure 5
For each utterance, we calculate the average of all vectors of training data to make a template Z = (Zw , Zh ) via an adaptive way, i.e. given a new training data F = (Fw , Fh ), let i = Zw
i−1 Zw + Fw 2
(19)
Zhi =
Zhi−1 + Fh . 2
(20)
i After finishing the training phase, we let Z = (Zw , Zhi ). When testing, we calculate the distance between each template and testing data via
d = ||Fw − Zw || · ||Fh − Zh ||.
(21)
That is, the testing data is classified into the category which corresponds into the minimum d.
A Novel Automatic Lip Reading Method Based on Polynomial Fitting
303
4 Experimental Result This section will demonstrate the performance of the proposed approach. The experimental environment is shown in Figure 7. The illumination source was a 18w fluorescent lamp, which was placed in front of a speaker. The resolution of camera was 320 × 240, and the FPS (i.e. Frames Per Second) was 30.
Fig. 7. The illustration of experiment environment
Our task was to recognize 10 isolate digits (0 to 9) in Chinese Mandarin, whose pronunciations are shown in Table 1. Table 1. The pronunciations of number 0 to 9 in Chinese mandarin Digit Phonetic Symbol Digit Phonetic Symbol 0 1 2 3 4
5 6 7 8 9
There were 5 speakers (i.e. 4 males and 1 female) taking part into the experiment. For each digit, the speakers were asked to repeat ten times to train the system, and fifty times to test. Figure 8 shows the performance of the proposed approach with the different number of training samples. Moreover, Table 2 shows the recognition accuracy for each number. Furthermore, we compared the proposed approach with the four existing approaches: HMM, RDA, Spline, and ST Coding, as investigated in [14] and [15]. Table 3 shows the comparison results. It can be seen that the proposed approach outperforms all the existing methods we have tried so far.
304
M. Li and Y.-m. Cheung Table 2. The pronunciations of digits 0 to 9 in Chinese mandarin Digit Accuracy Digit Accuracy 0 1 2 3 4
0.972 0.952 0.976 0.964 0.788
5 6 7 8 9
0.912 0.964 0.744 0.952 0.932
Fig. 8. The testing result over the different number of training samples Table 3. The lip reading recognition results obtained by the proposed approach in comparison with the existing methods Method
Accuracy
HMM RDA Spline ST Coding Our Approach
81.27% 77.41% 91.49% 77.20% 91.56%
5 Conclusion In this paper, we have proposed a new approach to automatic lip reading recognition based upon polynomial fitting. The feature vectors of the proposed approach have low dimensions and the approach needs a small number of training data. Experiments have shown its promising result in comparison with the existing methods.
Acknowledgment The work described in this paper was supported by the Faculty Research Grant of Hong Kong Baptist University with the project code: FRG/07-08/II-88, FRG2/08-09/122, FRG2/09-10/098, and by the Research Grant Council of Hong Kong SAR under Project HKBU 210309.
A Novel Automatic Lip Reading Method Based on Polynomial Fitting
305
References 1. Bulwer, J.: Philocopus, or the Deaf and Dumbe Mans Friend. Humphrey and Moseley (1648) 2. Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: An overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing, MIT Press, Cambridge (2004) 3. Luettin, J., Thacker, N.A., Beet, S.W.: Speaker identification by lipreading. In: Proc. IEEE International Conference on Spoken Language Processing, Philadelphia, USA, pp. 62–65 (1996) 4. Chen, T., Rao, R.R.: Audio-visual integration in multimodal communication. Proceedings of the IEEE 86(5), 837–851 (1998) 5. Wei, X., Yin, L., Zhu, Z., Ji, Q.: Avatar-mediated face tracking and lip reading for human computer interaction. In: Procedings of the 12th Annual ACM International Conference on Multimedia, New York, USA, pp. 500–503 (2004) 6. Granstrom, B., House, D.: Effective interaction with talking animated agents an dialogue systems. In: Advances in Natural Multimodal Dialogue Systems. Springer, Netherlands (2005) 7. Bregler, C., Conig, Y.: “eigenlips” for robust speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing, Adelaide, Australia, pp. 669–672 (1994) 8. Potamianos, G., Graf, H.P.: Discriminative training of hmm stream exponents for audiovisual speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle,WA, pp. 3733–3736 (1998) 9. Potamianos, G., Luettin, J., Neti, C.: Hierarchical discriminant features for audio-visual lvcsr. In: Proc. IEEE Internatioal Conference on Acoustics, Speech, Signal Processing, Salt Lake City, Utah, USA, pp. 165–168 (2001) 10. Neti, C., Iyengar, G., Potamianos, G., Senior, A., Maison, B.: Perceptual interfaces for information interaction: Joint processing of audio and visual information for human-computer interaction. In: Proc. International Conference on Spoken Language Processing, Beijing, China 11. Luettin, J., Thacker, N.A.: Speechreading using probabilistic models. Computer Vision and Image Understanding 65(2), 163–178 (1997) 12. Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia 2(3), 141–151 (2000) 13. Werda, S., Mahdi, W., Hamadou, A.B.: Colour and geometric based model for lip localisation: Application for lip-reading system. In: Proc. IEEE International Conference on Image Analysis and Processing, Modena, Italy, pp. 9–14 (2007) 14. Wang, S.L., Lau, W.H., Liew, A.W.C., Leung, S.H.: Automatic lipreading with limited training data. In: Proc. IEEE International Conference on Pattearn Recognition, pp. 881–884 (2006) 15. Baig, A.R., S´eguier, R., Vaucher, G.: Image sequence analysis using a spatio-temporal coding for automatic lip-reading. In: Proc. IEEE International Conference on Image Analysis and Processing, Venice, Italy, pp. 544–549 (1999)
An Approach for the Design of Self-conscious Agent for Robotics Antonio Chella1 , Massimo Cossentino2 , Valeria Seidita1 , and Calogera Tona1 1
Dipartimento di Ingegneria Informatica Universit´ a degli Studi di Palermo, Palermo, Italy {chella,seidita,tona}@dinfo.unipa.it 2 Istituto di Calcolo e Reti ad Alte Prestazioni, Consiglio Nazionale delle Ricerche Palermo, Italy [email protected]
Abstract. Developing complex robotic systems endowed with selfconscious abilities and subjective experience is a hard requirement to face at design time. This paper deals with the development of robotic systems that do not own any a-priori knowledge of the environment they live in and proposes an agent-orientd design process for modelling and implementing such a systems by means of implementing the perception loop occurring between environment, body and brain during subjective experience. A case study dealing with a robocup setup is proposed in order to describe the design process activities and to illustrate the techniques for making the robot able to autonomously decide when an unknown situations occurs and to learn from experience.
1
Introduction
The design and development of complex robotic systems require a great effort in terms of techniques, models and methods in order to catch the specific features the system to be developed has to implement and in order to relate them and to realize them onto a specific robotic platform. In the past several software engineering techniques have been proposed for developing complex robotic systems also using the well know agent paradigm [2]; using that the robotic system is considered as a collection of agents, each agent is responsible of a functionality or is committed to reach a goal. The authors carried on in the past several experiments in the creation of adhoc design processes and methods for different kinds of application; in this paper they present an agent oriented design process for the design and development of self-conscious robotic systems able to autonomously act in an unknown and unstructured environment, in a human like fashion. The presented work is mainly based on the assumption that self-conscious ability can be implemented by a continuos interaction between brain, body and environment; hence the subjective experience, realized by means of what is called the perception loop allows the robotic system to anticipate the results of the mission it is performing and to realize if it was successful by means of a continuous A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 306–317, 2010. c Springer-Verlag Berlin Heidelberg 2010
An Approach for the Design of Self-conscious Agent for Robotics
307
comparison with the perceived environment. A perception loop is continuously and instinctively performed by a human being, when he is starting a mission he imagines his state (himself) at the end of the mission; when the mission is over, if what he has imagined is different from what he perceives then he realizes that his mission fails otherwise it was successful; in the first case he has to adopt some “corrective” actions. In this paper we focus on the implementation of the perception loop in a robotic system using an agent oriented design process and on how a not preprogrammed system can be made able to take decisions and learn each time it encounters an unknown situation for which it was not designed. The work embraces two different levels of abstraction, one concerning the creation, and then the use, of a design process for developing self-conscious systems and the higher one concerning the identification and the definition of the whole process for the development of conscious system, from the definition of the problem domain to the execution of the loop and the management of the robot parameters tuning phase. The proposed design process (PASSIC) has been created greatly exploiting the experiences made in the past with PASSI (Process for Agent Society Specification and Implementation) [9] that has been extended and integrated with a set of portion of design processes for developing and implementing the reflective part of the robotic system. We mainly reused features from PASSIG [16] that offers the possibility of performing a goal oriented analysis of the system in the same way of what is proposed in [3][18]. PASSIC phases are shown along the proposed paper through an experiment made about a Robocup setup and using the NAO platform. The rest of the paper is organized as follows: in section 2 the background and motivation of the proposed work is illustrated, in section 3 an overview on the PASSIC design process and the complete process for the development of a selfconscious robotic system is given, section 4 deals with the robocup experiement and finally in section 5 some conclusions are drawn.
2
Background and Motivations
The robot perception loop described in [5][8] (see Figure 1.a)) is composed of three parts: the perception system, the sensor and the comparative component ; through the proprioceptive sensors the perception system receives a set of data regarding the robot such as its position, speed and other information. These data are used from the perception system for generating the anticipation of the scenes and are mapped on the effective scene the robot perceives, thus generating the robots prediction about the relevant events around it. As it can be seen from Figure 1.a), a loop there exists among the perception and the anticipation; each time some parts of a perceived scene, in what it is called the current situation, matches with the anticipated one the robot realizes experiences about what it is happening around it. According to [15], the perception loop realizes a loop among “brain, body and environment” that is the base
308
A. Chella et al.
Fig. 1. The Loop Brain-Body-Environment vs the NAO Mission
for the externalist point of view of subjective experience; the subjective experience supposes a processual unity between the activity in the brain and what is perceived from the external world. We start from these considerations for developing and implementing our robotic systems. The experiments we made aim at verifying the usability of the perception loop into a software design process ad-hoc created in order to support that. Our aim is to create a software system able to control a robot by means of perception loops. The robot we consider does not own an a-priori knowledge of the possible obstacles it can encounter. It is able to self-localize, to recognize a situation preventing it to reach its goal (in this case to detect and identify an object as an obstacle) and to decide which is the best action to be performed in order to solve the problem. The robot is not equipped with “pre-compiled” pre-planning abilities: we want to study the situation in which the robot does not know what to do in a given case, and so it queries his memory in search of a ready solution or it tries to find a novel solution exploiting his knowledge about itself, its capabilities and the surrounding world. At the beginning of our experiment, the robot does not own any sophisticated behaviour: it can walk, move arms and legs and this is what it does, sometimes in a scrambled fashion, whenever it wants to reach a goal. For instance, let us suppose the robot wants to move on the floor and to pick up an object far in front of him. Then, it randomly tries to execute some behaviours from the set of “primitive” it owns until it does reach its objective. So, if the robot encounters a small obstacle, it could move on its right or left in order to go over, or it could move his arms (randomly) in order to cause the displacement of the obstacle. Each time a set of actions revealed to be successful for solving a problem, the robot is able to learn it, and when necessary it will apply again this successful strategy. In particular, we experimented a humanoid robot endowed with a set of primitive behaviours. The learning phase consisted in letting the humanoid robot try some sets of behaviours, analyze the results and store them. Each time the robot, while reaching a goal, does not know how to solve a problem, it may query its database of previous solutions and retrieve past cases experienced as successful. When this strategy fails, the robot activates random behaviours.
An Approach for the Design of Self-conscious Agent for Robotics
309
The design process and the model we fixed for designing such a systems is general and it can be applied to different kinds of robot. Indeed, it aims at designing concepts such as goals, behaviours and actions. We employed the humanoid robot NAO, a tall humanoid robot developed by Aldebaran Robotics1 . It is equipped with a set of basic behaviours that can be linked in a time-based or event-based fashion in order to create complex behaviours. In order to implement the anticipation step of the NAO perception loop, we experimented the 3D robot simulator Webots from Cyberbotics2 . By means of Webots, we may simulate NAO’s movements and behaviours as well as its surrounding environment. Figures 1 show the design and implementation of the perception loop in NAO. We exploited the fact that we use NAO and at the same time the NAO simulator, so the perception loop among brain, body ad environment [15] corresponds to the loop among NAO (the real robot), the virtual NAO (the robot simulator) and the NAO’s world. Both NAO and the virtual NAO use the Behavioural Specification resulting from the design process phases: the former for executing a sequence of behaviours, and the latter for producing the corresponding simulated behaviours. More in details, the Behavioural Specification is the work product resulting from the analysis phase. Here, the robot behaviours, to be put into practice for reach its goal, are fully specified; each behaviour is composed of a set of simple robot’s actions. Each time NAO encounters a stop condition - for instance an unexpected object in the path - while pursuing its goal, the anticipated scene is compared with the real NAO parameters and the resulting log, in form of proprioceptive and sensorial values, is used for the tuning and learning phase. It is worth noting that the use we made of the simulation is quite different from the common one: in fact, it is not used for investigating and anticipating the robot behaviour in a specific working condition, but instead for producing, starting from the designed behaviours, the expected results of a mission. The simulator and NAO work separately, only when a stop condition is identified the simulation results are compared with the real NAO parameters.
3
The Self-conscious Robotic System Design Process
In [7] the creation of PASSIC design process and a model for perception loop has been presented. The process for creating the new design process follows the Situational Method Engineering paradigm [4][14] and extends the PASSI2 [10] and PASSIG [16] developed by the authors in the latest years for creating ad-hoc design processes [17]. One of the main points is the definition of a metamodel for the perception loop, see [7] for further details, where elements of perception loop have been identified and reflected onto a robotic system. The creation of PASSIC allowed us to formalize the design of the aforementioned kind of robotic applications, it has been then included in what we think 1 2
http://www.aldebaran-robotics.com http://www.cyberbotics.com
310
A. Chella et al.
it would be the whole self-Conscious System Development Process (CSDP). The complete development process used for developing self-conscious systems is composed of three different phases: Problem, Design and Configuration and Execution. The Problem phase is composed of all the activities devoted to focus the problem domain hence to elicit the system requirements and to identify the mission the robot performs to reach its goals. During these activities the designer considers a database where the set of abilities the robotic system possesses are stored (Cases and Configuration). A Case is composed of the goal description, the set of tasks performed in order to reach it (a plan), some pre-conditions, and the list of parameters needed for successfully applying the plan. With reference to NAO, task correspond to the action it is able to do. A Configuration is a specific set of parameter values that has proved to be successful to instantiate one specific case; it also includes the number of total used and positive outcomes this configuration produced in pursuing one case. During the Design and Configuration phase, the designer defines a software solution in order to accomplish the required goals; these activities are made with the aid of the PASSIC design process during which two fundamental deliverables are collected: i) the design of the robotic system to be built using a society of agents each of which committed to realize a particular goal of the system; ii) the mission configuration including all the elements of the database of cases and configurations related to the particular robotic platform chosen. In the next section an example reporting a part of a case study will be given. In the Execution the running system is considered, here the robotic system has to execute a mission, a goal to satisfy following a plan, hence a specific sequence of tasks to be executed. For each goal the related set of tasks is decided
Fig. 2. The PASSIC Design Process
An Approach for the Design of Self-conscious Agent for Robotics
311
at design time, the specifications are at the same time sent to the part of the system devoted to generate the anticipation and to the part (the robot itself) that really executes the mission - it is shown in Figure 1. Once both have terminated the results are compared, if they match then the goal has been reached and the configuration (task, parameter,. . . ) can be saved for future reuse, the system has learnt. On the contrary if the results do not match the robotic system has to, without the human intervention, select another mission configuration. One of the most important part of the CSDP is the PASSIC design process, it has been ad-hoc created in order to the need of perception loop based selfconscious system design. PASSIC is composed of three phases and it is based on a iterative/incremental life cycle (see Figure 2). The System Requirement phase deals with a goal oriented analysis of the system and one of the main model it produces is the Agent Diagram where the society of agents is identified together with the roles ad the tasks of each agent. Besides a description of agent structure- in the Agent Structure Exploration - in terms of tasks required for accomplishing the agents’ functionalities is given. In the Agent Society phase an agent based solution is introduced, in this phase the ontological description of the domain categories and the actions having effect on their states allows to establish the agents’ communications. Then the agents are described in terms of roles, services and resources dependencies. Once the agent society has been designed the autonomous part of the system devoted to implement the perception loop is designed; during the the Perception Plan and Design, starting from the knowledge about the environment, the agents’ society architecture and the requirements, the anticipation is produced. Then the Perception Test Execution aims at designing the criteria for evaluating the results of the running loops and in the Configuration Management the rules for enhancing the Case and Configuration database, hence the rules for tuning the system parameters, are designed. Finally in the Implementation Phase the model of the solution architecture in terms of classes, methods, deployment configuration, code and testing directives is realized. In this phase, the agent society defined in the previous models and phases is seen as a specification for the implementation of a set of agents that should be now designed at the implementation level of details, then coded, deployed and finally tested.
4
Applying PASSIC - A Robocup Based Case Study
In this section the self-Conscious System Development Process (CSDP), including some PASSIC activities, will be shown through a case study in order to make some fundamental aspects of the proposed work clear. We will focus on the possibility of designing the mission configuration phase allowing the robot to effectively implement the perception loop. Therefore to provide the robot with the capability of: i) becoming conscious of what it is happening around
312
A. Chella et al.
it, ii) efficiently selecting a set of behaviours letting it to reach its goal, even if unexpected situations occurs, and finally iii) learning from experience. NAO is the official platform used in the standard league of Robocup3 . The experiment made was about the development of a multi-agent system for managing two robots (two NAOs) engaged in a soccer match where one NAO (from now on we will call it NAO F) serves as a forward with the main scope of making a goal and the second one as a defender (we will call it NAO D) with the main scope of preventing the NAO F from making the goal. As already said, NAO platform is endowed with a pre-determined set of actions that proved to be very useful for Robocup setups. The capabilities the NAO offers has been analyzed during the Problem Domain phase and let us to identify the right tasks to be used for realizing the goals identified during the PASSIC design phases. The developed system includes goals such as identifying the ball, identifying the goal and the boundaries of the game field, interfacing with the simulator, managing the results of the comparison and managing the communications with the databse. Some of the previous goals are clearly NAOs’ goals but it is to be noted that their are committed to one or more agents of the society. These goals are designed using PASSIC and from the design activities it results a set of behavioural specification useful for NAOs and the simulator (as it is shown in Figure 1.b)). Referring to Figure 1.a) and to what we said in section 2 the term “scene” includes a whole set of parameters including the surrounding world and the robot state identified through the set of specific parameters related to robotic platform used - NAO in our case. As it can be seen from PASSIC, designing the mission configuration exploits the analysis of the system’s goals. Figure 3 shows a portion of the goal diagram resulting from the Domain Analysis activity which aim is to identify each actor’s tasks and applying means-end-analysis - a task (mean) can be used to achieve a goal (end). The NAO F actor has been identified with some of its goals and tasks. The Goal Diagram results from the portion of Tropos [3] we used when we created PASSIG; Tropos principally adopts a requirements driven software development approach, exploiting goal analysis in order to identify actor dependencies. Tropos’s, hence PASSIC analysis activities, main concepts are: the Actor that models an entity having strategic goals and represents a physical, a social or a software agent; the Goal, that is the strategic interest of an actor, satisfied through the Task, hence a particular course of action that can be executed. Another important element is the Resource, an entity without intentionality that can be physical or informational. In the Figure 3.a) the goal MakeGoal is considered, it is decomposed in two goals GoTowardsGoal and AvoidNAO D, each of them is related to one or more tasks through the means-end-analysis and uses one or more resources; for instance bumping the ball implies the use of the ball as a resource. With reference to the proposed case our main aim is to have a robot exploiting all that we call its innate capabilities in a human like fashion. For the sake of 3
www.robocup.org
An Approach for the Design of Self-conscious Agent for Robotics
313
Fig. 3. The Goal Diagram Portion of the Robotic System
brevity in this section we point our attention to a sub-case, the one concerning the following scenario: the NAO F is in front of the ball and its objective is to kick the ball into the goal also autonomously managing all the possible unknown or unexpected situations. In the following all the design results of this sub-system will be shown and it will be illustrated how the robot acts as the result of design time activities and how it act as the results of the self-conscious abilities it has been provided. In order to express the pre and post-condition in the configuration we assume for now a high level of abstraction and we consider that the robot interacting with its environment, whatever its goal, has to see the world, the object in the world, to touch them if necessary, hence it has to sensorially perceive and then it can act. During the System Requirements phase the set of actors involved in the system together with all their related goals are identified and analyzed; the result is a set of work products including goal diagrams and agent diagrams as it is proposed in PASSIC4 and in [16][3]. In Figure 3.b) a portion of the agent diagram for the TakeBall goal to be pursued is provided, it relates to the case the robot has to reach the ball in order to kick it towards the goal. This artefact results from the Identify Architecture activity where the system-to-be is decomposed into sub-actor and the agents of the system are identified. It can be seen that the main actor involved in this part of the system is the NAO F, and the agent identified as responsible for this goal is the MoveManager agent. The TakeBall goal has been decomposed, through the means-end-analysis, into two tasks: Walk and Turn; in this case the two tasks are present in the database of cases analyzed and identified during the problem domain phase of the CSDP, besides the used resources are that provided by NAO platform. 4
More details can be found in http://www.pa.icar.cnr.it/passi/PassiExtension/extensionIndex.html
314
A. Chella et al.
Fig. 4. The Portion of Case and Configuration Related to the TakeBall Goal
Starting from the agent diagram and from all the identified relationships between each goal and the set of tasks to be used to reach it, the designer can perform the Configuration Management phase during which the database of Configurations and Cases is enhanced with all the behaviours established at design time. In Figure 4 the portion of the Case and Configuration database related to the presented experiment is shown. During the analysis phase we considered only two possible cases; as previously said the robot interacts with the environment through its sensors and actuators, so we established two kinds of pre-conditions: the vision and the sensorial perception. If the robot’s goal is to take the ball it has to walk towards the ball only if two conditions occur: it sees the ball but it has not still touched it, this means that the ball is in front (or more generally in the robot’s vision field). If, instead, the ball is not in the vision filed, the robot has to turn around in order to identify its position and then it can walk towards it. Therefore the pre-conditions for this case are {vision=Y AND sens perception=N} or {vision=N AND sens perception=N}. In order to describe the experiment in a more complete fashion, let us suppose that the database also presents two others cases, one related to the Turn-Left task and the other to the Turn-Right with the related pre conditions. For the Execution phase we used three different setups - the goal is the same: a) the NAO F in in front of the ball, b) the NAO F on the left of the ball and c) the NAO D has gone between the NAO F and the ball. Case a): the NAO F sees the ball {vision=Y and sens perception=N}) - the first Case is selected and the NAO F walks towards the ball for a maximum of 4 steps (this parameter is imposed at design time in order to stop the mission and to start the perception loop comparison). Each time the mission stops the comparison between the perceived scene and the anticipated one is performed and in this case, it is obvious that if the ball is not reached the NAO keeps going as it was designed to do for.
An Approach for the Design of Self-conscious Agent for Robotics
315
Case b): the NAO F does not see the ball {vision=N and sens perception=N}) - the second case is selected and the NAO F turns of 5 degree (this parameter too is fixed at design time in order to stop the mission); when after a certain number of turns it sees the ball it is in the a) case and it selects the first CASE moving towards the ball. Case c): while the NAO F is performing the actions for the case a) the NAO D has come in its trajectory, the stop mission condition is revealed and the comparison results in a mismatch between the expected and real situation; this is an unexpected situation and the NAO F is not designed for it; our aim is to provide means for autonomously deciding what to do in these cases without the human intervention. The NAO F has to retrieve from its knowledge base, the Case and Configuration database, that in some senses emulates the human behaviour based on the classic case based reasoning [12][1], the most useful Case and the related task, basing on the goal it has to satisfy. The case c) can have two possible consequences: the NAO F finds in the database a Case to be used or not, in this second case following our rationale for providing the robotic system with self-conscious ability the NAO F should randomly select a Case, the one containing the closer goal and it performs the related task. Obviously, considering a large populated database of Cases and a long list of tasks (representing the innate ability of the robot in general and the NAO F in this particular experiment), it is not efficient to let the NAO F selects each kind of task. If the goal is the one said above, it is for sure unfruitful to move the arm or to make some flexions so we needed a way for creating a taxonomy of the goal. In order to solve this problem we reused our previous experience in providing a robot [6] with the ability of planning a path in a dynamic indoor environment by using Cyc and the common sense reasoning [13][11]. This part of the system is designed in the Perception Test Execution phase and for space concerns is out of the scope of this paper. It is only worth noting that by using the common sense reasoning we are able to create at design time a kind of goal taxonomy letting the robot select, when necessary, a Case in a set of Cases related with the one it had before the unknown situation and avoiding the risk to move the arm when it has move its legs. The behavioural specifications generating by the designed behaviour in the Webots are the same given to the real NAO, the environment is exactly the same in both the cases and the NAO F at the end of its mission does not reveal any differences in its paramenters against the virtual NAO one. When, in the real environment, the NAO D is between the NAO F and the goal, the expected situation is different from the real one and NAO F started to retrieve the most useful case from the database. In the presented experiment the NAO D lied still while the NAO F was performing its mission; even if this part of the experiment is hardly reduced in the paper, it is worth nothing that it let us to verify the PASSIC design process activities and the way of saving the winning configuration causing the robot learning.
316
5
A. Chella et al.
Conclusion
The authors developed in the past some agent oriented design processes realizing the possibility of designing systems working in different application contexts mainly exploiting the fact that agent oriented processes can be used as a design paradigm. In this paper an ad-hoc created design process (PASSIC) is exploited in order to develop and implement a robotic system able to manage a soccer match in an autonomous fashion. The meaning of autonomy in this case is the capability of the robot to perceive its objectives and execute its tasks without the human intervention when it encounters situation not established at design time. The perception loop and the way how a robotic system with only its innate capabilities, is able to manage and interact with unstructured environment. PASSIC has been created basing on the experiences made in the latest years in the construction of ad-hoc design processes and on the use of agent oriented methodologies for developing and implementing robotic systems. PASSIC allows the design and implementation of the perception loop thus letting the system able to move in its environment and decide by continuously comparing the differences between expected and real situation. In this context the agent paradigm proved to be very useful in the sense that using PASSIC we were able to identify a society of agents in which a set of agent was devoted to manage a perception loop; moreover agents’ peculiarities such as autonomy, proactivity and situadness perfectly fitted our case. The proposed experiment allowed us to test the usability of the whole development process and in particular of PASSIC in the part regarding the parameters and configurations tuning; a lot of problems to be taken into account in the proposed setup such as real time contingencies and interaction among a society of robots will be treated in the future as well we are now deepening and fixing what is related to the taxonomy of the goals in order to improve and speed up the selection of cases.
Acknowledgment This research has been partially supported by the EU project FP7-Humanobs and by the FRASI project managed by MIUR (D.M. n593).
References 1. Aamodt, A., Plaza, E.: Case-based reasoning. In: Proc. MLnet Summer School on Machine Learning and Knowledge Acquisition, pp. 1–58 (1994) 2. Alami, R., Chatila, R., Fleury, S., Ghallab, M., Ingrand, F.: An Architecture for Autonomy. The International Journal of Robotics Research 17(4), 315 (1998) 3. Bresciani, P., Giorgini, P., Giunchiglia, F., Mylopoulos, J., Perini, A.: Tropos: An agent-oriented software development methodology. Autonomous Agent and MultiAgent Systems 3(8), 203–236 (2004)
An Approach for the Design of Self-conscious Agent for Robotics
317
4. Brinkkemper, S., Lyytinen, K., Welke, R.: Method engineering: Principles of method construction and tool support. International Federational for Information Processing 65, 336 (1996) 5. Chella, A.: Towards robot conscious perception. In: Chella, A., Manzotti, R. (eds.) Artificial Consciousness. Imprinting Academic, Exter (2007) 6. Chella, A., Cossentino, M., Fisco, M., Liotta, M., Rossi, A., Sajeva, G.: Simulation based planning and mobile devices in cultural heritage robotics. In: Proc. of the Robotics Workshop in the Ninth Conference of the Associazione Italiana per l’Intelligenza Artificiale (September 2004) 7. Chella, A., Cossentino, M., Seidita, V.: Towards a Methodology for Designing Artificial Conscious Robotic System. In: Samsonovich, A. (ed.) Proc. of AAAI Fall Symposium on Biologically Inspired Cognitive Architectures BICA 2009. AAAI Press, Menlo Park (2009) 8. Chella, A., Macaluso, I.: The perception loop in Cicerobot, a museum guide robot. Neurocomputing 72, 760–766 (2009) 9. Cossentino, M.: From requirements to code with the PASSI methodology. In: Agent Oriented Methodologies, ch. IV, pp. 79–106. 10. Cossentino, M., Seidita., V.: Passi2 - going towards maturity of the passi process. Technical Report ICAR-CNR (09-02) (2009) 11. Cycorp Inc., Austin, T.: Cyc home page, http://www.cyc.com 12. Kolodner, J.: Case-Based Reasoning. Morgan-Kaufmann Publishers, Inc., San Mateo (1993) 13. Lenat, D.B.: Cyc: a large-scale investment in knowledge infrastructure. ACM Commun. 38(11), 33–38 (1995) 14. Ralyt´e, J.: Towards situational methods for information systems development: engineering reusable method chunks. In: Procs. 13th Int. Conf. on Information Systems Development. Advances in Theory, Practice and Education pp. 271–282 (2004) 15. Rockwell, W.: Neither brain nor ghost. MIT Press, Cambridge (2005) 16. Seidita, V., Cossentino, M., Gaglio, S.: Adapting passi to support a goal oriented approach: a situational method engineering experiment (2007) 17. Seidita, V., Cossentino, M., Hilaire, V., Gaud, N., Galland, S., Koukam, A., Gaglio, S.: The metamodel: a starting point for design processes construction. International Journal of Software Engineering and Knowledge Engineering (2009) (in printing) 18. Yu, E.: Towards modeling and reasoning support for early-phase requirements engineering. In: Proc. RE 1997 - 3rd Int. Symp. on Requirements Engineering, Annapolis, pp. 226–235 (1997)
K-Means Clustering as a Speciation Mechanism within an Individual-Based Evolving Predator-Prey Ecosystem Simulation Adam Aspinall* and Robin Gras {aspina1,rgras}@uwindsor.ca
Abstract. We present a new method for modeling speciation within a previously created individual-based evolving predator-prey ecosystem simulation. As an alternative to the classical speciation mechanism originally implemented, k-means clustering provides a more realistic method for modeling speciation that, among other things, allows for the recreation of the species tree of life. This discussion introduces the k-means speciation, presents the improvements it provides, and compares the new mechanism versus the traditional method of speciation.
1 Introduction While the presence of individual-based models continues to rise, to our knowledge, there has been very little detailed study on the simulation of various speciation methods within an evolving individual-based ecosystem. Among the few such simulations, J. H. Holland [6] presented Echo, a platform for modeling complex adaptive agents that are able to collect resources and move to neighboring sites. However, both the organisms and the speciation methods in Holland’s platform are quite simple, and Hraber et. al [7] have shown that Echo did not match “exactly with quantitative predictions” when they compared the output data on species diversity with real data observed in nature. Another artificial life system is Avida [1] which, within a 2D geometry, models cells, the interactions between them, the breeding of cells, and their ability to adapt. In Avida, a genome (which the authors refer to as a “string”) is represented as an entirely separate piece of code running on its own virtual computer. During self-replication (or reproduction), a string may be subject to mutations either during or immediately after the copy method is performed, resulting in a new string to be placed in a nearby cell. This, the authors note, “is the driving force of evolutionary change and diversity” and, in fact, is similar to the evolutionary mechanism we implemented in our evolving predator-prey ecosystem. However, there is no explicit speciation mechanism implemented in this simulation. In this article, we re-introduce our evolving predator-prey ecosystem simulation with a focus on the definition of our behavioral model and how it is used to cluster individuals into species. Subsequent to this re-introduction, we present our new *
Corresponding author.
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 318–329, 2010. © Springer-Verlag Berlin Heidelberg 2010
K-Means Clustering as a Speciation Mechanism
319
method for speciation and include a discussion about its complexity, its benefits, and its results.
2 Evolving Predator-Prey Simulation Introduced in [5], Dr. Gras et. al. developed and continue to study the results of an individual-based evolving predator-prey ecosystem simulation that included a behavior model using a fuzzy cognitive map (FCM) [8]. In our simulation, complex adaptive agents (or, simply, individuals) are either a prey or a predator and a world implemented as a 1000 × 1000 matrix of cells. In addition to prey and predators, every cell in our world may contain some value of grass and meat which can be eaten by the prey and predators, respectively. Individuals in our simulation make decisions based on their behavioral model which has been represented by a fuzzy cognitive map (FCM) [8]. In addition to being a mechanism for decision making, the FCM is the basis for our evolutionary platform, and is also the object we use to cluster individuals into species. In our simulation, an FCM is a directed graph that contains a set of nodes, C, where each node, Ci, is a concept, and a set of edges, L, where each edge, Lij represents the influence of concept Ci on concept Cj. Every edge in L has a weight, w, such that a positive value corresponds to an excitation caused by one concept onto another, and a negative value corresponds to an inhibition. An edge, Lij, may exist with weight 0 which represents the lack of influence of concept Ci on Cj. Moreover, a value ai is associated with every concept Ci. Thus, our FCM allows for the representation of concepts that may be updated by an individual’s perception about the world around it, such as the distance to nearby friends, foe, and food, and allows for the computation of a decision of action for the agent depending on its perceptions and its internal states. The matrix of all the weights, Lij, which describes unambiguously the behavioral model of an agent, is considered in our simulation to be the agent’s genome. We define a distance function, D(F1, F2), which computes and returns the numerical distance between two FCMs, F1 and F2 – a sum of the distances between the weights of matching edges in L1 and L2, the edge matrices for F1 and F2, respectively. We use this computation of distance between two FCMs as a method of clustering individuals into similar groups which represent species. Our simulation embodies species as a set of individuals sharing similar genomes. Indeed, every member of a species has a fuzzy cognitive map that is within a threshold distance away from the species’ FCM – an average of the FCMs of its members. Our simulation iterates within an infinite loop such that every execution of the loop represents a single time step in which every agent computes its internal concepts, chooses an action, and carries out their chosen action. We consider a time step to represent a long period of time and it is the accumulation of many small actions throughout these time steps that demonstrate an evolutionary process.
3 Classical Speciation In the original implementation, our simulation used a basic mechanism for speciation, called our classical speciation model. For every newborn prey or predator, the
320
A. Aspinall and R. Gras
numerical distance (a representation of a genetic distance) between the newborn’s FCM and the FCMs of every existing species was calculated. If the distance to the closest existing species was below a predefined threshold in the simulation’s parameters file, the newborn was assigned to that existing species. Otherwise, a new species, S, was created with the newborn as its only member. In subsequent time steps, existing individuals may switch to species S if the genetic distance between the individual and S is smaller than the genetic distance between the individual and its current species. This implementation caused several limitations. Our classical speciation created a rigid rule that every new species had an initial size (number of members) of 1. It does not model the basic principle of speciation – that every new species is the result of a splitting of an existing species. Indeed, “most biologists agree that discrete clusters [of organisms] exist” [2] and that these clusters form discrete, or near-discrete, species. This phenomenon is observed in our 2-means speciation method. In addition to this limitation, we were not able to recreate a well-structured tree of life for species. Because our simulation allows for the rare occurrence of interbreeding, a newborn, N, may have parents from two different species, S1 and S2. If the newborn individual forms a new species, S3, then species S3 will have two parents. With this design, we were not able to extract from our data a tree of life that could represent any kind of species splitting. Our classical method for speciation was one of the most computationally expensive parts of the simulation. For all individuals, the distance, D, between their FCM and the FCMs of every existing species was calculated. Moreover, the new map of every species was recalculated. Suppose there are Nt1t1 prey and St1t1 prey species during time step t1. Then, in our classical speciation mechanism, the complexity of determining the existence of new prey species is O(Nt1t1St1t1). This is repeated for checking the emergence of new predator species, resulting in a combined complexity of O(Nt1t1St1t1 + Nt1t2St1t2), where Nt1t2 is the number of predators during time t1 and St1t2 is the number of existing predator species during time step t1.
4 K-Means Clustering for Speciation As an alternative mechanism for clustering individuals into similar groups, we implemented a k-means clustering technique designed to allow for (1) the splitting of an existing species S into S1 and S2, and (2) the clustering of individuals that initially belonged to S into one of either S or S1 (thus, more specifically, a 2-means clustering algorithm). In this implementation, every newborn individual, I3, which is created as the result of reproduction between individuals I1 and I2, is added to the closer of S1 or S2, the species of I1 and I2, respectively. Speciation then occurs later in the time step. Our speciation method begins by finding the individual in a species S with the greatest distance from the species’ FCM. If this distance is greater than a predefined threshold for speciation, 2-means clustering is performed. Otherwise, species S remains unchanged. If clustering is to be performed, two new species are created – one centered around a random individual in S, denoted Ir, and another centered around the individual in S that is farthest from Ir, denoted If. Subsequently, all remaining
K-Means Clustering as a Speciation Mechanism
321
individuals in S are added to one of the two new species – whichever species the individual is more genetically similar. After recalculating the new FCMs for the two new species, the process of clustering is repeated for convergence. After the 2-means clustering is completed, two new species exist, S1 and S2, whose members are a subset of the original members of S. The closer of S1 or S2 to the original species S inherits the properties of species S, such as the species ID and the ID of its parent species. Thus, one of the new species will continue to represent the original species while the other will represent a split off of the original species. Consider 2-means speciation for prey only1. The first part of speciation is to determine whether or not clustering should take place. For each prey species, S, and for every individual, I, within S, the distance D(I,S) is calculated. Clearly, this iterates over the number of prey species and the number of prey in each species – which is a complete subset of the total number of prey in the entire world. Thus, this part of our 2-means prey speciation takes O(Nt0t1) time, where Nt0t1 is the total number of existing prey at time t0. Selecting two individuals – one randomly, and the other as the individual that is genetically furthest from the randomly chosen individual – and creating two new species centered around these two individuals takes O(k) time. All remaining individuals in the current species, S, are then grouped into one of the two new species, S1 or S2. This grouping takes O(|S| - 2), or simply, O(|S|). If we suppose that there are P1 prey in prey species S1 and that the size of the matrix L in each prey is n1 × m1, then the recalculation of the FCM for prey species S1 has complexity O(P1n1m1). This, combined with the recalculation of the map for prey species S2 creates a total complexity of O(P1n1m1 + P2n2m2), where P2 is the number of prey in prey species S2 and n2 × m2 is the size of the matrix L for each of the prey in prey species S2. However, because the size of L is constant throughout the simulation, this complexity can be reduced to O(P1 + P2) or, more simply, O(N1). The overall complexity is then:
O ( N t 1t 1 + N t 1t 2 ) +
S t 0 t 1 +S t 0 t 2
∑P
i
i =1
The above equation is smaller than O(Nt1t1St1t1 + Nt1t2St1t2). As the sum is smaller than O(Nt0t1 + Nt0t2), simply because the process of speciation is only applied to a subset of the existing species, the total complexity is O(Nt1t1 + Nt1t2).
5 Comparing the Number of Species A recent execution of the simulation using the 2-means speciation method produced approximately 20,000 generations (time steps) in eighteen days (compared to 7,112 time steps of the classical speciation after two and half months). Table 1 displays the average number of prey and predator species during two executions of the simulation. Although the simulation is a large, complex, and evolving system, and although many of the data series show oscillations with high amplitude, there is a strong correlation between many of the dependent properties. 1
The concepts are easily applied to predator speciation.
322
A. Aspinall and R. Gras Table 1. Average number of species for executions using both speciation methods
Speciation method Classical speciation 2-Mean speciation
Average number of prey species
Average number of predator species
18 24
34 12
When speaking about the size of a species (the quantity of members), it is more useful to use relative sizes by comparing the average size of a species per time step, T, to the total quantity of individuals during T. The average prey species size during classical speciation was 10.83% of the total population size of the prey. This value is near double 5.72%, which is the average prey species size during 2-means speciation. This degree of difference demonstrates that the 2-means speciation mechanism produced, on average, many more prey species of smaller size relative to the quantity of prey individuals. According to [3], which discusses species abundance, we would expect to observe this exact phenomenon. Indeed, it is “widely observed by ecologists that species are far from being equally abundant” [4]. Instead, more species are represented with fewer individuals. It is interesting, as well, to recognize that although 2means speciation produced many more prey species (of smaller sizes), very little difference was seen for predator speciation. During our classical speciation, the average predator size was 8.43% while during 2-means speciation the average predator size was 11.26%. This suggests that 2-means speciation produced many fewer predator species but with slightly larger quantities than observed from our classical speciation mechanism. Classical speciation produced a maximum of 42 living prey species (at time step 996) and 83 living predator species (at time step 1309). 2-means speciation produced a maximum of 48 living prey species (at time step 659), which is quite similar to that of our classical speciation. However, 2-means speciation produced a maximum of just 22 living predator species (at time step 17,199), which is much smaller than the number of predator species created by our classical speciation method. These measurements are summarized in table 2. Table 2. Comparing our classical speciation vs. our 2-mean speciation
Avg. prey species size Avg. predator species size First split of prey species (time step) First split of predator species (time step) Max. number of living prey species Max. number of living predator species
Classical speciation
2-mean speciation
18 24 407 406 42 83
34 12 296 535 48 22
K-Means Clustering as a Speciation Mechanism
323
Fig. 1. Prey (grey) and prey species (black) data series from 2-means speciation
Fig. 2. Cross-correlation between the number of prey and the number of prey species for -250 ≤ d ≤ 250
Figure 2 demonstrates the dependency between the prey and prey species data series. This dependency has been widely discussed and is the basic principle of Fisher’s log series – a species abundance distribution model discussed in [3], which proposed a method for calculating the dependent relationship between the size of a community and the total number of species within the community. Figure 2 shows the cross-correlation function between the two data series in figure 1. which demonstrates a dependency between the number of prey and the number of prey species. In fact, the strong positive correlation between the number of prey and the number of prey species is at a maximum at a distance of approximately 50 time steps. This suggests that, as the quantity of prey individuals increases, so does the quantity of prey species 50 time steps later. Figure 3 shows the total number of prey species for both speciation mechanisms. It is clear that our 2-means speciation produced very similar results to our classical speciation method. What is more interesting, however, is the result shown in figure 4. Here, it is clear that our 2-means speciation method produces more appropriate quantities of predator species.
324
A. Aspinall and R. Gras
Fig. 3. Number of prey species using classical speciation (black) and 2-mean speciation (grey)
Fig. 4. Number of predator species using classical speciation (black) and 2-mean speciation (grey)
6 Species’ Sizes We refer to size here in two ways: (1) the size of a species S is the number of individuals in S (the same meaning of size used in table 2), and (2) the spatial size of a species S as the average of the pairwise physical distances in the world between the individuals in S. Our definition of a species’ spatial size allows us to comparatively measure the amount of space in the world that a species occupies. The average spatial sizes of prey species for the first 2,000 time steps of both speciation mechanisms are shown in figure 5. The spatial sizes of prey species are very similar for both speciation methods, despite there being a decrease in the average number of individuals per prey species during 2-means speciation. This suggests that our 2-means speciation method produced, on average, more prey species with smaller quantities that were as closely grouped as the species produced from our classical speciation.
K-Means Clustering as a Speciation Mechanism
325
Fig. 5. Average spatial size of prey species using classical speciation (black) and 2-mean speciation (grey)
Fig. 6. Spatial sizes of prey species 26, 81, and 105
In an execution of our simulation using the 2-mean speciation method, prey species 26 split into species 26 and species 81 (at time step 1036). During time step 1586, prey species 81 split into species 81 and 105. This splitting can be seen in figure 5, which displays graphs for the spatial sizes of prey species 26, 81, and 105. It can be seen that when a species splits, the splitting species experiences a reduction in its spatial size. We may deduce from figure 6 that there is a correlation between the spatial distances and genetic distances between individuals in a species. This gives us another criterion with which we may compare our speciation methods.
7 Physical vs. Genetic Distance Introduced in 1943 by Sewall Wright, “Isolation by distance” is a biological theory that suggests a positive correlation between physical distances and genetic differences. Subsequent authors, including Kimura and Weiss (in 1964), Nagylaki (in 1976), and the authors of [9] have continued to study this phenomenon, the last of
326
A. Aspinall and R. Gras
which demonstrated that on samples of genes from two populations, it is possible to identify isolation by distance. For every pair of individuals in a species, (I1, I2), measuring the physical distance and genetic distance between I1 and I2 demonstrates some evidence of isolation by distance. Depicted in figures 7 and 8, for some prey species, it can be seen that as the physical distance between two individuals, I1 and I2, increases (the x-axis), so does the genetic distance between I1 and I2 (the y-axis).
Fig. 7. Physical distance vs. genetic distance of individuals in prey species 141 during time step 3075
Fig. 8. Physical distance vs. genetic distance of individuals in prey species 92 during time step 3075
For prey species 92, there is evidence of genetic differences among the two clusters of individuals in species 92 which are physically isolated from each other (figure 8). Particularly interesting, prey species 11 demonstrates some genetic isolation among individuals in the species (figure 9). Visualizing the physical location of individuals within the world helps us to identify a relationship between the physical location of individuals and their genetic similarity. Figure 10 depicts the physical locations of individuals in prey species 286 and 425 – before and after splitting. It can be seen that the new cluster of genetically similar individuals, which form the new prey species, are also physically located near each other.
K-Means Clustering as a Speciation Mechanism
327
Fig. 9. Physical distance vs. genetic distance of individuals in prey species 11 during time step 3075
Fig. 10. (Left) Prey species 286 during time step 4546, (Right) Prey species 286 (black) and prey species 425 (grey) during time step 4547
8 Intra- and Inter-cluster Distances Calculating intra- and inter-cluster distances is one method of illustrating and measuring cluster “compactness.” Recall, however, that because our classical speciation method is not designed to allow for species splitting, this measurement of cluster compactness before and after species splitting can only be used with data from our 2mean speciation method. As we would expect to observe, immediately before the splitting of a species S (such as species 286 in figure 10), there is a high value for intra-cluster distance. This reflects the fact that there is at least one pair of individuals within S, I1 and I2, such that D(I1.FCM, I2.FCM) is beyond our speciation threshold. For prey species 286, the largest genetic distance between every individual and the center of the species is 2.91041. Moreover, there exists a pair of individuals in species 286 such that the genetic distance between the two individuals is 6.12 – the greatest of every pair of individuals in the species. Subsequent to prey species 286 splitting, the largest genetic distance from every individual and the center of the species has reduced to 1.8744. In addition, the greatest distance between every pair of individuals has reduced to 4.11. The new species, species number 425, is even more compact. The greatest distance from an individual
328
A. Aspinall and R. Gras
in species 425 to the center of species 425 is just 1.3886 and the largest distance between every pair of individuals in species 425 is 3.2. These results, suggesting that after a species, St, splits into S1t+1 and S2t+1, the two new species are more compact than the predecessor parent species, are illustrated in figure 11.
Fig. 11. Measuring the compactness of prey species 286 during time step 4546 (left), 286 during time step 4547 (middle), and 425 during time step 4547 (right)
9 Conclusion We presented a new speciation mechanism implemented in our existing individualbased evolving predator-prey ecosystem simulation. Not only does our 2-means speciation method have a smaller complexity compared to our classical speciation method, it also more accurately models the species splitting phenomena observed in nature. During our classical speciation method, the emergence of every new species, S, contained just a single individual. Instead, our 2-means speciation allows for the splitting of an existing species, S, into S1 and S2. This significance allows us to reconstruct the species tree of life and demonstrate what effect the emergence of a new species has on the spatial size of the original species. Additionally, we are not limited to studying species in terms of their sizes. The improvements we have made to our simulation and to our speciation mechanism provides us with the ability to track back to the origin of every species. We have access to the complete state of a species during the time step when it split, including its average energy level, and the levels of every internal concept, such as its level of fear, annoyance, and satisfaction. This detail about every species during every time step allows us to study the dependency between, for example, the average fitness level of a species, S, during time step t1 and the fact that species S may or may not have split during time step t1. This dependency, and others like it, will allow us to predict the occurrences of new species. We are beginning to study the dependencies between the states of a species by building a training set from our results and applying various machine learning techniques to predict the occurrences of future speciation events. Acknowledgements. This work is supported by the NSERC grant ORGPIN 341854, the CRC grant 950-2-3617 and the CFI grant 203617 and is made possible by the facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET: www.sharcnet.ca).
K-Means Clustering as a Speciation Mechanism
329
References 1. Adami, C., Brown, C.T.: Evolutionary Learning in the 2D Artificial Life System ‘Avida’. Artificial Life, 377–381 (1994) 2. Coyne, J.A., Orr, H.A.: Speciation. Sinauer Associates Inc., Sunderland (2004) 3. Devaurs, D., Gras, R.: Species abundance patterns in an ecosystem simulation studied through Fisher’s logseries. Simulation Modelling Practice and Theory, 100–123 (2010) 4. Fisher, R.A., Corbet, A.S., Williams, C.B.: The relation between the number of species and the number of individuals in a random sample of an animal population. The J. of Animal Ecology, 42 – 58 (1943) 5. Gras, R., Devaurs, D., Wozniak, A., Aspinall, A.: An Individual-Based Evolving PredatorPrey Ecosystem Simulation Using a Fuzzy Cognitive Map as the Behavior Model. Artificial Life, 423 – 463 (2010) 6. Holland, J.H.: Hidden order: How adaptation builds complexity. Addison-Wesley, Reading (1995) 7. Hraber, P.T., Jones, T., Forrest, S.: The ecology of echo. Artificial Life, 165 – 190 (1995) 8. Kosko, B.: Fuzzy cognitive maps. International J. of Man-Machine Studies, 65 – 75 (1986) 9. Slatkin, M.: Isolation by Distance in Equilibrium and Non-Equilibrium Population. Evolution 54, 1606–1613 (2007)
Improving Reinforcement Learning Agents Using Genetic Algorithms Akram Beigi, Hamid Parvin, Nasser Mozayani, and Behrouz Minaei Computer Engineering School, Iran University of Science and Technology (IUST), Tehran, Iran {Beigi,Parvin,Mozayani,B_Minaei}@iust.ac.ir
Abstract. In this paper a new Reinforcement Learning algorithm was proposed. Q learning is a useful algorithm for agent learning in nondeterministic environment but it is a time consuming algorithm. The presented work applies an evolutionary algorithm for improving Reinforcement Learning algorithm.
1 Introduction Reinforcement learning is a computational approach to building agents that learn their behaviors by interacting with an environment (Sutton and Barto, 1998). A reinforcement learning agent senses and acts in its environment in order to learn to choose optimal actions to achieve its goal. It has to discover by trial-and-error search how to act in a given environment. For example, a robot may have sensors to perceive the environment state, and actions to change its state such as moving in different directions. For each action the agent receives feedback (also referred to as a reward or reinforcement) to distinguish what is good and what is bad. The agent’s task is to learn a policy or control strategy for choosing the best actions in such a long run that achieve its goal. For such a purpose the agent stores a cumulative reward for each state or state-action pair. The ultimate objective of a learning agent is to maximize the cumulative reward it receives in the long run, from the current state and all subsequent next states along with goal state. Reinforcement learning systems have four main elements (Cuay´ahuitl, 2009): a policy, a reward function, a value function, and optionally, a model of the environment. A policy defines the behavior of the learning agent. It consists of a mapping from states to actions. A reward function specifies how good the chosen actions are. It maps each perceived state-action pair to a single numerical reward. A value function specifies what is good in the long run. The value of a given state is the total reward accumulated in the future, starting from that state. Approaches based on value function attempt to find a policy that maximize the return by maintaining a set of estimates of expected returns for one policy. Using of value functions distinguish reinforcement learning methods from evolutionary methods that search directly in policy space guided by scalar evaluations of entire policies. The model of the environment is something that mimics the environment’s behavior. A simulated model of the environment may predict the next environment state from the current state and action. The environment is A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 330–337, 2010. © Springer-Verlag Berlin Heidelberg 2010
Improving Reinforcement Learning Agents Using Genetic Algorithms
331
Q-Learning Algorithm: Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose a from s using policy derived from Q (e.g., ε-greedy) Take action a, observe r, s’
Q ( s, a ) ← Q ( s, a ) + α [r + γ max a ' Q ( s ' , a ' ) − Q ( s, a )] s ← s' Until s is terminal Fig. 1. Q- Learning Algorithm
usually represented as a Markov Decision Process (MDP) or as a Partially Observable MDP (POMDP) (Vidal 2007; Sutton and Barto, 1998; Shoham et al. 2009). One of the most important reinforcement learning algorithms is Q-learning. In this case, the learned action-value function, Q, directly approximates Q*, the optimal action-value function, independent of the policy being followed. This simplifies the analysis of the algorithm and enabled early convergence proofs. The Q-learning algorithm is shown in procedural form in Figure 1. 1.1 Markov Decision Process We have assumed that the agent senses the state of the world then takes an action which leads it to a new state. We also make the further simplifying assumption that the choice of the new state therefore depends only on the agent’s current state and the agent’s action. This idea is formally captured by a Markov decision process or MDP. An MDP is defined as a 4-tuple <S,A,T,R> characterized as follows: S is a set of states in the environment, A is the set of actions available in the environment, T is a state transition function in state s and action a, R is the reward function. The optimal solution for an MDP is that of taking the best action available in a state, i.e. the action that collected as much reward as possible over time. 1.2 Multi-task Learning Multi-task learning (Wilson et al., 2007) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation. This often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks. Tanaka et al. (2003) proposed Multitask Reinforcement Learning Problems on a number of instances of the Markov Decision Processes, which are sampled from the same probability distributions. The instances of the Markov Decision Processes are assumed to be sequentially presented to learning agents so that the learning agents
332
A. Beigi et al.
can effectively utilize knowledge acquired from past instances of the Markov Decision Processes. If the size of problem space of such instances is huge, Evolutionary Computation could be more effective than Reinforcement Learning Algorithms and memory-based Evolutionary Computation (Handa, H. 2007) is used for storing past optimal solutions. Dynamic or uncertain environments are crucial issues for Evolutionary Computation since. They are expected to be effective approach in such environments. The specific families of Reinforcement learning techniques we look at are derived from the Q-learning algorithm for learning in unknown MDPs. Because of slowness in learning by Reinforcement learning, in this work we propose a modified Q-learning algorithm with applying memory-based Evolutionary Computation technique for improving learning in multi task agents.
2 Memory-Based Evolutionary Computation Handa, has used memory-based Evolutionary Computation for storing past optimal solutions. In that work, each individual in population denotes a policy for given task. At environmental change, the best individual in current population is interested into the archive. Then individuals in the archive are randomly selected and they are moved into the population. An overview of Memory-based Evolutionary Programming is depicted in Figure 2.
Fig. 2. An overview of Memory-based Evolutionary Programming
A large number of studies concerning dynamic/uncertain environments have been made; have used Evolutionary Computation algorithms (Goh, and Tan, 2009). The aim of these problems is to gain goal as fast as it could be. The significant issue is that the robots could get assistance from their previous experiences. In this paper a memory-based Evolutionary Computation for Multitask Reinforcement Learning problems is examined.
Improving Reinforcement Learning Agents Using Genetic Algorithms
333
3 GA-Based RL (GARL) Algorithm GARL (Jiang, J. 2007) is an approach for searching the optimal control policy for an RL problem by using GAs. When GAs are used for RL problems, the potential solutions are the policies and are expressed as chromosomes, which can be modified by genetic operations such as mutation and crossover. GAs can directly learn decision policies without studying the model and state space of the environment in advance. The only feedback for GAs is the fitness values of different potential policies. In many cases, fitness function can be expressed as the sum of instant rewards, which are used to update the Q-values of value function-based RL algorithms.
GA-based Reinforcement Learning (GARL): Initialize: Construct a population pool randomly with a given policy form. The size of population is Nc (even number); Repeat until the terminal conditions are satisfied (1) Couple: Couple the Nc chromosomes randomly into Nc/2 groups. The two chromosomes in each group are called parents; (2) For each group (a) Form a family: Produce children from parents using crossover() and mutation(). The number of children is Nchildren. Parents and children form a family; (b) Evaluate: Calculate the fitness value of each member of the family by evaluation(); (c) Select: Select two chromosomes in the family with the highest fitness values; (d) Replace: Replace the old parents of the family in the population pool with the two chromosomes selected; End each group; (3) Update all the chromosomes in the population pool, (4) Start a new generation with the evolutive chromosomes. End. Fig. 3. GA-based Reinforcement Learning (GARL) Algorithm
4 Evolutionary Q-Learning With applying GARL for reinforcement learning agent in nondeterministic environments, we propose a Q-learning method called Evolutionary Q-learning. The algorithm is presented in Figure 4. The proposed algorithm is examined in nondeterministic maze.
334
A. Beigi et al.
Evolutionary Q Learning (EQL): Initialize Q(s,a) by zero Repeat (for each generation): Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose a from s using policy derived from Q (e.g., ε-greedy) Take action a, observe r, s’
s ← s' Until s is terminal Add visited path as a Chromosome to Population Until population is complete Do Crossover() by CRate Evaluate the created Childs Do tournament Selection() Select the best individual for updating Q-Table as follows:
Q( s, a) ← Q( s, a ) + α [r + γ max a ' Q( s ' , a ' ) − Q( s, a)] Copy the best individual in next population Until satisfying convergence Fig. 4. One Evolutionary Q Learning (EQL) Algorithm
5 Problem Definition and Modeling Assume that a number of robots are working in a mine and their task is to search for gold from an initial point. The mine has a group of corridors which robots can pass through them. In specific paths there exist some barriers which do not let robots to continue. Now, suppose that because of decadent corridors, it is possible that in some places, there could be some pits. If a robot enters to such pits, it could be don’t able exit from that pit with possibility above zero by some moves and if it fails to exit by those moves, it should try again. The aim of robots is finding the gold state as soon as possible. Note that the robots can use their past experiences. 5.1 Problem Space In this work, an extension of Sutton's maze problem is used. As depicted in Figure below, the Sutton's Maze problem consist of 6 x 9 cells. This problem is composed of 46 states, which exclude 1 goal state and 7 collision cells (gray cells). Agents can take four kinds of actions; Up, Down, Left, and Right. The original Sutton's Maze problem is a deterministic problem (Sutton and Barto, 1998).
Improving Reinforcement Learning Agents Using Genetic Algorithms
335
Fig. 5. Modified Sutton's maze
In nondeterministic version of this problem there are several probabilistic cells (cells with H letter). These probabilistic cells indicate that agents cannot move in accordance with their actions with a certain probability. That is, they have to stay the same cell with above zero probability; this probability is sampled from Normal distribution with average = 0, and variance = 1. Agents will receive a reward+1, if they have get goal state. The sizes of population and archive are set to 100. 5.2 Actions Each action of agent could be an MDP sample such as delineated in figure 6.
0.8
1
0.2 0.6 0.4
0.5 0.5
0.1 0.9
1 1 1
Fig. 6. Two sample actions are presented by MDP models
Right part shows certain case which in it agents move to next state by choosing any possible action with probability equal 1. On the other hand left part reveals that it is possible that the agents can not move to next state by choosing any possible action and it would remain in its position. for some do actions. These MDPs presented to learning algorithm sequentially. The presentation time of each problem instance is enough to learn. The problem is to maximize the total acquired rewards for lifespan. Agents return to start state after arriving goal state. 5.3 Learning Algorithm Considering the type of problems mentioned above and the fact that the reward is given to the agent only if it reaches to the goal, using original Q-learning algorithm may result in slowing down the learning process and the convergence might be decelerated. In addition in such problems, the shortest path may be not the best path; because of it may have some pit cells and it leads agent acts many movement until finding goal state. Therefore the optimum path has less pit cells and short length.
336
A. Beigi et al.
So applying an evolutionary version of Q-learning is more useful. Applied Items in this algorithm is as follow: • • • •
Chromosome’s Structure: Each chromosome is a array of states and actions, which agent visited them from start to goal state. Value: Value of each chromosome is the average of path length. fitness: There is proportion in fitness and inverse of value. Crossover Operation: A two point crossover is used. The figure 7 shows it.
Crossover Algorithm: 1. 2. 3. 4. 5. 6. 7.
Select two individuals randomly Select two states in first individuals randomly. Seek first state from start state in second individual Seek second state from last state in second individual Swap these parts between individuals. Evaluate these children Keep the best. Fig. 7. Crossover Algorithm
• •
Selection Operation: Truncation selection is applied. Q-Table Updating: After running generations, the best chromosome is selected for updating QTable as respect consisting states and actions.
6 Experimental Result A single experiment is composed of 100 tasks, where each task consists of 100 generations. In this work 50 experiments have been done. Hence, 500,000 generations in 50 experiments have been executed. Table summarizes experiment results: Table 1. Experimental Result Best average of path length in population
worst average of path length in population
Total Average of path length
Original QL
54.26
1600
449.1149
Proposed EQL
28.34
66.79
42.4738
improvement
47.7%
95.8%
90.5%
Improving Reinforcement Learning Agents Using Genetic Algorithms
337
7 Conclusion Reinforcement Learning is good approach to building agents that learn their behaviors by interacting with an environment. Q-learning is a most important method in Reinforcement learning that it learns action-value function, directly approximates, the optimal action-value function, independent of the policy being followed. Because of slowness in Q-learning, we applied genetic algorithms and used Memory based evolutionary computation for improving efficiency in nondeterministic environment. Proposed evolutionary Q-learning has about 90% improvement compared with original Q-learning algorithm in average case.
References 1. Cuayáhuitl, H.: Hierarchical Reinforcement Learning for Spoken Dialogue Systems. PhD thesis, University of Edinburgh (2009) 2. Goh, K., Tan, K.: Evolutionary Multi-objective Optimization in Uncertain Environments. Springer, Heidelberg (2009) 3. Handa, H.: Evolutionary Computation on Multitask Reinforcement Learning Problems. In: 2007 IEEE International Conference on Networking, Sensing and Control, pp. 685–688 (2007) 4. Jiang, J.: A Framework for Aggregation of Multiple Reinforcement Learning Algorithms. PhD thesis, University of Waterloo (2007) 5. Jin, Y., Branke, J.: Evolutionary optimization in uncertain environments - a survey. IEEE Transaction Evolutionary Computation 9(3) (2005) 6. Shoham, Y., Leyton-Brown, K.: MULTIAGENT SYSTEMS Algorithmic, Game Theoretic, and Logical Foundations. Cambridge University Press, Cambridge (2009) 7. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) 8. Tanaka, F., Yamaura, M.: Multi Task Reinforcement learning on the distribution of MDPs. In: Proceedings. 2003 IEEE International Symposium on Computational Intelligence In Robotics and Automation, September 2003, vol. 3, pp. 1108–1113. IEEE Press, Los Alamitos (September 2003) 9. Vidal, J.M.: Fundamentals of Multi Agent Systems (2009), http://multiagent.com/2008/12/fundamentals-of-multiagentsystems.html 10. Wilson, A., Fern, A., Ray, S., Tadepalli, P.: Multi-task reinforcement learning: A hierarchical Bayesian approach. In: International Conference on Machine Learning (ICML), Corvallis, OR, USA, pp. 1015–1022 (2007)
Robust and Efficient Change Detection Algorithm Fei Yu, Michael Chukwu, and Q.M. Jonathan Wu Department of Electrical and Computer Engineering University of Windsor, Windsor, ON N9B 3P4 {yu11g,chukwu,jwu}@uwindsor.ca
Abstract. Change detection in temporally related image sequences is a primary tool for extraction and detection of activities in background scene with vast and wide range of applications ranging from security and surveillance to fault detection and power savings. The prevalent methods for change detection are derived from the difference extraction where differences in the gray-level of values of the pixels between the two or more image sequences are used for the estimation and prediction of these changes. However this approach and its derived modifications are largely dependent and reliant on the application of value thresholds to provide significance to the differences, in order to compensate for the vulnerability of these methods to illumination variability and noise. A frequency domain approach to change detection is proposed that eliminates the need for thresholds and provides comparatively superior performance to the existing algorithms.
1 Introduction Detection of changes in a scene in order to trigger an action or track an event or mark the beginning of a timeline is a very important tool in computer vision, digital image and video processing, it provides a veritable platform for the selective prioritization of camera feed for human monitoring and control in security and surveillance applications, it is also used in applications that track and detect activities, errors, abnormal trends and outliers in several applications use cases for optimal extraction of changes. Several methods exists for the extraction of the image scene changes, One of the prominent, earliest and basic method for change detection is the signed difference of the two input image sequences, in this method the difference of two temporally related images is used for the analysis of the changes. However the major pitfall of this approach is its strong vulnerability to variation in lighting of the scene as the pixel values return considerable changes due to this variation, in addition noise in the captured image effectively reduces the reliability of the method as it is incapable to detecting, compensating or eliminating its presence. An additional and required supplement to this method is the application of a defined threshold to the resulting difference, in order to eliminate noise and match the sensitivity requirement of the specific application use case. Other related approach to the signed difference is the change vector analysis [1], which uses the modulus difference of the feature vectors (representing the magnitude and direction of change) of the image sequences to detect change. The method also A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 338–344, 2010. © Springer-Verlag Berlin Heidelberg 2010
Robust and Efficient Change Detection Algorithm
339
adopts a threshold application approach for the several use cases. Image ratio-ing [2] [3] is another technique that uses intensity ratios rather than differences to detect changes. Different applications of principal component analysis (PCA) [4] [5] have also been proposed and used to extract changes across images, with several methods of selecting the principal components that represent the changed parts of the image. Multiple variants and hybrids of these techniques have been proposed in literature for generic and specific application use cases. However most of the existing techniques rely heavily on the application of threshold for the performance tuning, but the task of estimating a global threshold for every possible application is completely arduous if not elusive, hence the development of thresholds for specific set of image scenes with known characteristics and application requirements, more so the methods [6] of the evaluating the appropriate threshold for each use case are mostly experimental devoid of any objective validation and mathematical rigor. This could result in sub-optimal implementations with no guarantee of performance as the spectral dynamics of the input images may not be reliably and deterministically predicted for each use case. More so thresholds are largely inadequate solution to the problem of illumination variation and noise; hence the development additional pre-processing techniques like intensity normalization [7], background modeling [8] [9], illumination modeling [10] and image averaging. The proposed method models the problem as the extraction of the differences across fine-grained spectral structures of an image sequence, the correlation of the spectral content within a defined spatial boundary is used as the evaluation factor for change classification. The proposition has been iteratively refined and optimized to produce the presented simple and efficient implementation that embodies the complete functionality of the method. The technique completely eliminates the need for threshold and provides a platform for change detection across wide application areas, with improved performance in comparison to existing generic or application specific methods. The paper is presented in the following layout, section one is this introduction while section two contains details of the proposed method, the experimental analysis is presented in the third section with conclusion and insight for future work in the fourth.
2 Change Detection Algorithm The problem of change detection is modeled as the determination of the change mask across image sequence I1…N, where N = 2. The proposed change detection algorithm pre-supposes the alignment of the image sequence in the same co-ordinate system. In an ideal scenario the difference in the image pair can be extracted from the simple signed difference, but the problem of noise and illumination distorts this status, hence mechanisms for the compensation of the two factors is implemented. 2.1 Noise To eliminate the impact of noise, the spectral content generated from redundant discrete wavelet transform of the image pair, which provides the space-frequency information is used, the choice of wavelet is predicated upon the core requirement for high spatial resolution, this yields wavelet coefficients influenced by large amount of neighbor pixels, and this reduces the effects of noise to minimal deviations in coefficient values across the image pair.
340
F. Yu, M. Chukwu, and Q.M.J. Wu
d ⎛ dθs(x) ⎞ Wsf ( x ) =f ( x ) ∗ ⎜ s ⎟ = s ( f ∗ θs )( x ) dx dx ⎝ ⎠
(1)
where θs ( x ) = 1 θ ⎛ x ⎞ is a smooth function integrates to 1 and converges to zero at ⎜ ⎟ s ⎝s⎠
infinity, Wsf(x) is single step decomposition using redundant discrete wavelet transform (see Figure 1). 2.2 Illumination Variance The output of the first step for the input frames is temporally filtered using the Haar integer reversible high pass wavelet filter, as shown in Figure 2 below, by doing which, the illumination variance will be mostly removed.
Fig. 1. Redundant discrete wavelet transform
Fig. 2. Temporal filtering of the wavelet sub bands, derived from the two input frames
Robust and Efficient Change Detection Algorithm
341
2.3 Noise – Illumination Compensation The application of threshold classification to the temporally filtered coefficients, is used to eliminate the near zero values. The hard magnitude threshold is applied to the output.
3 Implementation Inverse discrete wavelet transform of the output of step 3 above, and the output an inverse discrete wavelet with the filtered approximation coefficients replaced with zero values is summed up to produce the motion profile of the two frames, all non zero values in the motion profile are areas of relative motion across the frames, this is shown in Figure 3.
Fig. 3. Generation of the changes across two input frames
In Figure 3, the motion profile is used for deterministic search and matching across the input frames for estimation of the displacement vectors. The estimation of differences and displacement vectors across the input frames is restricted to the areas of changes across the frames, as defined by the motion profile. The objective criteria and the shape and size of the matching units (regions or blocks) used are user defined, any method can be implemented. An example is shown in Figure 4 below:
Fig. 4. An example of estimation of areas of changes across two input frames
342
F. Yu, M. Chukwu, and Q.M.J. Wu
4 Experimental Analysis, Test and Results The experimental evaluation of the change detection algorithm was performed in four categories of tests and presented in the following subsections. The categories are: 4.1 Simple Image Scene
Fig. 5. Output estimation of areas of changes across two simple input frames. (Top: sequences 1, Bottom: sequences 2).
4.2 Complex Image Scene
Fig. 6. Output estimation of areas of changes across two complex input frames. (Top: sequences 3, Bottom: sequences 4).
Robust and Efficient Change Detection Algorithm
343
Fig. 6. (Continued)
4.3 Variable Lighting (Alpha Value)
Fig. 7. Output estimation of areas of changes across two input frames with variable lighting
4.4 Noise (3 % Gaussian noise)
Fig. 8. Output estimation of areas of changes across two input frames with 3% Gaussian noise in one of them
5 Conclusion The proposed change detection method provides robust and efficient approach to the problem of image change extraction that eliminates the need and use of the experimentally determined value thresholds. The implementation results in the generation of an image of the differences between two input image frames with sub-pixel accuracy.
344
F. Yu, M. Chukwu, and Q.M.J. Wu
This approach results in a technique that transcends specific application boundaries and the limitations of variations caused by noise and illumination. Future work on the extension of the method to motion estimation and temporal super resolution is currently evaluated.
References 1. Malila, W.A.: Change vector analysis: an approach for detecting forest changes with Landsat. In: Proc. of the 6th Annual Symposium on Machine Processing of Remotely Sensed Data, pp. 326–335 (1980) 2. Singh, A.: Digital change detection techniques using remotely-sensed data. International Journal of Remote Sensing 10(6), 989–1003 (1989) 3. Oppenheim, A.V., Schafer, R.W., Stockham Jr., T.G.: Nonlinear filtering of multiplied and convolved signals. Proc. IEEE 56, 1264–1291 (1968) 4. Niemeyer, I., Canty, M., Klaus, D.: Unsupervised change detection techniques using multispectral satellite images. In: Proc. IEEE Int. Geoscience and Remote Sensing Symp., pp. 327–329 (July 1999) 5. Gong, P.: Change detection using principal components analysis and fuzzy set theory. Canadian Journal Remote Sens. 19, 22–29 (1993) 6. Rosin, P.L.: Thresholding for Change Detection. In: Proceedings of the Sixth International Conference on Computer Vision, ICCV, Washington, DC, USA , pp. 274–279 (1998) 7. Toth, D., Aach, T., Metzler, V.: Illumination-Invariant Change Detection. In: 4th IEEE Southwest Symposium on Image Analysis and Interpretation, Austin, TX, USA, April 2-4, pp. 3–7 (2000) 8. Cavallaro, A., Ebrahimi, T.: Video object extraction based on adaptive background and statistical change detection. In: Proc. SPIE Visual Communications and Image Processing, pp. 465–475 (January 2001) 9. Huwer, S., Niemann, H.: Adaptive change detection for real-time surveillance applications. In: Proc. Visual Surveillance, pp. 37–45 (2000) 10. Bromiley, P., Thacker, N., Courtney, P.: Non-parametric image subtraction using grey level scattergrams. Image Vis. Comput. 20(9-10), 609–617 (2002)
Building Users’ Profiles from Clustering Resources in Collaborative Tagging Systems Maya Rupert1 and Salima Hassas2 1 Thompson Rivers University, McGill Rd, Kamloops, BC, Canada 2 LIESP, Univerity Lyon1, 43 boulevard du 11 novembre 1918, 69622 Villeurbanne cedex, France [email protected], [email protected]
Abstract. The evolution of the Web and its applications has undergone in the last few years a mutation towards technologies that include the social dimension as a first class entity in which the users, their interactions and the emerging social networks are the center of this evolution. Let us consider the case of collaborative tagging systems. These systems are an example of complex, selforganized and socially aware systems. The multi-agent systems paradigm coordinated by self-organization mechanisms was used in an effective way for the design and modeling of complex systems. In this paper, we propose a model for the design and development of a new collaborative tagging system, MySURF (My Similar Users, Resources, Folksonomies), using a multi-agent system approach governed by the co-evolution of the social and spatial organizations of the agents. We show how the proposed system offers several new features that can improve current collaborative tagging systems, including clustering of resources and building users’ profiles. Keywords: Collaborative Tagging Systems, Users’ profiling, Organizational Multi-Agent Systems.
1 Introduction The web continues to grow and evolve at a very fast rate becoming one of the primary sources of information. Over the last ten years, we have witnessed a revolution in the content, usage and structure of the web and its various applications. This evolution is largely driven by the needs of users. The transition of the web from a static source of information (Web 1.0) to a support of activities (virtual communities), then to a collaborative and social system (web 2.0) takes into account the users’ activities as a principal component in this evolution. The web has surpassed its description of being just a source of information, to becoming a mechanism for dynamic content, supporting social interaction between the users. We consider the evolution of the web from a complex systems perspective [1]. This evolution of the web is viewed as the result of the aggregate contributions of hundreds of millions of users. From these local interactions between users, a global macro behavior has emerged. The concept of “social A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 345–352, 2010. © Springer-Verlag Berlin Heidelberg 2010
346
M. Rupert and S. Hassas
machines” was introduced in the perception of the “Web Science” [2]. The design of these machines is a collective work that depends on the technology that will enable communities of users to build, share and adapt to these social machines. However these newly introduced concepts face the following challenge [2]: “What underlying architectural principles are needed to guide the design and efficient engineering of new Web infrastructure components for this social software?” In this paper, we try to answer these two questions: How do we integrate the social dimension within the complex systems of the web? Can this integration enhance current collaborative tagging systems?
2 Collaborative Tagging Systems A new generation of applications has emerged with the "web 2.0" such as wikis, blogs, podcasts and systems of resource sharing among different users. This revolutionary form allows web users to contribute extensively to the content of the web. We are interested in a particular system of sharing resources and that is collaborative tagging systems and folksonomy. Users add a tag or keyword to a resource on the Internet. By studying and analyzing existing collaborative tagging systems, we noted the lack of adaptability and customization in these systems. At first glance, these systems provide information and tagged resources added by the users themselves. But by studying these systems in greater depth, taking into account the power of the social and the spatial aspect in these systems, the knowledge that can be extracted is well beyond just a list of resources corresponding to a particular tag. Current systems have many limitations at the search of information level and the integration of the social dimension ensures their evolution as complex systems. In this work, we took advantage of the co-evolution of the social and spatial organization in a complex system to develop a collaborative tagging system allowing the emergence of new features. This new system, MySURF (My Similar Users, Resources, and Folksonomies), is based on the retroactive effect of the social on the spatial and vice versa. The objective of this research is to complement the existing tagging systems by adding new features that can enhance these systems, taking into account the complex characteristics of these systems and the strong links that exist between the spatial and social organizations, which, to our knowledge has not been well explored so far in the design of these systems. Tagging systems are being more and more used for users’ profiling and in recommender systems [3, 4] and for personalization [5]. With the absence of hierarchical classification in collaborative tagging systems, problems of vocabulary and semantics become more persistent. There is no hierarchical structure, and the classification of information in these systems suffers from an inconsistency in the use of a word (what word or correct tag should be used to best describe a resource) [6]. Users do not use tags consistently; for example, they can use a tag today for a particular resource and use a different tag in the future for the same resource, as their vocabularies and semantics change and evolve over time. When searching for resources by tags, the user must agree with the provider of the tag on the semantics of the resource.
Building Users’ Profiles from Clustering Resources in Collaborative Tagging Systems
347
3 An Organizational Multi-Agent System Approach for the Design of a Tagging System Multi-agent systems (MAS) coordinated by self-organization and emergence mechanisms have been used [7] for the development and design of complex systems, in which the role of the environment has increasingly been taken into consideration as a first class entity in building MAS. In order to engineer systems capable of “adequate” adaptation to their environment, we propose a coupling between the system and its environment: - A structural coupling represented by the spatial organization of the MAS. - A behavioral coupling represented by the social organization of the MAS. - The co-evolution of both organizations through the MAS dynamics. This model of representation allows us to develop systems for the web in which the coupling of structure and usage evolution (co-evolution) is made explicit, allowing for the emergence of new practices. The social organization of the system is the social structure in which agents can act and interact with each other. Roles in the system define the behaviours that agents exhibit as part of that role. The agents’ perceptions depend on the position of the agents in the place, and the actions depend on the roles they can play. Agents’ indirect communication and coordination are achieved through the use of the stigmergy mechanism and more particularly, through the diffusion, propagation, and evaporation of a specific digital pheromone. This digital pheromone is viewed as a spatial structure for coding the control and meta-control information. The physical environment is represented by a network or a graph. Agents are situated in the different nodes of the graph called places. These places form the organizational positions that agents can occupy at the physical level of the environment. The perceptions/actions of these agents are situated in the physical environment. A set of places forms a region. Regions form the organizational units of the spatial organization. As the network topology is highly dynamic, the regions are also dynamic and keep changing over time. 3.1 Organizational Approach for MySURF In the tagging application MySURF, users are represented by human agents who are involved in the accomplishment of a collective task. These agents are present in a physical environment, which is materialized by a complex network of physical resources that represents by itself the tagging system. These agents assume certain roles; they are able to perceive and influence their environment, as they accomplish some tasks in the considered network of resources, which consequently affects the system’s evolution. Such physical materialization allows the implementation of the mechanism of stigmergy, leading to self-organization. Agents communicate indirectly with each other and leave their traces on the environment in the form of tags. These tags could be considered as a pheromone, which allows self-structuring of the system through the users’ actions on the environment. This emphasizes the influence exerted by the persistent effects in the environment of past behaviors of agents on their future behaviours. These effects were grouped in [8] into three categories:
348
M. Rupert and S. Hassas
- a qualitative effect: this represents the influence on the choice of the action to be taken by an agent ; - a quantitative effect: this represents the influence on the parameters (such as the position, the strength, the frequency, the latency, the duration, etc.) of the action, while the nature of the action remains unchangeable; - a qualitative and/or quantitative indirect effect: this represents the influence on the action result. This influence indirectly affects the way the action will be taken and its result, are a consequence of the changes made to its environment. The spatial organization in form of sub-communities of resources is affected by this change, leading to a restructuring of the environment. This restructuring will have an effect on the spatial position of the agent in the physical environment that is materialized by the network of resources. This position influences the choice of the action to be taken by the agent. Consequently, the agent could choose to play the role of Resource Tagger, Resource Searcher or Knowledge Expert. He could also choose to be, for instance, the creator of a new community of users, etc. An agent’s behaviour and the role it will play are greatly affected by its position in the spatial organization. Its position is also affected by its actions and the roles it plays in the social organization and by the different activities in the environment (pheromone presence, etc.). The coupling between the social organization and the spatial organization is retroactive and is expressed in the graph topology.
4 Clustering Resources and Extracting Users’ Profiles 4.1 Clustering Resources We used the spectral clustering for grouping sub-communities of resources that share similar content. We adopted the algorithm used in [9] for the emergence of subcommunities of resources by calculating the weight between two resources R1 and R2, as follows: min(ft1, ft2 ) ft = ft1 ft2 max(ft1, ft 2 ) +∑ +∑ ∑t ∈T1 ∩T2 ∈ ∈ t T 1 − T2 t T 1 −T2 ft ft ft ∑t ∈T1 ∩T2
wR1,R2
The numerator is the sum of the minima of normalized frequencies for the tags used in both resources (intersection of sets T1 and T2). 1
T1 (respectively T2) is the set of tags associated with R1 (respectively R2), f t (re2
spectively f t ) is the frequency of occurrence of tag t in T1 (respectively T2) and
f t is the global frequency of tag t or the total number of times that tag t was used in all the resources. The formed similarity matrix W could be considered as the adjacency matrix of a complex weighted network in which it is possible to assign to the arcs of the graph a weight that is proportional to the intensity of the connections between the network elements.
Building Users’ Profiles from Clustering Resources in Collaborative Tagging Systems
349
In order to visualize the different sub-communities of resources that emerge from the set of all resources related to a particular topic, some necessary transformations of the rows and columns of this matrix were needed. Consequently, the matrix wR1 ,R2 was transformed into a matrix Q as follows:
Sij = δ ij ∑ j Wij
Q = S – W where Wij = (1-δij) ( w ' R1 , R2 ) and
S is a diagonal matrix in which every element equals the sum of the corresponding elements of the row of W. We study the spectral properties of matrix Q, and determine the number of emergent semantically distinct sub-communities of resources from the number of smallest distinct non-trivial eigenvalues. The Laplacian matrix L of the graph G (also called the Kirchhoff matrix) is defined as being the difference between the degree matrix D and the adjacency matrix W. L = D – W Let us consider the first smallest eigenvalues of the Laplacian matrix L. The number of these well separated eigenvalues can indicate the number of possible emerging communities. A study of the first eigenvectors that correspond to these eigenvalues reveals the structure of these communities. To partition the graph by the eigenvectors of matrix Q, the number of clusters from the eigenvectors of Laplacian matrix were detected by studying the correlation between two nodes, as both resources that belong to the same community will be strongly correlated [10]. We calculated the correlation matrix Cij between two nodes i and j based on the following formula:
c ij =
xix [(
x
2 i
−
j
x
− 2
i
x )(
x
i
x
2 j
j
−
x
2 j
)]
xi and xj are the components of the first few nontrivial eigenvectors, the notation .i
represents the average of these components. The correlation coefficient cij measures the proximity between two nodes i and j. Based on the analysis of this matrix, clusters of resources were retrieved. When searching for resources that are assigned a specific tag, our system clusters these resources into sub-communities. For example, if the user searches for resources tagged with ‘programming’, the system will arrange the resources into sub-communities as follows: ‘web programming’, ‘java programming’, ‘ajax programming’, etc., while the existing systems display resources that are associated with the tag ‘programming’. Our system significantly improves the results of a search by tagged resources. Once the system applies the clustering algorithm described above to a particular set of resources, these resources will be assigned system tags composed of the combination of the topic-subtopic (i.e. ‘web’ and ‘programming’, or ‘java’ and ‘programming’). The system tags are auto-suggested to new users who are about to tag a resource that is already assigned system tags. This application of a suggested tag is viewed as a pheromone used to reinforce traces left by the agents when coming across a particular resource.
350
M. Rupert and S. Hassas
4.2 Virtual Groups of Users A virtual group is a set of users sharing the same interest in a particular topic. Users are grouped into virtual groups based on their tagging history. Newly added resources are categorized into virtual groups based on users’ early tagging behaviour. This provides an advantage to users as they become aware of the social network earlier, and they can choose to interact with the network and add more tags and resources to share within the virtual groups. For each sub-community of resources, corresponding users who have already tagged these resources will be grouped into ‘virtual groups’ based on the topic of the sub-community resources. For example: all users interested in resources related to ‘Web Design’ belong to the same virtual group ‘Web Design’. 4.3 Personomies for Each User A personomy is defined as the set of resources and tags that are associated with a particular user. Folksonomy is formally defined as follows [11] : Definition 1: folksonomy F is a tuple F = (U, T, D, A) in which U is the set of users, T is the set of tags, D is the set of web documents, and A ⊆ U × T × D is the set of annotations. Definition 2: Personomy Pu of a user u is the restriction of folksonomy F to u: Pu = (Tu , Du , Au ), in which: Au is the set of the user’s annotations: Au = {(t, d) | (u, t, d) ∈ A} Tu is the set of the user’s tags: Tu = {t | (t, d) ∈ Au} Du is the set of the documents tagged by the user: Du = {d | (t, d) ∈ Au} For each virtual group of users, we suggest to retrieve the personomies of all users who belong to this group in order to determine whether there are strong similarities between some members of this group. For each user u of a virtual group, we have the data of its personomy: Pu = (Tu , Du , Au ). 4.4 Building Users’ Profiles and Analysis of the Similarity Degree between 2 Users The purpose of this analysis is to determine whether two specific users are strongly or slightly similar. Definition: Two users are defined as strongly similar if they tag many resources that are semantically non-similar. For example: if two users tag in the same way (use similar tags) resources that are related to programming, it will be interesting to analyze how these users have tagged resources related to cooking or the Vancouver Olympic games for instance. If the two users who belong to the same virtual group are strongly similar, the resources added by one of them will be suggested to the other user and vice versa. In this case, the results of the search by tag will be customized based on the interests of each user. In each virtual group of users, we will cluster users with high similarity degree, whereas other users in the same group will have slight similarity degree. This creates several clusters in the same group. It is important that every user become aware of other strongly similar users, and consequently will have direct access to the resources added by these users to their libraries.
Building Users’ Profiles from Clustering Resources in Collaborative Tagging Systems
351
In the sub-communities of resources this situation is reflected in the information that will be displayed to the user. The sub-communities of resources will in turn be customized and refined according to the user’s interests. Given this direct and rapid access to those resources that have been tagged by users with a high degree of similarity to a particular user, the particular user will have the tendency to tag the same resources. This will create a building of such resources in the community quite similar to the pheromones. For example, suppose we have 2 users, namely User1 and User2 both interested in “web design”, therefore tagging the same resources with the “web design” tag. Suppose they are also both interested by the resources related to the same city “Vancouver”. In our system, User1 and User2 will be highly similar. The customized recommendation for User1 will display the resources tagged by highly similar users (User2 in this case). When User2 tags a new resource which is a “web design” company in “Vancouver”, this resource was recommended only to User1 based on the high similarity score between these 2 users. These resources are of special significance to User1 because they come from a user who shares the same interests. It is very probable that User1 considers tagging the resources that the system recommends to him. Therefore theses tags and resources will be reinforced in the system.
5 Conclusion In this paper, we introduce more sociability in collective tagging systems. Our work is aimed to contribute to the effort of the emerging web science, where one objective is to propose new principles and underlying architectures of future social machines over the social web. Our first contribution was making explicit the linkage between social activity of tagging and its effect on emerging clusters of resources and virtual groups of users. This was modeled by a multi-agents systems, with a social organization (practices and activities) and a spatial organization (effects on resources), coupled in a retro-active way. This coupling allows for the co-evolution of clusters of resources (tagged with similar tags) and virtual groups of similar users. The emergence of clusters of resources was obtained by the adaptation of a spectral clustering algorithm, which was our second contribution. This clustering allowed for the definition of a hierarchical organization of tags, using system tags suggested after the emergence of a cluster of resources and an associated virtual group of users (taggers). Finally, these virtual groups of users and their associated clusters of tagged resources were used to propose resources recommender system, based on the cross-fertilization of the tagging activity of the members of the virtual groups that emerged from the tagging activity. In future work, we intend to consider the effect of considering social networks proprieties and their effects in the context of collective/social tagging systems.
References 1. Rupert, M., Rattrout, A., Hassas, S.: The web from a Complex Adaptive Systems Perspective. Journal of Computer and Systems Sciences 74, 133–145 (2008) 2. Hendler, J., Shadbolt, N., Halla, W., Berners-Lee, T., Weitzner, D.: Web science: an interdisciplinary approach to understanding the web. Communications of the ACM 51, 60–69 (2008)
352
M. Rupert and S. Hassas
3. Yeung, C.-M., Noll, M., Gibbins, N., Meinel, C., Shadbolt, N.: On Measuring Expertise in Collaborative Tagging Systems. Presented at WebSci 2009: Society On-Line, Athens, Greece (2009) 4. Milicevic, A.K., Nanopoulos, A., Ivanovic, M.: Social tagging in recommender systems: a survey of the state-of-the-art and possible extensions. Artif. Intell. Rev. 33, 187–209 (2010) 5. Wang, J., Clements, M., Yang, J., Vries, A.P.d., Reinders, M.J.T.: Personalization of tagging systems. Inf. Process. Manage. 46, 58–70 (2010) 6. Choy, S.-O., Lui, A.: Web Information Retrieval in Collaborative Tagging Systems. Presented at IEEE/WIC/ACM International Conference on Web Intelligence (2006) 7. Serugendo, G.D.M., Gleizes, M.-P., Karageorgos, A.: Self-organization in multi-agent systems. Knowl. Eng. Rev. 20, 165–189 (2005) 8. Holland, O.E., Melhuish, C.: Stigmergy, self-organization, and sorting in collective robotics. Artificial Life 5, 173–202 (1999) 9. Cattuto, A.B.C., Servedio, V., Loreto, V.: Emergent Community Structure in Social Tagging Systems. Presented at European Conference on Complex Systems, Dresden (2007) 10. Capocci, A., Servedio, V.D.P., Caldarelli, G., Colaiori, F.: Detecting communities in large networks (2004), Ed., http://www.citebase.org/abstract?id=oai: arXiv.org:cond-mat/0402499 11. Hotho, A., Jaschke, R., Schmitz, C., Stumme, G.: Information Retrieval in Folksonomies: Search and Ranking. In: Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011, pp. 411–426. Springer, Heidelberg (2006)
Some Optimizations in Maximal Clique Based Distributed Coalition Formation for Collaborative Multi-Agent Systems Predrag T. Tošić and Naveen K.R. Ginne Department of Computer Science, University of Houston, Houston, Texas, USA [email protected], [email protected]
Abstract. We study scalable and efficient coordination and negotiation protocols for collaborative multi-agent systems (MAS). The coordination problem we address is multi-agent coalition formation in a fully decentralized, resourcebounded setting. More specifically, we design, analyze, optimize and experiment with the Maximal Clique based Distributed Coalition Formation (MCDCF) algorithm. We briefly describe several recent improvements and optimizations to the original MCDCF protocol and summarize our simulation results and their interpretations. We argue that our algorithm is a rare example in the MAS research literature of an efficient and highly scalable negotiation protocol applicable to several dozens or even hundreds (as opposed to usually only a handful) of collaborating autonomous agents. Keywords: distributed AI, multi-agent systems, multi-agent coordination, distributed consensus, negotiation protocols, coalition formation.
1 Introduction Distributed coalition formation in collaborative multi-agent domains is an important coordination, negotiation and cooperation problem that has received a considerable attention within the Multi-Agent Systems (MAS) research community (e.g., [2, 6, 8, 9, 10, 12, 16, 19]). In many collaborative multi-agent system (MAS) applications, autonomous agents need to dynamically form groups or coalitions in an efficient, reliable, scalable, fault-tolerant and partly or entirely distributed manner. In medium and large ensembles, the agents’ capability to effectively autonomously coordinate and, in particular, self-organize into groups or coalitions, is often of utmost importance. Scalable, resource-aware and partially or fully distributed algorithms for coalition formation in such MAS settings are of a considerable interest among both the distributed AI researchers and the MAS application practitioners [17]. There are many important collaborative MAS applications where autonomous agents need to form groups, teams or coalitions. The agents may need to form teams or coalitions in order to share resources, jointly complete tasks that exceed the abilities of individual agents, and/or improve some system-wide performance metric such as the speed of task completion [7, 12]. One well-studied general problem domain is a collaborative MAS environment populated with distinct tasks, where each task A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 353–360, 2010. © Springer-Verlag Berlin Heidelberg 2010
354
P.T. Tošić and N.K.R. Ginne
requires a tuple of resources on the agents' part, so that the agents are able to jointly complete that task [10, 11, 12, 17]. In this distributed task allocation context, agents need to form coalitions that have sufficient cumulative resources or capabilities across the coalition members in order to complete the assigned task or tasks. We study the problem of distributed coalition formation in the following problem setting. We assume a collaborative multi-agent, multi-task dynamic environment. The agents are assumed to be collaborative, and hence are not selfish [8, 17, 20]. The agents have certain capabilities that may enable them to service the tasks. Similarly, the tasks have certain resource or capability requirements, so that no agent or coalition of agents whose (joint) capabilities do not meet a particular task’s resource requirements can serve that task [10-12]. Tasks are assumed mutually independent. Each task is of a certain value to an agent. Agents are assumed capable of communicating, negotiating and making agreements with each other [8, 10]. Communication is accomplished via exchanging messages. This communication is not free: an agent has to spend time and effort in order to send and receive messages [10, 17]. An agent can only work on one task at a time, whether on its own or as a part of a coalition that jointly attempts to complete that task. Thus far, the described model is identical to that found in [10]. An important difference we introduce with respect to the model in [10] is that we assume that an agent’s resources are not transferable to other agents [16, 17]. Thus, the only way for an agent Ai to use the internal resources of agent Aj for the purpose of servicing some task is that Ai and Aj join the same coalition, and then jointly attack that task. Our distributed coalition formation algorithm is based on the idea that, in peer-topeer MAS, an agent would prefer to form a coalition with those agents that it can communicate with directly, and, moreover, where every member of such a potential coalition can communicate with any other member directly. That is, the preferable coalitions are (maximal) cliques. Finding a maximal clique in an arbitrary graph is in general NP-hard in the centralized setting [3, 4]. This implies the computational hardness that, in general, each agent (that is, graph node) faces when trying to determine the maximal clique(s) it belongs to. When the degree of a node is sufficiently small, then finding all maximal cliques this node belongs to is feasible. If the system designer cannot guarantee that all the nodes in a given interconnection topology are of a small degree, then (s)he needs to impose additional constraints in order to ensure that the agents are not attempting to solve an infeasible problem [16, 17]. In the sequel, we will assume sufficiently sparse network topologies; while restrictive, this assumption holds in many robotic, unmanned vehicles and other collaborative MAS applications [17, 20]. The organization and contributions of this paper are as follows. First, we summarize an actual design and implementation of the original MCDCF algorithm. Second, we motivate and justify some potentially very useful optimizations to the original MCDCF protocol. Third, we summarize and discuss our findings based on recent and ongoing experimentation with different variants of the MCDCF algorithm, and argue that optimized MCDCF is a highly scalable negotiation protocol, rare among its peers in that it is applicable to dozens or even hundreds of coordinating agents, as opposed to only a few agents (like most negotiation protocols found in the existing literature). Lastly, we outline some directions for future work.
Some Optimizations in MCDCF for Collaborative Multi-Agent Systems
355
2 Summary of the MCDCF Algorithm MCDCF is a negotiation protocol among distributed collaborative agents whose coordination is accomplished via communication, that is, exchange of messages. Algorithmically, MCDCF is a distributed graph algorithm [15-17]. The underlying undirected or directed graph captures the communication (ad hoc) network topology of the agents. Each agent is a node in the graph. The necessary requirement for an edge between two nodes to exist is that the two nodes be able to directly communicate with one another. For undirected graphs, an unordered pair of nodes {A, B} is an edge in the graph if and only if A can communicate messages to B, or B can communicate messages to A, or both. For directed graphs, an ordered pair (A, B) is a directed edge if agent A can hear agent B (but agent B may or may not be able to hear agent A). The basic idea behind MCDCF is to efficiently and in a fully decentralized manner partition this graph into (preferably, maximal) cliques of nodes. In a given application, these maximal cliques would usually also need to satisfy some additional criteria in order to form temporary coalitions of desired quality. The coalitions are then maintained until they are no longer useful to or preferred by the agents forming them. Those coalitions should be transformed or dissolved when the interconnection topology of the underlying graph considerably changes, either due to the agents’ mobility, or because some of the old links have died out whereas some new, different links have formed, etc. Another possible reason to abandon the existing coalition structure is when the coalitions have accomplished the tasks they were formed to address. Thus, in an actual MAS application, an appropriate variant of MCDCF may need to be invoked a number of times as a coordination subroutine [16, 17]. The MCDCF protocol is explained and analyzed in detail in [17]; hence, we only summarize the most important aspects of its workings. In each negotiation round, each agent picks one of its candidate coalitions and sends this coalition proposal to those neighboring agents that would be the members of the chosen coalition proposal. An agent also receives similar coalition proposals from some subset of its neighbors. Hence, each agent needs to compare its selected coalition proposal with the proposed coalitions received from other agents. If those are all in agreement, a coalition is formed and each coalition member notifies its other neighbors (those not in the newly formed coalition, if any) that it has joined coalition and hence is no longer available or interested in future coalition proposals. As a simple example, assume agent A sends a proposal to agents B and C that {A, B, C} coalition be formed. If B and C also, independently of each other (and of A) proposed the same coalition, an agreement has been reached and the new coalition is formed. In contrast, if, say, B also proposed {A, B, C} but C proposed {A, C, D}, then there is no agreement and the protocol moves to the next round, where one or more of the agents needs to change their coalition proposal, so that the negotiation process moves forward until eventually everyone joins some coalition based on a distributed consensus as outlined above. When an agent’s coalition proposal does not get accepted by all of the neighboring agents to whom this proposal was made, what is that agent to do? It can either (a) stick to its current proposal, hoping that some of the other agents would change their mind in the subsequent rounds, or (b) change its coalition choice, and, in the next round, propose a different coalition from what it proposed before. The MCDCF
356
P.T. Tošić and N.K.R. Ginne
protocol therefore has to carefully address when exactly is an agent to “sit tight” and stick to its choice, and when is it to adopt a new candidate coalition and make a new proposal to its neighbors. In particular, MCDCF employs several tie-breaking mechanisms in order to ensure that, at the end of a round in which no new coalition has been formed, at least one agent has to change its coalition proposal. That requirement can be shown to be sufficient to ensure that the distributed consensus reaching process moves forward and, under the assumptions discussed in detail in [17], to ensure eventual convergence (in finitely many rounds) of the algorithm. The MCDCF protocol makes the following fundamental assumptions: – Each agent has a unique global identifier, ’UID’, and the agent knows its UID. – There is a total ordering, <, on the set of UIDs, and each agent knows <. – Each agent has (or can efficiently obtain) the reliable knowledge of which other agents are its neighbors; – The resulting coalition structure is required to be a partition of the set of agents into some number of mutually disjoint subsets (coalitions), where each such coalition is nonempty. – A coalition can service only one task at a time, and the coalition value is a function of a single most valuable task that this coalition can complete. – Communication bandwidth availability is assumed sufficient. – Each agent has a sufficient local memory (including the message buffers) for storing all the information received from other agents. – Communication is either reliable during the coalition formation, or else it fails in a nice, non-Byzantine manner: if an agent Ai sends a message to another agent Aj , then either agent Aj gets exactly the same message that Ai has sent, or else the communication has failed, so that Aj does not receive anything from Ai at all. – The veracity assumption holds, i.e., an agent can trust the information received from the neighbors. We note that an agent need not a priori know the UIDs of other agents, or, indeed, how many other agents are present in the system at any time. Agents form coalitions in a fully distributed manner as follows. Each agent (i) first learns of who are its neighbors, then (ii) determines the appropriate candidate coalitions, then (iii) evaluates the utility value of each such candidate coalition, measured in terms of the joint resources of all the potential coalition members, then (iv) chooses the most desirable candidate coalition, and, finally, (v) sends this choice to all its neighbors. This basic procedure is then repeated, together with all agents updating their knowledge of (a) what are the preferred coalitions of their neighbors, and (b) what coalitions have already been formed. The constraint on the candidate coalitions that an agent would consider is that all coalition members have to be that agent’s 1hop neighbors in the underlying communication topology.
3 Recent Improvements to the MCDCF Algorithm We next summarize the main design changes and improvements to the core MCDCF algorithm as outlined in the previous section. These modifications to the original MCDCF protocol as found in [15, 16] have been designed, implemented in our Java
Some Optimizations in MCDCF for Collaborative Multi-Agent Systems
357
MCDCF simulation, tested over a broad range of random graphs of small and medium sizes (from a few nodes to a few dozen nodes), and the simulation results have been carefully analyzed. While the algorithmic improvements and scaling-up MCDCF are still work in progress, we share the conclusions that can be drawn based on the optimizations and extensive experimentation we have done to date. We first outline three design changes to the original MCDCF that we have introduced in [18]. One of them addresses how an agent that is modifying its coalition choices due to a lack of consensus in the prior rounds moves down the lattice of subsets of its neighborhood list; this modification ensures moving down that lattice faster than in the original MCDCF, but does not address the criteria that determine whether and when is an agent to be forced to change its coalition choice. The other two modifications address the latter problem; in particular, they both revolve around the idea of lazy, as opposed to eager, approach to when an agent ought to change its coalition of choice. (i) When an agent, based on the tie-breaking mechanisms discussed in detail in [15-17], has been determined to have to change its current coalition proposal, it does so by picking a new subset of its extended neighborhood. This new coalition proposal can either be a proper subset of the prior proposal, or else it can be another subset of the extended neighborhood, that is not comparable (w.r.t. the subset relation) to the prior proposal. If the current coalition proposal for agent A has k+1 elements (meaning, A itself and k of its 1-hop neighbors), then there may be O(2k) candidate coalitions still to consider. In the worst case scenario, agent A ends up with the trivial coalition {A} at which point agent it has no choice but to form this trivial coalition. However, only one coalition proposal can be made per round, meaning that agent A may take, in the worst case, about 2k rounds to eventually converge to the trivial coalition. For the scalability reasons, we have implemented and tested a variant of the basic MCDCF algorithm (as found in [15-17]) where each subsequent coalition proposal has to be a proper subset of the prior ones, thereby ensuring that, if none of the intermediate coalition proposals gets accepted, such agent will converge to the singleton coalition after no more than O(k) rounds, as opposed to O(2k). (ii) Our next optimization addresses the criteria that determine whether an agent has to change its coalition choice, as opposed to sticking to its present choice in the future round(s). In the original, that is, Eager MCDCF, we use a special purpose flag, ChoiceFlag, and a total ordering on agents’ UIDs, to determine a unique agent that has to change its coalition choice after a round in which this agent and its neighbors failed to agree to the same coalition. By the very nature of which candidate coalitions are available to each agent in the future rounds, such changes are generally undesirable (as the new coalition is likely less desirable than the present choice). The Lazy1 mechanism avoids having an agent make such an undesirable change, whenever there have been changes to any neighboring agents’ current coalitions due to, presumably, some of those neighbors’ neighbors having joined their coalitions during the previous round. Therefore, the indirect effect on one’s neighbors’ neighbors having joined coalitions is taken advantage of, in a sense that now one need not change its coalition of choice, even though it would have been his “turn” based on the tie-breaking criteria. In the original version of MCDCF, these tie-breaking criteria are entirely based on the ChoiceFlag and the ordering of UIDs; see [17] for details.
358
P.T. Tošić and N.K.R. Ginne
(iii) We observe that Lazy1 mechanism only applies when some coalitions have been formed during the previous round. In contrast, our next optimization, dubbed Lazy2, is applicable even when no new coalitions have been formed by anyone in the previous round. Lazy2 is based on the monotonicity property of MCDCF: if an agent, A, receives a coalition proposal from one of its neighbors that is a proper subset of A’s own current proposal, then the coalition proposed by A is unreachable. Hence, A can safely drop its current proposal, and reduce its set of candidate coalitions according to the subset information obtained from the coalition proposal of its neighbors. Lazy2 optimization ensures, first, that every agent updates its current proposal and set of future candidate coalitions so that only reachable coalitions are considered, and then, second, that, if one or more agents have made changes of their coalition proposals based on the above monotonicity property, then none of those agents’ neighbors should have to change their current coalition choices based on the tie-breaking mechanisms, as such changes are unnecessary from the convergence stand-point, and are undesirable from the optimality stand-point. To ensure that the lazy evaluation approach is applied when appropriate on one hand, yet that each region of the underlying network sees some changes during every round on the other, we have also introduced two additional mechanisms whose purpose is to ensure the progress of the protocol in each region of the network. These mechanisms locally monitor (i) which of the neighboring agents have recently changed their coalition proposals and (ii) whether any of the neighboring agents have recently joined a coalition. With this information on recent local changes available, each agent can better determine whether “laziness” is acceptable or not, and whether it should change its coalition proposal when no alternatives are as good as the present proposal. In essence, the two additional flags keep track of neighborhood changes to ensure that no region of the network wastes rounds without making progress; this lack of progress in some cases has been observed to be due to agents in a particular region not changing their coalition proposals, because no agent feels obliged to make any changes since it’s aware of changes and/or new coalitions happening in other regions of the network. Due to space constraints, we leave the details out but we emphasize that these new mechanisms have made a huge difference insofar as the speed of convergence for the underlying graphs with relatively large average node degrees (in our context, average node degree of 5 to 7 is relatively large). We observe that previously, for networks of ~50 nodes, we had situations where only one agent/node would change its proposal per round, leading, at times, to rather slow convergence.
4 Brief Analysis, Discussion and Conclusions We next briefly summarize our simulation results and their interpretation along two performance axes: (1) the number of rounds until convergence and (2) the quality of resulting coalition structures (with respect to “the bigger, the better” criterion mentioned earlier and discussed at length in [17]). We have validated the baseline MCDCF and its variants for the sparse graphs with average node degrees in the range from 2 to 7. Our discussion is based on the observed behavior for several variants of MCDCF run on various sparse graphs with sizes of up to 50 nodes. We found that, in general, all variants of improved MCDCF that include at least one of the optimizations discussed in Section 3 outperform the baseline implementation of the original
Some Optimizations in MCDCF for Collaborative Multi-Agent Systems
359
MCDCF. This general conclusion holds across the scenarios we have tested so far. The main improvement is with respect to the speed of convergence. Moreover, a considerable reduction in the number of rounds is generally not at the expense of the quality of the resulting coalition structures. Given that reducing the number of rounds until convergence has been the main objective of our optimizations, one conclusion from our experiments is that the first optimization in the previous section is the biggest difference-maker for most tested small and medium sized, sparse-on-average graphs. In most of our examples, Lazy1 vs. Lazy1 + Lazy2 have performed very similarly to each other. However, for some underlying graphs Lazy2 makes a difference. One issue is, whether instances constructed with the particular purpose in mind (e.g., the presumed benefits of Lazy2) correspond to a kind of network topologies that can be expected to be encountered in various collaborative MAS practical applications where MCDCF is applicable. Such applications include ensembles of micro-UAVs or smart sensors “randomly” scattered across a sizable terrain. Our prior work [16, 17] contains a considerable discussion on applicability of MCDCF to various practical MAS, as well as the pointers to the literature on distributed coalition formation algorithms used in the real-world MAS applications. Insofar as quality of the resulting coalitions, while the relatively limited classes of underlying topologies we have simulated so far do not give us a sufficient ground for very general conclusions, it certainly appears that savings in terms of the speed of convergence are generally not at the expense of the obtained coalition structures’ quality. In particular, all three of Eager with optimization (i), Lazy1 (with optimizations (i) and (ii)) and Lazy1 + Lazy2 (with all three optimizations) usually take no more rounds than Eager with no optimizations, often take strictly fewer rounds, and have never been observed to result in a worse final coalition structure than what is obtained by the default Eager MCDCF. We note that the overall quality of the coalition structure need not imply that there could not be individual agents that are worse off in certain situations when the optimizations are turned on vs. when the optimizations are off. Rather, the statement “speed-up of convergence without sacrificing the coalition structure quality” should be interpreted in a Pareto optimality sense: no agent is ever better off in the unoptimized case (for the same underlying graph) without some other agent being worse off. Our ongoing and near future work includes validating the proposed optimizations on broader varieties of underlying graphs, and scaling the MCDCF protocol from about 50 to several hundreds of agents. We already have analytical evidence that MCDCF is indeed scalable to hundreds of nodes as long as the node degrees are kept low [17], but we’d like to validate these theoretical findings via simulations. Scaling up may also turn out to necessitate further improvements and additional optimizations in the algorithm. Our plans for the near future also include adding learning abilities to the agents, so that they can utilize their past experience to define, maintain and use preferences over potential coalition partners in order to make better choices of coalition proposals and, hopefully, ultimately be able to agree on most desirable coalitions considerably more efficiently than when no learning is involved. Acknowledgments. The authors would like to thank Gul Agha and Ricardo Vilalta for their guidance and support, as well as the Department of Computer Science and Texas Learning & Computation Center (TLC2) at University of Houston.
360
P.T. Tošić and N.K.R. Ginne
References 1. Abdallah, S., Lesser, V.: Organization-Based Cooperative Coalition Formation. In: Proc. IEEE/WIC/ACM Int’l Conf. on Intelligent Agent Technology IAT (2004) 2. Avouris, N.M., Gasser, L. (eds.): Distributed Artificial Intelligence: Theory and Praxis. Euro Courses Comp. & Info. Sci, vol. 5. Kluwer Academic Publ., Dordrecht (1992) 3. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press, Cambridge (1990) 4. Garey, M.R., Johnson, D.S.: Computers and Intractability: a Guide to the Theory of NPCompleteness. W.H. Freedman & Co., New York (1979) 5. Jang, M., Reddy, S., Tosic, P., Chen, L., Agha, G.: An Actor-based Simulation for Studying UAV Coordination. In: 16th Euro. Simulation Symp. (ESS 2003), Delft, Netherlands (2003) 6. Li, X., Soh, L.K.: Investigating reinforcement learning in multiagent coalition formation, TR WS-04-06. In: AAAI Workshop on Forming and Maintaining Coalitions and Teams in Adaptive MAS (2004) 7. de Oliveira, D.: Towards Joint Learning in Multiagent Systems Through Opportunistic Coordination, PhD Thesis, Univ. Federal Do Rio Grande Do Sul, Brazil (2007) 8. Sandholm, T.W., Lesser, V.R.: Coalitions among computationally bounded agents. In: Artificial Intelligence, vol. 94 (1997) 9. Sandholm, T.W., Larson, K., Andersson, M., Shehory, O., Tohme, F.: Coalition structure generation with worst case guarantees. AI Journal 111(1-2) (1999) 10. Shehory, O., Kraus, S.: Task allocation via coalition formation among autonomous agents. In: Proceedings of IJCAI 1995, Montréal, Canada (1995) 11. Shehory, O., Sycara, K., Jha, S.: Multi-agent coordination through coalition formation. In: Rao, A., Singh, M.P., Wooldridge, M.J. (eds.) ATAL 1997. LNCS (LNAI), vol. 1365, pp. 143–154. Springer, Heidelberg (1998) 12. Shehory, O., Kraus, S.: Methods for task allocation via agent coalition formation. AI Journal 101 (1998) 13. Soh, L.K., Li, X.: An integrated multilevel learning approach to multiagent coalition formation. In: Proc. Int’l Joint Conf. on Artificial Intelligence IJCAI 2003 (2003) 14. Sun, R.: Meta-Learning Processes in Multi-Agent Systems. In: Zhong, N., Liu, J. (eds.) Intelligent Agent Technology: Research and Development. World Scientific, Hong Kong (2001) 15. Tosic, P., Agha, G.: Maximal Clique Based Distributed Group Formation Algorithm for Autonomous Agent Coalitions. In: Proc. Workshop on Coalitions & Teams, AAMAS 2004, New York City, New York (2004) 16. Tosic, P., Agha, G.: Maximal Clique Based Distributed Coalition Formation for Task Allocation in Large-Scale Multi-Agent Systems. In: Ishida, T., Gasser, L., Nakashima, H. (eds.) MMAS 2005. LNCS (LNAI), vol. 3446, pp. 104–120. Springer, Heidelberg (2005) 17. Tosic, P.: Distributed Coalition Formation for Collaborative Multi-Agent Systems, MS thesis, Univ. of Illinois at Urbana-Champaign (UIUC), Illinois, USA (2006) 18. Tosic, P., Ginne, N.: MCDCF:A Fully Distributed Algorithm for Coalition Formation in Collaborative Multi-Agent Systems. In: Proc. World-Comp Int’l Conf. on Artificial Intelligence (ICAI 2010), Las Vegas, Nevada, USA (to appear, 2010) 19. Vig, L., Adams, J.A.: Issues in multi-robot coalition formation. In: Proc. Multi-Robot Systems: From Swarms to Intelligent Automata, vol. 3 (2005) 20. Weiss, G. (ed.): Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence. MIT Press, Cambridge (1999) 21. Wooldridge, M.: An Introduction to Multi-Agent Systems, 2nd edition. Wiley, Chichester (2009)
Enhanced Intra Coding of H.264/AVC Advanced Video Coding Standard with Adaptive Number of Modes Mohammed Golam Sarwer and Q.M. Jonathan Wu Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, N9B 3P4, Canada {sarwer,jwu}@uwindsor.ca
Abstract. In H.264/AVC intra coding, DC mode is used to predict the regions with no unified direction and the predicted values of all pixels are same. Therefore, the smoothly varying regions are not well de-correlated. In order to address this issue, this paper proposes an improved DC prediction mode based on the distance between the predicted and reference pixels. On the other hand, using the nine prediction modes in intra 4x4 and 8x8 block unit can reduce the spatial redundancies, but it needs a lot of overhead bits. In order to reduce the number of overhead bits and computational cost of the encoder, this paper adaptively selects the number of prediction mode for each 4x4 or 8x8 block. Experimental results confirm that the proposed methods save 14.8% bit rate and improve the video quality by 0.44 dB on average. The proposed method saves about 37.8% computation of the H.264/AVC intra coding method. Keywords: H.264/AVC, intra coding, DC prediction, mode, variance.
1 Introduction H.264/AVC is the newest international video coding standard developed by the Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG [1]. The rate-distortion (RD) performance of H.264/AVC is superior to any other conventional video coding standard. H.264/AVC offers a rich set of prediction patterns for intra prediction, i.e. nine prediction modes for 4x4 luminance (luma) blocks, nine prediction modes for 8x8 luma blocks and four prediction modes for 16 x 16 luma blocks. However, the RD performance of the intra frame coding is still lower than that of inter frame coding. Thus the development of efficient intra coding is important not only for the overall bit rate reduction but also for the efficient streaming. A block based extra/inter-polation method by changing the sub-block coding order in a Macroblock (MB) is proposed in [2]. Since three additional directional predictions are utilized, the computational complexity of intra prediction increases drastically. In addition to that number of overhead bits to represent the prediction mode also increases. A simplified Bi-directional intra prediction (BIP) [3] method combines two existing prediction modes to form bi-directional prediction modes. In order to improve the performance of DC prediction, a distance based weighted prediction method (DWP) based on the distance of predicted and reference pixels is introduced A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 361–372, 2010. © Springer-Verlag Berlin Heidelberg 2010
362
M.G. Sarwer and Q.M.J. Wu
in [4]. A fast algorithm of DWP is developed in [5]. However, from simulation it is shown that improvement of RD performance of these methods is marginal. A pixel based differential intra coding method is developed in [15] but this method is only suitable for lossless intra coding. H.264/AVC uses rate-distortion optimization (RDO) technique to get the best coding mode out of nine prediction modes in terms of maximizing coding quality and minimizing bit rates. A lot of algorithms are developed to reduce the computation of H.264/AVC encoder but all of these methods sacrifice video quality [9-14]. Using the nine prediction modes in intra 4x4 and 8x8 block unit can reduce the spatial redundancies, but it may need a lot of overhead bit to represent the prediction mode of each 4x4 or 8x8 block. A new intra coding method based on adaptive single-multiple prediction (SMP) is proposed in order to reduce not only the overhead mode bits but also computational cost [6]. If the variance of the neighboring pixels of upper and left blocks is less than the threshold, only DC prediction is used and that does not need to prediction mode bits. Otherwise nine prediction modes are computed. But the reference pixels in up-right blocks are not consider during early termination. If variance of reference pixels of upper and left blocks is very low and variance of upright blocks is higher, all of the prediction modes except diagonal down left and vertical left modes are similar. In this case, only DC prediction mode is not enough to maintain good PSNR and compression ratio. Therefore, still there are some rooms to improve the performance of intra encoder. The goal of this paper is development of efficient DC prediction method suitable for smooth image region and reduction of the overhead bits to represent intra prediction modes. The remainder of this paper is organized as follows. Section 2 provides the review of intra-prediction method of H.264/AVC. In Section 3, we describe the proposed methods. The experimental results are presented in Section 4. Finally, section 5 concludes the paper.
2 Intra Prediction of H.264/AVC For the luma samples, intra prediction may be formed for each 4x4 block or for each 8x8 block or for a 16x16 macroblock. There are a total of 9 optional prediction modes for each 4x4 and 8x8 luma block; 4 optional modes for a 16x16 luma block. Similarly for chroma 8x8 block, another 4 prediction directions are used. The prediction of a 4x4 block is computed based on the reconstructed samples labeled P0-P12 as shown in Fig. 1 (a). The grey pixels (P0-P12) are reconstructed previously and considered as reference pixels of the current block. For correctness, 13 reference pixels of a 4x4 block are denoted by P0 to P12 and pixels to be predicted are denoted by P13 to P28. Mode 2 is called DC prediction in which all pixels (labeled P13 to P28) are predicted by (P1+P2+P3+P4+P5+P6+P7+P8+4)/8 and mode 0 specifies the vertical prediction mode in which pixels (labeled P13, P14, P15 and P16) are predicted from P5, and the pixels (labeled P17, P22, P21, and P20) are predicted from P6, and so on. The remaining modes are defined similarly according to the different directions as shown in Fig. 1 (b). The reconstructed reference pixels and pixels to be predicted of a 8x8 block are shown in Fig. 1 (c). The directional pattern of a 8x8 block is exactly same as that of a 4x4 block which is shown in Fig. 1 (b). In addition to 4x4 and 8x8 prediction, 4
Enhanced Intra Coding of H.264/AVC Advanced Video Coding Standard
363
additional prediction modes (vertical, horizontal, DC and Plane) are also supported for a 16x16 block. The best mode is the one that has the minimum RD cost and this cost is expressed as,
J RD = SSD + λ ⋅ R
(1)
where, SSD is the sum of squared difference between the original block and the reconstructed block . R is the true bits needed to encode the block and λ is an exponential function of the quantization parameter (QP).
(c) Fig. 1. (a) Prediction samples of a 4x4 block (b) direction of prediction modes of 4x4 and 8x8 blocks (c) prediction samples of a 8x8 block
3 Proposed Enhanced Intra Coding 3.1 Improved DC Prediction for 4x4 Block In H.264/AVC, DC mode is used to predict regions with no unified direction and the predicted values of all pixels are same. But the correlation that exists between predicted pixels and reference pixels are not absolutely considered to predict the DC prediction mode. Thus, the prediction signal generated by DC prediction is not well matched to the original signal, and a large number of bits is required for encoding the difference between the predicted and original signal. In order to improve the performance of DC prediction of a 4x4 block, this section proposes a distance based
364
M.G. Sarwer and Q.M.J. Wu
prediction based on the distance between reference and current pixels. It is well known that Gaussian-like distribution can approximate local intensity variations in smooth image region. Therefore, the correlation between neighboring pixels would be attenuated while the distance is increased and negligible when pixels are far apart. Therefore, prediction accuracy is degraded if all of the pixels of a 4x4 block are predicted from the same reference pixels. Let us consider a 4x4 block as shown in Fig. 1 (a). In original DC prediction of H.264/AVC all of the pixels P13 to P28 of the 4x4 block are predicted from the neighboring reconstructed pixels P1 to P8. In the proposed IDCP method, pixels P13 to P19 are predicted from reconstructed pixels P1 to P8, P20 to P24 are predicted from previously predicted pixels P13 to P19, P25 to P27 are calculated from pixels P20 to P24 and the reference pixels of P28 are P25 to P27. The predicted pixels Pr are estimated by following equation. n ⎢ n ⎥ Pr = ⎢( ∑ Wi Pi ) / ( ∑ Wi ) ⎥ , where r = 13 to 28. i =m ⎣ i =m ⎦
(2)
Here Pr is the pixel to be predicted and Pi is the reference pixel and ⎣⎢ ⎦⎥ is the truncation operation to generate an integer value. Table 1 show the value of m and n for different predicted pixels r and Wi is the weight of i-th reference pixel. If the distance between predicted pixels Pr and reference pixels Pi is lower, the correlation between Pr and Pi is higher, and Wi is also higher. Therefore, weighting factor Wi is inversely proportional to the distance and defined as follows.
Wi = ⎣⎢8 / D ⎦⎥
(3)
where D is the distance between reference pixel and pixel to be predicted and defined as D = Br , x − Bi, x + Br , y − Bi, y . Br,x and Br, y are the x and y positions of the predicted pixel Pr and similarly Bi,x and Bi,y are the x and y positions of the reference pixel Pi. In our method, the predicted value of the neighboring pixels obtained in the previous step is used to predict current pixels. Since some of the predicted values are based on other predicted values, the order of the calculation of predicted values are P13 to P28 as shown in four steps in Table 1. Table 1. Value of m and n of (2) with different predicted pixels Step Step 1 Step 2 Step 3 Step 4
Predicted pixels r r= 13 to 19 r= 20 to 24 r = 25 to 27 r= 28
Reference pixels i i= 1 to 8 i= 13 to 19 i= 20 to 24 i= 25 to 27
m
n
1 13 20 25
8 19 24 27
Let us assume, we would like predict pixel P26 as shown in Fig. 1 (a). According to third row of Table 1, in order to predict P26, the reference pixels are P20, P21, P22, P23, and P24. By substituting the value of r, m and n in (2) we get
Enhanced Intra Coding of H.264/AVC Advanced Video Coding Standard
⎢ W P + W21 P21 + W22 P22 + W23 P23 + W24 P24 ⎥ P26 = ⎢ 20 20 ⎥ W20 + W21 + W22 + W23 + W24 ⎣ ⎦
365
(4)
Consider that the position of predicted pixel P26 is ( B26, x , B26, y ) and therefore, as shown in Fig. 1(a) the position of reference pixels ( Bi, x , Bi , y ) , i=20 to 24, are ( B26, x + 1, B26, y − 1) , ( B26, x , B26, y − 1) ,
( B26, x − 1, B26, y − 1) ,
( B26, x − 1, B26, y )
and ( B26, x − 1, B26, y + 1) , respectively . By substituting these values in (3), we can calculate the weighting factors W20 to W24. It is shown that in order to calculate (3), computationally expensive division operation is necessary. In order to lessen computations, Wi is simplified as follows.
Wi = 2 Ki
where
⎧3 ⎪2 ⎪ Ki = ⎨ ⎪1 ⎪⎩0
if D = 1 if D = 2 if D = 3 or 4
(5)
otherwise
By substituting (5) into (2) and replacing multiplication operation by shifting,
⎢ n ⎥ ⎢ ( ∑ Pi << Ki ) ⎥ ⎥ Pr = ⎢ i = m n ⎢ ⎥ ∑ Wi ⎢ ⎥ i =m ⎣ ⎦
, where r = 13 to 28
(6)
The division and truncation operation of (6) is inconvenient for hardware implementation. We have seen that for a particular value of r, the denominator of (6) is constant throughout the encoding and decoding process and independent on the intensity of the pixels. In order to avoid computational expensive division operator, (6) is approximated as follows: n
Pr = [( ∑ Pi << Ki ) + Cr × Pr ,left ] >> 5 i =m
(7)
n
where Cr = 25 − ∑ Wi and value of Cr is given in Table 2. Pr ,left is the immediate i =m
left pixel of Pr. For example, Pr ,left = P15 for r=22. Obviously, (7) is faster than (2) and hardware inconvenient division and truncation operations are avoided. According to (7), each 4x4 block needs 125 shifting, 109 addition and 16 multiplication operator for improved DC prediction. 3.2 Adaptive Number of Modes for 4x4 and 8x8 Blocks
Although H.264/AVC intra coding method provides good compression ratio, due to the use of nine prediction modes, its computational complexity is increased
366
M.G. Sarwer and Q.M.J. Wu Table 2. Value of Cr with different predicted pixels r
Cr
13, 19, 20, 24 14, 18 15, 17 16, 22 21, 23 25, 27, 28 26
11 8 5 0 6 12 4
drastically. Using the nine prediction modes in intra 4x4 and 8x8 block unit for a 16x16 MB can reduce the spatial redundancies, but it may needs a lot of overhead bit to represent the prediction mode of each 4x4 or 8x8 block. In order to reduce the number of overhead bits and computational cost, this section proposed an algorithm as given below. 3.2.1 Case 1 As shown in Fig. 2(a), if all reference pixels are same, the predicted values of nine directional predictions are same. In this case, it does not need to calculate the entire prediction modes. In this case, only DC mode can be used for encoder and decoder prediction and the prediction mode bit can be skipped. In order to classify the 4x4 block in this category, variance σ1 and threshold T1 are calculated. If variance σ1 of
all of the neighboring pixels is less than the threshold T1 , only DC prediction mode is used and it does not need the prediction mode bits. The variance σ1 and mean μ1 are defined as, 12
σ1 = ∑ Pi − μ1 i =1
and
⎢
12
⎥
⎣
i =1
⎦
μ1 = ⎢ ( ∑ Pi ) / 12 ⎥
(8)
where Pi is the i-th pixel of Fig. 1(a) and μ1 is the mean value of block boundary pixels. In order to avoid computational expensive division and truncation operations, μ1 is replaced by weighted mean value ( μ1′ ) as 4
8
12
i =1
i =5
i =9
μ1′ = ( ∑ Pi + ( ∑ Pi ) << 1 + ∑ Pi ) >> 4
12
and σ1 = ∑ Pi − μ1′ i =1
(9)
Here << and >> are the shift left and shift right operator, respectively. In order to set the threshold T1, we have done several experiments for four different types of video sequences (Mother & Daughter, Foreman, Bus and Stefan) with CIF format at different QP values. Mother & Daughter represents simple and low motion video sequence. Foreman and Bus contain medium detail and represent medium motion video sequences. Stefan represents high detail and complex motion video sequence. By changing the threshold, we observed the RD performance and found that threshold T1 is independent on the type of video sequence but depends on the QP values. By using the polynomial fitting technique, the threshold value T1 is approximated as follows:
Enhanced Intra Coding of H.264/AVC Advanced Video Coding Standard
(a)
367
(b)
Fig. 2. (a) Case 1: All of the reference pixels have similar value (b) Case 2: The reference pixels of up and up-right block have similar value
⎧ QP + 12 if QP ≤ 24 T1 = ⎨ ⎩5QP − 90 Otherwise
(10)
3.2.2 Case 2 As shown in Fig. 2(b), if all of the reference pixels of up and up-right blocks are same, vertical, diagonal-down-left, vertical-left, vertical-right and horizontal-down modes produce the same prediction values. That’s why, in the proposed method we have chosen only one mode from this set and it is the vertical prediction mode. If variance σ 2 of the neighboring pixels of up and up-right blocks is less than the
threshold T2 , four prediction modes (vertical, horizontal, diagonal-down-right and horizontal-up) are used and one of them is selected through RDO process. Instead of using 3 bits of original encoder, each of the four prediction mode is represented by 2 bits. In this case, the binary representations of prediction modes are recorded as shown in Table 3 and hence a significant amount of mode bits are saved. Threshold T2 is selected in the same way as T1. T2 also depends on the QP and better results were obtained at T2 = ⎢⎣(2T1 / 3) ⎥⎦ ≈ (T1 >> 1 + T1 >> 3)
(11)
The variance σ 2 and mean μ2 are defined as, 12
12
i =5
i =5
σ 2 = ∑ Pi − μ2 and μ2 = ( ∑ Pi ) >> 3 where μ2 is the mean value of block boundary pixels of top and top-right blocks. Table 3. Binary representation of modes of case 2 Mode Vertical Horizontal Diagonal-down-right Horizontal-up
Binary representation 00 01 10 11
(12)
368
M.G. Sarwer and Q.M.J. Wu
The flow diagram of the proposed ANM method is presented in Fig. 3. The variance σ1 and threshold T1 are calculated at the start of the mode decision process and if
the variance is less than the threshold ( σ1 < T1 ) only DC prediction mode is used. In this case computational expensive RDO process is skipped and a lot of computations are saved. In addition, no bit is necessary to represent intra prediction mode because only one mode is used. On the other hand, if σ1 < T1 is not satisfied, encoder calcu-
lates the variance σ 2 and threshold T2 . If σ 2 < T2 , vertical, horizontal, diagonaldown-right and horizontal-up modes are used as candidate modes in RDO process. A substantial saving in computations is achieved using 4 prediction modes instead of 9 modes of the original RDO process. In order to represent the best mode, 2 bits are sent to the decoder and Table 3 shows the four prediction modes with corresponding binary representations. As shown in Table 3, if the diagonal-down-right mode is selected as the best mode, the encoder sends “10” to the decoder. In this category, only 2 bits are used to represent the intra prediction mode whereas 3 bits are used in the original encoder. Consequently a large number of intra prediction mode bits are saved. If σ 2 < T2 is not satisfied, nine prediction modes are used as the candidate mode and one of them is selected through the RDO process, as in H.264/AVC. The new prediction mode numbers are recorded and compared against H.264/AVC in Table 4.
Fig. 3. Flow diagram of proposed ANM method
Since 8x8 intra prediction also uses 9 prediction modes, the proposed ANM method is also applied to 8x8 intra prediction mode. Assume Pi is the i-th reconstructed pixel of Fig. 1 (c). Variances and thresholds of a 8x8 block are defied as
Enhanced Intra Coding of H.264/AVC Advanced Video Coding Standard 8
16
24
i =1
i =9
i =17
μ1′8 x8 = ( ∑ Pi + ( ∑ Pi ) << 1 + ∑ Pi ) >> 5 24
24
i =9
i =9
24
and σ18 x 8 = ∑ Pi − μ18 x 8 ′ i =1
369
(13)
μ28 x8 = ( ∑ Pi ) >> 4 and σ 28 x8 = ∑ Pi − μ28 x8
(14)
⎧ 2QP + 24 if QP ≤ 24 T18 x 8 = ⎨ ⎩10QP − 180 Otherwise
(15)
and T28 x 8 = (T18 x8 >> 1 + T18 x8 >> 3)
Table 4. Prediction modes recording of the proposed method Mode
Mode number H.264/AVC
Mode number Proposed
Diagonal-down-left Vertical-right Horizontal-down Vertical-left Vertical Horizontal DC Diagonal-down-right Horizontal-up
3 5 6 7 0 1 2 4 8
0 1 2 3 4 5 6 7 8
4 Simulation Results To evaluate the performance of the proposed method, JM 12.4 [7] reference software is used in simulation. Simulation conditions are (a) QPs are 28, 36, 40, 44 (b) entropy coding: CABAC (c) RDO on (d) frame rate: 30 fps and (e) number of frames: 100. The comparison results are produced and tabulated based on the average difference in the total encoding ( ΔT1 %) and decoding ( ΔT2 %) time, the average PSNR differences ( ΔP ), and the average bit rate difference ( ΔR% ). PSNR and bit rate differences are calculated according to the numerical averages between RD curves [8]. The encoding ( ΔT1 %) and decoding ( ΔT2 %) complexity is measured as follows
ΔT1 = ΔT2 =
T penc − Toenc Toenc T pdec − Todec Todec
× 100%
(16)
× 100%
(17)
where, Toenc and Todec are the total encoding and decoding time of the JM 12.4 encoder, respectively. Tpenc and Tpdec are the total encoding and decoding time of the proposed method, respectively.
370
M.G. Sarwer and Q.M.J. Wu Table 5(a). RD performances of proposed methods ( only 4x4 modes, All I frames) DWP[4]
Grand Mother (QCIF) Sales man (QCIF) Stefan (QCIF) Carphone (QCIF) Silent (CIF) Hall (CIF) Mobile Calendar (HD-1280x720) Average
SMP [6]
ΔP 0.04
ΔR % -1.4
ΔP 0.37
ΔR % -15.4
Proposed IDCP only ΔP ΔR % 0.11 -4.6
Prop ANM only ΔP ΔR % 0.41 -16.4
ANM + IDCP ΔP 0.47
ΔR % -18.0
0.02
-0.2
0.32
-12.9
0.12
-6.6
0.39
-13.5
0.42
-18.9
0.01 0.04
-0.2 -1.0
0.10 0.66
-2.7 -18.4
0.09 0.07
-1.3 -2.1
0.19 0.79
-5.8 -22.3
0.21 0.84
-5.6 -20.7
0.02 0.02 0.03
-1.0 -0.5 -2.4
0.35 0.32 0.19
-15.4 -8.6 -6.8
0.07 0.10 0.06
-2.4 -2.6 -2.7
0.40 0.37 0.25
-17.3 -9.8 -9.3
0.45 0.46 0.26
-19.1 -11.2 -10.1
0.03
-0.96
0.33
-11.5
0.09
-3.19
0.40
-13.5
0.44
-14.8
4.1 Experiments with 4x4 Intra Modes Only
In this experiment all frames are intra coded and only 4x4 mode is enabled. The performance comparisons are presented in Table 5. In these tables, a positive value indicates increment and a negative value represents decrement. As shown in Table 5(a), the proposed IDCP improves 0.09 dB PSNR and reduces bit rate by 3.19% whereas DWP improves PSNR by only 0.03 dB and reduces bit rate by only 0.96% of the original encoder. In terms of computation, the proposed IDCP increases the encoding and decoding time by 4.01% and 2.51%, respectively. In case of SMP method, the average PSNR improvement is about 0.33 dB and average bit rate reduction is about 11.5%. Whereas in our proposed ANM method, the average PSNR improvement is about 0.40 dB and average bit rate reduction is about 13.5%. The proposed ANM only method also reduces the computation of the original encoder by 41.5%. Although this method introduces some extra computations of the decoder side, the simulation results Table 5(b). Complexity comparisons of proposed methods (only 4x4 modes, All I frames) DWP[4]
Grand Mother (QCIF) Sales man (QCIF) Stefan (QCIF) Carphone (QCIF) Silent (CIF) Hall (CIF) Mobile Calendar (HD-1280x720) Average
SMP [6]
Proposed IDCP only
Prop ANM only
ANM+IDCP
ΔT1
ΔT2
ΔT1
ΔT2
ΔT1
ΔT2
ΔT1
ΔT2
ΔT1
ΔT2
% 2.22
% 2.25
% -39.7
% 2.09
% 4.35
% 4.39
% -52.1
% 1.99
% -44.9
% 4.7
1.17
0.04
-31.2
1.19
3.70
1.12
-37.9
0.91
-34.7
2.3
1.01 1.29
0.39 1.18
-17.9 -33.8
0.39 0.42
3.46 4.27
0.54 2.65
-25.4 -46.0
0.33 0.39
-21.0 -43.5
1.1 3.2
1.34 1.35 2.62
0.99 0.35 1.28
-35.8 -38.8 -27.6
1.74 1.45 1.55
4.77 3.16 4.39
2.81 1.71 1.84
-45.8 -48.5 -34.7
1.53 1.18 1.59
-41.6 -46.1 -32.9
3.1 3.2 2.8
1.57
0.93
-32.2
1.27
4.01
2.51
-41.5
1.13
-37.8
2.91
Enhanced Intra Coding of H.264/AVC Advanced Video Coding Standard
371
of Table 5(b) confirm that, the computational overhead of the decoder is very low (about 1.13%). It is shown that if we combine our both methods together, about 14.8% bit rate reduction is achieved along with a 0.44 dB improvement in PSNR in the expense of 2.91% increment of decoding time. The proposed method reduces 37.8% computation of encoder. 4.2 Experiments with All Intra Modes
In this experiment all frames are encoded by intra coding and all intra modes (4x4, 8x8, and 16x16) are enabled. The results are tabulated in Table 6. Here proposed IDCP method is applied in 4x4 block and ANM method is implemented in 4x4 and 8x8 blocks. Since only small amount of MBs are encoded with 16x16 modes, the proposed methods are not implemented in 16x16 mode for computational difficulties. We have seen that the average gain is in the range of 0.37 dB PSNR and 12.2% bit rate saving, with a maximum for sequence Carphone with 0.79 dB and 17.7%. We have seen that the proposed method reduces 31.2% computation of original encoder. The computation increment of decoder side is very low and that is 2.3% on average. Table 6. Experimental results of proposed methods (All I frames, all Intra modes) Sequence
Δ P in dB
ΔR %
ΔT1 %
ΔT2 %
Grand Mother (QCIF) Sales man (QCIF) Stefan (QCIF) Carphone (QCIF) Silent (CIF) Hall (CIF) Mobile Calendar (HD-1280x720) Average
0.33 0.40 0.18 0.79 0.36 0.31 0.20 0.37
-14.2 -13.1 -5.8 -17.7 -12.3 -9.1 -13.1 -12.2
-36.8 -29.0 -17.9 -33.8 -37.1 -34.7 -29.2 -31.2
4.0 1.4 1.1 3.8 3.1 1.3 2.0 2.3
5 Conclusions In this paper, we propose two methods to improve the RD performance of H.264/AVC intra encoder. At first, a distance based improved DC prediction is utilized to better representation of smooth region of sequences. Then a bit rate reduction scheme for representing the intra prediction mode is described. The proposed methods not only improve the RD performance but also reduce the computational complexity of H.264/AVC intra coder.
References 1. ISO/IEC 14496-10, Information Technology-Coding of audio-visual objects-Part:10: Advanced Video Coding. ISO/IEC JTC1/SC29/WG11 (2004) 2. Shiodera, T., Tanizawa, A., Chujoh, T.: Block based extra/inter-polating prediction for intra coding. In: Proc. IEEE ICIP 2007, pp. VI-445–VI-448 (2007)
372
M.G. Sarwer and Q.M.J. Wu
3. Ye, Y., Karczewicz, M.: Improved H.264 Intra coding based on bi-directional intra prediction, directional transform, and adaptive coefficient scanning. In: Proc. IEEE ICIP 2008, pp. 2116–2119 (2008) 4. Yu, S., Gao, Y., Chen, J., Zhou, J.: Distance based weighted prediction for H.264 Intra Coding. In: Proc. ICALIP 2008, pp. 1477–1480 (2008) 5. Wang, L., Po, L.M., Uddin, Y.M.S., Wong, K.M., Li, S.: A novel weighted cross prediction for H.264 intra coding. In: Proc. IEEE ICME 2009, pp. 165–168 (2009) 6. Kim, D.Y., Han, K.H., Lee, Y.L.: Adaptive Single-Multiple Prediction for H.264/AVC Intra Coding. IEEE Trans. on Circuit and System for Video Tech. 20(4), 610–615 (2010) 7. JM reference software 12.4, http://iphome.hhi.de/suehring/tml/download/old_jm/jm12.4.zip 8. Bjontegaard, G.: Calculation of average PSNR differences between RD-curves. Presented at the 13th VCEG-M33 Meeting, Austin, TX (April 2001) 9. Sarwer, M.G., Po, L.M., Wu, J.: Fast Sum of Absolute Transformed Difference based 4x4 Intra Mode Decision of H.264/AVC Video Coding Standard. Journal of Signal Processing: Image Commun. 23(8), 571–580 (2008) 10. Sarwer, M.G., Po, L.M., Wu, J.: Complexity Reduced Mode Selection of H.264/AVC Intra Coding. In: Proceeding on International Conference on Audio, Language and Image Processing (ICALIP 2008), China, pp. 1492–1496 (2008) 11. Sarwer, M.G., Wu, Q.M.J.: Adaptive Variable Block-Size Early motion estimation termination algorithm for H.264/AVC Video Coding Standard. IEEE Trans. Circuit and System for Video Technol. 19(8), 1196–1201 (2009) 12. La, B., Jeong, J., Choe, Y.: Most probable mode-based fast 4 × 4 intra-prediction in H.264/AVC. In: International Conference on Signal Processing and Communication Systems, ICSPCS 2008, pp. 1–4 (2008) 13. Elyousfi, A., Tamtaoui, A., Bouyakhf, E.: A New Fast Intra Prediction Mode Decision Algorithm for H.264/AVC Encoders. Journal of World Academy of Science, Engineering and Technology 27, 1–7 (2007) 14. La, B., Jeong, J., Choe, Y.: Fast 4×4 intra-prediction based on the most probable mode in H.264/AVC. IEICE Electron. Express 5(19), 782–788 (2008) 15. Krishnan, N., Selva Kumar, R.K., Vijayalakshmi, P., Arulmozhi, K.: Adaptive Single Pixel Based Lossless Intra Coding for H.264 / MPEG-4 AVC. In: International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007), vol. 3, pp. 63– 67 (2007)
Extracting Protein Sub-cellular Localizations from Literature Hong-Woo Chun1 , Jin-Dong Kim2 , Yun-Soo Choi1 , and Won-Kyung Sung1 1 2
Korea Institute of Science and Technology Information, 335 Gwahangno, Yuseong-gu, Daejeon, 305-806, Republic of Korea Database Center for Life Science, Research Organization of Information and System, Japan {hw.chun,armian,wksung}@kisti.re.kr, [email protected]
Abstract. Protein Sub-cellular Localization (PSL) prediction is an important task for predicting protein functions. Because the sequence-based approach used in the most previous work has focused on prediction of locations for given proteins, it failed to provide useful information for the cases in which single proteins are localized, depending on their states in progress, in several different sub-cellular locations. While it is difficult for the sequence-based approach, it can be tackled by the text-based approach. The proposed approach extracts PSL from literature using Natural Language Processing techniques. We conducted experiments to see how our system performs in identification of evidence sentences and what linguistic features from sentences significantly contribute to the task. This article presents a text-based novel approach to extract PSL relations with their evidence sentences. Evidence sentences will provide indispensable pieces of information that the sequence-based approach cannot supply.
1
Introduction
Analysis of where a protein resides in a cell is an important component to predict genome, transcriptome, and proteome annotation, and many research have tackled to predict the place where a protein is located in. Protein Sub-cellular Localization (PSL) prediction is the name of the task. While most of programs available are applied Machine Learning (ML) techniques using features from the amino acid sequences, Text Mining (TM) and Natural Language Processing (NLP) can also contribute to the task. First, the number of pairs of proteins and their localization reported in papers increases very rapidly. Since experimentally recognizing the PSL is a time consuming and expensive task, it would be of great use if one could identify sentences of localization in published literature which are not known in databases. Second, the sequence-based approach can only predict localization and biologists would like to have further confirmation from relevant publications. Bibliographic links will be indispensable. Third, since we have constructed a corpus with respect to the PSL, it may provide useful features for the sequence-based PSL prediction programs. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 373–382, 2010. c Springer-Verlag Berlin Heidelberg 2010
374
H.-W. Chun et al.
There have been several researches for PSL prediction. Most approaches to PSL prediction have employed ML techniques to combine various features. For example, WoLF PSORT [1] employed the k-Nearest Neighbors (kNN) algorithmbased ML to deal with features from amino acid sequences. As an attempt of TM approach to the task, Stapley et al. (2002) and Brady and Shatkay (2008) used text-based features to train Support Vector Machine (SVM)) [2] [3]. Although TM techniques have contributed to the PSL prediction, they took a role to provide useful features to the sequence-based approach in the most previous work. In addition, they have not provided further confirmation from literature that may be a most important role of TM techniques. Such outputs are of intrinsically limited value, because a protein can be located in more than one sub-cellular location. It might be more useful if output contains not only pairs of proteins and sub-cellular locations but also additional information. To analyze such additional conditions, we have to use NLP techniques. Our aim is to develop an extractor that predicts protein sub-cellular localizations for proteins using the state of the arts NLP techniques. The proposed approach provides not only protein sub-cellular localizations but also their evidential sentences that may contain the additional information. In the relation extraction point of view, we have regarded the task as a relation extract between proteins and sub-cellular locations.
2
Methodology
This section describes a novel relation extraction approach using NLP techniques. To begin with the explanation of the relation extraction method, we will describe construction of gold standard data for the PSL, and explain features that used in an ML techniques. 2.1
Construction of Gold Standard
A gold standard corpus has been constructed using GENIA corpus [4] as an initial data, because the term annotation task of GENIA corpus has already done by human experts. The term annotation of GENIA corpus contains annotation of Protein and Cellular component that is a candidate of sub-cellular location. Table 1 describes statistical analysis of GENIA corpus version 3.02 with the viewpoint of the PSL. There are three points that should be mentioned to describe the annotation task. The first is that the only sentence-level co-occurrences have been processed in the approach. Thus, the co-occurrence indicates a sentence that contains at least one pair of protein and sub-cellular location names. The second is that the term sub-cellular location is used instead of cellular component in the GENIA corpus as the viewpoint of the PSL, although these cellular components contain not only sub-cellular locations but also other components. The third is that the proteins in the annotation task are the terms annotated by Protein. The category of protein in GENIA ontology contains the following sub-categories: Protein complex, Protein domain or region, Protein family or group, Protein
Extracting Protein Sub-cellular Localizations from Literature
375
Table 1. Statistical analysis of GENIA corpus GENIA corpus Abstracts Sentences Proteins Sub-cellular locations Co-occurrences Unique pairs
Frequency 2,000 18,554 34,265 (unique:8,369) 743 (unique: 225) 863 557
molecule, Protein substructure. In addition, we only considered surface forms annotated as cellular component. We have defined two types of annotation for the PSL. One is classification of sub-cellular location names into the fine-grained sub-cellular locations, and the other is categorization of relations between proteins and sub-cellular locations into the pre-defined three categories. For the task of classification of sub-cellular location, the following 11 fine-grained sub-cellular locations were selected based on Gene Ontology by biologists: Cytoplasm, Cytoskeleton, Endoplasmic reticulum, Extracellular region, Golgi apparatus, Granule, Lysosome, Mitochondria, Nucleus, Peroxisome, Plasma membrane. Among the selected sub-cellular locations, “Extracellular region” is actually not included in the sub-cellular location. Because it is also regarded as an important location we should deal with, the extracellular region was considered as one of the sub-cellular locations in the approach. Table 2 describes the statistical analysis of the classification of subcellular locations. Three human experts have annotated classification of sub-cellular locations with 1.0 of Fleiss’ kappa score that is a statistical measure for assessing the reliability of inter-annotator agreement between a fixed number of annotators. Kappa score 1.0 means all annotation results among annotators are the perfectly same. To calculate the inter-annotator agreement of three annotators, all annotators have annotated the same 31 co-occurrences from 54 MEDLINE abstracts. The kappa score, κ, can be defined as, κ=
P0 − Pc 1 − Pc
(1)
where Po is the proportion of observed agreements and Pc is the proportion of agreements expected by chance [5]. For the task of categorization of relations between proteins and sub-cellular locations, we have decided the following three categories. The notation in examples are followed that of the above example. – Positive assertion: The true relation of a protein and a sub-cellular location names in a co-occurrence that indicated the positive existence of the protein in a certain sub-cellular location. e.g.) We conclude that [p56]P exists in [cytosol]L in a higher order complex
376
H.-W. Chun et al.
containing hsp70 and hsp90, both of which in turn have been found to be associated with untransformed steroid receptors. (PMID: 2378870) – Negative assertion: Another true relation of a protein and a sub-cellular location names in a co-occurrence that indicated the negative existence of the protein in a certain sub-cellular location. e.g.) There is no detectable [NF-AT]P protein expression in the [nuclei]L of resting eosinophils. (PMID: 10384094) – Neutral: Neither positive nor negative relation was indicated, although a protein and a sub-cellular location names occurred together in a sentence (a co-occurrence). In the following example, a relation of the protein Rex and the sub-cellular location cytoplasm describes a neutral relation, and another relation of the protein gag/pol and the sub-cellular location cytoplasm describes a positive relation. e.g.) Monocyte adhesion and receptor cross-linking induced [stress fiber]L assembly, and inhibitors of [myosin light chain kinase]P prevented this response but did not affect receptor clustering. (PMID: 10366600) Table 2. Statistical analysis of newly annotated corpus: classification of cellular components Gold standard Total sub-cellular locations Cytoplasm Cytoskeleton Endoplasmic reticulum Extracellular region Golgi apparatus Granule Lysosome Mitochondria Nucleus Peroxisome Plasma membrane
Frequency 743 184 12 14 1 1 77 3 6 346 5 94
Table 3. Statistical analysis of newly annotated corpus: categorization of relations
Sub-cellular locations Total sub-cellular locations Nucleus Cytoplasm Plasma membrane Granule Lysosome Cytoskeleton
# Relevant Co-occurrences # Irrelevant Co-occurrences (Positive + Negative) Neutral 301 (286 + 15) 562 173 (159 + 14) 233 94 (94 + 0) 189 23 (23 + 0) 77 9 (8 + 1) 47 1 (1 + 0) 5 1 (1 + 0) 11
Extracting Protein Sub-cellular Localizations from Literature
377
Table 3 describes the numbers of relevant and irrelevant relations between proteins and sub-cellular locations in the relation-categorized corpus. The relevant relations contain positive and negative relations, and the irrelevant relations contain neutral relations. Three human experts have annotated categorization of relations with 0.688 of Fleiss’ kappa score. The kappa score in this annotation can be argued as a substantial agreement by the criterion of Landis and Koch [6]. To calculate the inter-annotator agreement of three annotators, all annotators have annotated the same 203 co-occurrences from 54 MEDLINE abstracts. 2.2
Extraction of Protein Sub-cellular Localizations Using Natural Language Processing Techniques
Maximum Entropy (ME)-based ML technique is used to build the relation extraction system. The ME technique can combine various types of features with maximally unbiased, and it is a general technique to estimate probability distributions of data. Thus, the ME has been widely used to incorporate various features for classification problems in NLP [7]. The ME-based relation extraction method has used various features that are from contexts including co-occurrence information, syntactic analysis. The features considered in the proposed method are as follows: – Frequency of protein and sub-cellular location pairs that appear in GENIA corpus. (1∼2) (1) In the GENIA corpus (2,000 abstracts). (2) In MEDLINE 2007∼8: We checked the frequency of protein and subcellular location pairs in MEDLINE abstracts that were published between 2007(19,051,558 abstracts) to 2008(∼May 16, 2008: 18,623,706 abstracts). Among 557 unique pairs of a protein and a sub-cellular location in GENIA corpus, 122 pairs do not exist in MEDLINE 2007∼8. It may be due to the fact that the GENIA corpus only contains rather old papers (from 1990 to 1999). (3) Protein and sub-cellular location names annotated by human experts. – Adjacent words of protein and sub-cellular location names. (4∼5) (4) Adjacent one words. (5) Adjacent one and two words. (6) Bag of words: All contextual terms in a co-occurrence. (7) Order of candidate names: ProLocRE determined whether or not a protein name appeared before a sub-cellular location name in a co-occurrence. (8) Distance between protein and sub-cellular location names: The number of words between protein and sub-cellular location names. – Features from the syntactic analysis (9∼11) (9) Syntactic category of protein and sub-cellular location names: ProLocRE conducted to parse all co-occurrences using the deep-syntactic parser ENJU ver. 2.1 [8]. ENJU parser provides Part-of-Speech tags of all words and syntactic categories of phrase structures. Syntactic categories
378
H.-W. Chun et al.
We conclude that p56 exists in cytosol in a higher order complex containing N(CD) V(VBZ) P(IN) N(NP)
hsp70 and hsp90, both of which in turn have been found to be associated with untransformed steroid receptors.
Fig. 1. An example for explanation of features from the full parsing results. (Protein: p56, Sub-cellular localization: cytosol).
of protein and sub-cellular location names were used as features. In Figure 1, p56 is a protein name and cytosol is a sub-cellular location name, and their categories are equally Noun phrase (N). (10) Predicates of protein and sub-cellular location names: ENJU parser can analyze deep syntactic and semantic structure of an English sentence and provide predicate-argument relations among words. Predicates of protein and sub-cellular localization names were used as features. In Figure 1, “exists” and “in” are predicates of protein (p56) and sub-cellular location (cytosol) names, respectively. (11) Part-of-Speech of predicates: In Figure 1, “VBZ” and “IN” are Partof-Speech of predicates, exists and in, respectively. VBZ indicates verb that has the 3rd person singular present as its subject, and IN indicates preposition.
3
Experimental Results
In the experiments, we evaluated how well the proposed approach extracts relevant relations. To evaluate the system, we performed a 10-fold cross validation and measured the precision, recall, and F-score of the system for all experiments. All co-occurrences in GENIA corpus are used, and they contain 301 relevant relations and 562 irrelevant relations (Table 3). We conducted two sets of experiments: one set of experiments is to extract relevant relations between proteins and sub-cellular locations without classification of sub-cellular locations, and the other set of experiments is to extract relevant relations between proteins and sub-cellular locations with classification of sub-cellular locations. Both sets of experiments have an assumption in baseline experiments: a pair of protein and sub-cellular location indicates a relevant relation if they occur together in a co-occurrence. 3.1
Extraction of Relevant Relations between Proteins and Sub-cellular Locations without Classification of Sub-cellular Locations
We conducted experiments to extract relevant relations between proteins and sub-cellular locations without using the classification information of sub-cellular
Extracting Protein Sub-cellular Localizations from Literature
379
Table 4. Performances of protein sub-cellular localization prediction for all cellular components. (# of relations: 301). Features 1 2 3 4 5 6 7 Baseline Experiment Effectiveness of √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
each √ √ √ √ √ √ √ √ √ √ √
Best experiment √ √ √ √
8
9
feature √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
√ √ √ √ √ √ √ √
√ √
√ √
√ √ √ √ √ √ √ √ √
√ √ √ √ √
10
√ √ √ √ √ √ √ √ √ √ √ √
Performance 11 Precision Recall F-score
√ √ √ √ √ √ √ √ √ √ √
0.349
1.0
0.517
0.836 0.828 0.830 0.830 0.837 0.825 0.820 0.819 0.808 0.821 0.836 0.831
0.872 0.875 0.875 0.875 0.877 0.865 0.874 0.845 0.863 0.874 0.870 0.872
0.854 0.851 0.852 0.852 0.857 0.844 0.846 0.832 0.835 0.847 0.853 0.851
√
0.837 0.877 0.857 Effectiveness of syntactic analysis (w/o features from syntactic analysis) √ √ √ √ √ √ √ 0.826 0.868 0.846 N ote) TEXT: Text-based approach; SEQUENCE: Sequence-based approach; CO-OCCURRENCE: Co-occurrence-based features; Features from texts: 1∼11, Features from sequences: 12∼13: 1. Frequency of co-occurrence in GENIA corpus; 2. Frequency of co-occurrence in MEDLINE abstracts 2007∼8; 3. Protein and cellular component names; 4. Adjacent one words of names; 5. Adjacent two words of names; 6. Bag of Words; 7. Order; 8. Distance; 9. Syntactic category of protein and sub-cellular localization names; 10. Predicates of protein and sub-cellular localization names; 11. POS of predicates.
locations. Table 4 describes experimental results, and it consists of five types of experiments. The first type of experiments is baseline experiment. The second type of experiments describes the effectiveness of an individual feature for the text-based approach. This type of experiments is to show ablation study that is to describe a set of experiments where leaving out individual features in order test their contribution. All features are used in the first experiment in the second type of experiments. Through the experiments, we found that the feature Order and Distance are the effective features to extract the relevant relations. The third type of experiment describes the best performing experiment, and it used all features except one feature that is adjacent one words of protein and sub-cellular location names. To show the effectiveness of syntactic features, experiment without features from the syntactic analysis was conducted in the fourth type of experiment.
380
H.-W. Chun et al.
Table 5. Performance of protein sub-cellular localization prediction for each finegrained sub-cellular location Sub-cellular # Relevant location relations Nucleus 173 Cytoplasm 94 Plasma membrane 23 Granule 9
3.2
Performance : F-score (Precision, Recall) Baseline w/o Syntactic analysis Best combination 0.334 (0.200, 1.0)) 0.767 (0.742, 0.794) 0.776 (0.748, 0.807) 0.196 (0.109, 1.0) 0.842 (0.812, 0.874) 0.850 (0.821, 0.880) 0.052 (0.027, 1.0) 0.873 (0.821, 0.932) 0.911 (0.857, 0.973) 0.017 (0.009, 1.0) 0.915 (0.878, 0.956) 0.926 (0.880, 0.978)
Extraction of Relevant Relations between Proteins and Sub-cellular Locations with Classification of Sub-cellular Locations
We conducted experiments to extract relevant relations between proteins and sub-cellular locations using the classification information of sub-cellular locations. Therefore, this set of experiments focus on the more fine-grained relations compared with the first set of experiments. Three types of experiments have been compared based on four fine-grained sub-cellular locations in Table 5. The three types of experiments are baseline, experiments without using features from the syntactic analysis, experiment using best combination in Table 4. In the baseline experiments, we assumed that all pairs in 863 co-occurrences indicate relevant relations for the corresponding fine-grained sub-cellular locations. The next two types of experiments used the same combination of each type of experiment in the Table 4. The four finegrained sub-cellular locations have been selected when the number of relevant relations is more than one. These experiments showed that the features from the syntactic analysis contributed to improve performance for all fine-grained sub-cellular locations. , and frequency-based features also played an important role in improving performance in three of four fine-grained sub-cellular locations.
4 4.1
Discussion and Conclusion Discussion
There are some issues that should be discussed in the proposed approach. Some of them would be dealt with in the future work. (1) Table 6 describes examples of predicted protein names for the corresponding fine-grained sub-cellular locations. We found that a protein was located in not only one fine-grained sub-cellular location. To categorize proteins into the fine-grained sub-cellular locations, the proposed approach provides the evidential sentences for relations. The following two sentences are the evidential sentences. The first sentence describes the pair of NF-kappaB and nucleus, and the second sentence describes the pair of NF-kappaB and cytoplasm.
Extracting Protein Sub-cellular Localizations from Literature
381
Table 6. Examples of predicted proteins for the corresponding intracellular compartments Sub-cellular locations Nucleus Cytoplasm Plasma membrane
Protein names NF-AT, NF-kappaB, Radiolabeled VDR, ... FKBP51, NF-kappaB, hGR, protein kinase C, ... monoclonal antibody, CD4, protein kinase C, ...
– IkappaB further reduces the translocation of [NF-kappaB]P into the [nucleus]L thus preventing the expression of proinflammatory genes. (PMID: 10487715) – Associated with its inhibitor, I kappaB, [NF-kappaB]P resides as an inactive form in the [cytoplasm]L. (PMID: 9032271) Although the proposed approach provides the evidential sentences, it needs to analyze the additional information such as the corresponding progress, conditions. (2) At the first time, 11 fine-grained sub-cellular locations selected to deal with all sub-cellular locations, but GENIA corpus contains relevant relations with respect to only six fine-grained sub-cellular locations (See Table 3.). Two of six fine-grained sub-cellular locations have only one relevant relation. We expect that the remaining fine-grained sub-cellular locations might be dealt with by building the relevant corpora to them. (3) With the second issue, MEDLINE abstracts in GENIA corpus are related with only human. We expect that the annotated corpus will become more valuable if it consists of MEDLINE abstracts that are related with not only human but also other kinds of animals or plants. (4) The proposed approach has considered only sentence-level co-occurrences. We expect that context extension to a paragraph or a section might provide much more information than that from the sentence-level co-occurrences. 4.2
Conclusion
This article describes a novel text-based relation extraction approach for PSL. There are three contributions in the proposed approach. The first contribution is improvement of the extraction performance for the PSL by using the various state of the arts NLP techniques. Features from the syntactic analysis also played an important role in extracting PSLs. The second contribution of the proposed approach is that the proposed method extracted not only relations but also their evidential sentences. The evidential sentences are very important to support the relation extraction method and categorize proteins into the fine-grained sub-cellular locations. Moreover, the evidential sentences would be good start materials to analyze the additional information for the relations. The
382
H.-W. Chun et al.
third contribution is construction of the gold standard that contains classification of sub-cellular locations into 11 fine-grained sub-cellular locations, and categorization of relations between proteins and sub-cellular locations into three categorizes.
References 1. Horton, P., Park, K.J., Obayashi, T., Nakai, K.: Protein Subcellular Localization Prediction with WoLF PSORT. In: Asia Pacific Bioinformatics Conference (APBC), pp. 39–48 (2006) 2. Stapley, B.J., Kelley, L., Sternberg, M.: Predicting the subcellular location of proteins from text using support vector machines. In: Pacic Symposium on Biocomputing, PSB (2002) 3. Brady, S., Shatkay, H.: EPILOC: A (Working) Text-Based System for Predicting Protein Subcellular Location. In: Pacific Symposium on Biocomputing, PSB (2008) 4. Kim, J.D., Ohta, T., Tsujii, J.: Corpus annotation for mining biomedical events from literature. BMC Bioinformatics 9(10) (2008) 5. Sim, J., Wright, C.C.: The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy 85(3), 206–282 (2005) 6. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977) 7. Berger, A.L., Della Pietra, S.A., Della Pietra, V.J.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 39–71 (1996) 8. Tsujii Laboratory: ENJU Deep Syntactic Full Parser ver. 2.1., http://www-tsujii.is.s.u-tokyo.ac.jp/enju/index.html/ 9. Tsujii Laboratory: GENIA Project, http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
Enhancing Content-Based Image Retrieval Using Machine Learning Techniques Qinmin Vivian Hu1 , Zheng Ye1,2 , and Xiangji Jimmy Huang1 1
2
Information Retrieval and Knowledge Management Research Lab, York University, Toronto, Canada Information Retrieval Lab, Dalian University of Technology, Dalian, China [email protected], {jhuang,yezheng}@yorku.ca
Abstract. In this paper, we propose a term selection model to help select terms in the documents that describe the images to improve the content-based image retrieval performance. First, we introduce a general feature selection model. Second, we present a painless way for training document collections, followed by selecting and ranking the terms using the Kullback-Leibler Divergence. After that, we learn the terms by the classification method, and test it on the content-based image retrieval result. Finally, we setup a series of experiments to confirm that the model is promising. Furthermore, we suggest the optimal values for the number maxK and the tuning combination parameter α in the experiments.
1
Introduction
For centuries, a wealth of technologies has been done on text in different languages for efficient retrieval. However, when it comes to pictures, machines do not perform images as well as text. One reason which causes this distinction is that text is human being’s creation, while typical images are a mere replica of what human being has seen, concrete descriptions of which are relatively elusive. Naturally, the interpretation of what we see is hard to characterize, and even harder to teach a machine. Yet, over the past decade, ambitious attempts have been made to make machines learn to process, index and search images with great progress [4]. Image retrieval is to retrieve “similar” images to the query which is provided by a user. In general, the query terms are such as keyword, image file/link, or click on some image. The similarity used for retrieval criteria could be meta tags, colour distribution in images, region/shape attributes, etc. Therefore, there are two classical ways for image retrieval as image meta search and content-based image retrieval (CBIR). Image meta search is a search of images based on associated metadata such as captioning, keywords, or text describing a image, etc. CBIR is a technology that in principle helps retrieve images based on their visual content. This characterization of CBIR places it at a unique juncture within the scientific community. While continually developing new techniques for image retrieval, researchers in the field have leveraged mature methodologies developed A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 383–394, 2010. Springer-Verlag Berlin Heidelberg 2010
384
Q.V. Hu, Z. Ye, and X.J. Huang
in related fields including databases, information retrieval, signal and image processing, graphics, vision, document imaging, design research, human-computer interaction, machine learning, statistical modelling, data mining, artificial intelligence, social sciences, and so on [16]. However, there are shortcomings of CBIR as a real-world technology. For example, one problem is the reliance on visual similarity for the semantic gap between low-level content and higher-level concepts [14]. ImageCLEF is a continuing image retrieval track, run as part of the Cross Language Evaluation Forum (CLEF) campaign. This track evaluates retrieval of images described by text captions based on queries in a language. The goal of its retrieval track is to explore and study the relationship between images and their captions during the retrieval process. It is likely that this track will appeal to members of more than one research community, including those from image retrieval, cross language retrieval and user interaction [11]. In this paper, we focus on the automatic retrieval task of the medical image retrieval in English. We propose a combined term selection model to improve the performance of CBIR based on the text retrieval technology. First, we briefly introduce a general feature selection method. Then, in order to overcome the shortcomings of general feature selection, we propose a painless way to collect training document effectively. We employ the Kullback-Leibler Divergence to give the terms weights, and measures how a terms distribution in the feedback documents diverges from its distribution in the whole training documents. Furthermore, we apply the classification method on the CBIR results. Last but not at least, we conduct a series of experiments to show that our model is promising. The remainder of this paper is organized as follows. First, we propose a term selection model in Section 2. Then, in Section 3, we set up our experimental environment. After that, we present an empirical study to present the experimental results, discuss and analyze the influences of our work in Section 4. Furthermore, we describe the related work in Section 5. Finally, we briefly draw the contributions and conclusions of this paper in Section 6.
2
Term Selection Model
We propose a term selection model for selecting and ranking terms from the learning documents to improve the CBIR performance in this section. First, we introduce a general feature selection method in Section 2.1. Then we describe how to collect the training documents in Section 2.2. After that, we present a method for selecting and ranking features in Section 2.3, followed by introducing a classification method in Section 2.4. Finally, we provide a combination for the weights from the baseline and the classification method respectively in Section 2.5. In addition, we present the pseudo codes for the algorithm in Section 2.6. 2.1
Feature Selection
In statistics and machine learning, feature selection, also known as subset selection, is a process where a subset of features from the data are selected for
Enhancing Content-Based Image Retrieval
385
building learning models. The best subset contains the least number of feature dimensions that most contribute to accuracy. The remaining feature dimensions are discarded as unimportant. Therefore, feature selection helps improve the performance of learning models by: 1) alleviating the effect of the curse of dimensionality; 2) enhancing generalization capability; 3) speeding up learning process; and 4) improving model interpretability [6]. Feature selection algorithms typically fall into two categories: feature ranking and subset selection. Feature ranking ranks the features by a metric and eliminates all features that do not achieve an adequate score. Subset selection searches the set of possible features for the optimal subset. However, simple feature selection algorithms are ad hoc, even though there are also more methodical approaches. From a theoretical perspective, it can be shown that optimal feature selection for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality. If large numbers of features are available, this is impractical. For practical supervised learning algorithms, the search is for a satisfactory set of features instead of an optimal set. Therefore, we propose our term selection model to provide narrowed feature candidates for feature ranking and best subset selection. First, we extract features from the text retrieval results as terms for re-ranking the CBIR results. Second, we apply the Kullback-Leibler Divergence (KLD) to give the terms weights, and measures terms’ distributions in the training documents and the whole collection. Third, we employ the classification method to get the training information from the selected terms, and test them on the IR and CBIR results respectively. 2.2
Collecting Training Documents
It’s impractical for feature selection to do an exhaustive search of all possible features. In order to make the effort be a stitch in time, we propose a way of collecting the training documents for feature selection. First of all, we set up our experiments under IR systems which will be introduced in Section 3. Second, according to the IR results, we obtain top k documents of the IR result as a set D. Here a reasonable assumption is given that the top k documents are retrieved by the systems as the most relevant ones. Third, we select and rank terms from the training set D as features in Section 2.3. 2.3
Term Selection and Ranking
For a given query, we have a corresponding set D defined in Section 2.2. We also define the whole data collection as a set C. Then we rank unique terms in D by decreasing order of their KLD weights. KLD is a popular choice of expansion term weighting, which has been applied in many state-of-the-art methods [1]. KLD measures how a term’s distribution in the training document set D and its distribution in the whole collection C. The higher KLD is, the more informative
386
Q.V. Hu, Z. Ye, and X.J. Huang
the term is. For a unique term in D, the KLD weight is given by Equation 1. Top maxK terms are selected and sent into the classification method in Section 2.4. KLD(t) = p(t|D)log2 {P (t|D)/P (t|C)}
(1)
where P (t|D) = c(t, D)/c(D) is the generation probability of term t from D. c(t, D) is the frequency of t in D, and c(D) is the count of words in D. P (t|C) = c(t, C)/c(C) is the document pool model. c(t, C) is the frequency of t in C, and c(C) is the count of words in C. 2.4
Classification Method
The support vector machine (SVM) [8] is widely used in text classification in recent years. Its underlying principle is structure risk minimization. Its objective is to determine a classifier or regression function which minimizes the empirical risk (that is, the training set error) and the confidence interval (which corresponds to the generalization or test set error). Given a set of training data, an SVM determines a hyperplane in the space of possible inputs. This hyperplane will attempt to separate the positives from the negatives by maximizing the distance between the nearest positive examples and and negative examples. There are several ways to train SVMs. One particularly simple and fast method is the Sequential Minimal Optimization [12] which is adopted in our study. In addition, we empirically apply the non-linear homogeneous polynomial kernel function at degree dg as follows: k(xi , xj ) = (xi ∗ xj )dg
(2)
where xi and xj are real vectors in a p-dimensional space, and p is the number of features. The exponential parameter dg is default to 1 in our study. 2.5
Weight Combination
There are two weights which are given by the original system and the classification method. In this paper, we simply combine these two weights together by Equation 3. Here α as the tuning combination parameter, will be discussed further in Section 4. W eightre−ranking = α ∗ W eightoriginal + (1 − α) ∗ W eightclassif ier 2.6
(3)
Algorithm
The pseudo code for our proposed term selection method is presented in this section. Six steps are described, which are corresponded to Section 2.1, 2.2, 2.3, 2.4 and 2.5.
Enhancing Content-Based Image Retrieval
387
0. Input IR results by text mining systems; Initial retrieval result by CBIR system. 1. Output Re-ranking lists of text mining systems; Re-ranking list of CBIR system. 2. Initialization k, where top k documents of each IR result of text mining systems are selected and put into the pool D; maxK, selected features according to their KLD weights as training data. 3. Training Data Collection D, containing k documents; C, the whole data collection; 4. Term Selection and Ranking Extracting terms in D with stemming; Using Equation 1 to compute the KLD weights for each term; Sorting the term weights with a decreasing order; 5. Classification and Re-ranking CF, the classifier introduced in Section 2.4; Learning the classifier CF from maxK terms and get information to represent the documents in D as the training data; For the CBIR results { Applying CF on the CBIR result as the testing data; Combining the weights by CF and the CBIR system in Equation 3; Re-ranking the CBIR results by sorting the weights. } For the IR results { Applying CF on the IR results as the testing data; Combining the weights by CF and the IR system in Equation 3; Re-ranking the IR results by sorting the weights.}
Fig. 1. An Algorithm for the Combined Term Selection Method
3
Experimental Setup
In this section, we first show our IR system in Section 3.1. Then we introduce a context-based image retrieval system as in Section 3.2. Third, we describe the CLEF 2009 medical image data set and queries in Section 3.3. 3.1
Text Retrieval System
We used Okapi BSS (Basic Search System) as our main search system. Okapi is an IR system based on the probability model of Robertson and Sparck Jones [2, 7, 13]. The retrieval documents are ranked in the order of their probabilities of relevance to the query. Search term is assigned weight based on its withindocument term frequency and query term frequency. The weighting function used is BM25.
w=
(k1 + 1) ∗ tf (r + 0.5)/(R − r + 0.5) (k3 + 1 ) ∗ qtf ∗log ∗ K + tf (n − r + 0.5)/(N − n − R + r + 0.5) k3 + qtf
⊕ k2 ∗nq ∗
(avdl − dl ) (avdl + dl) (4)
388
Q.V. Hu, Z. Ye, and X.J. Huang
where N is the number of indexed documents in the collection, n is the number of documents containing a specific term, R is the number of documents known to be relevant to a specific topic, r is the number of relevant documents containing the term, tf is within-document term frequency, qtf is within-query term frequency, dl is the length of the document, avdl is the average document length, nq is the number of query terms, the ki s are tuning constants K equals to k1 ∗ ((1 − b) + b ∗ dl/avdl), and ⊕ indicates that its following component is added only once per document, rather than for each term. 3.2
Content-Based Image Retrieval System
Content-based Image Retrieval (CBIR) systems enable users to search a large image database by issuing an image sample, in which the actual contents of the image will be analyzed. The contents of an image refer to its features: colors, shapes, textures, or any other information that can be derived from the image itself. This kind of technology sounds interesting and promising. The key issue in CBIR is to extract representative features to describe an image. However, this is a very difficult research topic. In this paper, we explore three representative features for medical image retrieval as follows. 1. Color and Edge Directivity Descriptor (CEDD): is a low level feature which incorporates color and texture information in a histogram [5]. 2. Tamura Histogram Descriptor: features coarseness, contrast, directionality, line-likeness, regularity, and roughness. The relative brightness of pairs of pixels is computed such that degree of contrast, regularity, coarseness and directionality may be estimated [15]. 3. Color Histogram Descriptor: Retrieving images based on color similarity is achieved by computing a color histogram for each image that identifies the proportion of pixels within an image holding specific values (that humans express as colors). Current research is attempting to segment color proportion by region and by spatial relationship among several color regions. Examining images based on the colors they contain is one of the most widely used techniques because it does not depend on image size or orientation. 3.3
Data Set and Queries
In this paper, we use the dataset of the CLEF 2009 medical image retrieval track. It contains scientific articles from two radiology journals, Radiology and Radiographics. The size of the database is a total of 74,902 images. This collection constitutes an important body of medical knowledge from the peerreviewed scientific literature including high quality images with annotations. Images are associated with journal articles and can be part of a figure. Figure captions are made available to participants as well as the part concerning a particular subfigure if available. For each image, captions and access to the full text article through the Medline PMID (PubMed Identifier) were provided. An articles PMID could be used to obtain the officially assigned MeSH (Medical Subject
Enhancing Content-Based Image Retrieval
389
Table 1. Performance of Term Selection Model on Content-based Image Retrieval MAP Performance with maxK maxK = 10 maxK = 20 maxK = 30 maxK = 40 Baseline 0.0045 0.1 0.0056 0.0053 0.0046 0.0045 (24.44%) (17.78%) (2.22%) (0.00%) 0.2 0.008 0.0067 0.0049 0.0047 (77.78%) (48.89%) (8.89%) (4.44%) 0.3 0.0115 0.0083 0.005 0.0049 (155.56%) (84.44%) (11.11%) (8.89%) 0.4 0.0164 0.0095 0.0054 0.0053 (264.44%) (111.11%) (20.00%) (17.78%) α 0.5 0.0203 0.0106 0.0057 0.0058 (351.11%) (135.56%) (26.67%) (28.89%) 0.6 0.0218 0.0126 0.0069 0.0068 (384.44%) (180.00%) (53.33%) (51.11%) 0.7 0.0251 0.0133 0.0078 0.0086 (457.78%) (195.56%) (73.33%) (91.11%) 0.8 0.0251 0.0130 0.0088 0.0096 (457.78%) (188.89%) (95.56%) (113.33%) 0.9 0.0223 0.0121 0.0092 0.0100 (395.56%) (168.89%) (104.44%) (122.22%) 1.0 0.0222 0.012 0.009 0.0094 (393.33%) (166.67%) (100.00%) (108.89%)
Headings) terms. The collection was entirely in English. However, the queries are supplied in German, French, and English [11]. We focus on the queries in English. The image-based queries have been created using methods where realistic search queries have been identified by surveying actual user needs. Each query contains 2 to 4 sample images. More information can be found in [11].
4
An Empirical Study
In this section, we first present our experimental results in Section 4.1. Then we analyze and discuss the influence of our proposed model on text retrieval results and CBIR results respectively. 4.1
Experimental Results
The experimental results of the proposed model on the CBIR result are presented in Table 1. The term number maxK and the tuning combination parameter α are set to different values. The original baseline is first displayed at the third row. The remained are the performance under the settings. The values in the parentheses are the relative rates of improvement over the baseline. In addition, before tuning the parameters maxK and α, we set k = 10 as one of conclusion in our paper [19]. All our results presented in this paper are automatic, i.e., no manual query modification or iterative selection of results is allowed. In order to show the robustness of our model, we also conduct the experiments which apply the proposed model on the text retrieval result. The re-ranking results are presented in Table 2. The tuning parameters maxK and α are the same as set in Table 1. The text retrieval baseline is displayed as well. The values
390
Q.V. Hu, Z. Ye, and X.J. Huang Table 2. Performance of Term Selection Model on Text Retrieval MAP Performance with maxK maxK = 10 maxK = 20 maxK = 30 maxK = 40 Baseline 0.3520 0.1 0.3561 0.3547 0.3546 0.3534 (1.16%) (0.77%) (0.74%) (0.40%) 0.2 0.3612 0.3578 0.356 0.354 (2.61%) (1.65%) (1.14%) (0.57%) 0.3 0.3677 0.3604 0.3591 0.3568 (4.46%) (2.39%) (2.02%) (1.36%) 0.4 0.368 0.3639 0.3581 0.354 (4.55%) (3.38%) (1.73%) (0.57%) α 0.5 0.3692 0.3634 0.3564 0.3525 (4.89%) (3.24%) (1.25%) (0.14%) 0.6 0.3657 0.3612 0.356 0.3472 (3.89%) (2.61%) (1.14%) -(1.36%) 0.7 0.3622 0.3542 0.3485 0.3376 (2.90%) (0.63%) -(0.99%) -(4.09%) 0.8 0.3558 0.3455 0.3386 0.3269 (1.08%) -(1.85%) -(3.81%) -(7.13%) 0.9 0.3462 0.3345 0.3252 0.3132 -(1.65%) -(4.97%) -(7.61%) -(11.02%) 1.0 0.332 0.322 0.3093 0.3003 -(5.68%) -(8.52%) -(12.13%) -(14.69%)
in the parentheses are the relative rates of improvement over the text retrieval baseline. We also note k = 10. 4.2
Influence of Term Selection Model on Content-Based Image Retrieval
To illustrate the results in Table 1 graphically, we re-plot these data in Figure 2 and 3. The x-axis represents the tuning combination parameter α which varies from 0.1 to 1.0. The y-axis shows the MAP performance. maxK is set as {10, 20, 30 ,40} respectively. We can see that all re-ranking results of the proposed Term Selection Model on CBIR Result 0.03
0.025
maxK=10 maxK=20 maxK=30 maxK=40 Baseline
0.02
0.015
0.01
0.005
0 0.1
0.2
0.3
0.4 0.5 0.6 0.7 Alpha Tuning as the Linear Combination Parameter
0.8
0.9
Fig. 2. Term Selection Model on CBIR Result
1
Enhancing Content-Based Image Retrieval
391
Improvements on CBIR Result 5 4.5 4
maxK=10 maxK=20 maxK=30 maxK=40
3.5 3 2.5 2 1.5 1 0.5 0 0.1
0.2
0.3
0.4 0.5 0.6 0.7 Alpha Tuning as the Linear Combination Parameter
0.8
0.9
1
Fig. 3. Improvements of Term Selection Model on CBIR Result
term selection model outperform the original CBIR baseline. It’s very successful that no matter how many terms are selected as the training features, and no matter how to combine the different weights given by the CBIR baseline and the classifier, the re-ranking results make great improvements. In particular, we believe the proposed model can make progress if we can obtain a better baseline, since the current baseline is as low as 0.0045. 4.3
Influence of Term Selection Model on Text Retrieval
The proposed model has shown its effectiveness and promising on content-based image retrieval in Section 4.2. In order to further prove the robustness of the model, we conduct the same experiments on the text retrieval results, under four maxK values and 10 tuning combination parameter α. We re-plot Table 2 in Figure 4 and 5. It’s interesting that we can not make improvements on all maxK and α. However, if we fix α within an interval, the proposed model brings significant improvements. So we can say that the proposed term selection model works on the IR result. But it depends on how to combine the classification weight and the baseline weight. Also, the improvements in Figure 5 show that the value of maxK effect the performance obviously. In summary, the proposed model can boost the text retrieval performance through tuning the parameters. 4.4
Influence of Term Number maxK
In the experiments, we try four different maxK values to show the influence of maxK on the re-ranking results. In Figure 2 and 4, the re-ranking CBIR and text retrieval results tell us when maxK equals to 10, the re-ranking performance achieves the best, no matter how α tunes. We can also observe a trend that the improvements are getting smaller, when maxK is getting bigger. This suggests us how to select terms as features at the stage of feature selecting and ranking.
392
Q.V. Hu, Z. Ye, and X.J. Huang
Term Selection Model on Text Retrieval Result 0.37
0.36
0.35
0.34 maxK=10 maxK=20 maxK=30 maxK=40 Baseline
0.33
0.32
0.31
0.3 0.1
0.2
0.3
0.4 0.5 0.6 0.7 Alpha Tuning as the Linear Combination Parameter
0.8
0.9
1
Fig. 4. Term Selection Model on Text Retrieval Result Improvements on Text Retrieval Result 0.05
0
maxK=10 maxK=20 maxK=30 maxK=40
−0.05
−0.1
−0.15 0.1
0.2
0.3
0.4 0.5 0.6 0.7 Alpha Tuning as the Linear Combination Parameter
0.8
0.9
1
Fig. 5. Improvements of Term Selection Model on Text Retrieval Result
4.5
Influence of Tuning Parameter α
For the re-ranking CBIR results in Table 1 and Figure 2, we find that the reranking results outperform the baseline for all maxK and α. In particular, the re-ranking performance reaches the best at α = 0.7. However, for the re-ranking text retrieval results in Table 2 and Figure 5, the re-ranking results are not always better than the baseline. Based on these observations, an interval for α can be recommended as [0.1, 0.5]. Therefore, for both CBIR and text retrieval, we suggest to tune α around 0.5.
5
Related Work
Lots of previous work has been done on feature selection. In the early 1992, Kira and Rendell [10] described a statistical feature selection algorithm called
Enhancing Content-Based Image Retrieval
393
RELIEF that uses instance based learning to assign a relevance weight to each feature. Later in 1994, John, Kohavi and Pfleger [9] addressed the problem of irrelevant features and the subset selection problem. They presented definitions for irrelevance and for two degrees of relevance (weak and strong). They also state that features selected should depend not only on the features and the target concept, but also on the induction algorithm. Further, they claim that the filter model approach to subset selection should be replaced with the wrapper model. In a comparative study of feature selection methods in statistical learning of text categorization, Yang and Pedersen [18] evaluated document frequency (DF), information gain (IG), mutual information (MI), a χ2 -test (CHI) and term strength (TS); and found IG and CHI to be the most effective. Blum and Langley [3] focussed on two key issues: the problem of selecting relevant features and the problem of selecting relevant examples. Xing, Jordan and Karp [17] successfully applied feature selection methods to a classification problem in molecular biology involving only 72 data points in a 7130 dimensional space. They also investigated regularization methods as an alternative to feature selection, and showed that feature selection methods were preferable in the problem they tackled. Guyon and Elisseeff [6] gave an introduction to variable and feature selection. They recommend using a linear predictor of your choice and select variables in two alternate ways: (1) with a variable ranking method using a correlation coefficient or mutual information; (2) with a nested subset selection method performing forward or backward selection or with multiplicative updates.
6
Conclusions
In this study, our contributions are four-fold. First, we propose a term selection model to improve the content-based image retrieval performance. Second, we introduce our term selection and ranking method to collect the training documents. Later we employ the classification method to classify and re-ranking the baselines. Third, we evaluate the proposed model on the CLEF 2009 medical image data. The experimental results confirm the model works very well on the CBIR system. Furthermore, we also conduct the same experiments on the text retrieval result, which shows the robustness of the proposed model since it works well on the text retrieval baseline. Fourth, for the term number maxK and the tuning combination parameter α, our experimental results suggest that maxK equal to 10 and α vary around 0.5.
References [1] Amati, G.: Probabilistic models for information retrieval based on divergence from randomness. PhD thesis, Department of Computing Science, University of Glasgow (2003) [2] Beaulieu, M., Gatford, M., Huang, X., Robertson, S., Walker, S., Williams, P.: Okapi at TREC-5. In: Proceedings of TREC-5, pp. 143–166. NIST Special Publication (1997) [3] Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1-2), 245–271 (1997)
394
Q.V. Hu, Z. Ye, and X.J. Huang
[4] Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2), 1–60 (2008) [5] Gasteratos, A., Vincze, M., Tsotsos, J.: Cedd: Color and edge directivity descriptor. a compact descriptor for image indexing and retrieval. In: Gasteratos, A., Vincze, M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 312–322. Springer, Heidelberg (2008) [6] Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) [7] Huang, X., Peng, F., Schuurmans, D., Cercone, N., Robertson, S.: Applying machine learning to text segmentation for information retrieval. Information Retrieval Journal 6(4), 333–362 (2003) [8] Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML 1999: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 200–209. Morgan Kaufmann Publishers Inc., San Francisco (1999) [9] John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: ICML, pp. 121–129 (1994) [10] Kira, K., Rendell, L.A.: A practical approach to feature selection. In: ML 1992: Proceedings of the Ninth International Workshop on Machine Learning, pp. 249– 256. Morgan Kaufmann Publishers Inc., San Francisco (1992) [11] Muller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C., Hersh, W.: Overview of the clef 2009 medical image retrieval track. In: CLEF working notes 2009, Corfu, Greece (2009) [12] Platt, J.C.: Fast training of support vector machines using sequential minimal optimization, pp. 185–208 (1999) [13] Robertson, S.E., Walker, S.: Some Simple Effective Approximations to the 2Poisson Model for Probabilistic Weighted Retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 3-6, pp. 232–241. ACM/Springer (1994) [14] Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000) [15] Tamura, H., Mori, S., Yamawaki, T.: Textural features corresponding to visual perception. IEEE Transactions on Systems, Man and Cybernetics 8, 460–473 (1978) [16] Wang, J.Z., Boujemaa, N., Del Bimbo, A., Geman, D., Hauptmann, A.G., Tesi´c, J.: Diversity in multimedia information retrieval research. In: MIR 2006: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 5–12. ACM, New York (2006) [17] Xing, E.P., Jordan, M.I., Karp, R.M.: Feature selection for high-dimensional genomic microarray data. In: ICML 2001: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 601–608. Morgan Kaufmann Publishers Inc., San Francisco (2001) [18] Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML 1997: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997) [19] Ye, Z., Huang, X., Lin, H.: Towards a better performance for medical image retrieval using an integrated approach. In: Proceedings of the 10th Workshop of the Cross-Language Evaluation Forum (CLEF 2009), Corfu, Greece, September 30 October 2 (2009)
Modeling User Knowledge from Queries: Introducing a Metric for Knowledge Frans van der Sluis and Egon L. van den Broek Department of Human Media Interaction University of Twente P.O. Box 217, 7500 AE Enschede, The Netherlands [email protected], [email protected]
Abstract. The user’s knowledge plays a pivotal role in the usability and experience of any information system. Based on a semantic network and query logs, this paper introduces a metric for users’ knowledge on a topic. The finding that people often return to several sets of closely related, well-known, topics, leading to certain concentrated, highly activated areas in the semantic network, forms the core of this metric. Tests were performed determining the knowledgeableness of 32,866 users on in total 8 topics, using a data set of more than 6 million queries. The tests indicate the feasibility and robustness of such a user-centered indicator.
1
Introduction
Lev Vygotsky (1896 - 1934) and Jean Piaget (1896 - 1980) were the first to indicate the importance of a user’s knowledge base for interpreting information. New knowledge should not be redundant but should have sufficient reference to what the user already knows. Moreover, sufficient overlap fosters the understandability of information (Van der Sluis and Van den Broek, in press), in particular deep understanding (Kintsch, 1994); is an important determinant of interest (Schraw and Lehman, 2001); and, through both understandability and interest, the knowledge base has an indirect effect on motivation (Reeve, 1989). Concluding, it is of pivotal importance to have a model of the user’s knowledge allowing to measure the distance between a topic and the user’s Knowledge Model (KM). Several types of models have been used for user (knowledge) modeling; e.g., weighted keywords, semantic networks, weighted concepts, and association rules. A core requirement is the connectedness of concepts, which makes a semantic network particularly useful for knowledge modeling. Query logs offer an unobtrusive, large source, storing basic information about the search history (e.g., queries and some click through data) for a limited amount of time (e.g., a cookie will often expire within 6 months) (Gauch et al., 2007). The use of a query log to base a KM on assumes that what one has searched for, one has knowledge about. We will give three arguments to support this assumption: 1) The user has to know a word before being able to posit it to a search system. This is well known as the vocabulary problem, where users have difficulty finding the right word to represent their exact information need; A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 395–402, 2010. c Springer-Verlag Berlin Heidelberg 2010
396
F. van der Sluis and E.L. van den Broek
i.e., they are confined to the words they know (Furnas et al., 1987). 2) A query log represents the history of searches. As a user performs a query to learn new information, a query log primarily indicates that what has recently been learned. 3) Users often return to very specific domains when performing a new search (Wedig and Madani, 2006). These domains are indicative of what is familiar to the user; i.e., domains the user has at least above-average knowledge about. Reliability is a salient problem for user profiles based on implicit sources. Logs can contain noise; e.g., due to sharing a browser with a different user. Moreover, polysemy is a structural problem for information retrieval. Polysemy refers to the need of context in order to know the intentioned meaning of a word. Any knowledge metric based on implicit sources has to deal with this uncertainty. The remainder of this paper looks at the feasibility of measuring the amount of knowledge a user has about a certain topic. A semantic KM is introduced in Section 2, using a query log as data source. The idea that users posit many queries on closely related topics that are very well-known to them, forms the basis for the metric of knowledge presented in Section 3. Some examples of the use of the metric are shown in Section 4. Finally, the results are discussed, reflecting on the feasibility of a web system adaptive to the knowledge base of users.
2
Knowledge Model
The KM was build on WordNet version 3.0. WordNet is a collection of 117, 659 related synonym sets (synsets), each consisting of words of the same meaning (Miller, 1995). As input data, an AOL query log was used (cf. Pass et al., 2006). This log consists of 21, 011, 340 raw queries, gathered from 657, 426 unique user ID’s. As in general a large proportion of all queries is caused by a small percentage of identifiable users (Wedig and Madani, 2006), only a subset of the most active 5% of users was used. One outlier was removed, having 106, 385 queries compared to 7, 420 for the next highest user. The final subset consisted of 32, 866 users and was responsible for 29.23% of all unique queries. The procedure used to analyze the query logs is an iterative one, analyzing each unique query present in the (subset of the) log. For each query (q), the total set of synsets touched upon (Sq ) was computed as follows: Sq (q) = ∪ S(w), w∈q
S(w) = {s ∈ W |w ∈ s}.
(1)
Here, S(w) gives the set of all possible synsets s of lexical dictionary W (i.e., WordNet) related to the word w of query q. All possible lemmata for each word w were used to retrieve the synsets s. With the analysis of the query log noise reduction was disregarded; no effort was spent to reduce the effects of polysemy. This resulted in substantial noise within the data set. On average, for the subset of the data used, a query loaded upon 13.92 synsets. The results of the analyses were stored as a large collection of (user, query, synset, part-of-speech) tuples. This collection allowed for the retrieval of the number of times a user posited a query on a specific synset Q(s).
Modeling User Knowledge from Queries
3
397
Similarity Metric
The distance between a certain topic and the KM indicates the amount of knowledge a user has on that topic. We exploit this by measuring the semantic distance k(t) (See Equation 6) between a synset t, representing the topic, and the KM. This is done by looking at the number of activated synsets A near the topic synset t, which is inspired by the finding that people stick to certain domains for their searches (Wedig and Madani, 2006). This leads to a large number of (unique) queries in a specific domain and, accordingly, to very few queries in non-familiar domains. By looking at areas of activation, the method attends one of the core problems of user profiling: polysemy. It is expected that polysemy will lead to the random, isolated, activation of synsets and not to the activation of an area of synsets. The number of activated synsets An (t) is defined as: An (t) = #{s ∈ Sn (t)|Q(s) > 0},
(2)
where Q(s) is the previously defined number of queries per synset s (See Section 2) and Sn is the set of synsets in exactly n steps related to topic t (Gervasi and Ambriola, 2003): S0 (w) = {s ∈ W |w ∈ s},
(3)
Sn (w) = Sn−1 ∪{s ∈ W |r(s, s ) ∧ s ∈ Sn−1 (w)}.
1 Weight (d)
1 8000 6000 4000 2000
Weight (a)
Related Synsets (Sn)
Here, w is the word of interest (i.e., the topic t), s represents all the synsets in lexical dictionary W (e.g., WordNet) of which the word w is part of, and r(s, s ) is a boolean function that indicates whether or not there is any relationship between synset s and s‘. Please note that S0 is the same as S from equation 1 and that Sn (w) has a memory: it includes the synsets up to n. Following Gervasi and Ambriola (2003), hyponymy relationships are excluded from function r(s, s ). The absolute count A is not directly useful for two reasons. First, the growth in synsets for higher ns is approximately exponential (cf. Figure 1). This can lead to very high values of An (t) for higher ns. Second, a normalized value is preferable, allowing to estimate the limits of the function kN (t). Consequently,
0.5
0 0
5 Steps (n)
10
Fig. 1. Average related synsets
0.5
0 0 5 10 Activated Synsets (A)
Fig. 2. Activation limit function a
0
5 Steps (n)
Fig. 3. Distance function d
10
decay
398
F. van der Sluis and E.L. van den Broek
there is a need for a strict limiting function for the contribution of An (t) to kN (t). Let c1 be a constant and An be defined in Equation 2; then, the weighted activation a at n steps from topic t is given by: an (t) = 1 − (c1 )An (t) .
(4)
The constant c1 is indicative for how many activated synsets are needed. Since lower values of An (t) are very informative (as opposed to higher values), we choose for a value of 12 . Figure 2 illustrates this idea. It is unlikely that a user has as much knowledge on synsets n = 3 steps away as on synsets n = 1 step away. Therefore, a decay function is needed, giving a high penalty to activation far away. We use the following exponential decay function to achieve this: d(n) = (c2 )n ,
(5)
where n is the number of steps of the connection and c2 a constant determining the distance penalty. The output of this function is illustrated in Figure 3 for c2 = 12 over n = 10 steps. The idea clearly shows from this figure. Activation in the KM close to the input synsets are given a higher weight, to such an extent that twice as much weighted activation is needed at n = 1, four times as much activation at n = 2, etc. Combining Equations 4 and 5, the knowledge a user has on a topic synset t can be estimated by the function k(t): kN (t) =
N
d(n)an (t).
(6)
n=0
The result is a weighted count of how many related synsets a user has previously touched upon, able to indicate how close a topic is to the user’s knowledge. The maximum of the function is 2.00, the minimum is 0. A short review of the behavior of kN (t) helps interpreting its range. Consider some hypothetical users and their unweighted activation values of An (t), as shown in Table 1. The values of kN (t) are shown in the last column. An interpretation of the values of An (t) is that the first user has relatively little knowledge about the topic, the second and third have reasonable knowledge, and the fourth has a lot of knowledge. Therefore, as a rule of thumb, we use the value of 0.85 as a threshold. Table 1. Exemplar users User 1 2 3 4
A0 0 1 0 1
A1 1 1 3 3
A2 1 1 3 4
A3 1 1 3 4
A4 1 1 3 4
A5 1 1 3 5
k 0.48 0.98 0.85 1.38
Modeling User Knowledge from Queries
4
399
Proof of Concept
As a proof of concept, the KM and metric were tested on eight topics. The performed test method defines, for all users, their knowledge about the topic. This was done through the following steps. Each topic is represented by a word; subsequently, for every word, the corresponding synsets were collected. Next, for each user, kN (t) was computed. A value of N = 5 was used. The distribution of users over kN (t) is illustrated by a 10-bins histogram, counting the number of users per bin of kN (t), shown in Figure 4. Furthermore, the total number of “knowledgeable users” (K) for each topic was calculated by counting the number of users with kN (t) > 0.85; i.e., the previously defined threshold. The values of K for each topic are shown in Table 2. Three of the eight topics were derived from a list of topics that Wedig and Madani (2006) identified as being sticky topics: those topics that a group of users often return to. The sticky topics are: lottery, travel, and money. Figure 4(a-c) illustrates that for all three topics there is a particular user group active in the topic and a greater set of users inactive; i.e., with kN (t) < 0.85). From these distributions of users, the knowledge metric can identify knowledgeable users. Furthermore, Figure 4d shows a very uncommon word: malonylurea (Gervasi and Ambriola, 2003). This word refers to a chemical compound often used in barbiturate drugs. Being a word previously unknown to the authors of this article, the expectation was that this word will not have many users common to the word. This expectation was correct, as K = 0. The topic vacation is, besides a very common word, also a somewhat peculiar word concerning its semantic relatedness. The word corresponds to two synsets 4
1
2 1
4
x 10 Users
Users
2
x 10
1
2 1
0
0
0
0
0 1 2 Knowledge Metric (k)
0 1 2 Knowledge Metric (k)
0 1 2 Knowledge Metric (k)
0 1 2 Knowledge Metric (k)
(b) Travel
(c) Money
4
x 10
x 10 Users
2 1
2 1
(d) Malonylurea
4
4
x 10 2
x 10 Users
4
Users
Users
2
(a) Lottery
Users
4
x 10
Users
4
x 10
1
2 1
0
0
0
0
0 1 2 Knowledge Metric (k)
0 1 2 Knowledge Metric (k)
0 1 2 Knowledge Metric (k)
0 1 2 Knowledge Metric (k)
(e) Vacation
(f) University
(g) School
(h) University and School Fig. 4. Distribution of users over topic relatedness
400
F. van der Sluis and E.L. van den Broek Table 2. Number of knowledgeable users per topic Topic Health Lottery Travel
K 5,411 3,624 3,960
Topic Money Malonylurea Vacation
K 3,082 0 4,952
Topic University School University and School
K 7,551 14,077 4,701
that after following its relationships for n = 3 steps still lead to only S3 (t) = 8 synsets. Compared to the average, this is a very low number of synsets; see also Figure 1. However, the metric can still identify (at least part of) the knowledgeable users: K = 4, 952; see Figure 4e). Finally, two somewhat related words were analyzed: university (Figure 4f) and school (Figure 4g). Both are very popular topics, respectively having K = 7, 551 and K = 14, 077 users. Moreover, the topics were combined by taking only those users regarded as knowledgeable users (K) on both topics and averaging their values of kN (t). This was still a large number of users: K = 4, 701; see also Figure 4h. Hence, a concatenation of topics also seems possible.
5
Discussion
Founded on the notion that people perform most of their searches on a few salient topics, this paper introduced a metric of knowledge. Queries of users were monitored and the synsets used were identified. This revealed that areas of the semantic network on which a user had posed queries before were most activated. This provided a metric of users’ knowledge on a topic. Using the metric introduced in Section 3, Section 4 showed that it is indeed feasible to give an indication of a user’s knowledge. Moreover, we showed that this indication can be founded on i) relatively limited data, ii) only a restricted KM, and iii) a noisy data set, as no effort was put into noise reduction. So, if based on a measure of the spread of activation, the possibility of “measuring knowledge” seems to be quite robust and can, thus, be applied to make information systems adaptive. Several improvements can be made to the source, model, and metric. To start with the source, the use of queries provides limited information. This is partly due to the inherent ambiguity of the often short, keywords-based, queries. However, users also, often, only search for items that are at a certain moment in their interest. Hence, using only this source gives a skewed KM. A more elaborate approach could for example be the use of the content of a user’s home directory. The metric itself has room for improvement as well. For example, the constants c1 and c2 can be further optimized. Moreover, the implementation of semantic relationship, as introduced in Section 3, can be improved. Currently, it is a simple distance measure: the n relations a synset is away from a topic. Other measures have been proposed as well; e.g., Hirst-StOnge, Resnik, LeocockChodorow, and Jiang-Conrath (Budanitsky and Hirst, 2006). However, not all features will be suitable for our purpose – a measure of distinctness between two
Modeling User Knowledge from Queries
401
items of knowledge (or, information) is needed, which is consistent with how the user will experience it (Van der Sluis et al., in press). A similar notion holds for the threshold of .85, used to indicate when a user is knowledgeable: user tests are needed to compare the metric to how the user perceives his knowledgeableness. The last argument of perceived relatedness and knowledgeableness is intrinsically interweaved with the used model of knowledge. For the purpose of modeling knowledge, a semantic model such as WordNet is distinct from a true reflection of knowledge. Besides the absence of named entities in WordNet, one of the most salient problems is the small number of relations needed to converge a large part of all synsets. This causes the effect seen in all examples shown in Figure 4, where almost all users tend to obtain a low relationship with the topic of interest. This is even the case for very uncommon words such as malonylurea. Hence, a more elaborate model can hold much more power in distinguishing between parts of the model, alleviating the need for a strict decay function; see also Figure 3. Every KM that is based on implicit sources will be particularly successful in identifying true positives; i.e., topics on which a user has knowledge. In contrast, the identification of false negatives forms a more substantial challenge. No one data source will cover all the knowledge a user has. Every KM will also, to a lesser extent, suffer from false positives. This can occur when people simply forget about a topic or when they share their account with a different user. However, this is less of a problem when looking at the spread of activation, as this spread indicates that a user has often returned to that topic. Moreover, when using not up-to-date logs, this problem should be less prominent. LaBerge and Samuels (1974) noted about comprehension: “The complexity of the comprehension operation appears to be as enormous as that of thinking in general” (p. 320). Notwithstanding, by looking at the spread of activation extracted from a user’s query history, it is feasible to infer part of the comprehensibility: the relatedness of a (new) topic to a model of the user’s knowledge. This metric will pave the way to adaptive web technology, allowing systems to directly aim for a user’s understandability, interest, and experience.
Acknowledgements We would like to thank Claudia Hauff, Betsy van Dijk, Anton Nijholt, and Franciska de Jong for their helpful comments on this research. This work was part of the PuppyIR project, which is supported by a grant of the 7th Framework ICT Programme (FP7-ICT-2007-3) of the European Union.
References Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of semantic distance. Computational Linguistics 32(1), 13–47 (2006) Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T.: The vocabulary problem in human-system communication. ACM Commun. 30(11), 964–971 (1987)
402
F. van der Sluis and E.L. van den Broek
Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for personalized information access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 54–89. Springer, Heidelberg (2007) Gervasi, V., Ambriola, V.: Quantitative assessment of textual complexity. In: Merlini Barbaresi, L. (ed.) Complexity in Language and Text, pp. 197–228. Plus Pisa University Press, Pisa, Italy (2003) Kintsch, W.: Text comprehension, memory, and learning. American Psychologist 49(4), 294–303 (1994) LaBerge, D., Samuels, S.J.: Toward a theory of automatic information processing in reading. Cognitive Psychology 6(2), 293–323 (1974) Miller, G.A.: Wordnet: a lexical database for english. ACM Commun. 38(11), 39–41 (1995) Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: Proc. 1st Intl. Conf. on Scalable Information Systems. ACM Press, New York (2006) Reeve, J.: The interest-enjoyment distinction in intrinsic motivation. Motivation and Emotion 13(2), 83–103 (1989) Schraw, G., Lehman, S.: Situational interest: A review of the literature and directions for future research. Educational Psychology Review 13(30), 23–52 (2001) Van der Sluis, F., Van den Broek, E.L.: Applying Ockham’s razor to search results: Using complexity measures in information retrieval. In: Information Interaction in Context (IIiX) Symposium, ACM, New York (in Press) Van der Sluis, F., Van den Broek, E.L., Van Dijk, E.M.A.G.: Information Retrieval eXperience (IRX): Towards a human-centered personalized model of relevance. In: Third International Workshop on Web Information Retrieval Support Systems, Toronto, Canada, August 31 (2010) Wedig, S., Madani, O.: A large-scale analysis of query logs for assessing personalization opportunities. In: KDD 2006: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 742–747. ACM, New York (2006)
Computer-Assisted Interviewing with Active Questionnaires Seon-Ah Jang, Jae-Gun Yang, and Jae-Hak J. Bae∗ School of Computer Engineering & Information Technology, University of Ulsan, Ulsan, Republic of Korea [email protected], {jgyang,jhjbae}@ulsan.ac.kr
Abstract. Computer-assisted interviewing systems have various benefits, as compared to paper-and-pencil surveys. The engine for processing questionnaires, however, should be reprogrammed when the questionnaire is changed since its processing logic is hard-coded in the system. As such, the engine for processing questionnaires depends on the questionnaires. This study makes the engine for processing questionnaires independent of questionnaires using an active document model. In this model, machines can process documents with rules specified in the documents. The active questionnaire, which is an active document, is composed of questions, control logic, and a knowledgebase. Among these, control logic expresses a method of processing questions in an executable XML. In this paper, we propose a framework for processing active questionnaires and describe its implementation. Keywords: Questionnaire, Active Documents, XML, ERML, Logic Programming, Computer-Assisted Interviewing.
1 Introduction Traditionally, companies and government offices mainly use paper forms in order to exchange and manage necessary information. In addition, each organization adopts a certain document management system based on the technology of Electronic Data Interchange (EDI) in order to reduce costs and improve productivity. As a result, user interfaces are gradually replaced with electronic form documents and many application programs have been developed based on these documents. These form documents, however, do not include processing logics, but rather only define the appearance and content of documents as is often the case with paper documents. Recently, advanced electronic form documents can include information related to the business process that expresses the flow of the document processing, as well as the user interface and business data. This kind of form document is called an active document [1]. A questionnaire is no exception. Computer-assisted interviewing (CAI) is widely used to overcome shortcomings [2] of paper-and-pencil surveys. CAI systems have provided various benefits in terms of time, costs, return rates, and reliability of ∗
Corresponding author.
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 403–410, 2010. © Springer-Verlag Berlin Heidelberg 2010
404
S.-A. Jang, J.-G. Yang, and J.-H.J. Bae
responses, by providing an environment in which respondents can conveniently answer the questions. Despite these advantages, they still suffer from system maintenance problems. Electronic questionnaires also consist of the structure and content of questions, just like their paper counterparts. The methods for processing questions are generally hard-coded in the CAI system. When preparing for a new questionnaire, the system developer should modify the engine for processing questionnaires to reflect the new control flow in the questionnaire. In this paper, we propose a CAI system that adopts active documents in order to solve the maintenance problems, and implement the system as the Web Interview System with Active Documents (WINAD) in order to examine its usefulness.
2 Previous Work There is SSS(Simple Survey System)[2] which has processing logics in questionnaires. SSS only includes three predefined functions (routing logic, fills and computed variables) in XML schemas in questionnaires. However, based on active documents, WINAD can handle any control logic which can be described in Prolog. Table 1. Approaches to Active Documents
Research Concept of active document
Technology used Key idea
Language for behavior representation Application areas
AHDM (Active Hypertext Document Model) [3] Active document combines structured and hyperlinked information, like any other hypertext document, but also contains associated procedural information in the form of scripts or program fragments XML, link, Xpointer, CSS DOM Realizes applications through hypertext documents with embedded application logic
Displet [4] Active document is a document that can provide some autonomous active behavior; it may be displayed, printed, searched, perform computations, produce additional documentation, perform animations XML, XSLT, Java Associating XML elements to Java classes that are able to perform behaviors
Script language (Tcl / Java language OTcl) Computer supported workflow systems
Multi-document agent applications
Total Shadow and ActiveForm [1] Jima [5] Active document is a content-aware, autonomous, proactive, adaptive, and context-aware document
Active document includes in itself data, business rules, and data integrity constraints that are implied in documents as declarative knowledge to support the automation of document processing
Jini
XML, SLT, JSP, Prolog Implements active Form documents imply documents knowledge of through mobile themselves, namely, the agent technologies methods of processing and context documents that reflect information the intention of their infrastructure designer Java language ERML Ubiquitous computing and communication environment
Intelligent web applications, intelligent form document processing systems
Several studies have employed the concept of active documents, such as in electronic publishing, workflow systems, and mailing systems. Table 1 summarizes various approaches to including control logics in documents. The control logics specify how to process documents and how to exhibit their active behaviors. They are
Computer-Assisted Interviewing with Active Questionnaires
405
compared in terms of five criteria, which are the concept of active document, technology used, key idea, language for behavior representation, and application areas. Accordingly, we can see the limitations discussed below. First, procedural languages have been generally used in expressing document behaviors. Their behavior is event-driven and can be described by rules more naturally than by procedures. Second, different languages are used together in implementing the components of active documents. The use of multiple languages leads to a difficulty in maintaining the compatibility of active document processing systems in a heterogeneous environment. Finally, it takes a lot of time to represent and modify the active behaviors of documents in Java or a script language. They are not flexible enough to be used in the dynamic environment of business today. To cope with these limitations, we write active documents in Prolog and XML.
3 Active Questionnaire Model 3.1 Active Questionnaires We can separate logics for processing questions from the engine for processing questionnaires and then include the logics in the questionnaires. In this case, it becomes possible to change the questionnaire without modifying the engine. This kind of questionnaire is referred to as an active questionnaire, and its constituents are shown in Fig. 1. An active questionnaire is composed of questions, rules, a knowledgebase, and a query. Among these, questions are expressed in XML, and rules specify how to process questions in an executable XML, i.e., Executable Rule Markup Language (ERML). A knowledgebase and a query are also expressed in ERML [1, 6].
Fig. 1. The Active Questionnaire Model
3.2 Types of Questions 3.2.1 Control Logics for Questions There are two main categories of questions: closed-ended and open-ended [7]. A closed-ended question allows the respondent to select an answer from a given number of options, whereas an open-ended question asks the respondent to give his or her own answer. To achieve the best response rates, questions should flow logically from one to the next. We have devised control logics for processing questions according to question type, as shown in Table 2 [8, 9]. 3.2.2 Representation of Active Questionnaires Fig. 3 shows a part of a paper questionnaire, which includes two control logics for questions: Fill and Check response. This paper questionnaire is represented in an active questionnaire in Fig. 2. Naturally, answer, fill and checkResponse should have
406
S.-A. Jang, J.-G. Yang, and J.-H.J. Bae Table 2. Control Logics for Questions
Control Logic Check
Compute
Explanation
duplicate response limit add subtract multiply divide
Fill Arrange
question instance
Skip
To check redundancy in responses To get mandatory responses To restrict the number of responses To do addition needed in answering the current question To do subtraction needed in answering the current question To do multiplication needed in answering the current question To do division needed in answering the current question To fill the current question with phrases taken from or based on previous responses To change the order of questions To change the order of answers To control the sequence of questions for each respondent.
answer('3', none) answer('4', none) answer('5', none) 3. Do you have a car? (You have to answer this question.) Yes No 4. (If you select in question 3)
① ②
①
What is the brand of your car? ( ) 5. Are you satisfied with the car "none" that you have? Satisfied Not satisfied checkResponse('3') :- answer('3', Value), Value == none skip('4') :- answer('3', Value), Value == 2 skip('5') :- answer('3', Value), Value == 2 fill('5',Value) :- answer('4', Value)
① ②
Fig. 2. An Active Questionnaire 3. Do you have a car? (You have to answer this question.) Yes No 4. (If you select in question 3) What is the brand of your car? ( ) 5. Are you satisfied with the car "none" that you have? Satisfied Not satisfied
① ② ①
②
①
Fig. 3. A Paper Questionnaire
been translated into ERML in the form of XML, but they remain as Prolog clauses for the convenience of explanation. In the case of this questionnaire, the element consists of user responses, and the element consists of control logics for processing questions appropriate to the types of each question. There is a
Computer-Assisted Interviewing with Active Questionnaires
407
Prolog rule that directs the questionnaire system to fill the "none" area of question 5 with the answer to question 4. Fig. 4 shows how this rule is expressed in ERML. It is eventually consulted into an inference engine of the questionnaire system.
fill('5',Value) :- answer('4', Value)
:- fill 5 Value , answer 4 Value
Fig. 4. Control Logic Fill in Prolog (left) and in ERML (right)
4 The Framework of Processing Active Questionnaires 4.1 The Structure of the WINAD System In this section we describe a CAI system that adopts active questionnaires. The system is the WINAD and has three components, namely, a questionnaire client, a web server, and a questionnaire designer, as shown in Fig. 5. The Questionnaire Client is composed of the User Interface and the Questionnaire Clerk. The User Interface obtains questions that respondents request, translates them into HTML, and then displays them on a web browser ( -1, -2). It receives the consequences of the control logic for the current question through the Questionnaire Clerk (Send Inference Result), and reflects them in the questionnaire displayed on the web browser. In addition to this, it delivers question objects to the Questionnaire Clerk (Send Question Object). The Questionnaire Clerk hands question numbers and their corresponding answers to the Question Processor in the Web Server ( -1). It also receives question numbers and consequences of control logics for the questions from the Question Processor in the Web Server ( -2). The Web Server is composed of the Question Processor, Control Logic Translator, and Web Container. The Question Processor receives question numbers and answers from the Questionnaire Clerk ( -1). It checks whether or not there are control logics corresponding to the question numbers. If a question has a control logic, it infers the consequence, which is a command for processing questions, and transmits the command to the Questionnaire Clerk. In an active questionnaire, control logics for processing questions are specified by rules and a knowledgebase in XML. The Control Logic Translator changes the representation language of control logics from Prolog to XML and conversely. It is composed of two modules. The XML2Prolog changes the representation of control logics into Prolog in order to be run in the Question
④ ④
⑨
⑨
⑨
408
S.-A. Jang, J.-G. Yang, and J.-H.J. Bae
⑦ ⑧
Processor ( -2, ). The Control Logic Translator receives the rules and knowledgebase from the Control Logic Extractor in the Web Container. Finally, the third component, the Web Container, consists of the Control Logic Extractor and Content Manager. The Control Logic Extractor separates questions, rules, and knowledgebase from an active questionnaire. The questions are delivered to the Content Manager ( -1), and the rules and knowledgebase are delivered to the XML2Prolog module ( -2). The Content Manager gives requested questions to the Questionnaire Client ( -2), or processes the knowledgebase of a questionnaire that a respondent finishes ( , ). Questionnaire Designer creates an active questionnaire with an XML editor. The active questionnaire is stored and distributed as an XML document, which consists of three elements: the question, rule, and knowledgebase ( , ).
⑦ ⑦ ④ ⑩⑪
①②
Fig. 5. System Architecture of WINAD
4.2 Interviewing with Active Questionnaires Fig. 6 shows a questionnaire screen that is displayed by the Questionnaire Client of the WINAD System. The active questionnaire includes three control logics for questions: Skip, Fill, and Check response types. Fig. 7 shows the screen where control logic Skip is performed according to the answer “No” to question 3 and where the questions 4, 5, and 6 are hidden. After the answer “Yes” to question 7, three questions 8, 9, and 10 remain on the screen. In order to implement the Skip function to ignore useless questions, we do not need additional program codes in the Content Manager. This function is specified in the control logic of questions and takes effect on the display when the logic is performed by the Question Processor. Fig. 8 is an example that shows how the control logic Fill is performed, in which previous answers are filled into current questions. Namely, it is the case in which the response for question 1 is filled into the "none" area of question 2. Fig. 9 shows the screen where the control logic Check response is performed to confirm whether or not there has been any response for question 1.
Computer-Assisted Interviewing with Active Questionnaires
409
Fig. 6. Questionnaire Screen
Fig. 7. Execution of Control Logic Skip
Fig. 8. Execution of Control Logic Fill
Fig. 9. Execution of Control Logic Check response
5 Conclusions and Further Research Computer-assisted interviewing systems are devised to overcome the shortcomings of paper-and-pencil surveys and to provide a convenient environment for respondents to answer questions. However, when a new questionnaire is prepared, the engine for processing questions should be modified or re-implemented because the procedures for processing questions are generally hard-coded in the engines. This means the engine for processing questionnaires depends on the questionnaires. To maintain the engine efficiently, we make it independent of questionnaires using the active document model. In this model, machines can process documents with rules specified in the documents.
410
S.-A. Jang, J.-G. Yang, and J.-H.J. Bae
A questionnaire that follows the active document model, is referred to as an active questionnaire. It consists of questions, rules, a knowledgebase, and a query. Among these, questions are expressed in XML, and rules are control logics that specify how to process questions in ERML. A knowledgebase and a query are also expressed in ERML. In order to examine the usefulness of active questionnaires, we have designed and implemented a web interview system. We have demonstrated that the engine for processing questionnaires can be made independent of questionnaires in the system. The independence implies that there is no need to modify the engine for each questionnaire. Now, we have a plan to improve the WINAD system in three ways; to develop various control logics for processing questions, to enhance the convenience and efficiency of user interactions, and finally, to implement intelligent interviewing procedures. Acknowledgments. This work was supported by the 2009 Research Fund of University of Ulsan.
References 1. Nam, C.-K., Jang, G.-S., Bae, J.-H.J.: An XML-based Active Document for Intelligent Web Applications. Expert System with Applications 25(2), 165–176 (2003) 2. Bethke, A.D.: Representing Procedural Logic in XML. Journal of Software 3(2), 33–40 (2008) 3. Köppen, E., Neumann, G.: Active Hypertext for Distributed Web Applications. In: 8th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise (1999) 4. Bompani, L., Ciancarini, P., Vitali, F.: Active Documents in XML. ACM SIGWEB News letter 8(1), 27–32 (1999) 5. Werle, P., Jansson, C.G.: Active Documents Supporting Teamwork in a Ubiquitous Computing Environment. In: PCC Workshop 2001 & NRS 01, Nynashamn, Sweden (2001) 6. Jang, S.-A., Yang, J.-G., Bae, J.-H.J., Nam, C.-K.: A Framework for Processing Questionnaires in Active Documents. In: 2009 International Forum on Strategic Technologies (IFOST 2009), pp. 206–208 (2009) 7. Brace, I.: Questionnaire Design: How to Plan, Structure and Write Survey Material for Effective Market Research. Kogan Page Publishers (2008) 8. Jang, S.-A., Yang, J.-G., Bae, J.-H.J.: Design of Questionnaire Logic in Active Documents. In: 32nd KIPS Fall Conference 2009, vol. 16(2), pp. 945–946 (2009) (in Korean) 9. Jang, S.-A., Yang, J.-G., Bae, J.-H.J.: Flow Control of Survey in Active Documents. In: KIISE Korea Computer Congress 2009, vol. 36(1D), pp. 283–288 (2009) (in Korean)
Assessing End-User Programming for a Graphics Development Environment Lizao Fang and Daryl H. Hepting Computer Science Department, University of Regina, Canada [email protected], [email protected]
Abstract. Quartz Composer is a graphics development environment that uses a visual programming paradigm to enable its users to create a wide variety of animations. Although it is very powerful with a rich set of programming capabilities for its users, there remain barriers to its full use, especially by end-users. This paper presents a prototype end-user programming system that is designed to remove the barriers present in the native Quartz Composer environment. The system, called QEUP, is based on earlier work with cogito. It provides direct access to samples of Quartz Composer output without requiring any of the manual programming involved in Quartz Composer. In order to assess the impacts of QEUP, a user study was conducted with 15 participants. Preliminary results indicate that there may be benefit to using QEUP when first learning Quartz Composer, or when learning new capabilities within it.
1
Introduction
The power of the visual programming paradigm, which began with ConMan [1], is evident in the Quartz Composer graphics development environment (GDE) now available on Apple computer systems. This direct-manipulation style of programming makes clear the relationships between different parts, called patches, and it affords alteration of those relationships by changing the connections between patches. See Figure 1 for a view of the programming interface. Other modifications are also possible through other features of the interface. Yet, if the user would like to explore different variations possible from this simple program comprising the 5 patches in Figure 1, he or she must be responsible for all of the changes. This necessity leads to two questions that are of interest here: 1. given an unsatisfactory output, which changes can be made to improve it? 2. after evaluating several alternatives, how can one easily return to, or reprogram, the best output amongst them? Following from the classification developed by Kochhar et al. [2], graphics development environments may be classified as either manual, automated, or augmented. Manual systems require complete involvement of a human to construct A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 411–423, 2010. c Springer-Verlag Berlin Heidelberg 2010
412
L. Fang and D.H. Hepting
(a) Editor
(c) Patch Creator
(b) Patch Inspector
(d) Viewer
Fig. 1. Quartz Composer, with the sample used throughout this paper
a graphics application; automated systems require no involvement of a human; and augmented systems support some notion of the development process as a collaborative effort between human and computer. Quartz Composer, with its visual programming paradigm, is a manual environment. This paper describes the development and testing of an augmented system, called QEUP (for Quartz Composer End-user Programming) which is based on cogito [3]. As well, QEUP is an end-user programming system, which can be used by end-users with little or no programming experience. According to the spectrum of software-related activities proposed by Fischer and Ye [4], the entry level of visual programming is higher than that of end-user programming. Can additional supports for end-user programming remove the barriers that remain with Quartz Composer, highlighted by the two questions posed earlier? The rest of this paper is organized as follows. Section 2 presents some background work. Section 3 describes the software that is being studied. Section 4 describes the design of the user study. Section 5 describes the results of the user study. Section 6 presents some conclusions and opportunities for future work.
Assessing End-User Programming for a Graphics Development Environment
2
413
Background
Although visual programming systems do enable end-user programming to some extent, the differences between these two areas are highlighted here. 2.1
Visual Programming
Visual programming employs pictorial components to specify programs. In many visual programming environments (VPEs), these pictorial components are connected according to the required data flow. In some other systems, the pictorial components are constructed based on flowcharts. Many VPEs allow users to drag and drop these components. The most exciting feature of visual programming is not the colourful graphical components, but its ability to develop, test and modify in an interactive way [5]. The barriers to visual programming are lower than conventional programming [6], and everyone with some necessary training would be able to manage to create simple programs in a VPE. Visual programming provides an interactive way for users to perform programming. Some systems support real-time, or approximately real-time, computing. Furthermore, some VPEs such as FPL (First Programming Language) [7] and Quartz Composer are well-suited for end-users because the systems eliminate syntactic errors [6]. Some other samples of VPEs include: ConMan [1], LabVIEW [8], Alice [9], and Scratch [10]. However, visual programming has not been widespread. Some researchers [6,11,12] point out reasons, which are summarized as follows: – visual programming is not radical enough to describe dynamic processes – pictorial components in visual programming increase abstraction, because they are symbolic objects – pictorial components waste precious screen real estate – visual programming inhibits details – visual programming does not scale well – visual programming has no place for comments – visual programming is hard to integrate with other programming languages, such as text. 2.2
End-User Programming
End-user programming (EUP) is defined as the activities that end-users, with no or minimal programming knowledge, create functions or programs. EUP also is called end-user modifiability [13,14], end-user computing [15], and end-user development [16]. The proposed scope of EUP varies amongst different researchers. Ye and Fischer [4] defined EUP as activities performed by pure end-users, who only use software rather than develop software. However, Blackwell [17] categorized EUP into five categories: activities based on a scripting language; activities performed in visual programming environments; activities performed in graphical rewrite
414
L. Fang and D.H. Hepting
systems; activities relying on spreadsheet systems; and activities with examplebased programming. The scope of EUP brought forward by Myers, Ko and Burnett [18] covers almost all software-related activities. End-users of EUP systems are those people having limited programming knowledge. The entry level into EUP systems is supposedly relatively low. We propose the following requirements for EUP systems - in the context of GDEs to answer the question of which software systems support EUP. 1. support for creative activities: designers cannot envision the results created by end-users. The system enables end-users to create their own programs. 2. ordinary end-users are able to benefit: the only requirement could be that users have experience of using computers. Some domain-oriented systems, such as MatLab [19], require solid domain knowledge that makes them inaccessible outside of the target domain. In contrast, Microsoft Excel is widely accepted by users from various domains. 3. easy to learn and easy to use: many end-users may not have the patience to learn a complex system. They do not have much time, nor do they want to spend a lot time learning software applications. After a short training period, end-users are able to recognize and comprehend a large percentage of functions provided by the EUP systems. 4. fault-tolerant and interactive: the system should render results without crashing, even though the results could be unreasonable. Maintaining responsiveness will help to engage the end-user.
3
Quartz Composer, cogito and QEUP
Three different software packages were involved in this paper. Each of them is described in more detail here: Quartz Composer, cogito, and QEUP. For each section, a usage scenario describes how a sample, which presents the movement of a cartoon sun, can be refined in each system. The actor in each scenario, named Jimmy, is an end-user with little programming experience. In addition, the implementation of QEUP is described. 3.1
Quartz Composer
Quartz Composer is a visual programming environment as well as a graphics development environment. There are four main windows in Quartz Composer (See Figure 1): the Editor (Figure 1(a)) window for constructing the program; the Patch Inspector (Figure 1(b)) window for setting specific parameter values; the Patch Creator (Figure 1(c)) window for adding new patches to the program; the Viewer (Figure 1(d)) window for displaying the output in real-time. To perform programming, end-users drag and drop patches then connect them by dragging lines from source ports to destination ports. The values produced from the source patch are passed to the destination patch through their ports.
Assessing End-User Programming for a Graphics Development Environment
415
Usage Scenario Step 1: Jimmy opens the example file, the Editor and Viewer windows pop up. In the Editor window (Figure 1(a)), he sees 5 boxes connected by lines. Step 2: The cartoon sun is moving in the horizontal direction, and Jimmy wants to add a movement effect in the vertical direction. He selects the LFO patch and clicks the “Patch Inspector” to access the Patch Inspector window. Jimmy changes the value of the Amplitude from 0 to 0.2. Step 3: Jimmy removes the horizontal movement effect by disconnecting the line between the Random patch (top) and the Billboard patch. Step 4: To add new patches to the program, Jimmy drags new patches from the Patch Creator window to the Editor window. Step 5: Jimmy refines the parameter values for the new patches, as described above, to produce a satisfying animation. 3.2
cogito
Hepting [3] developed cogito, which was designed to address drawbacks of traditional visualization tools. In this case, cogito presents the end-user with a variety of outputs, intended to show the breadth of alternatives available by exploring the parameter values of the various patches. Users can iteratively select promising alternatives to establish parameter values for exploration, refine them in the New Space (Figure 2(b)) dialogue box, and generate new alternatives for consideration on the Viewer (Figure 2(a)) window. Usage Scenario Step 1: Jimmy opens the example file, which is then shown in the first cell of the Viewer window (Figure 2(a)).
(a) Viewer Fig. 2. cogito’s user interfaces
(b) New Space
416
L. Fang and D.H. Hepting
Step 2: Jimmy selects the animation in the first cell and the background of the cell turns green. He clicks the “New” button on the bottom, then accesses the New Space dialogue box, which shows what he has selected so far. Jimmy clicks to add values in the “Billboard 1 | inputX” and “Billboard 1 | inputY”. Step 3: He clicks the “OK” button and eight new animations are displayed in the Viewer window (Figure 2(a)). Two of them are interesting, so Jimmy selects them and the backgrounds of their cells turn green as well. Step 4: Jimmy navigates between screens by clicking the arrow buttons and continues selecting other appealing outputs. Step 5: Jimmy clicks the “New” button again to refine his selections. He continues to explore until he decides on his favourite animations. 3.3
QEUP
QEUP is an example-based end-user programming system. Users start with an example, following the bottom-up recursive learning method. Specific examples aid visualization [20] and help users create concrete mental models. Unlike Quartz Composer, which is a manual system, QEUP is an augmented system. In Quartz Composer, users have to track the value of each parameter in order to identify its effect on the final result. However, in QEUP, users are able to set multiple values for each parameter, and the system processes and generates many outputs each time. Users’ attention is shifted from setting values to making decisions and selecting outputs from diverse alternatives. Usage Scenario Step 1: Jimmy selects the example file. The Editor (Figure 3(a)) window is displayed and the example animation is shown (Figure 3(a), circle 4). Patch names are listed in the Patches control list (Figure 3(a), circle 1), along with their parameters and values. Jimmy realizes that new patches must be defined to the system from the Description Configuration dialogue box (Figure 3(b)) before he can explore their parameter values. Step 2: Jimmy begins to edit patch descriptions. These descriptions include enforceable constraints on value ranges and types. Jimmy realizes that all these activities are helping him to learn about the example, and he becomes more confident that he could tackle more complicated configurations quite easily. After finishing the description of a patch, Jimmy realizes that his description is displayed in the Editor window (Figure 3(a), circle 2). Step 3: In the Editor window, he navigates to a parameter, and clicks the “Add” button under the Values control list. In the Add Values (Figure 3(d)) dialogue box that pops up, he realizes that the values he added to his description appear as defaults under the Value List. He sees that he can also input new values manually, or use other patches to generate values. He doesn’t change anything at this time. Step 4: Jimmy clicks “Build” button in order for the system to produce the outputs, based on his choices. In the Viewer (Figure 3(c)) window, he reviews the outputs and selects those which he finds interesting.
Assessing End-User Programming for a Graphics Development Environment
(a) Editor
417
(b) Description Configuration
(c) Viewer
(d) Add Values
Fig. 3. QEUP’s user interfaces
Step 5: By clicking the “Explore” button, Jimmy transfers his selections’ data back to the Editor window, where he continues to refine his choices. He is confident about making changes here, because he knows that he is able to load a history of his previous work (Figure 3(a), circle 3). The implementation of the QEUP functionality is divided into four phases. XML is a cornerstone of the system because most documents involved are in XML format. Phase 1: After the example file (.qtz) is successfully loaded, the system converts the document from binary property list to XML format. The XML document is processed and related information, such as patch names, parameter names, and values, is picked up from the document. As well, the Description document (eup description.xml ) is processed. New patches that do not have descriptions are identified by the system.
418
L. Fang and D.H. Hepting
Phase 2: When users write the description for the new patches, the description information is saved into eup description.xml, which will be processed again if the users edit the description. Phase 3: The system begins to produce outputs when the Build button on the Editor window is clicked. During the processes of generating outputs, at first the Template document (eup standarizedSample.xml ) is produced based on the XML format of the example document. The Template document inherits the XML structure of the example document, but removes those data which vary in different outputs. This Template document is created only once for each example file. A cog document is used to save the parameters and their values. The system processes and generates an XML scheme document that includes any restrictions defined by users. Then, the XML scheme document is employed to validate data in cog document. If the validation succeeds, the data in the cog document constructs a set of combinations, each of which will be used to produce an output. Based on the Template document, the first eight combinations are processed. The system produces eight outputs and puts them in the Viewer window. Phase 4: Other outputs will be dynamically produced when users navigate to the next screen. Users might select several satisfying outputs and transfer them to the Editor window for further exploration. During this process, the system maps the selected outputs to corresponding combinations mentioned in Phase 3. Then, the values of the parameters are appended to the Values list in the Editor window.
4
User Study
A user study was conducted in order to assess the impacts of QEUP for Quartz Composer. The between-subjects study was designed to look at the users’ experience with Quartz Composer in three different cases. 4.1
Participants
15 participants took part in the study, with ages ranging from 18 to 32. All of them were students at the University of Regina. Their areas of study were diverse, but most had taken at least one Computer Science course. Regarding their level of programming knowledge, 4 reported a low level; 5 reported a medium level, and 6 reported a high level. 14 participants reported no experience with Quartz Composer, and 1 participant reported low experience on Quartz Composer. They have at most medium experience on visual programming. The participants were randomly grouped assigned to 3 groups. Participants in the first group used Quartz Composer directly (QC Group). Participants in the second group used cogito followed by Quartz Composer (cogito Group). Participants in the third group used QEUP followed by Quartz Composer (QEUP Group).
Assessing End-User Programming for a Graphics Development Environment
4.2
419
Materials and Task Design
Each participant also encountered the following documents: a pre-task questionnaire, which covered aspects of participants’ background; a tutorial manual, which provided a standard introduction to the software systems being used; and a post-task questionnaire, which captured aspects of their experience with the software system(s) used by the participant. During each part of the study, participants began with a very simple example (shown in Figure 1). It only had 5 patches: an Image Importer (to load the image), a Billboard (to display the image), an LFO (low-frequency oscillator) and 2 Random patches, used to move the Billboard around the screen. Participants were requested to work on the input example, refine the example, and produce appealing outputs. 4.3
Procedure
Participants completed a consent form and the pre-task questionnaire. The QC Group received some training and then used Quartz Composer, for a period of at most 15 minutes. The cogito Group received some training and then used cogito followed by Quartz Composer, each for a period of at most 15 minutes. The QEUP Group received some training and then used QEUP, followed by Quartz Composer, each for a period of at most 15 minutes. Participants were asked to talk about what they were doing as they navigated the software applications, using a think-aloud protocol. Each participant also completed the post-task questionnaire. All operations using the software were recorded from the computer screen, as well as audio from the participants’ interactions (the participants themselves were not video recorded).
5
Results and Analysis
We analyzed all participants’ performance on Quartz Composer. All data in following tables in this section is collected from performance on Quartz Composer. The participants’ performance is analyzed from three aspects: time to complete the task, attempts to set values, and the primary operation performed by participants. The final outputs in Quartz Composer are determined by parameters’ values and the connection relationship amongst patches. Time to complete the task. In the study, the time spent on Quartz Composer was limited to 15 minutes, but participants were able to stop before that time. Table 1 shows the time spent by all participants from all three groups on Quartz Composer. The total and average time spent is shortest for the QEUP Group. All participants are beginners with respect to Quartz Composer. But participants using QEUP seemed able to more efficiently produce satisfying outputs within Quartz Composer compared to other participants. Users communicate with computers based on two channels: explicit and implicit [21]. The explicit
420
L. Fang and D.H. Hepting Table 1. Time to complete task (min) Group
Participant Times
Total Avg.
QC
10.00 13.28 15.00 15.00 15.00 68.28 13.66
cogito
14.25 15.00 15.00 15.00 15.00 74.25 14.85
QEUP
9.22 10.25 11.00 11.75 15.00 57.22 11.44
Table 2. Setting values on sample Group QC
User Performance Set-value operations
28
30
40
14
30
Set-value kept
20
20
22
7
10
Ratio cogito
1.40 1.50 1.82 2.00 3.00 1.94
Set-value operations
0
17
26
29
61
Set-value kept
0
13
17
13
27
Ratio QEUP
Avg.
1.00 1.31 1.53 2.23 2.26 1.67
Set-value operations
0
21
9
18
16
Set-value kept
0
17
7
12
10
Ratio
1.00 1.24 1.29 1.50 1.60 1.32
channel is based on the user interface and the implicit channel relies on the knowledge that the users and the computers both have. Because the participants in this study have no, or very limited, experience on Quartz Composer, their communication with the computer through the explicit channel is not very different. Therefore, the improvement of the communication through the implicit channel may be the factor that results in less time spent in the QEUP Group. The QEUP Group might have gained necessary knowledge after they have some experience on the QEUP system. Attempts to set values. All participants performed programming based on the example. In order to set a suitable value for a parameter, participants might try several times. We calculated the ratio of set-value attempts made to the setvalues attempts kept, which is meant to represent how many trial values were needed in order to successfully customize a parameter. We wanted to evaluate impacts of exposure to cogito and QEUP on Quartz Composer performance. The data in this table is the data related to the example only, operations on new patches were not considered. Table 2 shows data from the three groups. The minimum ratio is 1 and so it is reasonable that this is the ratio for participants who did not set any values.
Assessing End-User Programming for a Graphics Development Environment
421
The QEUP Group has the smallest ratio, which may indicate that use of the QEUP system helps participants to make better decisions in Quartz Composer. The significance of using cogito on this aspect is less obvious. As well, Table 2 supports the assertion that the participants’ knowledge of Quartz Composer in the QEUP Group is improved by using QEUP first. Primary operation: setting values or connection/disconnection. Table 3 provides the number of set-value and connection/disconnection operations attempted, from which a ratio is calculated. If ratio is great than 1, setting values seems to be the primary operation. If it is less than 1, connecting/disconnecting patches is said to be the primary operation. The QC Group has no participants whose primary operation was connecting/disconnecting patches. However, there is 1 in the cogito Group and 3 in the QEUP Group. In the QC Group, we observed that participants expended much effort on tracking parameters’ values. However, participants in cogito Group and QEUP Group tend to be less concerned about tracking parameters’ values. Patch Inspector (Figure 1(b)) window for setting values is a window on the second level in Quartz Composer. It is accessible by clicking Patch Inspector button on the Editor Window. Connection/disconnection is the operation in the Editor Window, which is the first level window. The operations of setting values are on a lower level, though they are important. Using QEUP and cogito systems might have capabilities to inspire participants to move from a lower level to a higher level. In addition, the impacts of using QEUP on this aspect are more noticeable. Table 3. Setting values and connection/disconnection Group QC
User Performance Set-value operations
30
28
45
14
30
(Dis)connection operations
29
26
30
5
6
Ratio cogito
Set-value operations
16
17
33
29
61
(Dis)connection operations
52
17
21
9
14
Ratio QEUP
Set-value operations (Dis)connection operations Ratio
6
1.03 1.08 1.50 2.80 5.00
0.31 1.00 1.57 3.22 4.36 0
9
21
18
18
29
13
26
6
2
0.00 0.69 0.81 3.00 9.00
Conclusion and Future Work
From analysis of the user study, it appears that QEUP may be a useful complement to Quartz Composer, which may help end-users to acquire knowledge
422
L. Fang and D.H. Hepting
and to form mental models. Those participants starting with QEUP, followed by Quartz Composer, seemed better able to cope with the barriers that emerge in Quartz Composer. It may be that experience with QEUP can provide a headstart in learning the Quartz Composer graphical development environment. Furthermore, it may provide a similar benefit when an end-user wishes to better understand new features and patches within Quartz Composer. A longer-term study would be needed to better assess both potential outcomes hinted at from these results. As well, a comprehensive study would be needed to evaluate whether the QEUP system could be an alternative to replace Quartz Composer. Acknowledgements. The authors wish to acknowledge support from the Natural Sciences and Engineering Research Council of Canada, the Canada Foundation for Innovation, the Saskatchewan Innovation and Science Fund, and the University of Regina.
References 1. Haeberli, P.E.: Conman: a visual programming language for interactive graphics. In: Proc. SIGGRAPH 1988, pp. 103–111. ACM, New York (1988) 2. Kochhar, S.: et al.: Interaction paradigms for human-computer cooperation in graphical-object modeling. In: Proc. Graphics Interface 1991, pp. 180–191 (1991) 3. Hepting, D.: Towards a visual interface for information visualization. In: Proc. Information Visualisation 2002, pp. 295–302 (2002) 4. Ye, Y., Fischer, G.: Designing for participation in socio-technical software systems. In: Stephanidis, C. (ed.) HCI 2007, Part I. LNCS, vol. 4554, pp. 312–321. Springer, Heidelberg (2007) 5. Burnett, M., et al.: Toward visual programming languages for steering scientific computations. IEEE Comput. Sci. Eng. 1(4), 44–62 (1994) 6. Myers, B.A.: Taxonomies of visual programming and program visualization. J. Vis. Lang. Comput. 1(1), 97–123 (1990) 7. Cunniff, N., et al.: Does programming language affect the type of conceptual bugs in beginners’ programs? SIGCHI Bull. 17(4), 175–182 (1986) 8. Johnson, G.W.: LabVIEW Graphical Programming: Practical Applications in Instrumentation and Control. McGraw-Hill School Education Group, New York (1997) 9. Pierce, J.S., et al.: Alice: easy to use interactive 3D graphics. In: Proc. UIST 1997, pp. 77–78. ACM, New York (1997) 10. Resnick, M., et al.: Scratch: programming for all. CACM 52(11), 60–67 (2009) 11. Kahn, K.: Drawings on napkins, video-game animation, and other ways to program computers. CACM 39(8), 49–59 (1996) 12. Brooks Jr., F.P.: No silver bullet essence and accidents of software engineering. Computer 20(4), 10–19 (1987) 13. Fischer, G., Girgensohn, A.: End-user modifiability in design environments. In: Proc. CHI 1990, pp. 183–192. ACM, New York (1990) 14. Girgensohn, A.: End-user modifiability in knowledge-based design environments. PhD thesis, University of Colorado at Boulder, Boulder, CO, USA (1992) 15. Brancheau, J.C., Brown, C.V.: The management of end-user computing: status and directions. ACM Comput. Surv. 25(4), 437–482 (1993)
Assessing End-User Programming for a Graphics Development Environment
423
16. D¨ orner, C., et al.: End-user development: new challenges for service oriented architectures. In: Proc. WEUSE 2008, pp. 71–75. ACM, New York (2008) 17. Blackwell, A.F.: Psychological issues in end-user programming. In: End User Development, pp. 9–30. Springer, Netherlands (2006) 18. Myers, B.A., Ko, A.J., Burnett, M.M.: Invited research overview: end-user programming. In: Proc. CHI 2006, pp. 75–80. ACM, New York (2006) 19. Gilat, A.: MATLAB: An Introduction with Applications. New Age (2005) 20. Lieberman, H.: An example based environment for beginning programmers. Instructional Science 14(3-4), 277–292 (1986) 21. Fischer, G.: User modeling in human–computer interaction. User Modeling and User-Adapted Interaction 11(1-2), 65–86 (2001)
Visual Image Browsing and Exploration (Vibe): User Evaluations of Image Search Tasks Grant Strong, Orland Hoeber, and Minglun Gong Department of Computer Science, Memorial University St. John’s, NL, Canada A1B 3X5 {strong,hoeber,gong}@cs.mun.ca
Abstract. One of the fundamental challenges in designing an image retrieval system is choosing a method by which the images that match a given query are presented to the searcher. Traditional approaches have used a grid layout that requires a sequential evaluation of the images. Recent advances in image processing and computing power have made similarity-based organization of images feasible. In this paper, we present an approach that places visually similar images near one another, and supports dynamic zooming and panning within the image search results. A user study was conducted on two alternate implementations of our prototype system, the findings from which illustrate the benefit that an interactive similarity-based image organization approach has over the traditional method for displaying image search results.
1
Introduction
Image search tasks can be divided into two fundamentally different categories: discovery and rediscovery. Within a rediscovery task, the searcher knows precisely what image they are looking for and seeks to either find it in the search results collection, or decide that it is not present. In contrast, when a searcher is performing a discovery task, the mental model of the image for which they are searching is often vague and incomplete. Within the search results collection, there may be many images that match the desired image to various degrees. The primary activities for the searcher in such discovery tasks are browsing and exploration. In this paper, we evaluate how visual image browsing and exploration, as implemented in Vibe, can assist searchers in preforming discovery tasks within the domain of image search. The fundamental premise is that a visual approach to image organization and representation that takes advantage of the similarities between images can enhance a searcher’s ability to browse and explore collections of images. Vibe is an example of a web information retrieval support system (WIRSS) [5]; its purpose is to enhance the human decision-making abilities within the context of image retrieval. The primary method of image retrieval used on the Web is based on keyword search. Search engines merely adapt their document retrieval algorithms to the context of images and present the results A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 424–435, 2010. c Springer-Verlag Berlin Heidelberg 2010
Vibe: User Evaluations of Image Search Tasks
425
in a scrollable list ranked on query relevance. While list interfaces are easy to use there is limited ability to manipulate and explore search results. To facilitate an exploration of a collection of image search results, Vibe arranges the images by content similarity on a two-dimensional virtual desktop [9,10]. The user can dynamically browse the image space using pan and zoom operations. As the user navigates, an image collage is dynamically generated from selected images. At the broadest zoom level, the images in the collage are those that best represent the others in their respective neighbourhoods, providing a high-level overview of the image collection. As the searcher zooms in toward an image of interest, more images that are visually similar to the area of focus are dynamically loaded. The benefit of this interaction method is that the user has the ability see as little or as much detail as they wish; a single unified interface provides both a high-level overview and details of a subset of the image collection. Two different methods for organizing the collection of images in Vibe are discussed and evaluated in this paper. The original design of Vibe displays images in irregular patterns [9], following a messy-desk metaphor. In a preliminary evaluation of the interface, we found that once searchers zoomed into a particular area of interest in the image space, they sometimes experienced difficulties scanning the irregularly placed images within the display. A potential solution to this difficulty is to align the images in the messy-desk arrangement into a more structured neat-desk layout in order to enhance the ability of searchers to linearly scan the images. This method maintains the similarity-based organization of the images, but relaxes the use of distance between pairs of images to represent a measure of their similarity. Where user productivity and enjoyment are concerned, we feel that the characteristics of Vibe have merit. The results of a user evaluation conducted in a controlled laboratory setting are reported in this paper. The evaluation compares three image search interfaces: messy-desk Vibe, neat-desk Vibe, and a scrollable grid layout similar to that found in Web image search engines. The remainder of this paper is organized as follows. Section 2 provides an overview of image retrieval and organization. Section 3 outlines the specific features of Vibe and the techniques used to construct the similarity-based image organization. Section 4 describes the user evaluation methods, followed by the results of the study in Section 5. The paper concludes with a summary of the research contributions and an overview of future work in Section 6.
2
Related Work
Techniques for finding specific images in a large image database has been studied for decades [2]. Most current Web-based image search engines rely on some form of metadata, such as captions, keywords, or descriptions; the matching of queries to images is performed using this metadata. Manual image annotation is tedious and time consuming, whereas the results of automatic annotation are still unreliable. Hence, methods for performing retrieval using image content directly,
426
G. Strong, O. Hoeber, and M. Gong
referred as Content-based Image Retrieval (CBIR) [7,2], have been extensively studied. While CBIR approaches normally assume that users have clear search goals, Similarity-based Image Browsing (SBIB) approaches cater to users who wish to explore a collection of images, but do not have a clearly defined information need [4]. The challenge of SBIB is to arrange images based on visual similarities in such a way as to support the browsing and exploration experience. This paper investigates whether SBIB techniques, as implemented in Vibe, can improve users’ image searching experience and performance. Several SBIB techniques have been proposed. Torres et al. [11] prescribe ways to enhance CBIR results by browsing them in spiral or concentric ring representations. The position and size of the images vary with their measure of similarity to the query. In Chen et al.’s approach [1], contents of image databases are modelled in pathfinder networks. The result is a branched clustering constructed with reference to the histogram or texture similarity between images. Snavely et al. [8] provide an interesting way to arrange and browse large sets of photos of the same scene by exploiting the common underlying 3D geometry in the scene. The image browsing technique evaluated in this paper is derived from Strong and Gong’s previous work [9,10]. We adopt their idea of organizing images in 2D space by training a neural network. Two alternative approaches to laying out the images are provided and studied.
3
Vibe
The Vibe technique can arrange images in two alternative ways, which are referred to as messy-desk and neat-desk layouts, respectively. Both layouts place images on a 2D virtual desktop so that visually similar images are close to each other. The difference is that images can be positioned at arbitrary locations in the messy-desk layout, but have to be aligned to a grid in the neat-desk layout. Vibe also supports dynamic pan and zoom operations within the image search results space, allowing the searcher to easily browse and explore the images. The rest of this section discusses the methods for generating these two layouts, and the techniques for supporting interactive exploration and browsing. 3.1
Feature Vector Generation
In order to organize images based on similarity, we need to define a way of measuring the similarity between any two images. Here the similarity is computed using the Euclidean distance between two feature vectors, which are extracted from images to represent the salient information. In this paper, the color-gradient correlation is used since it is easy to calculate and offers good organizational performance [10]. To compute the color-gradient correlation for an input image I, we first compute the gradient magnitude lp and gradient orientation θp for each pixel p. We then divide the colour and gradient orientation spaces into Nc and Nθ bins,
Vibe: User Evaluations of Image Search Tasks
427
respectively. Assuming that functions C(p) and Θ(p) give us the colour and gradient orientation bin indices for pixel p, the sum of gradient magnitudes for all pixels belonging to the k th colour and gradient orientation bin can be computed using: mk = lp (1) p∈I∧C(p)×NΘ +Θ(p)=k
where N = Nc × Nθ is the total number of bins. In practice, we set Nc = 8 and Nθ = 8, resulting a 64-dimensional feature vector F (I), and then normalize the final vector. 3.2
Messy-Desk Layout
Given a collection of T images, the messy-desk layout tries to position them on a 2D virtual desktop, so that visually similar images are placed together. This layout is generated by training a Self-Organizing Map (SOM), a process similar to the one discussed in [9]. A SOM is a type of artificial neural network that is trained through unsupervised learning. It is used here to map N-dimensional vectors to 2D coordinate space. SOMs consist of M ×M units, where each unit x has its own N-dimensional weight vector W (x). For dimension reduction we ensure that M × M T , making it possible to map distinct vectors to unique locations in the SOM. The SOM training process requires multiple iterations. During each iteration all images in the collection are shown to the SOM in a random order. When a particular image I is shown, the goal is to find the best match unit B and then update the weight vectors in B’s neighbourhood proportionally based on the distance between B and the neighbouring unit in the SOM. After the SOM converges, the coordinates of the best match unit B(I) for each image I gives us the mapping in 2D. The SOM’s topology preserving property ensures that images that have similar vectors are mapped to locations that are closer to each other, and vice versa. 3.3
Neat-Desk Layout
The messy-desk layout groups visually similar images together, allowing users to quickly narrow down the search results to a small area of the virtual desktop. However, preliminary evaluations found that users sometimes have difficulty locating the exact image they want because the irregular image layout makes it hard to remember which images have already been inspected. To address this problem, we propose another way to organize images, referred to as the neat-desk layout. The neat-desk layout constrains images positions to be aligned to a grid. Since a trained SOM cannot guarantee one image per unit, we cannot simply use a SOM has the same number of units as the grid we want to align the images to. Instead, we generate the neat-desk layout from the messy-desk layout. As shown in Figure 1, given the collection of images and their 2D locations in the
428
G. Strong, O. Hoeber, and M. Gong
left top
right
bottom top
bottom
Fig. 1. Converting from a messy-desk to the neat-desk layout using a k-d tree
messy-desk layout, the k-d tree algorithm is used to arrange the images into a neat-desk layout. The algorithm starts by finding the median value among the horizontal coordinates of all images, and uses this to split the collection into left and right halves. It then computes the median value among the vertical coordinates of images in each half, so that each half is further split into top and bottom quarters. The above two steps are repeated until each node contains at most one image. At the end, all images are contained in the leafs of a balanced binary tree. Based on the position of each leaf, we can assign a unique location to its associated image in the neat-desk layout. In the messy-desk approach, two images that are very similar to one another will be placed in close proximity. The resulting gaps and irregular placement of images provide a good representation of the visual clustering, but make sequential evaluation of images difficult. The neat-desk layout produces a more regular layout, at the expense of losing the visual encoding of the degree of similarity. 3.4
Determining Display Priority
While the above layouts handle the positioning of the images in a collection, it is impractical to display all images at those positions when the collection is large. To facilitate the selection of images to display at run time, we pre-assign priorities to all images. The priority assignment is based on the criteria that the more representative images should have higher priorities to allow them to be selected first. For the messy-desk layout, the images’ priorities are determined using a multiresolution SOM [9]. The bottom SOM, the one with the highest resolution, is obtained using SOM training procedure described in Section 3.2. The upper level SOMs are generated from the lower level ones directly without training. This is done by assigning each unit in an upper level SOM the average weight vector of its children in the lower level SOM. The average weight vector is then used to find the best matching image for each unit in the upper level SOMs. The upper level images represent their neighbourhoods below and are given a higher priority for display. The same principle is applied for the neat-desk layout. The bottom level grid holds all images, each in its assigned location. An upper level grid contains a
Vibe: User Evaluations of Image Search Tasks
429
Fig. 2. The layout of images using messy-desk Vibe (top row) and neat-desk Vibe (bottom row) for the same collection of images at three different levels of zoom. Note the visual similarity of images that are near to one another.
quarter of the grid points, with each point p linking to four child locations in the lower level grid. To select a single image for the grid point p, we first compute the average vector using images mapped to p’s four child locations, and then pick the image that has the vector closest to the average. 3.5
Browsing Interface
Given the images and their mapped locations in either messy-desk or neatdesk layouts, the browsing interface selectively displays images at their mapped locations based on the users’ pan and zoom interactions with the interface [9]. The number of images shown depends on the display resolution, the zoom level, and the user specified image display size. If the system is unable to fit all of the available images inside the viewing area, the ones with higher display priorities are shown. Figure 2 shows the three different levels of zoom for both the messydesk and neat-desk layout methods. Panning is implemented using a mouse drag operation, which translates the current viewing area. Zooming adjusts the size of the viewing area and is achieved using the normal mouse wheel operations. Zooming out enlarges the viewing area and allows users to inspect the overall image layout on the virtual desktop, whereas zooming in reduces the viewing area, making it possible to show the images in a local region in greater detail. It is worth noting that the zooming operation only changes the image display size when room is available (i.e., the view is at the lowest level and there are no “deeper” images); otherwise it is
430
G. Strong, O. Hoeber, and M. Gong
provides a filtering operation that pulls and pushes images into and out of the view area. The browsing interface also provides two ways for adjusting the display size of the images. First, the users can use the combination of the control key and mouse wheel to change the size of all displayed images, which also affects the total number of images that can be shown within the limits of the current view. Secondly, users are able to selectively enlarge an image of interest with a double-click.
4
Evaluation
In order to explore the differences between the traditional grid layout of image search results and the interactive content-based approach implemented in Vibe, a user evaluation was conducted in a controlled laboratory setting. In this study, the messy-desk Vibe (Vibe-m) and the neat-desk Vibe (Vibe-n) are compared to a grid layout (Grid). In order to reduce the interaction differences between the systems being studied, the Grid was implemented as a single scrollable grid (rather than the more common multi-page approach). 4.1
Methods
Although a number of options are available for studying search interfaces [6], we conducted a user evaluation in a laboratory setting in order to obtain empirical evidence regarding the value of the similarity-based approach for image search results representation. The controlled environment of the study allowed us to manage and manipulate the factors that we believed would have an impact on a participants performance and subjective reactions. At the same time, we were also able to ensure that the search tasks each participant performed were the same. The study was designed as a 3 × 3 (interface × search task) between-subjects design. Each participant used each interface only once, and conducted each search task only once. To further alleviate potential learning effects, a GraecoLatin square was used to vary the order of exposure to the interface and the order of the task assignment. Prior to performing any of the tasks, participants were given a brief introduction to the features of each of the three interfaces. A set of three situated search tasks were provided to the participants, for which they used either Vibe-m, Vibe-n, or the Grid. For each task, participants were given a scenario in which they were asked to find five images that were relevant to the described information need (see Table 1). The tasks were chosen to be somewhat ambiguous, requiring the participants to explore the search results in some detail. The images used for all three datasets were obtained from Google Image Search by searching with the corresponding keywords. In addition, the order of images displayed in the Grid follow the order returned by Google search. For each task, measurements of time to task completion, accuracy, and subjective measures were made. Pre-study questionnaires were administered to
Vibe: User Evaluations of Image Search Tasks
431
Table 1. Tasks assigned to participants in the user evaluation Query
Information Need
“Eiffel Tower” Find five images of sketches of the Eiffel Tower. “Notre Dame” Find five images of the stained glass windows of the Notre Dame Cathedral. “Washington” Find five images of Denzel Washington.
determine prior experience with image search, educational background, and computer use characteristics. In-task questionnaires measured perceptions of quality of the search results and ease of completing the task. Post-study questionnaires followed the guidelines of the Technology Acceptance Model (TAM) [3], measuring perceived usefulness and ease-of-use, along with an indication of preference for an image search interface. 4.2
Participant Demographics
Twelve individuals were recruited from the student population within our department to participate in this study. They reported using a wide range of systems for the purposes of searching for images. These included the top search engines (e.g., Google, Bing, and Yahoo), other online services (e.g., Flickr, Picasa, and Facebook), and desktop software (e.g., iPhoto, Windows Photo Gallery, and file browsers). As a result, we can conclude that all of the participants in the study were very familiar with the traditional grid-based approach to image layout.
5
Results
5.1
Time to Task Completion
The average time required to complete the three tasks with the three interfaces are illustrated in Figure 3. Clearly, these results are somewhat varied. For the “Eiffel Tower” and “Notre Dame” tasks, participants performed better using
Fig. 3. Average time to task completion measurements from the user evaluation
432
G. Strong, O. Hoeber, and M. Gong
both versions of Vibe than the Grid. However, which version of Vibe performed better was different between the two tasks. For the “Washington” task, participants performed better using the Grid than either version of Vibe. ANOVA tests were preformed on these measurements to determine whether their differences were statistical significant. Among these results, only three were significant. For the “Notre Dame” task, the time taken to complete the task using Vibe-m was faster than both the Grid (F(1, 7) = 12.4, p < 0.05) and Vibe-n (F(1, 7) = 8.15, p < 0.05). For the “Washington” task, the time to completion using the Grid was faster than Vibe-m (F(1, 7) = 6.49, p < 0.05). For the rest of the pair-wise comparisons, the differences were not statistically significant. For most combinations of tasks and interfaces, there was a high degree of variance in the time to task completion measurement, indicating that the ability to complete the tasks is more a function of the skill and interest of the participant than the interface used to browse, explore, and evaluate the image search results. One aspect of particular note is the situation where the Grid allowed the participants to complete the “Washington” task faster than with either version of Vibe. Within Vibe, the system was effective in grouping images with similar global features, but not very effective in putting together images with similar local features. Since the images that contain people are strongly influenced by the background, these images are not necessarily placed together in Vibe. While participants were able to navigate to a location of interest easily, if they were unable to find enough relevant images in that location (e.g., images of Denzel Washington), they were hesitant to zoom out and continue exploring. As a result, it took them longer to find the images than sequentially searching the image space. Nevertheless, this suggests that the users were able to use the spatial layout information presented in the Vibe interface effectively. As the methods for grouping images based on local features improve, issues such as this will be eliminated. 5.2
Accuracy
After the participants completed the tasks, the five selected images were carefully inspected to verify their relevance to the information need. ANOVA tests across all three tasks indicate that there are no statistically significant differences in the accuracy when using the different interfaces (“Eiffel Tower”: F (2, 11) = 1.29, p = 0.32; “Notre Dame”: F (2, 11) = 1.00, p = 0.41; “Washington”: F (2, 11) = 0.346, p = 0.72). The average number of errors ranged from zero to 0.75. This result indicates that the exploratory nature of Vibe neither helped nor hindered the participants in deciding the relevance of individual images to the search task. 5.3
Subjective Reactions
After each task was complete, participants were asked to indicate their degree of agreement to statements related to the quality of the search results and the ease at which they were able to complete the task (using a five-point Likert
Vibe: User Evaluations of Image Search Tasks
(a) quality of search results
433
(b) ease of the search task
Fig. 4. Average response to statements related to the search tasks
scale where high values indicated agreement). The average responses to these questions are reported in Figure 4. For the “Eiffel Tower” and “Notre Dame” tasks, one can readily see that participants perceived the search results to be of higher quality and the tasks to be easier to perform when using either version of Vibe compared to the Grid. For the “Washington” task, it appears that since there was some difficulty with Vibe being able to organize the local features of people in the images properly, the participants provided similar responses for all three interfaces. The statistical significance of these results were evaluated using pair-wise Wilcoxon-Mann-Whitney tests. Significance was found only for certain comparisons in the “Notre Dame” query. For the quality of search results measure, only the Grid vs. Vibe-n (Z = −2, 055, p < 0.05) comparison was statistically significant. For the ease of search task measure, only the Grid vs. Vibe-m (Z = −2.494, p < 0.05) and Grid vs. Vibe-n (Z = −2.494, p < 0.05) comparisons were statistically significant. Since the data from these in-task questionnaires was rather sparse, questions related to the overall perception of the usefulness and ease of use of the interface were collected in the post-study questionnaire, using the TAM instrument. Since this data was not collected in the context of a particular task, aggregate results of all participants and all TAM statements are shown in Figure 5. WilcoxonMann-Whitney tests were performed on the responses using a pair-wise grouping
Fig. 5. Average response to statements regarding to the usefulness and ease of use of the interface
434
G. Strong, O. Hoeber, and M. Gong
Table 2. Statistical analysis (Wilcoxon-Mann-Whitney tests) of the responses to the TAM questions Grid vs. Vibe-m
Grid vs. Vibe-n
Vibe-m vs. Vibe-n
Usefulness Z = −7.578, p < 0.001 Z = −7.966, p < 0.001 Z = −0.967, p = 0.334 Ease of Use Z = −2.775, p < 0.05 Z = −2.206, p < 0.05 Z = −0.785, p = 0.432
of the interfaces. The results from this statistical measure are reported in Table 2, showing that participants found either version of Vibe more useful and easy to use than the Grid. The differences between Vibe-m and Vibe-n were not found to be statistically significant. 5.4
Preference
At the end of the study, participants were asked to indicate their preference for an image search interface. Four participants indicated a preference for Vibe-m (33%), six for Vibe-n (50%), and two for the Grid (17%). This clearly indicates a high degree of preference for the dynamic layout and interactive features of Vibe. A Wilcoxon signed rank sum test found statistical significance (Z = −2.309, p < 0.05) in the preference of Vibe over the Grid. The preference between the messy-desk and neat-desk layouts was not statistically significant (Z = −0.632, p = 0.53).
6
Conclusions and Future Work
In this paper, we present an interactive visual interface that supports the browsing and exploration of image search results (Vibe). Two different versions of Vibe were created and studied in comparison to the commonly used grid layout. The messy-desk layout version of Vibe places images on a 2D virtual desktop, using the distance between images to represent their similarity. The neat-desk layout adds structure to the image arrangement. Both versions of Vibe provide dynamically generated collages of images, which can be interactively panned and zoomed. As the searcher zooms into an area of interest and more space is created in the view, more images from the search space are dynamically displayed. This interaction results in a filtering and focusing of the search space, supporting the searcher in discovering relevant images. As a result of the user evaluation, we conclude that Vibe can improve the time it takes to find relevant images from a collection of search results. However, there are situations where the overhead of browsing and exploring outweighs the time saved in finding relevant images. Further study is required to examine the boundary conditions for increasing or decreasing searcher performance. During the study, the perception of search results quality and ease of completing the tasks was higher for Vibe than for the grid layout. However, the degree and significance of this result was dependent on the task. By the end of
Vibe: User Evaluations of Image Search Tasks
435
the study (after each participant was exposed to each of the three interfaces), measurements of usefulness and ease of use showed a clear and statistically significant preference for Vibe. These results indicate that the the participants were able to see the value in using Vibe for their image search tasks, even though the time taken to find relevant images was not necessarily improved. Further validation of this outcome was provided by the fact that 83% of the participants preferred to use Vibe over a grid layout. In terms of the differences between the messy-desk and neat-desk layout, no clear conditions were found in this study indicating when one layout method was superior to the other. Whether a participant found one or the other easier to use may simply be a matter of personal preference. However, further study to identify the value of one layout method over the other will be of value.
References 1. Chen, C., Gagaudakis, G., Rosin, P.: Similarity-based image browsing. In: Proceedings of the IFIP International Conference on Intelligent Information Processing, Beijing, China, pp. 206–213 (2000) 2. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys 40(2), 1–60 (2008) 3. Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information technology. Management Information Systems Quarterly 13(3), 319– 340 (1989) 4. Heesch, D.: A survey of browsing models for content based image retrieval. Multimedia Tools and Applications 42(2), 261–284 (2008) 5. Hoeber, O.: Web information retrieval support systems: The future of Web search. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence - Workshops (International Workshop on Web Information Retrieval Support Systems), pp. 29–32 (2008) 6. Hoeber, O.: User evaluation methods for visual Web search interfaces. In: Proceedings of the International Conference on Information Visualization, pp. 139–145. IEEE Computer Society Press, Los Alamitos (2009) 7. Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12), 1349–1380 (2000) 8. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. In: Proceedings of the ACM International Conference on Computer Graphics and Interactive Techniques, pp. 835–846 (2006) 9. Strong, G., Gong, M.: Browsing a large collection of community photos based on similarity on GPU. In: Proceedings of the International Symposium on Advances in Visual Computing, pp. 390–399 (2008) 10. Strong, G., Gong, M.: Organizing and browsing photos using different feature vectors and their evaluations. In: Proceedings of the International Conference on Image and Video Retrieval, pp. 1–8 (2009) 11. Torres, R.S., Silva, C.G., Medeiros, C.B., Rocha, H.V.: Visual structures for image browsing. In: Proceedings of the International Conference on Information and Knowledge Management, pp. 49–55 (2003)
Contextual Recommendation of Social Updates, a Tag-Based Framework Adrien Joly1,2 , Pierre Maret3 , and Johann Daigremont1 1
3
Alcatel-Lucent Bell Labs France, Site de Villarceaux, F-91620 Nozay, France [email protected] 2 Universit´e de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, F-69621, France Universit´e de Lyon, Laboratoire Hubert Curien, UMR CNRS 5516, F-42000 Saint-Etienne, France [email protected]
Abstract. In this paper, we propose a framework to improve the relevance of awareness information about people and subjects, by adapting recommendation techniques to real-time web data, in order to reduce information overload. The novelty of our approach relies on the use of contextual information about people’s current activities to rank social updates which they are following on Social Networking Services and other collaborative software. The two hypothesis that we are supporting in this paper are: (i) a social update shared by person X is relevant to another person Y if the current context of Y is similar to X’s context at time of sharing; and (ii) in a web-browsing session, a reliable current context of a user can be processed using metadata of web documents accessed by the user. We discuss the validity of these hypothesis by analyzing their results on experimental data.
1
Introduction
On Social Networking Services (such as Facebook1 , Twitter2 , LinkedIn3 ) and other collaboration software, people maintain and create new social ties by sharing personal (but not necessarily private) social updates regularly to their community, including status messages and bookmark notifications. As depicted on Figure 1, a social update is a short message sent to a group of interested persons (e.g. a community). It can consist of a one-sentence news or question, an anchor, a picture, or a comment, to share their current thoughts, activities, intentions and needs. On most of these tools, social updates are not meant to be consumed in a push fashion, i.e. like emails, that are aimed at specific recipients and stacked in their inboxes. Instead, community members can go through the list of short social 1 2 3
http://www.facebook.com/ http://www.twitter.com/ http://www.linkedin.com/
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 436–447, 2010. c Springer-Verlag Berlin Heidelberg 2010
Contextual Recommendation of Social Updates
437
Fig. 1. A status update from Twitter, and a bookmark update from Delicious
updates of the people or subjects (e.g. hashtags on twitter) they follow, to get a quick feeling of awareness about those they care about. However, as the number of people and subjects being followed increases, the time required to get through to the social updates they emit also increases, causing a loss of productivity. Additionally, as social updates are broadcast in real-time, they create frequent interruptions that can reduce people’s ability to focus on a demanding task, especially when the social updates are not relevant for this task (because it would induce a costly cognitive switch). In response to this emerging problem of information overload, we propose a framework to rank social updates according to real-time distances in-between users’ contexts, which are processed on social and meta-descriptions of the web documents the users are looking at. The two underlying hypothesis that we are supporting in this paper are: (i) a social update shared by person X is relevant to another person Y if the Y ’s current context is similar to X’s context at time of sharing; and (ii) in a web-browsing session, a reliable current context of a user can be processed using a combination of tags and metadata of web documents accessed by the user. In the next section, we motivate our approach by explaining how contextawareness and current web techniques can be leveraged to improve awareness. In section 3, we survey existing work related to our problem. In section 4, we describe our contextual recommendation framework to provide relevant social updates to people. In section 5, we present our experimental setup and results to evaluate the human response to this approach. We will then discuss these results and propose future work.
2
Motivation
Vyas et al. proved [18] that informal ties between co-workers are essential to improve awareness, and thus better collaboration. In previous studies [9,8], we have identified that contextual information about users could be leveraged to assist the sharing of social updates, and thus maintain these ties, while reducing interruptions. Context was defined by Dey [5] as ”information that can be used to characterize the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves”. In most context-aware applications, researchers have been relying on sensors to extract contextual information. As tagging becomes a common practice on the Internet, rich contextual information can also emerge from human-generated content.
438
A. Joly, P. Maret, and J. Daigremont
Despite some semantic issues related to ambiguity of terms combined in folksonomies, the increasing amount of tags given by Internet users on digital resources [12,11] (e.g. web pages tagged on delicious4 ) have become good indexing features for recommender systems [7,15]. With the growing use of twitter and geotagging applications in mobility, tags are now emerging from places, events and other real-world entities [13], which implies exciting opportunities to create new ambient intelligence, ambient awareness, augmented reality, and other social applications.
3
Background
To the best of our knowledge, the closest existing solution to our problem is a web-based service and mobile application called My6sense 5 . This software can filter entries from RSS and other feeds (including social updates) according to the user’s preference. This content-based filtering technique relies on a profile which contains the user’s subjects of interest, and this profile is continuously evolving by tracking which entries are consulted by the user. Similarly, SoMeONe [1] is a collaboration tool that can recommend contacts (and their selection of web resources) to users by identifying communities of interest around topics, using collaborative filtering. The names of users’ bookmark categories are leveraged as topics. groop.us [3] applied the same approach while relying on tags attributed to web pages by users (folksonomies from a social bookmarking website) instead of hierarchical categories. In these approaches, recommendations are based on documents that were explicitly selected and shared (bookmarked) by users. Despite the evolving design of user profiles, the filtering is not adaptive to the user’s current context. It is possible to provide collaboration opportunities by recommending people that are currently browsing similar documents [4,2], based on a TF-IDF analysis [17] of their content, users’ context being represented by weighted term vectors. These recommendations can also include some information about people’s current activities, as identified by a software module that tracks users’ actions on their computer (e.g. chat sessions, office documents being edited, etc...) [6]. However these efforts don’t leverage tags proposed by web users. In the PRUNE framework [10], contextual entities (e.g. person, place or resource) and events can be extracted from heterogeneous sources like RSS feeds, web APIs and manual user entry. The Notes that Float application leverages this framework to attach such contextual information to notes added by the user, so that their individual visibility depends on their relevance with the user’s current content, which relies on their similarity with the context at the time these notes were added. However, we have found no evidence that tags were leveraged in this application. Moreover, previous collaborative systems imply potential privacy issues. 4 5
http://del.icio.us/ http://www.my6sense.com/
Contextual Recommendation of Social Updates
4
439
Contextual Recommendation Framework
After having identified that contextual information about people can be leveraged to further describe the documents they are browsing/editing, and thus to recommend these documents to people that are in a similar context, we have reviewed several techniques and applications that are relevant for ranking documents. In this section, we study the case of enterprise employees working on computers, then we present a framework and software implementation of an adapted social update recommender system which considers web-browsing context as a relevance criteria, and leveraging their tags as features. 4.1
Case Study
As a motivating case, we propose to consider an enterprise environment where employees work on individual networked computers. They don’t know everybody in the enterprise, and are possibly spread across several offices in several cities, or even different countries. Such organizations traditionally rely on hierarchies of managers to coordinate the efforts of workers and transfer information to the relevant parties. We propose an internal social networking tool that allows every worker to share and retrieve relevant information about the current interests and status of their colleagues, while reducing unnecessary task interruptions and network maintenance time. This system will rely on various stream/feed to leverage personal current thoughts, activities, intentions and questions, and must respect their privacy (e.g. case of private browsing). 4.2
Contextual Tag Clouds
As users are working on computers in the case study presented above, most contextual information about their current activity can be extracted from the software they use (e.g document edition and viewing). In this study, we assume that descriptions about the web sites they are currently browsing (e.g. to find some reference on an ongoing task), can provide clues on the user’s current activity. Users’ context can thus be modeled as a set of weighted terms, combining metadata and tags about these browsed web pages. As these terms can potentially reveal private or confidential information, users must be able to quickly visualize and edit them before submitting to a recommender system. We propose
Fig. 2. Sample contextual tag cloud: a set of words describing one’s context
440
A. Joly, P. Maret, and J. Daigremont
the name of ”Contextual Tag Clouds” to refer to these human-readable contexts based on a weighted combination of tags and other descriptive terms, as depicted on Figure 2. 4.3
Data Flow and User Interaction
As depicted on Figure 3, contextual information is extracted from usermanipulated content (in our case, descriptions of web pages currently browsed) by sniffers running on every user’s computer. For privacy control reasons, no contextual information will be ever sent to any remote party without user confirmation. Contextual information is represented by a set of weighted keywords, and represented as a tag cloud. Also running on the user’s computer, an aggregator gathers events from all these sniffers, and queries several web services to generate weighted tags, in order to combine them in a contextual tag cloud that represents the user’s current context.
Fig. 3. Overview of the contextual recommendation loop
When posting a social update (e.g. a tweet ), the user can attach his/her current contextual tag cloud, so that the contextual filter (i.e. a recommender system running in the infrastructure) can match it with other users’ last contextual tag cloud, using a relevance function. This social update will then be recommended to users whose last contextual tag cloud is relevant to the one attached to the social update. That way, every user gets a dynamic (near real-time) list of recent social updates, sorted by decreasing relevance score as they browse the web. Like with regular social networking and microblogging services, these short social updates can be quickly read by users to remain aware of relevant activities going on in their communities. They can also decide to reply to social updates or to call their author.
Contextual Recommendation of Social Updates
4.4
441
Ranking Model
The theoretical framework that we designed to solve our relevance ranking problem relies on a vector space model, five weighting functions, an aggregation operator and a relevance function. The weighting functions are equivalent to context interpreters [5]: they transform raw data (in our case, URLs of web documents being browsed by the user) into higher-level information (contextual tags) by abstraction. The contextual tag cloud model is equivalent to the vector space model proposed by Salton [17]. Notice that, in this paper, the word tag refers to terms, whatever their origin. Traditionally, a set of terms resulting from the analysis of a document d is formalized as a vector of weights v(d) = [w(t1 , d), w(t2 , d), ..., w(tN , d)] attributed for each term t = [t1 , t2 , ..., tN ∈ R]. The specificity of our model lies in the combination of five functions applied on browsed documents (and their crowd-sourced description: tags) to compute the weights: 1) the Metadata function counts occurrences of term t with different coefficients (α, β, γ), depending of the position of this term in document d’s metadata: w1 (t, d) = α ∗ |t ∈ Td | + β ∗ |t ∈ Kd | + γ ∗ |t ∈ Dd | where |t ∈ Td | is the number of occurrences of the term t in the title of the document d, |t ∈ Kd | in its keywords set, and |t ∈ Dd | in its description text. 2) the SearchQuery function counts the number of occurrences of term t in a search query Qd , when the analyzed document d contains a list of search results: w2 (t, d) = |t ∈ Qd | 3) the DomainNames function adds the domain names (including subdomains) Nd from document d’s URL as terms: w3 (t, d) = |t ∈ Nd | 4) the SocialBookmarks function counts the number of people who publicly bookmarked the document d using the term t as tag: w4 (t, d) =
tag(p, d, t)
p∈P
where each person p is in the set of people P that are using this bookmarking service, and where tag(p, d, t) has a value of 1 or 0, whether or not this person p bookmarked the document d using term t as a tag.
442
A. Joly, P. Maret, and J. Daigremont
5) the SemanticAnalyzer function counts the number of occurrences of semantically-defined entities (i.e. concepts and instances) that are represented by the term t, when they are identified in the document d: w5 (t, d) = |t ∈ Rd | Rd = [∀e ∈ Ed , repr(e)] where repr(e) is the textual representation of a semantic entity e, Rd is the set of textually represented entities Ed found in the document d. This function is further described in the next part of this section. Additionally, we define an aggregation operator and a relevance function that leverage the vectors resulting from the weighting functions above: The aggregation operator is the addition of given weighted term vectors, after their individual normalization. The normalized form v of vector v conforms N to t=t1 vt = 1, with weight values vt ∈ R in the range [0, 1]. Thus, the aggregation operator applied to a set of vectors V = [v1 , v2 , .., vM ] acts as the following function: M aggr(V ) = vt t=1
The relevance function between normalized vectors, like in traditional vectorbased models, relies on cosine similarity. Thus, the relevance of a tag cloud vector R with another tag cloud vector S is computed by: relevance(R, S) =
R·S RS
which returns a relevance score r ∈ R in the range [0, 1], 1 being the maximum relevance score (i.e. contextual equality). 4.5
Software Implementation
The framework described above was designed as a modular architecture, according to the data flow depicted in Figure 3, in which software modules communicate through RESTful HTTP requests. In this section, we present the implementation of these modules: – A Firefox extension6 acts as a context sniffer and a notifier. For sniffing, it hooks on the browser’s events related to opening, closing and switching web pages, and transmits these events with the corresponding URLs to the local Context Aggregator for processing. At the end of the flow, the recommended social updates are displayed in the side-bar of the browser. 6
http://www.firefox.com/
Contextual Recommendation of Social Updates
443
– The Context Aggregator handles local events with their attached contextual information, and runs weighting functions on this information to produce an aggregated (and thus normalized) contextual tag cloud for the Contextual Filter, using the aggr() function defined in the previous section. The weighting functions are implemented as five interpreters that turn URLs into contextual clouds (i.e. weighted term vectors). The Metadata interpreter parses the title, description and keywords elements from the HTML source code of each web page to produce the corresponding weighted terms, with the following parameter values: α = 50 per term appearing in the title, β = 10 in the keywords field, and γ = 1 in description field. The SearchQuery interpreter extracts query terms from Google Search7 result pages. The SocialBookmarks interpreter gathers tags given by users about a web page, when existing on the Delicious social bookmarking service. The SemanticAnalyzer gathers textual representations of semantic entities that were identified in the web page, thanks to the SemanticProxy web service8 . – The Contextual Filter receives contextual clouds gathered and interpreted by users’ aggregator, computes relevance scores between them using the relevance() function, and recommend best-ranked social updates to each user (through their notifier ). Social updates are gathered by subscription to the users’ declared third-party social feeds/streams (e.g. their Twitter account). This software ecosystem is functional and gives a good sense of the benefits of our approach. In the next section, we present an evaluation of the underlying framework.
5
Evaluation
In order to evaluate the validity of our hypothesis on relevance of contextually recommended social updates, we gathered browsing logs and social updates from 8 volunteers during one week, ran our algorithms on these logs to generate 1846 contextual clouds (every 10 minutes), and asked the volunteers to rank the quality of a selection of social updates. In this section, we define the experimentation plan we followed, explain its setup, then discuss the results obtained. 5.1
Experimentation Plan
The evaluation of our hypothesis relies on two measures: (i) the relevance of social updates with the context of their author at the time of posting, and (ii) their relevance for other users in similar contexts. As the quality of recommendations is to be evaluated by users with their own browsing behavior, implied contexts, and own social updates, we did not rely on existing evaluation data sets such as the ones from TREC, nor follow a scenario-based experiment. 7 8
http://www.google.com/ http://semanticproxy.opencalais.com/
444
A. Joly, P. Maret, and J. Daigremont
During one week, volunteers browsed web pages using Firefox and produced social updates (i.e. shared statuses and bookmarks), while the provided sniffing extension was logging the required browsing events to a local database. At the end of this period, they were proposed to review these log entries, so that they could remove privacy-critical entries when needed (e.g. private activities, and other noisy data that is irrelevant to this study), and then send us their log. Afterwards, we ran our algorithms on the browsing logs and social updates provided by volunteers, to produce personalized survey forms containing ranked recommendations for each volunteer. We asked each volunteer to fill two personalized surveys. In the first survey, we asked volunteers to rate the perceived relevance of three random social updates with five contextual clouds generated from their own web browsing data. For each context, only one of the proposed social updates was actually a well-ranked match. In order to support them in remembering those contexts, we provided the list of web pages that were being browsed by the volunteer at that time. In the second survey, volunteers rated the relevance of their own social updates with their contextual tag cloud at the time of posting. 5.2
Experimental Setup and Process
In this section we provide the process and parameters that we set to generate these personalized surveys from the logs provided by volunteers. Because the experiment was not interactive, we indexed contextual clouds and social updates on a common time line with a period of 10 minutes. Contextual clouds are generated from the list of URLs involved in a web browsing event, i.e. when the page was opened, selected or closed. Indexing a social update consists of associating it with the contextual cloud of the last context snapshot at the time of posting this update. If there is no known context information in the previous snapshot, we use the one before the previous. Every indexed contextual cloud is processed to split multiple-word tags, cleaned from punctuation and other non-literal characters, filtered against a stop-words list, and then normalized so that the sum its tags’ weights equals 1. Only the first 20 tags (with highest weights) are displayed to volunteers. As shown on Figure 2, a contextual tag cloud can contain diverse kinds of terms, such as words in various languages, word combinations and acronyms. Then, we ran the recommendation algorithm on the contextual and social indexes in order to produce a relevance matrix for each participant. In order to generate a participant’s personalized survey, we selected 5 heterogeneous contexts (i.e. the most dissimilar to each other) that were matched (by the recommender) with at least one highly-ranked social update. The second survey was simply generated by correlating users’ social updates with their corresponding context. 5.3
Results
As stated above, the results are twofold: we gathered scores given by every participant on (i) the relevance of social updates with the context of their posting, and (ii) the relevance of social updates for other people with similar contexts.
Contextual Recommendation of Social Updates
445
Relevance of contextualized social updates: In order to measure the consistency of contextual clouds as reference documents for recommending social updates, we asked the participants to rate the relevance of each of their own social updates (e.g. their tweets and bookmarks) to the contextual cloud representing their current situation at time of posting/sharing. Over a total of 59 social updates, their authors rated an average relevance to context of 50.3%. The following distribution of ratings is observed: 19 social updates were ranked 1 (low relevance), 10 were ranked 2, 14 were ranked 3, and 16 were ranked 4 (high relevance). These social updates are gathered from several social streams: 54% are status updates posted on Twitter, 29% are bookmarks instantly shared through Delicious. By further analyzing these specific types of social streams, we discovered an average relevance score of 71% for shared bookmarks, and 38% for status updates from Twitter. It is natural that new bookmarks are more relevant to their context, as the web document that is bookmarked is usually being browsed by the user, and thus represented in the corresponding contextual clouds. Concerning status updates, Naaman et al. [14] proved that only 41% of social updates from twitter are actual statuses about the current activity of the person (categorized as ”me now” by the authors). The similarity of this proportion with our average contextual relevance score for status updates gives some proof, although preliminary, about the consistency of our results. Relevance of recommendations: As explained in the previous section, social updates proposed to users are voluntarily not all relevant. Our goal is to observe a correlation between the relevance scores given by participants and the rankings computed by the system. Thus, we rely on a Mean Percentage Error (based on MAE, Mean Absolute Error) to define the following accuracy function: Q |relevance(Cq , Uq ) − rating(Cq , Uq )| accuracy = 1 − q=1
in which, for each proposed social update q, relevance(Cq , Uq ) is the relevance score of the social update Uq with the contextual tag cloud Cq , as evaluated by the ranking algorithm. Whereas, rating(Cq , Uq ) is the actual relevance score, as given by the volunteer. Both scores are values in the range [0, 1], represented as percents. As rating() scores are given by volunteers in the [1, 4] grade rating range, they are converted to percents with the following formula: rating =
(grade − 1) 3
We observed an average accuracy value of 72%. As a natural behavior of recommender systems, the best-ranked ratings (mostly in Rank 3 ) are slightly overestimated by the recommendation algorithm, whereas low relevance ratings (Rank 1 ) given by participants are higher than expected. From the list of relevance ratings expected by the recommendation system, 63% are low ranked (Rank 1 ), whereas 19% are medium-high (Rank 3 ). The high
446
A. Joly, P. Maret, and J. Daigremont
number of low-ranked scores and the medium ranking of better scores expected by the algorithm reveals that highly similar contextual clouds were rare in our small scaled experiment. By increasing the number of participants, more similar contexts would be found, thus the average scores would naturally increase.
6
Conclusion
In this paper, we proposed a theoretical framework, a privacy-aware implementation and its evaluation to rank social updates by contextual relevance, for reducing information overload. Through the analysis of experimental results, we evaluated a combined weighting scheme based on social and meta-descriptions of web pages being accessed by the user, as a contextual criteria for recommending relevant social updates. This study explores the potential of our novel recommendation approach based on contextual clouds. Despite the small scale of this preliminary experiment, our results are promising. The average accuracy of recommended social updates: 72%, is significant for a web recommender system. We observed that the relevance perceived by users increases as social updates reflect the current activity of their authors. In order to improve the performance of our system, we intend: – to improve the quality of context with emergent semantics of tags [16]. – to broaden the range of context, by developing additional context sniffers, including documents, and physical context information from mobile devices. – to find more precise relevance factors between specific types of social updates and contextual properties, after having carried out a higher-scale experiment. – and to improve the scalability of the system when used simultaneously by numerous users (currently: O(n2 ) complex), e.g. using tag clustering.
References ´ 1. Agosto, L.: Optimisation d’un R´eseau Social d’Echange d’Information par Recommandation de Mise en Relation. Ph.D. thesis, Universit´e de Savoie, France (2005) 2. Bauer, T., Leake, D.B.: Real time user context modeling for information retrieval agents. In: CIKM 2001: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 568–570. ACM, New York (2001) 3. Bielenberg, K., Zacher, M.: Groups in social software: Utilizing tagging to integrate individual contexts for social navigation. In: Digital Media, p. 120. University Bremen. Master of Science in Digital Media, Bremen, Germeny (2005) 4. Budzik, J., Fu, X., Hammond, K.: Facilitating opportunistic communication by tracking the documents people use. In: Proc. of Int. Workshop on Awareness and the WWW. ACM Conference on CSCW 2000, Philadelphia, Citeseer (2000) 5. Dey, A.K.: Providing Architectural Support for Building Context-Aware Applications. Ph.D. thesis, Georgia Institute of Technology (2000) 6. Dragunov, A., Dietterich, T., Johnsrude, K., McLaughlin, M., Li, L., Herlocker, J.: TaskTracer: a desktop environment to support multi-tasking knowledge workers. In: Proceedings of the 10th International Conference on Intelligent Uer Interfaces, pp. 75–82. ACM, New York (2005)
Contextual Recommendation of Social Updates
447
7. Hotho, A., J¨ aschke, R., Schmitz, C., Stumme, G.: Information retrieval in folksonomies: Search and ranking. In: The Semantic Web: Research and Applications, pp. 411–426 (2006) 8. Joly, A.: Workspace Awareness without Overload: Contextual Filtering of Social Interactions. In: Smart Offices and Other Workspaces, Workshop of the Intelligent Environments 2009 Conference, Ambient Intelligence and Smart Environments, pp. 297–304. IOS Press, Amsterdam (2009) 9. Joly, A., Maret, P., Daigremont, J.: Context-Awareness, the Missing Block of Social Networking . International Journal of Computer Science and Applications 4(2) (2009), Special Issue on Networking Mobile Virtual Knowledge 10. Kleek, M.V., Karger, D.R., Schraefel, M.C.: Watching through the web: Building personal activity and Context-Aware interfaces using web activity streams. In: Proceedings of the Workshop on Understanding the User - Logging and Interpreting User Interactions in Information Search and Retrieval (UIIR-2009), in Conjunction with SIGIR-2009, Boston, MA, USA (2009) 11. Marlow, C., Naaman, M., Boyd, D., Davis, M.: Position paper, tagging, taxonomy, flickr, article, toread. In: Collaborative Web Tagging Workshop at WWW 2006, 31–40 (2006), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.74.8883 12. Mathes, A.: Folksonomies-cooperative classification and communication through shared metadata. Computer Mediated Communication (2004) 13. Naaman, M., Nair, R.: ZoneTag’s collaborative tag suggestions: What is this person doing in my phone? IEEE Multimedia 15(3), 34–40 (2008) 14. Naaman, M., Boase, J., Lai, C.: Is it really about me? message content in social awareness streams. In: Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, pp. 189–192. ACM Press, Savannah (2010) 15. Niwa, S., Doi, T., Honiden, S.: Web page recommender system based on folksonomy mining for itng 2006 submissions. In: Third International Conference on Information Technology: New Generations, ITNG 2006, pp. 388–393 (2006) 16. Rattenbury, T., Good, N., Naaman, M.: Towards automatic extraction of event and place semantics from flickr tags. In: SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 103–110. ACM, New York (2007) 17. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGrawHill, Inc., New York (1986) 18. Vyas, D., Van De Watering, M., Eli¨ens, A., Van Der Veer, G.: Engineering Social Awareness in Work Environments. In: Stephanidis, C. (ed.) UAHCI 2007 (Part II). LNCS, vol. 4555, pp. 254–263. Springer, Heidelberg (2007)
Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data Ying Ding1, Yuyin Sun1, Bin Chen2, Katy Borner1, Li Ding3, David Wild2, Melanie Wu2, Dominic DiFranzo3, Alvaro Graves Fuenzalida3, Daifeng Li1, Stasa Milojevic1, ShanShan Chen1, Madhuvanthi Sankaranarayanan2, and Ioan Toma4 1
School of Library and Information Science, Indiana University 2 School of Computing and Informatics, Indiana University 47405 Bloomington, IN, USA 3 Tetherless World Constellation, Rensselaer Polytechnic Institute, NY, USA 4 School of Computer Science, University of Innsbruck, Austria {dingying,yuysun,binchen,katy,djwild,daifeng,madhu, yyqing,chenshan}@indiana.edu, {dingl,agraves,difrad}@cs.rpi.edu, {ioan.toma}@uibk.ac.at
Abstract. One of the main shortcomings of Semantic Web technologies is that there are few user-friendly ways for displaying, browsing and querying semantic data. In fact, the lack of effective interfaces for end users significantly hinders further adoption of the Semantic Web. In this paper, we propose the Semantic Web Portal (SWP) as a light-weight platform that unifies off-the-shelf Semantic Web tools helping domain users organize, browse and visualize relevant semantic data in a meaningful manner. The proposed SWP has been demonstrated, tested and evaluated in several different use cases, such as a middle-sized research group portal, a government dataset catalog portal, a patient health center portal and a Linked Open Data portal for bio-chemical data. SWP can be easily deployed into any middle-sized domain and is also useful to display and visualize Linked Open Data bubbles. Keywords: Semantic Web data, browsing, visualization.
1 Introduction The current Web is experiencing tremendous changes to its intended functions of connecting information, people and knowledge. It is also facing severe challenges in assisting data integration and aiding knowledge discovery. Among a number of important efforts to develop the Web to its fullest potential, the Semantic Web is central to enhancing human / machine interaction through the representation of data in a machine-readable manner, allowing for better mediation of data and services [1]. The Linked Open Data (LOD) initiative, led by the W3C SWEO Community Project, is representative of these efforts to interlink data and knowledge using a semantic approach. The Semantic Web community is particularly excited about LOD, as it marks a critical step needed to move the document Web to a data Web, toward enabling powerful data and service mashups to realize the Semantic Web vision. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 448–460, 2010. © Springer-Verlag Berlin Heidelberg 2010
Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data
449
The Semantic Web is perceived to lack user-friendly interfaces to display, browse and query data. Those who are not fluent in Semantic Web technology may have difficulty rendering data in an RDF triple format. Such perceived lack of user-friendly interfaces can hinder further adoption of necessary Semantic Web technologies. D2R server or various SPARQL endpoints display query results in pure triple formats such as DBPedia (e.g., displaying the resource Name: http://dbpedia.org/page/Name) and Chem2Bio2RDF (e.g., displaying the SPARQL query result on “thymidine” as http://chem2bio2rdf.org:2020/snorql/?describe=http%3A%2F%2Fchem2bio2rdf.org% 3A2020%2Fresource%2FBindingDBLigand%2F1):they aren’t, however, intuitive and user friendly. Enabling user-friendly data displays, browsing and querying is essential for the success of the Semantic Web. In this paper, we propose a lightweight Semantic Web Portal (SWP) platform to help users, including those unfamiliar with Semantic Web technology, allowing all users to efficiently publish and display their semantic data. This approach generates navigable faceted interfaces allowing users to browse and visualize RDF triples meaningfully. SWP is aligned with similar efforts within medical domains funded by NIH in the USA toward the facilitation of social networking for scientists and facile sharing of medical resources. The main architecture of the SWP is based upon Longwell (http://simile.mit.edu/wiki/Longwell_User_Guide) and the Exhibit widget (http:// simile-widgets.org/exhibit/) from MIT’s SIMILE project (http://simile.mit.edu/). We further extend the system by adding Dynamic SPARQL Query module, Customized Exhibit View module, Semantic Search module and SPARQL Query Builder module to enhance the functionality and portability of the system. This paper is organized as follows: Section 2 discusses related work; Section 3 introduces the SWP infrastructure; Section 4 discusses and exemplifies portal ontology; Section 5 demonstrates four use cases for deploying SWP; Section 6 evaluates and compares SWP to related systems, and; Section 7 presents future work.
2 Related Work Research on Semantic Web portals began fairly early, in the nascent 2000s. A number of Semantic Web portal designs and implementations were published in research literature such as SEAL (SEmantic portAL) [2] and Semantic Community Portal [3]. Lausen et al [4] provided an extensive survey on a selection of Semantic Web portals published before 2005. Many research groups are currently maintaining their group portals using Semantic Web technologies. For example, Mindswap.org was deployed as “the first OWL-powered Semantic Web site” [5] and Semantic Mediawiki [6] has been used to power several groups’ portals, such as the Institute of Applied Informatics and Formal Description Methods (AIFB, aifb.kit.edu) and Tetherless World Constellation (tw.rpi.edu). Meanwhile, there are many domain-specific Semantic Web portals coming from winners of the “Semantic Web challenge” [7] including CS AKTive Space [8], Museum Finland [9], Multimedia E-Culture demonstrator [10], HealthFinland [11] and TrialX [12]. While these Semantic Web portals are nicely crafted, most of them are too complicated to be replicated by non-specialists. Visualizations are one of the key components of a Semantic Web portal ([13], [14]). There are some general-purpose tools for visually presenting Semantic Web data, including
450
Y. Ding et al.
linked data browsers such as Tabulator (http://dig.csail.mit.edu/2005/ajar/ajaw/ tab.html) and OpenLink Data Explorer (http://linkeddata.uriburner.com/ode), as well as data mashup tools such as sigma (aggregated instance description, sig.ma) and swoogle (aggregated semantic web term definition, swoogle.umbc.edu). These tools render RDF triples directly via faceted filtering and customized rendering. SIMILE’s Longwell can be used to enable faceted browsing on RDF data, and Exhibit can further enable faceted visualization (e.g., map, timeline). It is notable that these tools differ from information visualization tools, which have more emphasis on rendering data into a graphical format.
3 SWP Architecture The SWP is a lightweight portal platform to ingest, edit, display, search and visualize semantic data in a user-friendly and meaningful way. It can convert a current portal based on relational databases into a Semantic Web portal, and allows non-Semantic Web users to create a new Semantic Web portal in a reasonable period of time without professional training. Fig. 1 shows the overall architecture, which contains the following main components:
Fig. 1. SWP overall architecture
Data Ingestion (DI) Component: Its main function is to facilitate the conversion of the input data in various formats into RDF triples. It provides different templates and wrappers to handle some common data formats, such as text file, relational databases and Excel sheets. For example, it uses D2R MAP and offers templates to help nonSemantic Web users to semi-automatically create D2R rules to convert their relational data into RDF triples. Ontology Management (OM) Component: Its main function is to enable easy online ontology creation, editing, browsing, mapping and annotation. It is based on Vitro developed by Cornell University [15]. Vitro provides similar functions as Protégé (http://protege.stanford.edu/), but it is online based. Vitro will be further developed and improved by the NIH-funded VIVO project. Faceted Browsing (FB) Component: Based on Longwell, SWP mixes the flexibility of the RDF data model with faceted browsing to enable users to explore complex RDF triples in a user-friendly and meaningful manner. This faceted browser can be multifiltered, where, for example, for a research group portal, users can browse either all
Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data
451
the existing presentations by one research group or only those within one specific year AND at a specific location; for a health center portal, a doctor can know the number of patients who have diabetes AND live in Monroe County, Indiana. Semantic Visualization (SV) Component: It is based on Exhibit developed by MIT Simile project and Network Workbench by the Cyberinfrastructure for Network Science Center at Indiana University ([16], [17], [18]). It displays or visualizes RDF data in tile, timeline, Google map and table formats. It also enables the faceted visualization so that userscan visualize all of the research group members, or only those group members who share common research interests; and Semantic Search (SS) Component: It enables a type-based search that can categorize federated RDF triples into different groups based on ontologies. It is based on Lucene (http://lucene.apache.org/) and integrated with pre-defined portal ontologies to provide type-based searches. For example, if users key in “semantic web” as search query to SWP, they will receive RDF resources which contain the string “semantic web,” wherein these resources are further categorized as person, project, publication, presentation, and event. Subclasses of a Person group can be further categorized into Academic, Staff or Student. SWP acts as a stand-alone Semantic Web portal platform which can be deployed in any domain or application to input, output, display, visualize and search semantic data. Currently, it has been deployed to: (1) a middle-size research group to semantically manage topics of people, paper, grant, project, presentation and research; (2) a specialty Linked Open Data chem2bio2rdf dataset to display the relationship and association among gene, drug, medicine and pathway data; (3) an eGov dataset to facilitate faceted browsing of governmental data, and; (4) a health center to enable federated patient, disease, medication and family ties to be grouped, associated and networked. For more details, please see Section 5.
4 Portal Ontology Deploying SWP is domain specific. The user needs to create one or more portal ontologies to convert current relational databases into RDF triples. Creating an appropriate ontology is therefore a critical part of SWP. It should facilitate user queries, and meaningfully display and visualize RDF data. There are some generic requirements for creating ontologies for SWP: 1) the ontology should reflect the database schema of its original datasets; 2) the identified main concepts or relationships from commonly used user queries should be included in ontologies; 3) to enable interoperability, the portal ontologies should try to reuse existing popular ontologies, such as using FOAF to represent people (http://en.wikipedia.org/wiki/FOAF_% 28software%29) , using DOAP (http://en.wikipedia.org/wiki/Description_of_ a_Project) to represent projects, using Bibontology (http://bibliontology.com/) to represent publications and using SIOC (http://sioc-project.org/) to represent online communities, and; 4) Obeying Linked Open Data (LOD) rules (http://www.w3.org/ DesignIssues/LinkedData.html): using HTTP URIs for naming items, making URIs dereferencable and trying to use URIs from other Linked Open Data as much as possible to facilitate easy mapping. Here we use the Information Networking Ontology Group (INOG) to demonstrate the principle of creating an ontology for research networking of people and sharing medical resources. Part of this ontology group has been implemented in the Research
452
Y. Ding et al.
Group Portal use case in Section 5. INOG is one of the efforts funded by NIH and led by University of Florida [19] and Harvard University [20]. It aims to create modularized ontologies to enable a semantic “facebook” for medical scientists to network and share lab resources. The overall INOG framework is shown in Fig. 2. The core part of the framework are the INOG, including the VIVO ontology (modeling research networking) and Eagle-I ontology (modeling medical resources). These two ontologies share some common URIs and map other related URIs, and are aligned with popular ontologies such as FOAF, SIOC, DOAP and BIBO. This enables us to link our data with some existing Linked Open Data sets, such as FOAF, DBPedia and DBLP. Also, in order to model the expertise of scientists and categorize medical resources, we use existing domain ontologies such as MeSH (http://www.ncbi.nlm.nih.gov/mesh), SNOMED (http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html), Biomedical Resource Ontology (http://bioportal.bioontology.org/visualize/43000) and Ontology for Biomedical Investigation( http://obi-ontology.org/page/Main_Page) to provide categories or controlled vocabularies.
Fig. 2. Information Networking Ontology Group framework
5 Use Cases In this section, we demonstrate that SWP can be easily deployed to different domains to create various Semantic Web portals. Research Group Portal Research Group portals are one of the most common portals used in academic settings. Professors need to manage their research labs, groups or centers in an efficient way to conduct, disseminate and promote their research. The traditional research group websites are normally not easy to maintain, browse and search, especially when the size of groups reaches a certain level. The following use case is based on a mid-size research group (the Information Visualization Lab (IVL) in the School of Library and Information Science at Indiana University Bloomington (http://ivl.slis.indiana.edu/). There are approximately 30 group members, consisting of
Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data
453
one professor, several senior research staff and programmers, PhD and master students and hourly students. It has, at any point in time, around ten externally-funded projects, mostly from NIH and NSF. The major activities and datasets for this research group are people, papers, courses, presentations, events, datasets, software, hardware and funding. Previously all data has been stored in a relational database (e.g., PostgresSQL) with about 20 main tables and more than 50 bridge tables to inter-connect different datasets. One of the major bottlenecks is that it is not simple to harvest all items relating to one entity. For example, it is very difficult to group all information about one group member. Users have to go to the publication page to get information on publications, the presentation page to get information on presentations and the research page to get information on projects. This harvesting limitation also generates the problem of maintaining and updating the data.
Fig. 3. List view of SWP
Fig. 4. Graph view of SWP
Fig. 5. Screenshots of SWP’s semantic visualization
454
Y. Ding et al.
Using SWP, we create a machine-readable semantic version of this research group portal (http://vivo-onto.slis.indiana.edu/ivl/). We used D2R to convert around 70 relational tables into RDF triples based on the VIVO ontology version 0.9. This portal enables faceted browsing and semantic visualization. For example, by clicking People, users see the list view of federated information for each group member, including his or her publications, presentations, research interest and projects. Using a faceted browser, users can further narrow down their searches. Among all the group members, SWP can display group members who are only interested in the Network Workbench Tool research topic. The default view is List view (see Fig. 3), and Graph view provides basic graph overlay of RDF triples and highlights some nodes with labels (see Fig. 4). Exhibit view contains several view formats, such as tile, timeline, map and table views (see Fig. 5). Tile view groups entities based on multiple criteria, such as grouping presentations based first on year, then on presenter’s last name. Timeline view shows timelines on grouped entities, such as presentations at different time slots. Table view displays entities in table format. Map view uses Google Map to view grouped entities based on locations. All of these views enable faceted visualization so that users, for example, can view presentations in 2005 AND in Indianapolis. The current semantic search function is very limited. Longwell only provides Lucene text search. Since the People page groups all the related information about one person together, by going to the People page and searching “network,” users can locate people who are interested in “Network Workbench Tool” or who published their papers in “Network Science” conference.
Fig. 6. Screenshots of the Health Center Portal Fig. 7. Screenshots of eGov Portal
Health Center Portal Indiana University (IU) Health Center (http://healthcenter.indiana.edu/index2.html) provides comprehensive health services to meet the medical and psychological needs of students, spouses and dependents. It serves more than 40,000 potential patients around campus, and each patient can access his or her information online. Doctors and medical staff can pull out the related information about a group of patients from this portal for diagnosis and analysis purposes. It currently uses a relational database and is powered by workflow.com enterprise solutions. IU Health Center data are stored in more than 100 tables and contain information such as person, insurance, medication, clinical document, surgery, immunization, allergies and family ties. We deployed SWP to IU Health Center and created an easy-to-use Semantic Web portal (see Fig. 6). As it is useful for doctors and staff to look at the overall
Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data
455
information at one place, this portal groups together all information related to one patient, such as medication, diagnosis, doctor, disease, location and time factors. The faceted browser allows users to select different criteria by which to view data. For example, the right side of Fig. 6 shows the H1N1 flu patients’ geographical distribution in the Bloomington area. Doctors can further narrow down the geo maps by selecting different time periods or patient status. eGov Portal eGov’s current initiative of adopting Semantic Web technology makes converting governmental data into RDF triples and providing meaningful browsing and searching supports essential. In this example, we use Ozone and Visibility data from the EPA’s Castnet project (http://www.epa.gov/castnet/) and convert them into RDF triples. The problem here is that while these datasets have data on Ozone and Visibility for each of the Castnet sites, they do not have data on where these sites are located. Using a second dataset from the EPA’s site (http://www.epa.gov) that has data on the location of each Castnet site, we created this Web application as seen in Fig. 7. In the left side of Fig. 7, yellow dots represent a single Casetnet site and the size of dots corresponds to the average Ozone reading for that site. Users can apply filters to narrow down the results of Castnet sites. When a Castnet site is clicked, a small pop-up opens that displays more information on that site and provides a Web link which takes users to another page. The right side of Fig. 7 displays a timeline for all the Ozone and Visibility data available for that site based on Google Visualization API. Chem2bio2rdf Portal/Linked Open Data Portal This use case demonstrates the potential of using SWP to provide better browsing and searching support for some of LOD bubbles. A systems chemical biology network called chem2bio2rdf has been created by integrating bio2rdf and Linking Open Drug Data (LODD) to allow links between compounds, protein, targets, genes and diseases. The chem2bio2rdf contains 18 datasets in the domain of systems chemical biology and is grouped into five categories: chemical (pubchem, ChEBI), chemogenomics (KEGG ligand, CTD chemical, BindingDB, Matador, PubChem BioAssay, QSAR, TTD, DrugBank), biological (UNIPROT), systems (KEGG, Reactome, PPI, DIP), phenotype (OMIM, CTD disease, SIDER) and literature (PubMed). The result is a SPARQL endpoint to support RDF queries (http://chem2bio2rdf.org) and a userfriendly SWP at (http://chem2bio2rdf.org/exhibit/drugbank.html).
6 Evaluation To evaluate SWP’s usability, we conducted a user evaluation based on 14 users. The survey results demonstrate that semantic web technology provides better integrated information with positive feedback by 78% of our users. As for the faceted browser, more than 57% of users agreed that such function shortens the time they required to find desired information. Additionally, users were very positive about the visualizations function of SWP. Among the 6 methods of visualization available, map view received the highest aggregate score in users’ satisfaction, while graph view the lowest., The survey did reveal limitations to user satisfaction with the SWP., some users
456
Y. Ding et al.
felt that too much information is integrated. The predefined filtering conditions need refinement in the faceted-browsing function. users suggested that visualization views should be based on the data type, potential user needs, user system configuration and final output, and currently these views did not match their expectations. Another evaluation approach is a straightforward comparison of the difference between portals with and without SWP, where we take the afore-mentioned Research Group Portal and chem2bio2rdf Portal as examples. The Research Group Portal comparison demonstrates that the SWP version provides several value-added features (e.g., federating related information about one entity in one place) than the non-SWP version. The second chem2bio2rdf Portal comparison explains that SWP can provide better user-friendly browsing support for Linked Open Data bubbles than normal SPARQL endpoints (see Fig. 8).
Fig. 8. Normal LOD display vs. SWP LOD display
Seven related systems have been identified herein: Disco (http://www4.wiwiss. fu-berlin.de/bizer/ng4j/disco/), Marbles (http://marbles.sourceforge.net/), Zitgist (http://zitgist.com/), Dipper (http://api.talis.com/stores/iand-dev1/items/dipper.html), mSpace (http://mspace.fm/), jSpace (http://www.clarkparsia.com/jspace/), sigma (http://sig.ma), Exhibit (http://www.simile-widgets.org/exhibit/) and Tabular (http:// www.w3.org/2005/ajar/tab). We compare SWP with nine systems (see Table 1, Disco (http://www4.wiwiss.fu-berlin.de/bizer/ng4j/disco/), Marbles (http://marbles.source forge.net/), Zitgist (http://zitgist.com/), Dipper (http://api.talis.com/stores/ianddev1/items/dipper.html), mSpace (http://mspace.fm/), jSpace (http://www.clarkparsia. com/jspace), sigma (http://sig.ma), Exhibit (http://www.simile-widgets.org/exhibit/) and Tabular (http://www.w3.org/2005/ajar/tab), where the major function of these systems is to display RDF triples. Except for Dipper and mSpace, these systems only display RDF triples in plain property-value pairs. mSpace provides RSS news style display with headings, pictures and content. Dipper displays RDF triples in plain property-value pairs and provides further categorization of these RDF triples. Sigma allows users to provide feedback on each triple by either accepting or rejecting it. Disco and Marbles only display RDF triples based on the input URI, while the others have their own data sources and ontology. Sigma has the largest data source compared to the others, and also mashes up data from other APIs. Exhibit and Tabular
Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data
457
both provide different view types to render the data, such as table view, map view, timeline view. Only mSpace, jSpace and Exhibit provide faceted browsers. In mSpace and jSpace, users can add or delete different facets based on their own needs. None of the systems, however, provide semantic search and visualization. Marble, Zitgist and Tabulator trace data provenance by adding the data source from which the RDF triple is derived. Sigma provides data provenance by allowing users to provide trust of these data sources. Only jSpace provides user-friendly SPARQL template based on the user-selected paths. Tabulator uses the selected data to generate SPARQL query. Through these comparisons, SWP can be enhanced by adding provenance to RDF triples (e.g., Sigma), improving SPARQL query builder (e.g., jSpace) and providing more output formats (e.g., Dipper).
7 Conclusion and Future Work In this paper, we propose a SWP platform which enables faceted browsing, semantic visualization and semantic search functions of RDF triples. It can be deployed to any domain or application that needs to integrate, federate and share data. It has been tested in several different domains, and requires users to create their own portal ontologies. Some future improvements to this platform include: •
• •
•
•
Dynamic SPARQL queries: Currently MIT Simile toolsets (e.g., Exhibit) cannot process dynamic SPARQL queries. It can only read static JSON files. In order to make searching and browsing more interactive, we need to find a way to let Exhibit handle dynamically generated JSON files, mainly via asynchronized service requests; Online ontology management: Currently the OM component is not fully integrated from Vitro to SWP,; Data ingestion: Currently, SWP only has the read function of RDF triples to display them in different ways. To implement the write function of SWP, data has to be converted separately to become the input of SWP. Also, there is no user-friendly way to let end users add, delete and update their instance data. Vitro provides some good examples for addressing this issue, but the integration of Vitro and SWP has to be investigated; Semantic visualization: Currently the semantic visualization of SWP is very limited, with only naïve displays of RDF graphs and labeling nodes. The network analysis is not yet implemented. Future work will be focused on visualizing network and identified paths of the network which are associated with user queries, and; Semantic Search: Currently SWP uses Lucene indexing, and the type-based search is very limited. We need to identify a better way to integrate Vitro semantic search with SWP. Meanwhile, we are exploring the potential integration of semantic associations to discover complex relationships in semantic data. As RDF data forms semantic graphs, with nodes and links that have embedded semantics, graph mining technologies can be applied to identify and rank semantic nodes and relationships. By weighing semantics of surrounding nodes and links, semantic associations can be calculated based on ranking of available paths of nodes [21].
No
No
No
No
Provenance
User-friendly SPARQL template
No
Yes
No
No
Have faceted browser? Semantic search Visualization
Yes
Yes
No
No
No
No
No
No
No
No
Yes
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
Yes
No
No
Yes(crawldata fromweb,donot haveown ontology) No
No
No
Yes
No
Yes
Yes
Displaythedata in different views
Yes
Yes
Yes
No
No
No
Displaythedata in different views.
Yes
No
Yes
Yes
Yes
Yes
Displaythedata in different views.
No(mashing up relateddatafrom differentdata sources) No
User-friendly displayRDF triples:RSSnews styleofdisplay (heading,picssk, andcontent) Purelypropertyvaluepairdisplay
No(just displayingdata contained in the inputURI) No
Haveown dataand ontology?
Classifyproperty-value pairsbasedonpredefinedcategories
Purelyproperty-value pairdisplay
Readallthe information available forthese entities,and displays itsothat userscaneasily readand understand related,contextual information. No
Purelypropertyvaluepairdisplay
Purely property-value pairdisplay
DisplayRDF triples
Usercan add/delete filters tothe faceted browser
SWP BrowseRDFdata indifferent views type,suchaslist, graph, map, timeline,table. ProvideuserfriendlySPARQL querybuilder, semanticsearch.
Tabulator BrowseRDFdata andselectpartofit todisplayin different views type,suchastable, map,calendar, timelineand SPARQL template.
Exhibit DisplayRDFtriples indifferent views, includingTabular View,Timeline View,MapView andTileView
Sig.ma DisplayRDF triples gathered fromcrawled sourcesorother APIs Usercanprovide theirfeedbackto acceptorrejectthe resourcesfortheir ownpurposes Purelypropertyvaluepairdisplay
jSpace DisplayRDFtriples Providethreeviews: data,web,andsocial network views User-friendly SPAQRLbuilder throughuserselected paths
mSpace Viewdatawith facetedbrowser
Dipper DisplayRDFtriples ina givenURI Categorizeproperties intoseveralpre-defined classes Exporttheoutputdata in different formats:JSON, RDF/XML,Turtle,NTriple
Zitgist
Provide DataViewerand QueryBuilderfor RDFtriples
Marbles
DisplayRDF triplescontained ina givenURI. Providethree views:full, summaryand photo views
DisplayRDF triples contained ina givenURI
Major functions
Disco
Table 1. Comparison of SWP with related systems
458 Y. Ding et al.
Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data
459
This paper addresses the issue of lacking user-friendly displaying and browsing support for semantic data. The Semantic Web is moving successfully from theory development to real data gathering and application building. It is now important to provide user-friendly methods that allow normal users to feel the beauty of semantic data and Semantic Web technologies. This paper confirms that SWP can make Semantic Web meaningful to both Semantic Web specialists and the public. SWP can be easily deployed into any middle-sized domain, and is also useful for displaying and visualizing Linked Open Data bubbles. Acknowledgments. This work is funded by NIH VIVO Project (UF09179).
References 1. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web: Scientific American. Scientific American (2001) 2. Maedche, A., Staab, S., Stojanovic, N., Studer, R., Sure, Y.: SEAL - A Framework for Developing SEmantic Web PortALs. In: Proceedings of the 18th British National Conference on Databases: Advances in Databases, pp. 1–22 (2001) 3. Staab, S., Angele, J., Decker, S., Erdmann, M., Hotho, A., Maedche, A., Schnurr, H., Studer, R., Sure, Y.: Semantic community Web portals. Comput. Netw., 473–491 (2006) 4. Lausen, H., Ding, Y., Stollberg, M., Fensel, D., Hernandez, R., Han, S.: Semantic web portals: state-of-the-art survey. Journal of Knowledge Management 9(5), 40–49 (2005) 5. Maryland Information and Network Dynamics Lab Semantic Web Agents Project (2004), http://www.mindswap.org/first.shtml 6. Krötzsch, M.: Semantic Media Wiki (2010), http://semantic-mediawiki.org/wiki/Semantic_MediaWiki 7. Bizer, C., Maynard, D.: Semantic Web Challenge (2010), http://challenge.semanticweb.org/ 8. Schraefel, M.C., Shadbolt, N.R., Gibbins, N., Glaser, H., Harris, S.: CS AKTive Space: Representing Computer Science in the Semantic Web. In: Proceedings of the 13th International Conference on World Wide Web, pp. 384–392. ACM Press, New York (2004) 9. Hyvönen, E., Junnila, M., Kettula, S., Mäkelä, E., Saarela, S., Salminen, M., Syreeni, A., Valo, A., Viljanen, K.: Publishing Museum Collections on the Semantic Web: The Museumfinland Portal. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp. 418–419. ACM Press, New York (2004) 10. MultimediaN N9C Eculture project: Multimedia E-Culture demonstrator, http://e-culture.multimedian.nl/index.shtml 11. Suominen, O., Hyvönen, E., Viljanen, K., Hukka, E.: HealthFinland-A National Semantic Publishing Network and Portal for Health Information. Web Semantics: Science, Services and Agents on the World Wide Web 7(4), 287–297 (2009) 12. Applied Informatics, Inc.: TrialX: Enabling Patients to Find New Treatments (2010), http://trialx.com/ 13. Cole, W.G., Shererta, D.D., Hsu, G., Fagan, L.M., Carlson, R.W.: Semantic Visualization Of Oncology Knowledge Sources. In: Proc. Annu. Symp. Comput. Appl. Med. Care, pp. 67–71 (1995) 14. Padgett, T., Maniquis, A., Hoffman, M., Miller, W., Lautenschlager, J.: A Semantic Visualization Tool for Knowledge Discovery and Exploration in a Collaborative Environment, https://analysis.mitre.org/proceedings/Final_Papers_Files/ 171_Camera_Ready_Paper.pdf
460
Y. Ding et al.
15. Devare, M., Corson-Rikert, J., Caruso, B., Lowe, B., Chiang, K., McCue, J.: VIVO: Connecting People, Creating a Virtual Life Sciences Community. D-Lib Magazine 13(7/8) (2007), http://www.dlib.org/dlib/july07/devare/07devare.html 16. Bruce, H., Huang, W., Penumarthy, S., Börner, K.: Designing Highly Flexible and Usable Cyberinfrastructures for Convergence. In: Bainbridge, W.S., Roco, M.C. (eds.) Progress in Convergence - Technologies for Human Wellbeing, Boston, vol. 1093, pp. 161–179. Annals of the New York Academy of Sciences, MA (2007) 17. Neirynck, T., Börner, K.: Representing, Analyzing, and Visualizing Scholarly Data in Support of Research Management. In: Proceedings of the 11th Annual Information Visualization International Conference, IEEE Computer Society Conference Publishing Services, Zürich, Switzerland, pp. 124–129 (2007) 18. NWB Team: Network Workbench Tool, http://nwb.slis.indiana.edu 19. Conlon, M.W.: VIVO: Engabling National Networking of Scientists. University of Florida, Cornell University, Indiana University, Washington University of St. Louis, Ponce School of Medicine, Weill Cornell Medical College, The Scripps Research Institute: NIH/NCRR, 1U24RR029822-01 (2009) 20. Nadler, M.D., Marshall, L.: Networking Research Resources Across America (Eagle-I project). Harvard Medical School, Dartmouth College, Jackson State Univesity, Morehouse School of Medicine, Montana State University, Oregon Health and Science University, University of Alaska Fairbanks, University of Hawaii Manoa, University of Puerto Rico: NIH CTSA (2009) 21. Anyanwu, K., Maduko, A., Sheth, A.: SemRank: ranking complex relationship search results on the semantic web. In: Proceedings of the 14th International Conference on World Wide Web, pp. 117–127 (2005)
NicoScene: Video Scene Search by Keywords Based on Social Annotation Yasuyuki Tahara, Atsushi Tago, Hiroyuki Nakagawa, and Akihiko Ohsuga Graduate School of Information Systems, The University of Electro-Communications, Tokyo, Japan [email protected]
Abstract. As there are increasing needs to view a huge number of videos on the Web in short time, the video summary technology is actively investigated. However, there exists trade-offs of costs and precision of summaries. In this paper, we propose a system called NicoScene to search desirable scenes from the videos provided in a video hosting service called Nico Nico Douga. We use the feature of the service in which we can attach comments to videos and treat the comments as social annotation. By some experiments, we demonstrate the advantages of NicoScene in particular the search precisions.
1
Introduction
Huge amount of storage and the rapidly spreading broadband Internet have enabled video hosting services such as YouTube1 . According to a report [Australian 10], the ratio of audience of TV is decreasing and that of the video hosting services is increasing. In addition, people are beginning to use services such as Nico Nico Douga2 in which they can annotate comments to shared videos synchronously with respect to the playing time. This is because such services can be used as highly bidirectional communication media. These services would become more important in building culture and arousing public opinion. For example, the users of the services replied to a questionnaire of their supporting political parties answers that are different from those of the general public. However, it is needed to enable the videos to be viewed efficiently in a limited time because the easy procedures to publish the videos on the Web make the number of the videos rapidly increase. The means of efficient viewing can be divided into the two categories, that is, video classification and video summary. Video classification decreases the number of the videos to be viewed. This approach classifies the videos by analyzing the tags and the contents so that the users can easily find the videos they want to view. It also uses techniques such as filtering to recommend videos matching the users’ preferences. The video summary approach enables the users to view videos in shorter time by summarizing 1 2
http://youtube.com/ http://www.nicovideo.jp/ “Nico Nico Douga” in Japanese is literally translated into “smiley video”.
A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 461–474, 2010. c Springer-Verlag Berlin Heidelberg 2010
462
Y. Tahara et al.
the videos. The summaries are made by classifying the scenes in the videos and picking up and playing the scenes of only some specific classes. The word “scene” here means a fragment of a video including some specific events. Annotation is a technique to attach to the scenes meta-information that are used for searching the scenes. There are two approaches of annotation, that is, automatic methods such as video analysis and manual annotation. The approach to attach comments in the bidirectional media mentioned before can be considered as social annotation in which the general public annotate manually. In this paper, we focus on scene search for the video summary approach and propose a search technique using social annotation. The reason to use social annotation is to decrease the annotation costs for a huge number of videos in the Web. In addition, by experiments, we demonstrate that our technique can search scenes with high precisions and sufficient recalls. This paper is organized as follows. Section 2 describes Nico Nico Douga that is a representative example of bidirectional service and the target of our research. Section 3 makes clear the issues in using the comments as social annotation. Section 4 proposes the system called “NicoScene” to address the issues. Section 5 describes the implementation of NicoScene. Section 6 describes the experiments to demonstrate the effect of our approach and discusses the results. Section 7 discusses how much our system addresses the issues described in Section 3. Section 8 compares related work with our proposal and examines the advantages of our approach. Section 9 provides some concluding remarks and future work.
2
Nico Nico Douga
We deal with videos published on Nico Nico Douga as socially annotated media. In this section, we describe a summary of Nico Nico Douga and its feature of attaching comments to videos. 2.1
Summary
Nico Nico Douga is one of the most popular video hosting services in Japan with more than thirteen million subscribers3 , about three million videos, and more than two billion comments. This popularity is due to the unique user interface (shown in Figure 1) in which the users can attach comments to each scene of videos individually by designating the time from the beginning of each video and the comments are superimposed on the videos. This commenting feature provides togetherness to the audience and highly bidirectional communication between video uploaders and the audience. The uploaders can have incentive because they can easily obtain the audience’s opinions. Nico Nico Douga is now a field of very lively and creative activities such as videos with advanced video processing techniques and original stories. It has also a considerable impact to the research community such as an analysis of video reuse networks [Hamasaki 08]. 3
Users cannot view the videos on Nico Nico Douga without subscription.
NicoScene: Video Scene Search by Keywords Based on Social Annotation
463
Comments
Comment Input Field
Comments Comment Input Field
Fig. 1. User interface of Nico Nico Douga: playing a video and attaching comments
2.2
Attaching Comments to Scenes
A Nico Nico Douga user can attach comments to any scenes during viewing a video. Other users can read the comments that move from right to left on the corresponding scene. The uploader of the video can use the uploader-specific commenting functionality to attach such texts as the explanations of the scenes and the lyrics of the songs. The types of comments varies depending on the scenes. For example, “gooru !!!” (Japanese for “goal”) is attached to a goal scene of a soccer video, “ooooo” (Japanese for “ooh”) to a surprising scene, and “wwwww” (“w” is the acronym of a Japanese word for “laugh”) to a funny scene. On a scene in which a specific character or person appears, we sometimes see comments related to it. More comments appear on a scene with more interesting contents. The comments in Nico Nico Douga have the potential to be useful video annotation because they usually include information about the contents of the scenes. However, there are so many noises in the comments that we cannot use all the comments as annotation. In this paper, we propose a system to search scenes by estimating the degree of attention to the scenes on the basis of the number of the comments and examining the contents of the scenes on the basis of the contents of the comments.
3 3.1
Issues in Scene Search Based on Comments Dependency on the Number of Comments during a Fixed Time
The number of viewers and comments of a video increases as the time length of the video does. However, we consider that the time when a scene appears in the
464
Y. Tahara et al.
video does not affect the number and the contents of the comments. Therefore it is not desirable if the number of comments affect the search results. For example, the search results should not be different between the case in which five hundred comments are attached to a video and the case in which one thousand comments are attached to the same video. 3.2
Dependency on Video Clipping
As Nico Nico Douga limits the time length of a video, the uploaders need to clip long videos. It is left to the uploaders’ discretion which part of a video they clip and upload even if they deal with the same video. Therefore it is not desirable if the way of clipping affects the search results. For example, suppose that we have the following three videos of the same soccer match: (1) the video including both of the first and the second halves, (2) the one including only the first half, and (3) the one including only the second half. It is not desirable if the results differ by searching the scenes of the first half from (1) and from (2) with the same query. 3.3
Representations of Attributes of Scenes
In order to search scene using the comments, the users need to know the words explicitly representing the attributes of the scenes that they want to search. However, the users usually do not know which comments are annotated to which scenes. Therefore we need to bridge the gap between the queries the users input and the representations of the attributes of the scenes included in the comments. 3.4
User Interface for Easy Understanding of Relationships between Scenes
Many existing user interfaces dealing with videos display the scenes in the order of the time sequence. Although such interfaces are useful if the users want to view the search results on the basis of time, it is not easy to understand the relationship between scenes if the users have searched them on the basis of their contents. In addition, if the users search scenes from multiple videos transversely, user interfaces based on time sequences are not considered as an appropriate way of displaying the search results.
4 4.1
Proposed System and Approach NicoScene: Scene Search System
We propose NicoScene (Figure 2) as a scene search system addressing the issues described in Section 3. NicoScene users can carry out the following operations. 1. Video search and identification 2. Scene search in the identified videos with queries of scene attributes
NicoScene: Video Scene Search by Keywords Based on Social Annotation
465
3. Search result display according to relationships to be focused on 4. Collaborative editing of the keyword ontology specifying the scene attributes After the users search videos on NicoScene, they search scenes by inputing scene attributes. The relationships to be focused on are switched by checkboxes. The ontology specifying the scene attributes is included in NicoScene. It can be edited using the specific editor.
(1)Search and specify videos
(2)Scene search with queries of scene attributes (3)Choose relationships to be focused on Scene Keyword Video Scene
Videos 動画動画 (4)Collaborative editing
Keyword ontology
System
シーン シーン Scenes
シーン2(内容2)
Fig. 2. NicoScene system
4.2
Unit Scoring Based on Comments
NicoScene treats each scene as a unit that is a fragment of videos whose time length is specified in advance. In order to address the issues described in 3.1 and 3.2, the system calculates the score of each unit of a video on the basis of the comments and outputs the units with more scores than a threshold as the scene search results. The following descriptions show the procedure. 1. Divide a video with its time length T and the number of comments C into units Ui (i = 1, 2, · · ·, N ). Each unit is a video fragment with its time length t that the user can change according to the category or the type of the video (therefore T = N t). N of all the comments 2. For each unit Ui , count the number ci i=1 ci = C and the number Ki of the comments including the keywords in the keyword ontology (“keyword comments” hereafter) corresponding to the scene attributes included in the query.
466
Y. Tahara et al.
3. Calculate the basic score Si = ci + αKi , where α is the weight representing how much more important are the keyword comments than other comments. 4. Calculate the average number of comments Ct/T in order to resolve the differences according to the total number of comments C and the time length of a unit t, and normalize Si with Ct/T into Si : Si =
T Si Ct
(1)
5. Output the units Ui and their previous units Ui−1 as the search results where the unit scores Si are larger than the threshold δ. 4.3
Collaborative Editing of Keyword Ontology
We adopt the ontology approach to address the issue described in 3.3. In order to represent the attributes of the scenes to be searched, NicoScene uses a set of keywords frequently appearing in the comments in such scenes. The keywords are organized in the keyword ontology (Table 1). A keyword ontology is a set of relations between scene attributes and keywords. This keyword ontology is created and edited by the audience that heavily use Nico Nico Douga and have knowledge about the videos and the scenes. Thus the ontology would become useful because it incorporates the knowledge of the audience who are acquainted with the comments of Nico Nico Douga. Table 1. Example of keyword ontology Scene Attributes Goal Red card Rough play Overtake Taro Aso Yayoi Takatsuki ...
4.4
Keywords ”goal”, ”here comes”, ”good”, ”ooh”, ”first score” ”red”, ”sent out”, ”suspension” ”dangerous”, ”rough”, ”red card”, ”yellow card” ”overtake”, ”OT”, ”ooh”, ”overtaking” ”Aso”, ”prime minister”, ”His Excellency”, ”Rosen”, ”Miyabe” ”yayoi”, ”u’uh”, ”bean sprout”, ”35 yen” ...
User Interface Displaying Relationships to Be Focused on
In order to address the issue described in 3.4, we provide a user interface visualizing videos, scenes, and keywords as the nodes of SpringGraph [SpringGraph] (Figure 3). SpringGraph is a component that draws graphs automatically on the based of the physical model of springs. It makes us easily understand the relationships between the nodes. For example, in Figure 3, we can focus on the video 1 by removing the node of video 2 and displaying only the relationships to be focused on. We can also focus on the keyword “goal” by removing the node of the keyword “poor” (Figure 4).
NicoScene: Video Scene Search by Keywords Based on Social Annotation
467
Fig. 3. Example of user interface arranging videos (rectangle nodes), keywords (round rectangle nodes), and scenes (oval nodes) in SpringGraph
Fig. 4. Left: focusing on video 1, right: focusing on the keyword “goal”
5 5.1
NicoScene Implementation Summary
NicoScene is composed by the server side program written in Java and the client side component implemented using Adobe AIR [Adobe AIR]. The client side component accepts queries of video and scene search, displays the search results, and builds and edits the keyword ontology. The server side program accesses Nico Nico Douga according to the input search queries. It also obtains video information and comments, and manage the keyword ontology. 5.2
Displaying Scene Search Results
We implemented the user interface described in 3.5 using SpringGraph [SpringGraph] as well as the interface arranging the scenes along the usual time sequence. SpringGraph displays a pair of linked nodes as close as possible and a pair of unlinked ones as far as possible. This feature enables easily and visually understandable representations of relationships between videos such as shown in Figure 5.
468
Y. Tahara et al.
Fig. 5. Connections of Content (oval nodes), scenes (smaller rectangles), and videos (larger rectangles)
6
Experimental Evaluation of Search Precision
In this sections, we evaluate NicoScene by measuring the precision of scene search by experiments. The target of the experiments include searches of objective scenes such as goal scenes of soccer and subjective ones such as impressive scenes. The correct answers for the objective scene searches are the video units including the times on which the corresponding events occur. For the subjective ones, we picked up the units including the times recorded as impressive scenes by multiple persons who carried out the experiments. The subjects of these experiments are four students of our university. 6.1
Method of Experiments
We used a set of keywords listed up in advance by users who frequently use Nico Nico Douga. We considered a search result of NicoScene as correct if the searched unit includes a correct scene. The evaluation criteria are precision = (the number of correct search results) / (the number of searched scenes) and recall = (the number of correct search results) / (the number of correct answers).
Video Search results Correct answer Fig. 6. Definition of correct answers
For example, in Figure 6, the triangles represent the times in which the events occur. As the correct answers are the units including that times, the number of the correct answers is three. The orange units are the search results of the system (searched scenes) and the number of them is two. In this case, the first search
NicoScene: Video Scene Search by Keywords Based on Social Annotation
469
result does not overlap with any units of the correct answers and therefore it is “incorrect”. Since the other result includes a correct answer, it is a correct result. Therefore, because one of the two searched scenes is a correct result and the number of the correct answers is three, the precision is 50% (1/2) and the recall is 33.3% (1/3). Experiment 1: Objective Scene Search. As the target of the experiment of searching scenes including objectively observable events, we examined soccer goal scenes and overtaking scenes of the Formula One auto racing (“F1” hereafter). 1. Experiment 1-(1): Soccer goal scene search We searched goal scenes from seventeen soccer videos whose time lengths are ranging from 20 to 65 minutes and measured the precisions and the recalls. We examined the correctness of the results by comparing them with the times of the goal scenes identified by actually viewing the videos. We also fixed the time length of one unit t as 38 seconds because we found by a preliminary inquiry that average people recognize this time length appropriate as one scene of a soccer match. We used the keywords corresponding to the “Goal” attribute in Table 1. 2. Examined 1-(2): F1 Overtaking scene search We searched overtaking scenes from nine F1 videos whose time lengths are around 25 minutes and measured the precisions and the recalls. We examined the correctness of the results by comparing them with the times of the overtaking scenes identified by actually viewing the videos. We also fixed the time length of one unit t as 60 seconds. We used the keywords corresponding to the “Overtake” attribute in Table 1. Experiment 2: Subjective Scene Search. In this experiment, we requested four people to view a soccer match of 160 minutes (including the extra time and the penalty shoot-out) and to record the beginning times of the scenes the people thought impressive. After that, we compared the records with the search results of the system and calculated the total precision and recall as the evaluation parameters. We used the set of keywords (keyword set) { “goal”, “here comes”, “good”, “ooh”, “first score”, “rough”, “dangerous”, “red card”, “sent off”, “red”, “suspension”, “yellow card”, “yellow”, “caution”, “great”, “unlucky”}. Experiment 3: Scene Search from Other Types of Videos. The aim of this experiment is to examine how much widely applicable our system is. We tried the following tasks and measured the precisions and the recalls with changing α and δ. 1. Experiment 3-(1): Searching scenes in which a specific person is speaking from videos of political discussions that are different from sports videos because they include only small actions of the people 2. Experiment 3-(2): Searching scenes in which a specific character appears from entertainment videos
470
Y. Tahara et al.
As for 3-(1), we searched scenes in which Mr. Taro Aso, the former prime minister of Japan, is speaking from four videos of party leader meetings. As for 3-(2), from four videos of original stories based on a video game called “THE IDOLM@STER”, we searched scenes in which Yayoi Takatsuki, a character of the game, appears. We fixed the time lengths of the units as 60 seconds (3-(1)) and 40 seconds (3-(2)) by considering the length of one scene of each type of videos. We want to remark that the distributions of the comments have larger deviations than the videos used in Experiments 1 and 2. We used the keywords appearing in Table 1. In all the experiments, we changed the weight α of the keyword comments from 0 to 100 and the score threshold δ from 0 to 12. 6.2
Results of Experiments
Figure 7 shows the results of the experiments. In each graph, the solid lines denote the precisions while the dotted lines denote the recalls. The results show that scene search based on social annotation has sufficient precisions. In detail, for any types of videos, all the experiments provided high precisions with keeping the recalls more than 0.5 if we fix the weight of keyword comments α to 70 and the threshold δ to 6 (Figure 2). Table 2. Precisions and recalls of each experiment with weight α = 70 and threshold δ=6 Precisions Recalls Soccer goal scenes 0.77 0.78 0.65 0.62 F1 overtake scenes Soccer impressive scenes 0.81 0.58 Mr. Aso speaking scenes 0.77 0.77 0.66 0.56 Yayoi appearing scenes
Next, we evaluate the results in detail. The precision of Experiment 1-(2) is lower than that of 1-(1) because the numbers of videos, scenes, and comments are small. Although our approach normalizes the comment scores, we would need a considerable number of comments. The results of Experiment 2 is worse than those of 1-(1) that deals with the same type of videos (of soccer). This would be because of the wide variety of the scenes picked up by each person on the basis of her or his subjective criteria of impressiveness, while goal scenes are identified by objective events. For example, a subject of this experiment picked up player substitution scenes. Another subject picked up scenes in which the supporters are singing. The results of Experiments 3 are worse than Experiments 1 although the former ones search objective scenes in which a specific person is speaking or a specific character appears. This would be because, while the audience respond
NicoScene: Video Scene Search by Keywords Based on Social Annotation
471
Fig. 7. Precisions (solid lines) and recalls (dotted lines) of each experiment
much to the first speaking scene or the first appearance scene, their responses become weaker for the following similar scenes due to the characteristics of the videos of discussions or stories.
7
Discussions
In this section, we discuss how much our system addresses the issues described in Section 3. First we discuss the issues described in 3.1 and 3.2, that is, dependency on the number of comments during a fixed time and dependency on video clipping. The experiments show that NicoScene can search required scenes with considerable precisions by scoring the units on the basis of the number and the contents
472
Y. Tahara et al.
of comments and normalizing them. This is in particular correct in searching scenes including objective events to which the general public respond. However, the experiments also show that our approach does not much fit searching scenes each person subjectively feels important. Even if we search objective scenes, sufficient number of comments are needed to be attached to each video. We tried to address the issue described in 3.3, that is, representations of attributes of scenes, by using the keyword ontology. The experiments show that building an appropriate keyword set helps effective scene search. However, it is not easy to list up keywords appropriate to various situations. We need to examine additional approaches such as keyword reuse based on the structure of the ontology to make our system practical. As for the issues described in 3.4, that is, user interface for easy understanding of relationships between scenes, SpringGraph visualizes the relationships. This is because the spring model puts the closely related nodes near each other. Our user interface also makes it easy to grasp scenes that match multiple keywords. Thus we can understand the relationships intuitively by NicoScene.
8
Related Work
There are various approaches to annotate multimedia contents. One of such approaches that are popular is automatic annotation. This approach identifies the types of scenes by analyzing the images and/or the sounds and attaches metadata to the scenes. For example, Bloehdorn et al. [Bloehdorn 05] relates low level descriptions of multimedia data such as colors and content level descriptions such as persons, and creates and attaches semantic annotations. The automatic annotation approaches are expected to make sure considerable precisions and have produced several applications such as removing commercial messages. However, most of them have the issue of limitations of target videos. For example, we need to identify each speaking person for news videos, changes of the background images for sports videos, and scene switching for movies. Therefore, since we need to use different techniques for different types of videos, it is difficult for one single system to deal with various types of videos. This means that we cannot deal with the huge number and variety of videos on the Web at the same time. On the other hand, the manual annotation approaches are divided into the two categories of experts’ annotations and annotations by the general public. As for the former approach, experts such as producers and broadcasting company staffs attach metadata mainly for the commercial purposes. Such annotations include captions in media such as DVDs, teletexts, and closed captioning. This approach would produce highly precise annotations because experts provide them. However, it is not practical to use them for video hosting services since it is more expensive than other approaches and can be used only for commercial media. Manual annotation by the general public is called social annotation this paper deals with. While Nico Nico Douga is an application of social annotation in practice, there are other applications in the research field such as the video tag
NicoScene: Video Scene Search by Keywords Based on Social Annotation
473
game [Video Tag Game]. In social annotation, general users voluntarily attach metadata to scenes. Social annotation is considered as more useful information as more users are involved. Miyamori et al. [Miyamori 05] obtained considerable result in summarizing American football match videos. In their approach, they compared with the video of the match the uploaded time of the texts written for a TV program of a match into BBSs on the Internet. Thus they treated the texts as annotations to the video. Because the annotations are attached by the users with the background knowledge about American football, their approach has the advantages that it can precisely describe the contents of scenes and the annotation costs are low. Masuda et al. [Masuda 08] adopt a similar approach. However, if we use the annotation without any processing, we would have the problems that we cannot obtain appropriate information due to the many noises. In addition, a large amount of information often makes it difficult to identify and extract specific scenes. Nico Nico Douga is attracting attentions from various communities because it is large-scale. Research on it is getting active recently. Nakamura et al. [Nakamura 08] measured reliabilities of videos by tracing the changes of comments along the playing time and the time since uploading of each video. Hamasaki et al. [Hamasaki 08] analyzed the connections and the width of the spread of users’ creative activities by tracing the relations between the user-generated contents in Nico Nico Douga. Our approach is different from these researches of scene search and annotation as follows. We attach annotations automatically in low costs to the huge number of videos stored in the bidirectional media by estimating the degree of attention to scenes on the basis of quantitative comment analysis and analyzing the types of the comments. We can thus search scenes with considerable precisions efficiently.
9
Conclusions
In this paper, we proposed NicoScene that is a system to search scenes of videos by examining comprehensively the number and the contents of the comments of Nico Nico Douga. Experiments showed that our system can search scenes with sufficient precisions and recalls from any types of videos by fixing an appropriate weight of keyword comments and an appropriate threshold of scores. In addition, we can adjust the trade-offs between the precision and the recall by changing the threshold. This means that the threshold can be used as a parameter to adjust the number of the search results. The current research status is at an intermediate step toward the video summary technology. Therefore we consider the precisions of our experiments are sufficient if we apply our approach to a system with which we manually extract scenes from the search results and create a summary by combining them. As future work, we will carry out experiments similar to those of this paper for other various types of videos. These experiments would make clear the applicability of our approach and increase the precision. We want to improve our system to be better applied to video summary. Because the keyword set of
474
Y. Tahara et al.
NicoScene is manually built using ontology, we will investigate automatic approaches such as identifying comments with high frequency in each scene as keyword candidates by supervised machine learning techniques. We are also going to investigate knowledge-based search methods such as analyzing relationships between comments.
Acknowledgement We would like to thank the members of our laboratory for their cooperation on the experiments and discussions with us.
References Adobe AIR. Adobe Systems Incorporated, Adobe AIR, http://www.adobe.com/products/air/ Hamasaki 08. Hamasaki, M., Takeda, H., Nishimura, T.: Network Analysis of Massively Collaborative Creation of Multimedia Contents - Case Study of Hatsune Miku videos on Nico Nico Douga. In: First International Conference on Designing Interactive User Experiences for TV and Video (uxTV 2008), pp. 165–168 (2008) SpringGraph. Shepherd, M.: SpringGraph, http://mark-shepherd.com/SpringGraph/ Miyamori 05. Miyamori, H., Nakamura, S., Tanaka, K.: Generation of views of TV content using TV viewers’ perspectives expressed in live chats on the web. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 853–861 (2005) Nakamura 08. Nakamura, S., Shimizu, M., Tanaka, K.: Can social annotation support users in evaluating the trustworthiness of video clips? In: Proceeding of the 2nd ACM Workshop on Information Credibility on the Web, pp. 59–62 (2008) Australian 10. The Australian, Net plan triggers ‘digital divide’ at Seven (2005), http://www.theaustralian.com.au/business/media/ net-plan-triggers-digital-divide-at-seven/story-e6frg996-1225840643633 Bloehdorn 05. Bloehdorn, S., Petridis, K., Saathoff, C., Simou, N., Tzouvaras, V., Avrithis, Y., Handschuh, S., Kompatsiaris, Y., Staab, S., Strintzis, M.G.: Semantic Annotation of Images and Videos for Multimedia Analysis. In: G´ omez-P´erez, A., Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, pp. 592–607. Springer, Heidelberg (2005) Masuda 08. Masuda, T., Yamamoto, D., Ohira, S., Nagao, K.: Video Scene Retrieval Using Online Video Annotation. In: Proc. of JSAI 2007, pp. 54–62 (2008) Video Tag Game. Zwol, R., Pyeyo, L.G., Ramirez, G., Sigurbj¨ ornsson, B., Labad, M.: Video Tag Game. In: Proc. of WWW 2008 (2008)
Social Relation Based Search Refinement: Let Your Friends Help You! Xu Ren1 , Yi Zeng1 , Yulin Qin1,2 , Ning Zhong1,3 , Zhisheng Huang4 , Yan Wang1 , and Cong Wang1 1
3
International WIC Institute, Beijing University of Technology Beijing, 100124, P.R. China [email protected] 2 Department of Psychology, Carnegie Mellon University Pittsburgh, PA 15213, U.S.A [email protected] Department of Life Science and Informatics, Maebashi Institute of Technology Maebashi-City, 371-0816, Japan [email protected] 4 Department of Artificial Intelligence, Vrije University Amsterdam De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands [email protected]
Abstract. One of the major problems for search at Web scale is that the search results on the large scale data might be huge and the users have to browse to find the most relevant ones. Plus, due to the reason for the context, user requirement may diverse although the input query may be the same. In this paper, we try to achieve scalability for Web search through social relation diversity of different users. Namely, we utilize one of the major context for users, social relations, to help refining the search process. Social network based group interest models are developed according to collaborative networks, and is designed to be used in more wider range of Web scale search tasks. The experiments are based on the SwetoDBLP dataset, and we can conclude that proposed method is potentially effective to help users find most relevant search results in the Web environment. Keywords: social relation, retained interest, social network based group interest model, personalized search, search refinement.
1
Introduction
Formulating a good query for search is an everlasting topic in the fields of information retrieval and semantic search, especially when the data goes to Web scale. The hard part is that users some times cannot provide enough constraints for a query since many of the users are not experienced enough. User background is a source that can be used to find user interests and the acquired interests can be added as constraints to the original vague query to refine the query process and help users get most relevant results. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 475–485, 2010. c Springer-Verlag Berlin Heidelberg 2010
476
X. Ren et al.
In our setting for this study, we define a user interest as concepts that the users are interested in or at least familiar with. In addition to the study that we have made in [1], which shows that users’ recent interests may help to get a better refined query, we propose that in some cases, users’ social relations and social network based group interest models can help to refine the vague query too, since social relations serve as an environment for users when they perform query tasks. From the perspective of scalable Web search, this paper aims at achieving scalability through providing important search results to users. Since no matter how fast the data is growing, the size of the most important search results for users will be relatively small. Users’ social relation can be represented in the form of semantic data and serve as one kind of background information that can be used to help users acquire the most important search results. In this paper, based on SwetoDBLP [2], an RDF version of the DBLP dataset, we provide some illustrative examples (mainly concentrating on expert finding and literature search) on how the social relations and social network based group interest models can help to refine searching on the Web.
2
Social Relations and Social Networks
Social relations can be built based on friendships, coauthorships, work relationships, etc. The collection of social relationships of different users form a social network. As an illustrative example, we build a coauthor network based on the SwetoDBLP dataset, we represent the coauthor information for each author using FOAF vocabulary “foaf:knows”.
Fig. 1. Coauthor number distribution in the SwetoDBLP dataset
Fig. 2. log-log diagram of Figure 1
The social network can be considered as a graph. Each node can be an author name and the relationships among nodes can be coauthorships. An RDF dataset that contains all the coauthor information for each of the authors in the SwetoDBLP dataset has been created and released1 . Through an analysis of 1
The coauthor network RDF dataset created based on the SwetoDBLP dataset can be acquired from http://www.wici-lab.org/wici/dblp-sse
Social Relation Based Search Refinement: Let Your Friends Help You!
477
node distribution for this DBLP coauthor network, we can find it has following statistical properties: As shown in Figure 1 and Figure 2 [3,4]. The distribution can be approximately described as a power law distribution, which means that there are not many authors who have a lot of coauthors, and most of the authors are with very few coauthors. We can indicate that with this distribution characteristics, considering the scalability issue, when the number of authors expand rapidly, it will be not hard to rebuild the coauthor network since most of the authors will just have a few links. The purpose of this RDF dataset is not just to create a coauthor network, but to utilize this dataset to extract social relations from it and use them for refining the search process.
3
Search Refinement through Social Relationship
In enterprize information retrieval, expert finding is a emerging research topic [5]. The main task for this research area is to find relevant experts for a specific domain [6]. Nevertheless, a list of expert names that has nothing to do with the end user always confuse them. More convenient search refinement strategies should be developed. We propose that if the end users are familiar with the retrieved expert names, the search results may be more convenient for use. As an illustrative example, we propose a search task that needs to find “Artificial Intelligence authors” based on the SwetoDBLP dataset. Table 1. A partial result of the expert finding search task “Artificial Intelligence authors”(User name: John McCarthy) Satisfied Authors Satisfied Authors without social relation refinement with social relation refinement Carl Kesselman (312) Hans W. Guesgen (117) * Thomas S. Huang (271) Virginia Dignum (69) * Edward A. Fox (269) John McCarthy (65) * Lei Wang (250) Aaron Sloman (36) * John Mylopoulos (245) Carl Kesselman (312) Ewa Deelman (237) Thomas S. Huang (271) ... ...
Table 1 provides a partial result for the experiment of the proposed expert finding search task (Here we only consider a very simple and incomplete strategy, namely, find the author names who have at least one paper with “Artificial Intelligence” in its title). The left column is a partial list of results without social relation based refinement, which is just a list of author names without any relationship with the user. The right column is a partial list of results with social relation based refinement (The refinement is based on the social relations of the specified user that are extracted from the social network created in Section 2). Namely, the “Artificial Intelligence” authors whom the user “John McCarthy”
478
X. Ren et al.
knows are ranked to the front (As shown in the table, including himself). The results of the right column type seems more convenient for a user since the results which are ranked to the first ones seems to be familiar with the user compared to a list of irrelevant names. In a enterprise setting, if the found experts have some previous relationship with the employer, the cooperation may be smoother. In this example, a user’s collaborators appeared in two different scenarios, namely, in the coauthor network and domain experts knowledge base (here we consider SwetoDBLP as the experts knowledge base). Both of them are represented as semantic datasets using RDF, which enables the following connection. When a user tries to find domain experts, his social relations in the coauthor network are linked together with the domain experts knowledge base through the user’s name or URI. This connection brings two separate datasets together and help to refine the expert finding task.
4
Social Network Based Group Interest Models
A user and his/her friends, collaborators form a social network. the user’s interests may be affected by this social network since the network contains a group of other users who also have some interests. If they always communicate with each other, in the form of talking, collaboration, coauthoring, etc., their interests may be affected by each others’. If the user is affected by the social network based “group interests”, he/she may begin to search on the interesting topic to find relevant information. Hence, the social network based group interests may be serve as an essential environmental factor from user background for search refinement. Group Interest For a specific interest “t(i)”, its group interests for a specific author “u”, namely “GI(t(i), u)” can be quantitatively defined as: GI(t(i), u) = E(t(i), u, c) =
m
E(t(i), u, c),
c=1
1 (t(i) ∈ IctopN )
(1)
0 (t(i) ∈ IctopN )
where E(t(i), u, c) ∈ {0, 1}, if the interest t(i) appears both in the top N interests of a user and one of his/her friends’, then E(t(i), u, c) = 1, otherwise, E(t(i), u, c) = 0. For a specific user “u”, there are m friends in all, and the group interest of “t(i)” is the cumulative value of E(t(i), u, c) based on the m friends. In a word, group interest focuses on the cumulation of ranked interests from a specific user’s social network. Various models can be used to quantitatively measure and rank interests so that one can get the top N interests to produce the value of group interests. We defined 4 models in [7], here we briefly review them so that the comparison of group interests from the 4 perspectives can be made. Let i, j be positive integers, yt(i),j be the number of publications which are related to topic t(i) during the time interval j.
Social Relation Based Search Refinement: Let Your Friends Help You!
479
Cumulative Interest Cumulative interest, denoted as CI(t(i), n), is used to count the cumulative appear times of the interest t(i) during the n time intervals. It and can be acquired through: n CI(t(i), n) = yt(i),j . (2) j=1
It is used reflect a user’s over all interest on the specified topic within a time interval. Retained interest A person may be interested in a topic for a period of time but is likely to loose interest on it as time passes by if it has not appeared in some way for a long time. This phenomenon is very similar to the forgetting mechanism for cognitive memory retention. In [1] we introduced an retained interest model based on a power law function that cognitive memory retention [8] follows: RI(t(i), n) =
n j=1
−b yt(i),j × ATt(i) ,
(3)
where Tt(i) is the duration interested in topic t(i) until a specified time. For each −b time interval j, the interest t(i) might appear yt(i),j times, and yt(i),j × ATt(i) is the total retention of an interest contributed by that time interval. According to our previous studies, the parameters satisfy A = 0.855 and b = 1.295 [1]. Interest Cumulative Duration Interest cumulative duration, denoted as ILD(t(i)), is used to represent the longest duration of the interest t(i): ILD(t(i)) = max(ID(t(i))n ).
(4)
where n ∈ I + , ID(t(i))n is the interest duration when t(i) discretely appears (the time interval of the appeared interest is not directly continuous with the one of the previous appeared interest) for the nth time. Interest Longest Duration Interest longest duration, denoted as ICD(t(i)), is used to represent the cumulative duration of the interest t(i):
ICD(t(i)) =
n
(ID(t(i))n ).
(5)
n=1
where n ∈ I + is used to represent the nth discrete appearance of the interest t(i), and n is the total discrete appearance times of the interest t(i). The above 4 interest models are used for producing the top N interests. The corresponding group interests based the proposed models are: group cumulative interest (GCI(t(i), u)), group retained interest (GRI(t(i), u)), group cumulative duration (GCD(t(i), u)), and group longest duration (GLD(t(i), u)) respectively. Their calculation function is the same as GI(t(i), u), namely, GCI(t(i), u), GRI(t(i), u), GCD(t(i), u) and GLD(t(i), u) are special cases of GI(t(i), u).
480
X. Ren et al.
As a foundation for the development of social “group interest”, we analyzed all the authors’ retained interests values based on the SwetoDBLP dataset (more than 615,000 authors) using the introduced model, an RDF version of the interest enhanced DBLP author set has been released on the project page2 . Here we give an illustrative example on producing group interests based on retained interests. Using formula 3 and 1, and taking “Ricardo A. Baeza-Yates” as an example, a comparative list of top 7 retained interests of his own and his group retained interests (with 132 authors involved) is shown in Table 2.
Table 2. A comparative study of top 7 retained interests of a user and his/her group retained interests. (User name: Ricardo A. Baeza-Yates) Self Retained Interests Web Search Distributed Engine Mining Content Query
Value Group Retained Interests Value 7.81 Search (*) 35 5.59 Retrieval 30 3.19 Web (*) 28 2.27 Information 26 2.14 System 19 2.10 Query (*) 18 1.26 Analysis 14
Through Table 2 we can find that may be group retained interests are not the same as, but to some extent related to the user’s own retained interests (interesting terms that are marked with “*” are the same). As a step forward, we analyzed the overlap between a specific user’s own interests and his/her group interests. 50 most productive authors from the DBLP dataset (May 2010 version) are selected for the evaluation. The analysis considers 4 types of overlaps: – cumulative interest (CI(t(i), n)) and group cumulative interest (GCI(t(i), u)), – retained interest (RI(t(i), n)) and group retained interest (GRI(t(i), u)), – interest longest duration (ILD(t(i), j)) and group interest longest duration (GLD(t(i), u)), – interest cumulative duration (ICD(t(i), j)) and group interest cumulative duration (GCD(t(i), u)). The value of the overlaps are average values of the selected 50 authors. As shown in Figure 3, from the 4 perspectives, the overlaps range are within the interval [0.593, 0.667]. It means that no matter from which of these perspectives, the overlap between the users’ own interests and their group interests are at least greater than 59%. Take RI(t(i), n) vs GRI(t(i), u) and CI(t(i), n) vs GCI(t(i), u) as examples, Figure 4 shows that for most of the 50 authors, the overlaps are within the time interval [0.4,0.9]. 2
http://www.wici-lab.org/wici/dblp-sse and http://wiki.larkc.eu/csri-rdf
Social Relation Based Search Refinement: Let Your Friends Help You!
481
Fig. 3. Ratio of Overlap between differ- Fig. 4. A comparative study on the ent group interest values and the speci- overlap between RI and GRI, CI and fied author’s interest values GCI
Based on the results and analysis above, besides users’ own interests, their group interests and can be used as another source to refine the search process and satisfy various user needs.
5
Group Interests Based Search Refinement
In [1], according to the idea of retained interests model of a specific user, we developed a DBLP Search Support Engine (DBLP-SSE), which utilizes the user’s
Fig. 5. A screen shot of the DBLP Search Support Engine (DBLP-SSE)
482
X. Ren et al.
own retained interests to refine search on the SwetoDBLP dataset [2]. Based on the idea of group retained interest model introduced in 4, we developed a search support engine based on the SwetoDBLP dataset [2]. Figure 5 is a screen shot on the current version of the DBLP Search Support Engine (DBLP-SSE).
Table 3. Search Refinement using user’s retained interests and group retained interests Name: Ricardo A. Baeza-Yates Query : Intelligence List 1 : without any refinement (top 7 results) 1. PROLOG Programming for Artificial Intelligence, Second Edition. 2. Artificial Intelligence Architectures for Composition and Performance Environment. 3. The Mechanization of Intelligence and the Human Aspects of Music. 4. Artificial Intelligence in Music Education: A Critical Review. 5. Readings in Music and Artificial Intelligence. 6. Music, Intelligence and Artificiality. 7. Regarding Music, Machines, Intelligence and the Brain: An Introduction to Music and AI. List 2 : with user’s own interests constraints (top 7 results) interests : Web, Search, Distributed, Engine, Mining, Content, Query 1. SWAMI: Searching the Web Using Agents with Mobility and Intelligence. 2. Moving Target Search with Intelligence. 3. Teaching Distributed Artificial Intelligence with RoboRally. 4. Prototyping a Simple Layered Artificial Intelligence Engine for Computer Games. 5. Web Data Mining for Predictive Intelligence. 6. Content Analysis for Proactive Intelligence: Marshaling Frame Evidence. 7. Efficient XML-to-SQL Query Translation: Where to Add the Intelligence? List 3 : with group retained interests constraints (top 7 results) interests : Search, Retrieval, Web, Information, System, Query, Analysis 1. Moving Target Search with Intelligence. 2. A New Swarm Intelligence Coordination Model Inspired by Collective Prey Retrieval and Its Application to Image Alignment. 3. SWAMI: Searching the Web Using Agents with Mobility and Intelligence. 4. Building an information on demand enterprise that integrates both operational and strategic business intelligence. 5. An Explainable Artificial Intelligence System for Small-unit Tactical Behavior. 6. Efficient XML-to-SQL Query Translation: Where to Add the Intelligence? 7. Intelligence Analysis through Text Mining.
Table 3 shows a comparative study of search results without refinement, with user retained interests based refinement, and with group retained interests based refinement. Different search results are selected out and provided to users to meet their diverse needs. One can see that how the social network based group interests serve as an environmental factor that affect the search refinement process and help to get more relevant search results.
Social Relation Based Search Refinement: Let Your Friends Help You!
6
483
Evaluation and Analysis
Since the user interests and group interests are obtained from analysis based on real authors in the DBLP system. For the evaluation of the experimental results, the participants also need to be real authors in the system, preferably those with some publications distributed in different years. These constraints made finding the participants not easy. The participants are required to search for intelligence in the DBLP Search Support Engine (DBLP-SSE)3 that we developed based on the SwetoDBLP dataset [2]. Three lists of query results are provided to each of them. One is acquired based on unrefined query, and another two are refined by users’ own top 9 retained interests and top 9 group retained interests. They are required to judge which list of results they prefer. Currently, we have received evaluation results from 7 authors that have some publication listed in DBLP. Through an analysis of these results, we find that: 100% of these authors feel that the refined search results using users’ most recent RI(t(i), n) and GRI(t(i), u) are much better than the result list which does not have any refinement. 100% of them feel that the satisfaction degree of the two refined result lists are very close. 83.3% of them feel that refined results by the users’ own RI(t(i), n) is better than others. 16.7% of them feel refined results by GRI(t(i), u) are better than others. The refined list with the authors’ RI(t(i), n) is supposed to be the best one. Since the query constraints are all most related information that the users are interested in. Since the average overlap between users’ RI(t(i), n) and GRI(t(i), u) is around 63.8%, which means that interests from the author’s social network are very relevant to his/her own interests! That’s why the refined list with GRI(t(i), u) are also welcome and considered much better than the one without any refinement. It indicates that if one’s own interests can not be acquired, his/her friends’ interests also could help to refine the search process and results.
7
Conclusion and Future Work
In this study, we provide some illustration on how the social relations and social network based interest models can help to refine searching on large scale data. For the scalability issue, this approach scales in the following way: No matter how large the dataset is, through the social relation based group interest models, the amount of most relevant results are always relatively small, and they are always ranked to the top ones for user investigation. The methods introduced in this paper are related but different from traditional collaborative filtering methods [9,10]. Firstly, both the user and their friends (e.g. coauthors, collaborators) do not comment or evaluate any search results (items) in advance. Secondly, interest retention models (both users’ own one and their group one) track the retained interests as time passed. The retained interests are dynamically changing but some of the previous interests are retained according 3
DBLP-SSE is available at http://www.wici-lab.org/wici/dblp-sse
484
X. Ren et al.
to the proposed retention function. Thirdly, following the idea of linked data [11], there is no need to have relevant information in one dataset or system. As shown in Section 3, user interests stored in different data sources are linked together for search refinement (user interests data and collaboration network data). Another example is that, if someone is recorded in the DBLP system wants to buy books on Amazon, he/she does not have to have a social relation on Amazon which can be used to refine the product search. Through the linked data from the group interests based on SwetoDBLP, the search process also could be refined. For now, semantic similarities of all the extracted terms have not been added into the retained interest models. Some preliminary experiments show that this may reduce the correlation between an author’s own retained interests and his/her group interests retention. For example, for the user “Guilin Qi”, both his current retained interests and his group interests contain “OWL” and “ontology”, which seem to be 2 different terms. But in practice, “OWL” is very related to “Ontology” (for their Normalized Google Distance [12], N GD(ontology, owl) = 0.234757, if N GD(x, y) ≤ 0.3, then x and y is considered to be semantically very related [12]). For the user “Zhisheng Huang”, the terms “reasoning” and “logic” are 2 important interests, while reasoning is very related to “logic” (N GD(logic, reasoning) = 0.2808). In our future work, we would like to use Google distance [12] to calculate the semantic similarities of interesting terms so that more accurate retained interests can be acquired and better search constraints can be found. We also would like to see whether other social network theories (such as six degree of separation) could help semantic search refinement in a scalable environment.
Acknowledgement This study is supported by the research grant from the European Union 7th framework project FP7-215535 Large-Scale Integrating Project LarKC (Large Knowledge Collider). We thank Yiyu Yao for his idea and discussion on Search Support Engine, Yang Gao for his involvement on the program development of interest retentions for authors in the SwetoDBLP dataset.
References 1. Zeng, Y., Yao, Y.Y., Zhong, N.: Dblp-sse: A dblp search support engine. In: Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 626–630 (2009) 2. Aleman-Meza, B., Hakimpour, F., Arpinar, I., Sheth, A.: Swetodblp ontology of computer science publications. Web Semantics: Science, Services and Agents on the World Wide Web 5(3), 151–155 (2007) 3. Elmacioglu, E., Lee, D.: On six degrees of separation in dblp-db and more. SIGMOD Record 34(2), 33–40 (2005) 4. Zeng, Y., Wang, Y., Huang, Z., Zhong, N.: Unifying web-scale search and reasoning from the viewpoint of granularity. In: Liu, J., Wu, J., Yao, Y., Nishida, T. (eds.) AMT 2009. LNCS, vol. 5820, pp. 418–429. Springer, Heidelberg (2009)
Social Relation Based Search Refinement: Let Your Friends Help You!
485
5. Balog, K., Azzopardi, L., de Rijke, M.: Formal models for expert finding in enterprise corpora. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2006) 6. YimamSeid, D., Kobsa, A.: Expert-Finding Systems for Organizations: Problem and Domain Analysis and the DEMOIR Approach. In: Sharing Expertise: Beyond Knowledge Management, 1st edn., pp. 327–358. The MIT Press, Cambridge (2003) 7. Zeng, Y., Zhou, E., Qin, Y., Zhong, N.: Research interests: Their dynamics, structures and applications in web search refinement. In: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligenc (2010) 8. Anderson, J., Schooler, L.: Reflections of the environment in memory. Psychological Science 2(6), 396–408 (1991) 9. Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to weave an information tapestry. Communications of the ACM 35(12), 61–70 (1992) 10. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: An open architecture for collaborative filtering of netnews. In: Proceedings of the Conference on Computer Supported Cooperative Work, 175–186 (1994) 11. Bizer, C.: The emerging web of linked data. IEEE Intelligent Systems 24(5), 87–92 (2009) 12. Cilibrasi, R., Vitanyi, P.M.B.: The google similarity distance. IEEE Transaction on Knowledge and Data Engineering 19(3), 370–383 (2007)
An Empirical Approach for Opinion Detection Using Significant Sentences Anil Kumar K.M. and Suresha Department of Studies in Computer Science University of Mysore Manasagangothri Mysore, India {anilkmsjce,sureshabm}@yahoo.co.in
Abstract. In this paper we present an unsupervised approach to identify opinion of web users using a set of significant sentences from an opinionated text and to classify web user’s opinion into positive or negative. Web users document their opinion in opinionated sites, shopping sites, personal pages etc., to express and share their opinion with other web users. The opinion expressed by web users may be on diverse topics such as politics, sports, products, movies etc. These opinions will be very useful to others such as, leaders of political parties, selection committees of various sports, business analysts and other stake holders of products, directors and producers of movies as well as to the other concerned web users. We use an unsupervised semantic based approach to find users opinion. Our approach first detects subjective phrases and uses these phrases along with semantic orientation score to identify user’s opinion from a set of empirically selected significant sentences. Our approach provides better results than the other approaches applied on different data sets.
1 Introduction The rapid development of world wide web and its related technologies have fueled the popularity of the web with all sections of society. The web has been used by many firms such as governments, business houses, industries, educational institutions etc., to make them familiar and accessible globally. An individual web user is provided with an opportunity to obtain and share knowledge. The web is the origin of many research activities and one interesting area of research is to mine users opinion from web on diverse topics like politics, movies, educational institutions, products etc. The study of opinions is useful to both producers and consumers of the topic. The producers can be manufacturers of automobiles, movie producers, editor of news article, digital product manufactures etc., who are very much interested to find opinion of a user. The consumers are individual users who express their opinion and want to share it with other web users. In this paper, we attempt to find opinion of the users using a set of significant sentences from the opinionated texts. The set of significant sentences is empirically selected after observing a large collection of opinionated texts available at opinionated sites, e-commerce sites, blogs etc. Our intuition is that, the opinion from such a set of A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 486–497, 2010. c Springer-Verlag Berlin Heidelberg 2010
An Empirical Approach for Opinion Detection Using Significant Sentences
487
significant sentences reflects user’s opinion on a subject. The other sentences of the opinionated text will have opinion on features of a subject. An opinionated text discussing on digital camera may have sentences with an overall opinion of the digital camera as well as opinion on features of digital camera such as lens, battery backup, size etc. It becomes critical to detect opinion of the web user’s from such significant sentences and classify it as positive or negative. For example, consider the following opinionated texts of web users obtained from an opinionated site reviewcentre. The collected opinionated texts were retained in the original form, no attempt was made to correct the grammatical mistakes of web users from these opinionated texts. Example 1. It is a good player. I guess I was kind of expecting a better picture on my Insignia 27” LCD with a regular dvd boosted up to “near HD DVD” quality. In the end it’s all good though. If I had to do it over again I would have saved some of my money and bought a Philips DVP3140/37 player which plays just about every dvd I own without a single buzz, click, skip or lockup. That’s more than I can say for any other dvd player I have ever seen. Example 2. Easy setup and easy to use. I connected it to a 19” HDTV using the component cables that came with the TV. Using the upscale setting and a standard DVD with TV at 720p, the picture is bright and beautifully detailed. Much better picture than I had expected. I definitely recommend this player. It is thinner but a bit wider than my last player, so be sure it will fit your allotted space. Example 3. Bought a PS3 so the old Toshiba standard def just wouldn’t do anymore. Found this tv at HH Greggs about 4 months ago during a big sale they were having for about $1250. This tv overall has been great so far. The full 1080p works great for gaming and playing blu-rays. Directv HD programming is also crystal clear even though it is not in 1080p resolution. If your looking for a great tv with the same quality as a Sony, but slightly cheaper, I can fully recommend this TV. Example 1 refers to an opinion of a user on philips DVD product. The overall opinion of a user is positive. The overall user’s opinion of the product is recorded in first sentence of the opinionated text. Example 2 show user’s opinion on DVD player in a sentence between the first and last sentence. Example 3 shows user’s opinion on PS3 T.V. in the last sentence. The sentences in Bold refer to overall user’s opinion of the product in an opinionated text. We observe from the afore mentioned examples as well as from the other opinionated texts, users document their overall opinion in few sentences and use other sentences to expresses opinion on different features of the product. We believe that it is important to identify a set of significant sentences that provides an overall opinion of the users. The opinion obtained from such significant sentences is considered as the actual opinion of a web user. In this paper, we focus on finding opinion of web users only on products using a set of significant opinionated sentences. A set of significant sentences here refers to first sentence, last sentence and a sentence with maximum semantic orientation score and they are represented by SF , SL and Smax . The remainder of this paper is organized as follows: In Section 2 we give a brief description of related work. Then, in Section 3, we discuss our methodology. In Section 4, the experiments and results are discussed. Conclusion is discussed in Section 5.
488
Anil Kumar K.M. and Suresha
2 Related Work Opinion mining is a recent sub discipline of information retrieval which is not about the topic of a document, but with the opinion it expresses [1]. In literature, opinion mining is also known as sentiment analysis [7], sentiment classification [8], affective classification [21] and affective rating [16]. It has emerged in the last few years as a research area, largely driven by interests in developing applications such as mining opinions in online corpora, or customer relationship management, e.g., customer’s review analysis [21]. Hatzivassiloglou and McKeown [19] have attempted to predict semantic orientation of adjectives by analyzing pairs of adjectives (i.e., adjective pair is adjectives conjoined by and, or, but, either-or, neither-nor) extracted from a large unlabelled document set. Turney [14] has obtained remarkable results on the sentiment classification of terms by considering the algebraic sum of the orientations of terms as representative of the orientation of the document. Turney an Littman [15] have bootstrapped from a seed set, containing seven positive and seven negative words, and determined semantic orientation according to Pointwise Mutual Information-Information Retrieval (PMI-IR) method. Wang and Araki [20] proposed a variation of the Semantic Orientation-PMI algorithm for Japanese for mining opinion in weblogs. They applied Turney method to Japanese webpage and found results slanting heavily towards positive opinion. They proposed balancing factor and neutral expression detection method and reported a well balanced result. Opinion observer [6] is the sentiment analysis system for analyzing and comparing opinions on the web. The product features are extracted from noun or noun phrases by the association miner. They use adjectives as opinion words and assign prior polarity of these by WordNet exploring method. The polarity of an opinion expression which is a sentence containing one or more feature terms and one or more opinion words is assigned a dominant orientation. The extracted features are stored in a database in the form of feature, number of positive expression and number of negative expression. Kamps et al [12] have focused on the use of lexical relations defined in WordNet. They defined a graph on the adjectives contained in the intersection between the Turney’s seed set and WordNet, adding a link between two adjectives whenever WordNet indicate the presence of a synonymy relation between them. The authors defined a distance measure d (t1, t2) between terms t1 and t2, which amounts to the length of the shortest path that connects t1 and t2. The orientation of a term is then determined by its relative distance from the seed terms good and bad. Our work differs from the afore mentioned studies, by finding opinion of a user only with a few significant opinionated sentences from an opinionated text. We do not consider opinion of all other sentences found in an opinionated text. Our work uses not only adjectives but other part-of-speech like verb, adverb etc., to capture opinionated words for efficient opinion detection.
3 Methodology We collected six data sets for our work on different products like sony cybershot, nikon coolpix, cannonG3, philips dvd player etc. The first data set consist of 250 opinionated texts on five different products, collected from results of various search engines.
An Empirical Approach for Opinion Detection Using Significant Sentences
489
The second data set is collection of 400 opinionated texts obtained from different opinionated sites like Amazon, CNet, review centre, bigadda, rediff etc. The third data set consisting of 140 opinionated texts on product is obtained from [3]. These data sets contained a balanced set of positive and negative opinionated texts. The remaining three data sets obtained from [22] contained an unbalanced set of positive and negative opinionated texts. The fourth data set contains 32 opinionated texts, the fifth data set contained 95 opinionated texts and the final data set contained 45 opinionated texts. In our approach we pass an opinionated text to a sentence splitter program. The sentences obtained from the program were input to a part of speech tagger. The tagger used in our approach is Monty Tagger [11]. Extraction patterns are applied to the tagged opinionated sentences to obtain opinionated phrases that are likely to contain user’s opinion. In this paper we use only two phrase extraction patterns. Table 1 shows a few extraction patterns used to obtain opinionated phrases from opinionated sentences. Here, JJ represent adjective and NN/NNS, VB/VBD/VBN/VBG, RB/RBR/RBS represents different forms of noun, verb and adverb. The opinionated phrases are then subjected to Sentiment Product Lexicon (SPL) for capturing only subjective or opinionated phrases. This is necessary as some phrases obtained after application of extraction patterns may be non subjective. Table 1. Extraction patterns Slno. 1 2 3 4 5
First Word JJ RB,RBR or RBS JJ NN or NNS RB,RBR or RBS
Second Word NN or NNS JJ JJ JJ VB,VBD,VBN or VBG
Third Word anything not NN nor NNS not NN nor NNS not NN or NNS anything
Sentiment Product Lexicon is collection of General lexicon and Domain lexicon. General lexicon maintains a list of positive and negative words by collecting opinion words that are positive or negative from sources like General Inquirer [17], Subjective clues [18] and list of adjectives [13]. Domain lexicon maintains a list of positive or negative words from the domain context. We found words like cool, revolutionary etc., appeared in negative list of General lexicon. These words were used to express positive opinion by web user’s. Hence we created a domain lexicon to have opinion words from the domain perspective. The details of construction of General lexicon and Domain lexicon are made available in [4]. In this paper we use these lexicons to identify neutral phrases. The need to identify neutral phrases arises, when extraction patterns yields phrases that are mostly opinionated, but a few non opinionated phrases in some cases. For example, we obtain poor/JJ quality/NN, too/RB grainy/JJ, not/RB recommend/VB, not/RB working/VBG etc., as opinionated phrases. Also a few non opinionated phrases like main/JJ theme/NN, however/RB is/VBZ, not/RB be/VB etc., are obtained. It is important to identify and
490
Anil Kumar K.M. and Suresha
discard these kind of neutral phrases as they can influence the polarity of opinionated sentence. Sentiment product lexicon can be expressed as SPL = {GLP , GLN , DLP , DLN }
(1)
Where GL GL DL DL
P : Positive words in General lexicon N : Negative words in General lexicon P : Positive words in Domain lexicon N : Negative words in Domain lexicon
For example, consider an opinionated sentence “This is a bad phone.” When the tagger is applied to input sentence, we get the following tagged sentence “This/DT is/VBZ a/DT bad/JJ phone/NN ./.”. Application of extraction patterns from Table 1 will obtain bad/JJ phone/NN as opinion word from the sentence. Sentiment Product Lexicon is used to detect neutral phrases. We consider the extracted phrases or words namely word1 and word2 from an opinionated sentence as neutral if none of the words extracted are found in Sentiment Product Lexicon. From the above example word1 is bad and word2 is phone. We find whether word2 is in positive or negative list of Domain lexicon. If word2 is present in any one of the list in Domain lexicon, polarity of the word will be similar to polarity of list in which it is found. If it is not in positive or negative list of Domain lexicon, then positive and negative list of General lexicon is consulted to find the polarity of a word. If word2 is neither present in Domain lexicon nor in General lexicon, we assume word2 to have neutral polarity, in such a case we use word1 instead of word2, and find polarity of word1 similar to polarity of word2 afore discussed. If polarity is found, then polarity is for the phrase consisting of both word1 and word2. If polarity is not found, we assume both word1 and word2 to be neutral. If a word, either word1 or word2, is present in both Domain lexicon and General lexicon, polarity of word will be similar to polarity of Domain lexicon. If word1 is a negator such as ’not’, the polarity of word2 will be opposite to an earlier obtained polarity of word2. For example the phrase ”not good”, here word1 is ’not’ and word2 is ’good’. The polarity of word2 is positive, since word2 is prefixed by word1 i.e. ’not’. The polarity of phrase is negative. We retain only those phrases that have a polarity and discard phrases that are neutral. We compute the strength of semantic orientation of phrases using the Equation 2 SO(phrase) = log2 [
hits(ws10 (phrase, excellent)).hits(poor) ] hits(ws10 (phrase, poor)).hits(excellent)
(2)
Where SO is the Semantic Orientation. The seed words excellent and poor are part of five star review rating system. SO is computed by finding association of these phrases with seed words from web corpus. It is found by querying the search engine and recording the number of hits returned by search engine for phrases closer to seed words in a window size (ws) of ten words.
An Empirical Approach for Opinion Detection Using Significant Sentences
491
We used Google search engine to find semantic orientation of the extracted phrases and compute average semantic orientation of opinionated sentences. Our choice to use Google search engine is that, it indexes more number of pages than other search engines. Even though it is dynamic in nature, we believe the semantic orientation of phrases obtained using Google reflects diverse nature of web users in dissemination of information. From the above example, we obtain SO (bad phone) = 4.20. The actual polarity of the phrase is negative, but SO value obtained is positive. We shift the polarity of the phrase in consultation with Sentiment Product Lexicon. We multiply the strength of semantic orientation of phrases by +1, if the phrases are positive, and -1, if phrases are negative. The new value for our example will be SO (bad phone) = - 4.20. We compute the average semantic orientation of all phrases in an opinionated sentence. This is done for all the opinionated sentences in an opinionated text. We obtain opinion from an opinionated text by considering a set of significant opinionated sentences. Edmond [9] used key words, cue words, title words and structural indicators( sentence location) to identify the significant sentence. The significant sentence was used to convey to the reader the substance of the document. One of the methods proposed is called sentence location method. The sentence location is based on the hypothesis that the sentences occurring under certain headings are positively relevant and the topic sentences tend to occur very early or very late in a document. The afore mentioned study highlights the importance of sentence location. We empirically select and use the first opinionated sentence, opinionated sentence with maximum semantic orientation value and last opinionated sentence of the opinionated text as a set of significant opinionated sentences (S OS). SOS = SF + Smax + SL
(3)
An opinionated text is classified as positive, if the semantic orientation of significant opinionated sentences using Equation 3 is greater than a Threshold and negative, when semantic orientation of significant opinionated sentences is less than a Threshold. The Threshold used to classify opinionated sentences as positive or negative is 0.
4 Experiments and Results We use our approach afore discussed to find an opinion from a few significant opinionated sentences. We compare the result of our approach with an approach discussed in [14] as well as with an approach discussed in [6]. We implemented both approaches to find opinion of an user on three data sets. The first approach aims to find opinion from all the sentences of an opinionated text. It makes use of Google search engine to find semantic orientation of phrases, instead of Altavista Search engine as used by [14]. Since Altavista no longer supports proximity search, hence we used Google search engine. The second approach discussed in [6] finds opinion of an user on features of product by considering only adjective as potential opinion words. The polarity of the opinion words is determined from a seed list. Here, we do not find feature of the product, but use the same method to obtain opinion from all sentences that contain adjective in
492
Anil Kumar K.M. and Suresha
an opinionated text. We use SPL to determine the polarity of opinion words than the original seed list used in [6] Table 2 shows the results of two approaches discussed in [14] and [6] on data set 1, data set 2, data set 3, data set 4, data set 5 and data set 6. We compute the classification accuracy by dividing the sum of true positives and true negatives with total number of items to be classified. Table 2. Results of Approaches Slno. 1 2 3 4 5 6 7 8 9 10 11 12
Approach
Data sets
Accuracy
[14] [14] [14] [14] [14] [14] [6] [6] [6] [6] [6] [6]
Data set 1 Data set 2 Data set 3 Data set 4 Data set 5 Data set 6 Data set 1 Data set 2 Data set 3 Data set 4 Data set 5 Data set 6
71.85% 65.75% 65.42% 68.80% 75.14% 73.80% 74% 76.06% 73.56% 78.16% 80.30% 78%
Table 3. Results of Our Approaches Slno. 1 2 3 4 5 6
Data set
SF
SL
S F+ S L
S max
S OS
Data set 1 Data set 2 Data set 3 Data set 4 Data set 5 Data set 6
44.28% 46.29% 27.54% 65.65% 45.26% 62.22%
47.14% 48.89% 43.52% 65.65% 50.52% 55.55%
45.71% 47.59% 35.53% 65.65% 47.89 % 58.88 %
75.71% 80.87% 71.01% 84.37% 82.10% 82.22%
77.14% 81.24% 69.59% 87.5% 84.21% 86.77%
We conducted a series of experiments to find opinion from sentences of an opinionated text. As already mentioned, our intuition is that, a few users express their actual opinion of the product in the first sentence before elaborating the same in remaining sentences. Also, a few other users express their actual opinion of the product in the last sentence after initially expressing their opinion on different features of the product. The remaining users document their opinion in a sentence apart of first and last sentences. Table 3 shows result of our experiment on different data sets. We started the experiment by finding opinion of the user from first sentence of opinionated text. We obtain an accuracy of 44.28%, 46.29%, 27.54%, 65.65%, 45.26% and 62.22% for data set 1, data set 2, data set 3, data set 4, data set 5 and data set 6. An accuracy of 47.14%, 48.89%, 43.52%, 65.65%, 50.52% and 55.55% were obtained while considering users opinion
An Empirical Approach for Opinion Detection Using Significant Sentences
493
Fig. 1. Accuracy of Our Approach for Different values of W
only from the last sentence of opinionated texts from different data sets. We achieved an accuracy of 45.71%, 47.59%, 35.53%, 65.65%, 47.89% and 58.88% considering opinion from first and last sentence of opinionated texts from different data sets. Next, we used a sentence in an opinionated text with maximum semantic score to find users opinion. We obtain an accuracy of 75.71%, 80.87%, 71.01%, 84.37%, 82.10% and 82.22% for different data sets. After obtaining the results on first sentence, last sentence and sentence with maximum semantic orientation value, we compute accuracy of users opinion using Equation 3. An accuracy of 77.14%, 81.24%, 69.59%, 87.5%, 84.21% and 86.77% were obtained on different data sets. The first and last sentences of a few opinionated texts in data set 3, inspite of recording the actual users opinion were having a smaller semantic orientation values for opinionated phrases in these sentences. The values of these sentences become insignificant due to presence of a sentence with maximum semantic orientation value. This factor contributed to a loss of accuracy on data set 3 using Equation 3 as compared to accuracy of S max. We reformulated Equation 3 to capture users opinion from first and last sentences with smaller semantic orientation values. SOS = Smax + W (SF + SL )
(4)
We experimented with different values of W. Figure 1 shows the accuracy of our result obtained for different values of W. We observe that, a good accuracy is obtained for W =10 and the accuracy remains consistent for higher values of W. Figures 2, 3, 4,5, 6, 7 show accuracy of our approach using Equation 4 as against the other implemented approaches on different data sets. The results obtained by our approach using Equation 4 are found to be better than the results documented in Table 2 and Table 3. We observe from Figure 4, the result obtained by approach discussed in [6] is better than our approach on data set 3. We obtain an average accuracy of 73.56% with approach discussed in [6], with positive accuracy of 94.28 % and negative accuracy of 52.85%. Our approach provides an average accuracy of 69.59%, with positive accuracy of 67.14% and negative accuracy of 72.05%. Our approach provides a balanced result as compared to a result obtained using the approach discussed in [6] on the same data set. It is also found to be better than the average accuracy of 69.3% recorded in [3] on data set 3.
494
Anil Kumar K.M. and Suresha
Fig. 2. Accuracy of Three Approaches on Data Set 1
Fig. 3. Accuracy of Three Approaches on Data Set 2
Fig. 4. Accuracy of Three Approaches on Data Set 3
An Empirical Approach for Opinion Detection Using Significant Sentences
Fig. 5. Accuracy of Three Approaches on Data Set 4
Fig. 6. Accuracy of Three Approaches on Data Set 5
Fig. 7. Accuracy of Three Approaches on Data Set 6
495
496
Anil Kumar K.M. and Suresha
5 Conclusion We have conducted a set of experiments in our approach to find opinion of web users from a fixed number of significant opinionated sentences in an opinionated text. The results obtained from the experiments are encouraging. It also highlights the importance of sentence position in detection of opinion from an opinionated text. We use Sentiment Product Lexicon to remove neutral phrases and shift polarity of few phrases based on some heuristics. Our proposed approach using Equation 4 provides better results compared to results considering all sentences from an opinionated texts. Our approach provides better results on data sets comprising of both balanced and unbalanced positive and negative opinionated texts.
References 1. Andrea, E., Fabrizio, S.: Determining term subjectivity and term orientation for opinion mining. In: Proceedings of 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy (2006) 2. Andrea, E., Fabrizio, S.: Determining the semantic orientation of terms through gloss classification. In: Proceedings of 14th ACM International Conference on Information and Knowledge Management, Bremen, Germany, pp. 617–624 (2005) 3. Alistair, K., Diana, I.: Sentiment Classification of Movie and Product Reviews Using Contextual Valence Shifters. In: Proceedings of FINEXIN 2005, Workshop on the Analysis of Informal and Formal Information Exchange during Negotiations, Canada (2005) 4. Anil Kumar, M.K., Suresha: Identifying Subjective Phrases From Opinionated Texts Using Sentiment Product Lexicon. International Journal of Advanced Engineering & Applications 2, 63–271 (2010) 5. Anil Kumar, M.K., Suresha: Detection of Neutral Phrases and Polarity Shifting of Few Phrases for Effective Classification of Opinionated Texts. International Journal of Computational Intelligence Research 6, 43–58 (2010) 6. Bing, L., Minqing, H., Junsheng, C.: Opinion Observer: Analyzing and Comparing Opinions on the Web, Chiba, Japan (2005) 7. Bo, P., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of 42nd Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp. 271–278 (2004) 8. Bo, P., Lee, L., Shivakumar, V.: Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of 7th Conference on Empirical Methods in Natural Language Processing, Philadelphia, US, pp. 79–86 (2002) 9. Edmundson, H.P.: Journal of the Association for Computing Machinery 16(2) (1969) 10. Review centre, http://www.reviewcentre.com/ 11. Hugo: MontyLingua: An end-to-end natural language processor with common sense (2003) 12. Jaap, K., Maarten, M., Robert, J., Mokken, M., De, R.: Using wordnet to measure semantic orientation of adjectives. In: Proceedings of 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal, pp. 1115–1118 (2004) 13. Maite, T., Jack, G.: Analyzing appraisal automatically. In: Proceedings of the AAAI Symposium on Exploring Attitude and Affect in Text: Theories and Applications, California, US (2004) 14. Turney, P.D.: Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, US, pp. 417–424 (2002)
An Empirical Approach for Opinion Detection Using Significant Sentences
497
15. Turney, P.D., Littman, M.L.: Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems, 315-346 (2003) 16. Owsley, S., Sood, S., Hammond, K.J.: Domain specific affective classification of document. In: Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, California, US (2006) 17. Stone, P.J.: Thematic text analysis: New agendas for analyzing text content. In: Roberts, C. (ed.) Text Analysis for the Social Sciences. Lawrence Erlbaum, Mahwah (1997) 18. Theresa, W., Janyce, W., Paul, H.: Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. In: Proceedings of HLT/EMNLP, Vancouver, Canada (2005) 19. Vasileios, H., McKeown, K.R.: Predicting the semantic orientation of adjectives. In: Proceedings of 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain, pp. 174–181 (1997) 20. Wang, A.: Modifying SO-PMI for Japanese Weblog Opinion Mining by Using a Balancing Factor and Detecting Neutral Expressions. In: Proceedings of NAACL HLT 2007, New York, US, pp. 189–192 (2007) 21. Youngho, K., Sung, H.M.: Opinion Analysis based on Lexical Clues and their Expansion. In: Proceedings of NTCIR-6 Workshop Meeting, pp. 308-315. Tokyo, Japan (2007) 22. http://www.cs.uic.edu/~ liub/FBS/sentiment-analysis.html
Extracting Concerns and Reports on Crimes in Blogs Yusuke Abe1 , Takehito Utsuro1 , Yasuhide Kawada2, Tomohiro Fukuhara3 , Noriko Kando4 , Masaharu Yoshioka5 , Hiroshi Nakagawa6, Yoji Kiyota6, and Masatoshi Tsuchiya7 1
University of Tsukuba, Tsukuba, 305-8573, Japan 2 Navix Co., Ltd., Tokyo, 141-0031, Japan 3 National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan 4 National Institute of Informatics, Tokyo, 101-8430, Japan 5 Hokkaido University, Sapporo, 060-0814, Japan 6 University of Tokyo, Tokyo, 113-0033, Japan 7 Toyohashi University of Technology, Toyohashi, 441-8580, Japan
Abstract. Among other domains and topics on which some issues are frequently argued in the blogosphere, the domain of crime is one of the most seriously discussed by various kinds of bloggers. Such information on crimes in blogs is especially valuable for those who are not familiar with tips for preventing being victimized. This paper proposes a framework of extracting people’s concerns and reports on crimes in their own blogs. In this framework, we solve two tasks. In the first task, we focus on experts in crime domain and address the issue of extracting concerns on crimes such as tips for preventing being victimized. In the evaluation of this first task, we show that we successfully rank expert bloggers high in the results of blog feed retrieval. In the second task, on the other hand, we focus on victims of criminal acts, and address the issue of extracting reports on being victimized. In the evaluation of this second task, we show that we outperform blog post ranking based on the search engine API by incorporating dependency relations for identifying victims’ blog posts.
1
Introduction
Weblogs or blogs are considered to be one of personal journals, market or product commentaries. While traditional search engines continue to discover and index blogs, the blogosphere has produced custom blog search and analysis engines, systems that employ specialized information retrieval techniques. With respect to blog analysis services on the Internet, there are several commercial and noncommercial services such as Technorati, BlogPulse [1], kizasi.jp (in Japanese), and blogWatcher (in Japanese) [2]. With respect to multilingual blog services, Globe of Blogs provides a retrieval function of blog articles across languages. Best Blogs in Asia Directory also provides a retrieval function for Asian language blogs. Blogwise also analyzes multilingual blog articles. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 498–509, 2010. c Springer-Verlag Berlin Heidelberg 2010
Extracting Concerns and Reports on Crimes in Blogs
499
Fig. 1. Overall Framework of Extracting Concerns and Reports on Crimes in Blogs
Among other domains and topics on which some issues are frequently argued in the blogosphere, the domain of crime is one of the most seriously discussed by various kinds of bloggers. One type of such bloggers are those who have expert knowledge in crime domain, and keep referring to news posts on criminal acts in their own blogs. Another type of bloggers who have expert knowledge also often post tips for how to prevent certain criminal acts. Furthermore, it is surprising that victims of certain criminal acts post blog articles on their own experiences. Blog posts by such various kinds of bloggers are actually very informative for both who are seeking for information on how to prevent certain criminal acts and who have been already victimized and are seeking for information on how to solve their own cases. Such information on crimes in blogs is especially valuable for those who are not familiar with tips for preventing being victimized. Based on this observation, this paper proposes a framework of extracting people’s concerns and reports on crimes in their own blogs. The overview of the proposed framework is shown in Figure 1. In this paper, we focus on those which represent relatively narrow range of concepts of criminal acts, namely, “fraud” and “Internet crime”. We also extract concerns and reports on crimes from English and Japanese blogs. In this framework, we solve two tasks. In the first task, we focus on experts in crime domain and address the issue of extracting concerns on crimes such as tips for preventing being victimized (to be presented in section 4). Its major component is designed as the blog feed retrieval procedure recently studied in the blog distillation task of TREC 2007 blog track [3]. In this first task, we regard blog feeds as a larger information unit in the blogosphere. We intend to retrieve blog feeds which roughly follow the criterion studied in the blog distillation task, which can be summarized as Find me a blog with a principle, recurring interest
500
Y. Abe et al.
in X. More specifically, in the blog distillation task, for a given target X, systems should suggest feeds that are principally devoted to X over the time span of the feed, and would be recommended to subscribe to as an interesting feed about X. In the evaluation of this first task, we show that we successfully rank expert bloggers high in the results of blog feed retrieval. In the second task, on the other hand, we focus on victims of criminal acts, and address the issue of extracting reports on being victimized (to be presented in section 5). In this second task, we propose a technique which is based on detecting linguistic expressions representing experiences of being a victim of certain fraud, such as “being a victim of ” and “deceived”. In the evaluation of this second task, we show that we outperform blog post ranking based on the search engine API by incorporating dependency relations for identifying victims’ blog posts. We also show that victims of criminal acts such as fraud and Internet crime sometimes post one or two articles to their own blogs just after they are victimized. In most cases, those victims do not keep posting articles related to those crimes, and hence, their blog feeds are not ranked high in the result of the first task of extracting expert bloggers in crime domain.
2
Related Works
There exist several works on studying cross-lingual analysis of sentiment and concerns in multilingual news [4,5,6,7], but not in blogs. [4] studied how to combine reports on epidemic threats from over 1,000 portals in 32 languages. [5] studied how to combine name mentions in news articles of 32 languages. [6] also studied mining comparative differences of concerns in news streams from multiple sources. [7] studied how to analyze sentiment distribution in news articles across 9 languages. Those previous works mainly focus on news streams and documents other than blogs. As another type of related work, [8,9] studied how to collect linguistic expressions which represent trouble situation, where Web documents including writer’s own trouble experiences such as blogs are used for evaluation. Their technique itself as well as the collected expressions representing trouble situation can be applicable to our task of identifying blog posts including blogger’s own experiences of being victimized in certain criminal acts.
3
Topics in the “Fraud / Internet Crime” Domain
In this paper, as topics in the domain of criminal acts, we focus on “fraud” and “Internet crime”. We first refer to Wikipedia (English and Japanese versions1) 1
http://{en,ja}.wikipedia.org/. The underlying motivation of employing Wikipedia is in linking a knowledge base of well known facts and relatively neutral opinions with rather raw, user generated media like blogs, which include less well known facts and much more radical opinions. We regard Wikipedia as a large scale ontological knowledge base for conceptually indexing the blogosphere. It includes about 3,321,00 entries in its English version, and about 682,000 entries in its Japanese version (checked at June, 2010).
Extracting Concerns and Reports on Crimes in Blogs
501
(Note: “A term tx ” (“a term ty ”) in the nodes above indicates that ty is not listed as a Wikipedia entry, nor extracted from any of Wikipedia entries, but translated from tx by Eijiro. ) Fig. 2. Topics and Related Terms in the “Fraud / Internet Crime” Domain Table 1. Statistics of “Fraud / Internet Crime” # of Cases (sent to the # of Hits in the Blogosphere court in the Year of 2008) (checked at Sept. 2009) ID Topic U.S.A. Japan English Japanese 1 Internet fraud 72,940 N/A 21,300 61,600 2 (Auction fraud) 18,600 1,140 1,760 44,700 3 (Credit card fraud) 6,600 N/A 43,900 8,590 4 (Phishing) N/A 479,000 136,000 5 Bank transfer scam N/A 4,400 30 349,000 6 Counterfeit money N/A 395 16,800 40,500 7 Cyberstalking N/A 20,300 32,100 8 Cyber-bullying N/A 38,900 45,700
and collect entries listed at the categories named as “fraud” and “Internet crime” as well as those listed at categories subordinate to “fraud” and “Internet crime”. Then, we require entry titles to have the number of hits in the blogosphere over 10,000 (at least for one language)2 . At this stage, for the category “fraud”, we 2
We use the search engine“Yahoo!” API (http://www.yahoo.com/) for English, and the Japanese search engine “Yahoo! Japan” API (http://www.yahoo.co.jp/) for Japanese. Blog hosts are limited to 2 for English (blogspot.com,wordpress.com) and 3 for Japanese (FC2.com,goo.ne.jp,Seesaa.net).
502
Y. Abe et al.
have 68 entries for English and 20 for Japanese, and for the category “Internet crime”, we have 15 entries for English and 8 for Japanese. Next, we manually examine all of those categories, and select those which exactly represent certain criminal acts. Then, for the category “fraud”, we have 14 entries for English and 10 for Japanese, and for the category “Internet crime”, we have about 6 entries for English and 5 for Japanese. Out of those entries, Figure 2 shows some sample topics3 . In the figure, the category “Internet fraud” is an immediate descendant of both “fraud” and “Internet crime”, where three entries listed at this category are selected as sample topics. The figure also shows samples of related terms extracted from those selected Wikipedia entries which are to be used for ranking blog feeds/posts in section 4.1. For those selected sample topics, Table 1 shows the number of cases actually sent to the court in U.S.A and in Japan4 . The table also shows the number of hits of those sample entry titles in the blogosphere.
4
Extracting Concerns on Crimes in Blogs by Applying General Framework of Expert Blog Feeds Retrieval
4.1
Retrieving and Ranking Blog Feeds/Posts
Out of the two tasks introduced in section 1, this section describes the first one. In this first task, we simply apply our general framework of expert blog feeds retrieval [10,11] to extracting expert bloggers in crime domain. First, in order to collect candidates of blog feeds for a given query, we use existing Web search engine APIs, which return a ranked list of blog posts given a topic keyword. We use the search engine “Yahoo!” API for English, and the Japanese search engine “Yahoo! Japan” API for Japanese. Blog hosts are limited to those listed in section 3. Then, we employ the following procedure for blog feed ranking: i) Given a topic keyword, a ranked list of blog posts are returned by a Web search engine API. ii) A list of blog feeds is generated from the returned ranked list of blog posts by simply removing duplicated feeds. We next automatically select blog posts that are closely related to the given query, which is a title of a Wikipedia entry. To do this, we first automatically extract terms that are closely related to each Wikipedia entry. More specifically, 3
4
For some of those selected samples, only English or Japanese term is listed as a Wikipedia entry, and the entry in the opposite language is not listed as a Wikipedia entry. In such a case, its translation is taken from an English-Japanese translation lexicon Eijiro (http://www.eijiro.jp/, Ver.79, with 1.6M translation pairs). Statistics are taken from the Internet Crime Complaint Center (IC3), U.S.A. (http://www.ic3.gov/), National Police Agency, Japan (http://www.npa.go.jp/english/index.htm), and NPA Japan Countermeasure against Cybercrime (http://www.npa.go.jp/cyber/english/index.html).
Extracting Concerns and Reports on Crimes in Blogs
503
Table 2. Statistics of # of topic-related terms extracted from Wikipedia entries, and blog feeds/posts (English/Japanese)
ID
Topic
1 2 3 4 5 6 7 8
Internet fraud (Auction fraud) (Credit card fraud) (Phishing) Bank transfer scam Counterfeit money Cyberstalking Cyber-bullying
# of topic-related terms from # of blog feeds # of blog posts Wikipedia 182 / 76 48 / 60 1576 / 353 24 /36 40 / 38 224 / 121 28 / 181 50 / 31 1086 / 143 172 / 63 49 /118 8982 / 1118 60 / 96 4 / 132 13 / 2617 175 /84 41 / 96 186 / 695 33 / 29 49 / 39 727 / 242 52 / 65 49 / 89 4278 / 613
from the body text of each Wikipedia entry, we extract bold-faced terms, anchor texts of hyperlinks, and the title of a redirect, which is a synonymous term of the title of the target page. Then, blog posts which contain the entry title or at least one of the extracted related terms as well as synonymous terms are automatically selected. For each of the sample topics shown in Figure 2, Table 2 shows the numbers of terms that are closely related to the topic and are extracted from each Wikipedia entry. Then, according to the above procedure, blog posts which contain the topic name or at least one of the extracted related terms (including synonymous terms) are automatically selected. Table 2 also shows the numbers of the selected blog feeds/posts. Finally, we rank the blog feeds/posts in terms of the entry title and the related terms (including synonymous terms) extracted from the Wikipedia entry. – Blog posts are ranked according to the score: weight(type(t)) × f req(t) t
where t is the topic name or one of the extracted related terms (including synonymous terms), and weight(type(t)) is defined as 3 when type(t) is the entry title or the title of a redirect, as 2 when type(t) is a bold-faced term, and as 0.5 when type(t) is an anchor text of a hyperlink to another entry in Wikipedia. Note that those weights are optimized with a development data set. – Blog feeds are ranked according to the total scores for all the blog posts ranked above, where the total score for each blog post is calculated as above, in terms of the entry title and the related terms (including synonymous terms).
504
4.2
Y. Abe et al.
Evaluation Results
This section shows the evaluation results of ranking expert bloggers in crime domain. Out of the sample topics shown in Figure 2, for “auction fraud” and “phishing” (for both English and Japanese), we manually examined top ranked 10 blog feeds. As shown in Table 3, for each topic and for each language, out of the top ranked 10 blog feeds, most are those of expert bloggers in crime domain. Here, as we introduced in section 1, when judging whether a blogger is an expert in crime domain, we employ the criterion studied in the blog distillation task. For the specific topics above, we have not quantitatively compared this result with that of the original rankings returned by search engine APIs. However, we have already reported in [10] that, with our general framework of expert blog feeds retrieval, we achieved improvement over the original rankings returned by “Yahoo!” API and “Yahoo! Japan” API. We then manually categorize those retrieved blog feeds into the following four types: (1) the blogger is a victim or one who personally knows a victim, (2) the blogger warns others with reference to news posts on criminal acts or to some other official sites on the Web, (3) the blogger introduces tips on how to prevent criminal acts, (4) the blogger is closely related to the given query topic, although none of the above three, such as only stating blogger’s own opinion. Here, for (1), we distinguish the two cases: the blogger himself/herself is a victim, or, the Table 3. Evaluation Results of Extracting Expert Bloggers on “Fraud” (English / Japanese) ratio of relevant blog feeds out of top ranked 10 feeds (%) Topic Auction fraud 90 / 90 Phishing 100 / 90 Table 4. Results of Manually Categorizing the Results of Top Ranked 10 Expert Bloggers on “Fraud” (%) (1) closely (2) referring related to to official victims sites blogger (3) introducing blog of personally other Topic news (4) rest prevention a victim knows Web sites tips a victim (a) English Auction fraud 0 0 30 40 70 0 Phishing 0 0 50 60 90 0 (b) Japanese Auction fraud 20 10 20 10 30 10 Phishing 0 0 40 30 70 0
Extracting Concerns and Reports on Crimes in Blogs
505
blogger personally knows a victim. For (2), we distinguish the two cases: the blogger’s post is with reference to news posts, or, with reference to some other official sites on the Web. The results are shown in Table 4. It is important to note that we can collect many blog posts which refer to official sites such as news sites or which introduce prevention tips.
5
Extracting Reports on Crimes by Identifying Victims’ Blogs
5.1
Detecting Dependency Relations and Bunsetsu Patterns in Victims’ Blogs in Japanese
This section describes the task of extracting reports on being victimized from blogs. In this second task, we propose a technique which is based on detecting linguistic expressions representing experiences of being a victim of certain fraud, such as “being a victim of ” and “deceived”. In this paper, we give the detail of the technique as well as the evaluation results only for Japanese. However, its fundamental methodology can be applicable to any other language by manually examining expressions representing experiences of being a victim of certain fraud. Table 5. Expressions for Detecting Experiences of being a Victim of “Fraud” Type of expressions dependency relation of two bunsetsus single bunsetsu
example weight # of expressions sagi - au 19 (base form) (“be victimized”), + 10 sagi - hikkakaru 84 (conjugated form) (“be scammed”) damasa-reta (“be deceived”) 2 13 higai-todoke (“offense report”) 1 113 keisatsu (“police”) 0.5 17
As shown in Table 5, expressions representing experiences of being a victim of certain fraud can be roughly decomposed into two classes: dependency relations of two bunsetsus and a single bunsetsu5 . A dependency relation of two bunsetsus can be regarded as much stronger evidence of representing experiences of being a victim than a single bunsetsu. Each expression is then assigned a certain weight, where those weights are considered when measuring the score of each blog post. We give each dependency relation of two bunsetsus a weight of 10, while we 5
In Japanese, a bunsetsu corresponds to a phrase in English such as a subject phrase, an object phrase, and a verb phrase. A bunsetsu consists of at least one content word and zero or more functional words. In this paper, we use KNP (http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/knp-e.html) as a tool for bunsetsu analysis and dependency analysis of Japanese sentences.
506
Y. Abe et al. (a) “auction fraud”
(b) “phishing”
Fig. 3. Evaluation Results of Extracting Reports on Crimes by Identifying Victims’ Blogs
give each single bunsetsu one of the weights 2, 1, and 0.5 based on intuition and empirical tuning. Finally, blog posts are ranked according to the score: weight(type(e)) e∈S weight(type(e)) + # of words in the blog post e∈D
where D is the set of dependency relations of two bunsetsus, S is the set of a single bunsetsu, and e is an expression in D or S. Since single bunsetsus are much weaker evidence and are frequently detected even in posts from bloggers who are not victims. Thus, we normalize the sum of the scores of single bunsetsus in terms of the length of each blog post. For the topics “auction fraud” and “phishing”, we collect 20 and 3 blog posts for training, respectively, and manually extracted 103 dependency relations of bunsetsus as well as 143 single bunsetsus in total (as shown in Table 5).
Extracting Concerns and Reports on Crimes in Blogs
507
Fig. 4. Topics, Terms in Blogs, and Summaries of Blog Posts: Examples
5.2
Evaluation Results
In order to evaluate the technique presented in the previous section, we first collect 403 test blog posts for the topic “auction fraud” and 466 for the topic “phishing”6 . Those test blog posts are then ranked according to the score presented in the previous section. We manually examined the top ranked 50 blog posts whether the blogger of the post is actually a victim of the certain fraud and plot the changes in precision as shown in Figure 3. Here, as the baseline, we simply show the original rankings returned by “Yahoo! Japan” API with the same search query “topic name t AND higai (victimized)”. 6
We use the Japanese search engine “Yahoo! Japan” API (http://www.yahoo.co. jp/). Blog hosts are limited to 8 (FC2.com,yahoo.co.jp,ameblo.jp,goo.ne.jp, livedoor.jp,Seesaa.net,yaplog.jp,hatena.ne.jp). As the search query, we employ “topic name t AND higai (victimized)”.
508
Y. Abe et al.
As can be clearly seen from these results, the proposed technique drastically outperforms the baseline. For the topic “phishing”, especially, the proposed technique detects just a small number of victims’ blogs, none of which can be ranked high by the baseline. The difference in the performance between the two topics “auction fraud” and “phishing” is estimated to be mainly because the number of bloggers who are actually victims of “phishing” is much less than that for “auction fraud” in the Japanese blogosphere. Furthermore, with the proposed technique, sometimes it can happen that blog posts containing a dependency relation assumed to be an evidence of a victim’s blog are ranked high, even though the blogger is not a victim. We observed that such over-detection occurs when those dependency relations are embedded in adnominal clauses, and the bloggers are actually experts who introduce prevention tips. Both for the first task (presented in section 4) and the second task (presented in this section), Figure 4 shows sample summaries of retrieved blog feeds/posts which are categorized into (1) closely related to victims, (2) referring to official sites, and (3) introducing prevention tips, as in section 4.2. In the figure, samples of related terms extracted from Wikipedia entries (those shown in Figure 2) are marked. Characteristic terms included in the blog posts are also marked.
6
Conclusion
This paper proposed a framework of extracting people’s concerns and reports on crimes in their own blogs. In this framework, we solved two tasks. In the first task, we focused on experts in crime domain and addressed the issue of extracting concerns on crimes such as tips for preventing being victimized. We showed that we successfully ranked expert bloggers high in the results of blog feed retrieval. In the second task, on the other hand, we focused on victims of criminal acts, and addressed the issue of extracting reports on being victimized. We showed that we outperformed blog post ranking based on the search engine API by incorporating dependency relations for identifying victims’ blog posts. Future works include incorporating multilingual sentiment analysis techniques [12,13]. and automatic extraction of reports or experiences of victims of crimes.
References 1. Glance, N., Hurst, M., Tomokiyo, T.: Blogpulse: Automated trend discovery for Weblogs. In: WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2004) 2. Nanno, T., Fujiki, T., Suzuki, Y., Okumura, M.: Automatically collecting, monitoring, and mining Japanese weblogs. In: WWW Alt. 2004: Proc. 13th WWW Conf. Alternate Track Papers & Posters, pp. 320–321 (2004) 3. Macdonald, C., Ounis, I., Soboroff, I.: Overview of the TREC-2007 Blog Track. In: Proc. TREC-2007 (Notebook), pp. 31–43 (2007) 4. Yangarber, R., Best, C., von Etter, P., Fuart, F., Horby, D., Steinberger, R.: Combining Information about Epidemic Threats from Multiple Sources. In: Proc. Workshop: Multi-source, Multilingual Information Extraction and Summarization, pp. 41–48 (2007)
Extracting Concerns and Reports on Crimes in Blogs
509
5. Pouliquen, B., Steinberger, R., Belyaeva, J.: Multilingual Multi-document Continuously-updated Social Networks. In: Proc. Workshop: Multi-source, Multilingual Information Extraction and Summarization, pp. 25–32 (2007) 6. Yoshioka, M.: IR Interface for Contrasting Multiple News Sites. In: Prof. 4th AIRS, pp. 516–521 (2008) 7. Bautin, M., Vijayarenu, L., Skiena, S.: International Sentiment Analysis for News and Blogs. In: Proc. ICWSM, pp.19–26 (2008) 8. De Saeger, S., Torisawa, K., Kazama, J.: Looking for Trouble. In: Proc. 22nd COLING, pp. 185–192 (2008) 9. Torisawa, K., De Saeger, S., Kakizawa, Y., Kazama, J., Murata, M., Noguchi, D., Sumida, A.: TORISHIKI-KAI, an Autogenerated Web Search Directory. In: Proc. 2nd ISUC, pp.179–186 (2008) 10. Kawaba, M., Nakasaki, H., Utsuro, T., Fukuhara, T.: Cross-Lingual Blog Analysis based on Multilingual Blog Distillation from Multilingual Wikipedia Entries. In: Proc. ICWSM, pp. 200–201 (2008) 11. Nakasaki, H., Kawaba, M., Yamazaki, S., Utsuro, T., Fukuhara, T.: Visualizing Cross-Lingual/Cross-Cultural Differences in Concerns in Multilingual Blogs. In: Proc. ICWSM, pp. 270–273 (2009) 12. Evans, D.K., Ku, L.W., Seki, Y., Chen, H.H., Kando, N.: Opinion Analysis across Languages: An Overview of and Observations from the NTCIR6 Opinion Analysis Pilot Task. In: Proc. 3rd Inter. Cross-Language Information Processing Workshop (CLIP 2007), pp. 456–463 (2007) 13. Wiebe, J., Wilson, T., Cardie, C.: Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation 39, 165–210 (2005)
Automatically Extracting Web Data Records Dheerendranath Mundluru, Vijay V. Raghavan, and Zonghuan Wu IMshopping Inc., Santa Clara, USA University of Louisiana at Lafayette, Lafayette, USA Huawei Technologies Corp., Santa Clara, USA [email protected], [email protected], [email protected]
Abstract. It is essential for Web applications such as e-commerce portals to enrich their existing content offerings by aggregating relevant structured data (e.g., product reviews) from external Web resources. To meet this goal, in this paper, we present an algorithm for automatically extracting data records from Web pages. The algorithm uses a robust string matching technique for accurately identifying the records in the Webpage. Our experiments on diverse datasets (including datasets from third-party research projects) show that the proposed algorithm is highly effective and performs considerably better than two other state-of-the-art automatic data extraction systems. We made the proposed system publicly accessible in order for the readers to evaluate it. Keywords: Structured data extraction, Web content mining.
1 Introduction It is often important for applications such as e-commerce portals and local search engines to enrich their existing content by aggregating relevant data displayed on external Websites. Such data to be aggregated is usually displayed as regularly structured data records. Fig. 1, for example, shows a sample Web page from USPS.com that displays a list of post office locations as regularly structured records (in dashed boxes). Each record displays attributes such as address, phone, and business hours. The records are displayed in one particular region of the Web page and are also formatted regularly i.e., HTML tags that makeup these records are similar in syntax thus making similar attributes to be displayed in similar positions. Though some large Websites expose their data through APIs, data from most other sources will have to be extracted programmatically. Extracting structured data programmatically is a challenging task as [1]: (a) HTML tags only convey presentation details rather than the meaning of the Web page content, (b) different Web sources display information in different formats, (c) Web pages also include other content such as navigation links and sponsored results that need to be filtered out, (d) attributes present in one record may be missing from another record in the same Web page, (e) few attributes may have multiple values and the number of such values for an attribute may also vary across records in the same Web page. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 510–521, 2010. © Springer-Verlag Berlin Heidelberg 2010
Automatically Extracting Web Data Records
511
Fig. 1. Webpage displaying structured data
The area of research that specifically deals with extracting structured data from Web pages has been very well studied and is referred to as structured data extraction [1]. Structured data extraction algorithms are based on the assumption that structured data is rendered regularly usually in the form of records as shown in Fig. 1. Such algorithms typically build data extraction rules called wrappers for extracting structured data. Wrappers can be constructed manually, semi-automatically, or automatically. Manual approaches are tedious and are not scalable when there are a large number of sources to extract data from. Semi-automated methods are very popular and are referred to as wrapper induction techniques [2]. In such methods, a human first labels the target data to be extracted in a few training Web pages collected from the target resource. A machine learning algorithm then uses these training pages to learn a wrapper. The learned wrapper can then be used to extract target data from similar, unseen Web pages returned by the target resource. A limitation of this approach is the time involved in labeling the training pages, which may be performed only by trained users. Scalability can still be an issue if there are thousands of sources to handle. Nevertheless, based on our experience, wrapper induction techniques are still very effective as they can build robust wrappers with reasonable effort [3]. Finally, automatic methods can avoid the limitations of wrapper induction methods by constructing wrappers and extracting structured data using completely automated heuristic techniques [4][5][7][9]. These methods use a pattern discovery algorithm in combination with other heuristics such as HTML tags, visual cues and regular expressions for building wrappers and subsequently extracting target data. Though current automated approaches are very effective, we believe that still more significant improvements are needed for them to achieve near perfect accuracy when dealing with large number of
512
D. Mundluru, V.V. Raghavan, and Z. Wu
data sources. However, using such algorithms as a component in wrapper induction systems can provide the benefit of reducing the labeling time [3]. In this paper, we propose an algorithm called Path-based Information Extractor (PIE) for automatically extracting regularly structured data records from Web pages. PIE uses a robust approximate tree matching technique for accurately identifying the target records present in a Webpage. PIE filters out content present in other nonrelevant sections such as navigation links and sponsored results. Due to space constraints we only present the algorithm that identifies target records in a Web page and do not discuss the component that generates wrappers for subsequent fast extraction of data. Though the discussed algorithm is a prerequisite for constructing wrappers, it may still be used independently for data extraction tasks. The rest of the paper is organized as follows: In section 2, we will briefly discuss the related work. In section 3, we discuss the proposed algorithm in detail. In Section 4, we will present our experiment results. Finally, we conclude in section 5.
2 Related Work Automatic structured data extraction systems have been well studied in the past several years [4][5][7][9]. Of all the automatic extraction systems, MDR is most similar to ours [4]. It uses an approximate string matching algorithm for discovering records. However, as we will show in section 4, MDR performed poorly on all the datasets that we used in our experiments. It was not effective when the degree of regularity across the records was not very high. MDR was also not effective when the Web page contains several regions (e.g., advertisements) in addition to target region. It also makes a strong assumption about the regularity in the structure of records in a region. Specifically, MDR views each Web page as a DOM tree (referred to as tag-tree) and defines a region as two or more generalized nodes having the same parent node. A generalized node is in turn a combination of one or more adjacent nodes in the tree. The authors, in [4], make the strong assumption that all the generalized nodes, which collectively comprise a region have the same length i.e., each of them is composed of the same number of adjacent nodes. However, in many Web pages, generalized nodes forming a region may not have the same length. As will be discussed later, we address this issue by relaxing the definition for regularity in the structure of records. Through this work, we propose several important improvements over MDR. In section 4, we also compare PIE to another automatic data extraction system called ViNTs [5]. ViNTs relies on both visual cues (as appearing in a Web browser) and HTML tag analysis in identifying records and building wrappers. For example, it uses a visual cue-based concept called content line for identifying candidate records. A content line is defined as a group of characters that visually form a horizontal line in the same region on the rendered page. A blank line is an example of a content line. Though ViNTs is very effective on result pages of search engines, we found in our experiments that its accuracy reduces significantly when handling Web pages in vertical portals such as content review sites.
Automatically Extracting Web Data Records
513
3 Proposed Algorithm 3.1 Observations and Overview PIE is based on the following three important observations. The first two observations have been used by most prior research projects and we present them here from [4]. Observation 3 can be considered as an extension to Observation 1. Observation 1: “A group of data records that contains descriptions of a set of similar objects are typically presented in a contiguous region of a page and are formatted using similar HTML tags. Such a region is called a data record region.” Observation 2: “A group of similar data records being placed in a specific region is reflected in the tag-tree by the fact that they are under one parent node.” Observation 3: If a region consists of n records displayed contiguously, then the left most paths of the sub-trees corresponding to all n records under the parent are identical. First observation is straightforward and has been discussed through Fig. 1 in section 1. Since HTML source of every Web page is inherently a tree, according to Observation 2, records displayed in a region will also be present under the same parent in the tree. For example, the tag-tree in Fig. 5 displays five records R1-R5 (in dashed boxes), which are present under the same parent P. Similarly, as specified in Observation 3, the left most paths of the sub-trees of these records are also identical.
Fig. 2. Three-steps in the PIE algorithm
As depicted in Fig. 2, the PIE algorithm involves three steps for extracting records. The input to the algorithm is a Web page and the output is the target records extracted from the Web page. The first step (Parent Discoverer) is based on the second observation. It identifies a set of candidate parent nodes present in the tag-tree of the input Web page. One of the identified candidate parent nodes is expected to be the parent node of the target records. The second step (Record Discoverer) takes candidate parent nodes as the input and outputs a list of candidate regions along with the records
514
D. Mundluru, V.V. Raghavan, and Z. Wu
discovered in those regions. The Record Discoverer is based on the first and third observations. The third and final step (Target Region Selector) takes the different candidate regions and their records as input and identifies exactly one region as the target region. Records present in this region form the final output. A constraint for the record discovery process is that the input Web page should have at least K (2 in our case) target records. 3.2 Three-Step Record Discovery Process In this section, we discuss in detail the three steps in the PIE algorithm. Parent Discoverer. Parent Discoverer is based on Observation 2. It takes the input Web page and outputs a list of candidate parent nodes. One of the candidate parent nodes is expected to contain the target records. Parent Discoverer first builds a tagtree of the input Web page. An example tag-tree is shown in Fig. 3. The root of the tag-tree is and each tag node can be located by following a path from the root to the node. The system finds the candidate parent nodes by analyzing such paths. Two types of paths, called Relaxed Path (rpath) and Indexed Path (ipath), are used for very effectively identifying the candidate parent nodes. rpath and ipath are defined as follows. Definition 1 (Relaxed Path): If n1, n2… np-1, np are p tag nodes in a tag-tree T containing N tag nodes (p ≤ N), where n1 is the root of T and also parent of n2, n2 is parent of n3 and so on, then the relaxed path of np is defined as rpath(np) = n1.n2…np-1.np Definition 2 (Indexed Path): If n1, n2, n3 are 3 tag nodes in a tag-tree T containing N tag nodes (3 ≤ N), where n1 is the root of T, n2 is the ith immediate child of n1 and n3 is the jth immediate child of n2, then the indexed path of n3 is defined as ipath(n3) = n1.n2[i].n3[j] Fig. 4 shows the algorithm for discovering candidate parent nodes while the tagtree in Fig. 3 is used to illustrate the algorithm. Some of the nodes in the tree have been indexed for clarity. Let sub-trees of UL0, TABLE0 and TABLE1 in Fig. 3 represent three different regions in the Web page. After constructing a tag-tree of the input Web page, the algorithm constructs all unique rpaths and ipaths that lead to leaf nodes with textual content (line 2). This generates two unique rpaths and seven ipaths: rpath1: HTML.BODY.UL.LI.A rpath2: HTML.BODY.TABLE.TR.TD ipath1: HTML.BODY.UL0.LI0.A0 ipath2: HTML.BODY.UL0.LI1.A0 ipath3: HTML.BODY.TABLE0.TR0.TD0 ipath4: HTML.BODY.TABLE0.TR1.TD0 ipath5: HTML.BODY.TABLE0.TR2.TD0 ipath6: HTML.BODY.TABLE1.TR0.TD0 ipath7: HTML.BODY.TABLE1.TR1.TD0 In line 3, we map ipaths to rpaths, i.e., we group all ipaths having the same rpath. In our example, ipath3-ipath7 are mapped to rpath2 as their leaf nodes have the same rpath HTML.BODY.TABLE.TR.TD. Similarly, ipath1-ipath2 are mapped to rpath1. Assumption here is that due to regularity in the rendered structured data, rpaths of similar attributes in different records of a region are identical, but their ipaths differ
Automatically Extracting Web Data Records
515
due to presence of index numbers. Assuming that each text leaf node corresponds to an attribute in a record, by grouping ipaths of these attributes with identical rpaths, we aim to conceptually group the records in a region. In line 4, we discard rpaths with less than K associated ipaths as the constraint to the algorithm is for the Web page to have at least K records. For each remaining rpath, we perform a pair-wise comparison of all its associated ipaths to generate a longest common prefix from each comparison (line 6). In our example, for rpath1, the longest common prefix generated by comparing ipath1 and ipath2 is HTML.BODY.UL0. Similarly, the longest common prefixes generated for rpath2 are: HTML.BODY, HTML.BODY.TABLE0 and HTML.BODY.TABLE1. A map of each longest common prefix and the list of ipaths that lead to its generation is maintained (line 7). In line 8, we discard longest common prefixes generated by less than K ipaths. Finally, trailing tags of the remaining longest common prefixes are returned as candidate parents.
Fig. 3. Sample tag-tree with three regions
Procedure: parentDiscoverer(Web page P) 1: construct tag-tree T of P 2: find all unique rpaths & ipaths in T leading to leaf nodes with text 3: map ipaths to rpaths 4: discard rpaths with less than K associated ipaths 5: for each rpath r do 6: find longest-common-prefixes by doing pair-wise comparison of all ipaths of r 7: M = map longest-common-prefix to ipath-list 8: discard longest-common-prefixes generated by less than K ipaths 9: return trailing tag of longest-common-prefixes Fig. 4. Parent discovery process
Record discoverer. Record discoverer takes candidate parent nodes as input and outputs a list of candidate regions along with records discovered in those regions. Sometimes, a parent may have more than one region in its sub-tree. For Web pages that display only one record in each row (e.g., Fig. 1), we define a record as follows. Definition 3 (Record): A record in a region is represented wholly by one or more immediate contiguous child nodes (of a parent) and their corresponding sub-trees.
516
D. Mundluru, V.V. Raghavan, and Z. Wu
Record discoverer is based on Observations 1 and 3. Due to Observation 3, the goal is to discover identical left most paths of sub-trees corresponding to records under the parent. Such left most paths are nothing but rpaths with parent node as their first node in the rpath. Once such paths are identified for a region, extracting records from that region is trivial as a record exists between two successive paths. We refer to such paths as Record Identifying Paths (RIP) as they can separately identify the records in a region. A robust string matching algorithm based on edit distance [6] is used in PIE to identify such RIPs. Fig. 5 is used to illustrate the record discovery process. It displays a tag-tree with parent P and its children labeled 1-8. For clarity, numbers are used instead of HTML tag names. The tag-tree has eleven candidate RIPs and five records R1-R5 that are to be discovered. A candidate RIP is the left most path of every sub-tree under P.
Fig. 5. Record discoverer illustration
We begin our search with the first candidate RIP P.1.2, left most path under P. We check if it appears at least K times as the algorithm requires the input page to have at least K records i.e., if there are K records, then we also have at least K identical RIPs. Since P.1.2 appears less than 2 times (since K=2), we discard it and consider the next candidate RIP P.3.4.5. We consider P.3.4.5 for further processing as it appears 5 times. P.3.4.5 appearing in the last sub-tree under P is not considered as a candidate as it is not the left most path in its sub-tree. We next construct a tag-string using the tags between the first and second appearances of P.3.4.5 (shown in dashed arrows). If this tag-string is represented as an ordered pair (1,2), then (1,2) would be P.3.4.5.6.7. Tags appearing in the second appearance of P.3.4.5 are excluded in the tag-string. Since goal is to identify contiguous records, we construct another tag-string using the tags between second and third appearances of P.3.4.5, which is represented using (2,3). The pair (2,3) corresponds to the tag-string P.3.4.5.6.7. Since from Observation 1, contiguous records in a region are formatted using similar HTML tags, we use edit distance (ED) to capture the similarity between the two tag-strings by computing ED((1,2),(2,3)). Here, ED is defined as the smallest number of operations (insertion, deletion and substitution of tags) needed to convert one tag-string into another. If ED is less than or equal to a certain threshold, then the two tag-strings are considered similar and thus correspond to two contiguous records. ED threshold is calculated automatically depending on the
Automatically Extracting Web Data Records
517
characteristics of the sub-tree under the parent [1]. Specifically, if the different instances of the candidate RIP appear at regular positions under the parent, then a small value (1 in our case) is used as ED threshold. Otherwise, a higher value (7 in our case) is used as threshold. We found that using a static threshold for all Web pages, like in MDR, reduces the extraction accuracy. Therefore, setting the ED threshold based on sub-tree characteristics under the parent gives us more flexibility. For the current example, let’s assume an ED threshold of 7. Since ED((1,2),(2,3)) is equal to 0, we consider the corresponding records as being contiguous. We save the left ordered pair (1,2) as the first discovered record. We next compute ED((2,3),(3,4)), which is equal to 1 as the ordered pair (3,4) had one extra node 8. Since the newly computed ED is less than ED threshold, we save (2,3) as second record. We next save (3,4) as ED((3,4),(4,5)) is equal to 1. At this point, since we do not have any more appearances of P.3.4.5, we extract the tag-string starting at the last instance of P.3.4.5. Choosing the ending path to construct this last tag-string is a characteristic of the number of paths between the previous two appearances of P.3.4.5. Since there is only one path (P.6.7) between (4,5), there should also be only one path between last appearance of the candidate RIP P.3.4.5 and the ending path. If ending path happens to be part of a sub-tree ‘s’ that is different from the sub-tree of the starting path (P.3.4.5), then we use the left most path in ‘s’ as the ending path. But, if the ending path falls outside the sub-tree under P, then we set the right most path under P as the ending path. In our case, the ending path (P.3.4.5) is part of a different sub-tree and since it is not the left most path in its sub-tree, we set the ending path as P.3.7. Due to this the last tag-string will be P.3.4.5. ED between (4,5) and the last tag-string is 2. Therefore, both (4,5) and the last tag-string are saved as newly discovered records. Since the number of records discovered so far is greater than or equal to K, we consider that a region has been found. It can be seen that all saved ordered pairs reflect the target records R1 through R5. If the number of records discovered was less than K, we would have discarded all the records discovered so far and would have repeated the process with the next immediate candidate RIP, P.6.7. Similarly, if there were more candidate RIPs appearing after the discovered region, then we would continue with the above process searching for new regions. Virtual-tree Based Edit Distance (VED). The above record discovery process would be effective on Web pages whose records display high regularity in their tag structures. It would however be ineffective when records display less regularity. For example, it failed on the Web page displayed in Fig. 1 as the tag structures of the records were very dissimilar due to the absence of “Business Hours” attribute (and its sub-tree) from the second record. To handle such complex cases, we designed a robust string matching function called, the virtual-tree based edit distance. VED considers the tree structure of one of the tag-strings to more effectively identify the similarity between the two tag-strings. After integrating VED, if ED of two tagstrings is greater than threshold, then we invoke VED with the tree structures of the two tag-strings. VED returns true if it considers the two input tree structures as contiguous records. Otherwise, it returns false. VED algorithm is given in Fig. 6. Input to VED includes the tree structures corresponding to the two tag-strings along with the threshold (same as ED threshold). The larger of the two trees (in terms of the number of tag nodes) will be big-tree while the other will be small-tree. The algorithm traverses through each node of the big-tree and at each node it creates a
518
D. Mundluru, V.V. Raghavan, and Z. Wu
new virtual tree (line 5) from the big-tree by deleting the node and its sub-tree. The tree traversal is done until either a newly created virtual tree is very similar to the small-tree (lines 6-7) or all the nodes in the big-tree are traversed. Tree traversal is done from left-most path to right-most path (lines 1 and 9) and within each path we traverse from leaf node upwards (lines 3 and 8) until we reach a node that was traversed earlier or until we reach the root (line 4). If n, m represent the total number of tags in big-tree and small-tree, then complexity of VED is O(n2m). Though the complexity of the algorithm has increased, it should be noted that usually record discovery process is only performed as part of wrapper creation process where some delay in extraction is not an issue. Procedure: VED(big-tree, small-tree, threshold) 1: curr-path = getNextPath(big-tree) 2: while curr-path != NULL do 3: curr-node = leaf(curr-path) 4: while curr-node != root of big-tree && curr-node NOT YET traversed do 5: virtual-tree = createNewTree(big-tree,curr-node) 6: if ED(virtual-tree, small-tree) <= threshold then 7: return true 8: curr-node = parent(curr-node) 9: curr-path = getNextPath(big-tree) 10: return false Fig. 6. VED algorithm
Fig. 7. Combination generation example
If both ED and VED of a particular combination ((i,j),(j,k)) fail (i.e., ED > EDthreshold and VED returns false), then instead of stopping the record discovery process for the current candidate RIP, we still try other combinations for the pair (i,j). Fig. 7 shows how different combinations, for a particular RIP, are tried when ED-VED of a particular combination results in a success S (ED ≤ ED-threshold or VED returns true) or a failure F (ED > ED-threshold and VED returns false). This combination generation makes record discovery process even more effective.
Automatically Extracting Web Data Records
519
Target region selector. Target region selector takes the input candidate regions and their records and selects only one of them as the target region whose records form the final output. It first selects three best candidate regions from all input candidate regions and subjects them to additional processing to select the final target region. Total number of characters appearing in all the records in each candidate region is used as a heuristic to select the three best candidate regions. Usually the target region has the highest number of characters compared to all other regions in a Web page though sometimes there may be cases where other regions might have more characters. Moreover, sometimes record discoverer might incorrectly construct a region which might have more characters than the target region. Four HTML features are further used in selecting the final target region from the three selected candidate regions. The features used are: (a) anchor tag, (b) image tag, (c) bold text and (d) presence of any text outside anchor tags. The system looks for these features in each record of the region. If a feature appears in at least half of the records in a region, then we consider it as a region feature. Among the three candidate regions, the candidate region that has the highest number of region features is selected as the target region. If there is more than one candidate region with the same number of region features, then the region having the maximum number of characters is selected as the target region. The motivation behind using feature count is that usually the target region has the maximum number of the above features in comparison to all other regions.
4 Experiments We conducted experiments on three datasets to evaluate PIE and compare it with MDR and ViNTs [5][6]. The first two datasets were randomly collected by us from the Web while the third was taken from third-party data extraction research projects. As MDR returns all the regions in a Web page, its performance on only the target region is considered for evaluation. The PIE prototype system and the datasets used in this experiment are publicly accessible [10]. Evaluation Measures. Recall and precision defined below are used as the evaluation measures [5]: recall = Ec/Nt and precision = Ec/Et where Ec is the total number of correctly extracted target records from all Web pages in the dataset, Nt is the total number of target records that appear in all Web pages in the dataset, Et is the total number of target records extracted. Et includes target records that are both correctly and incorrectly extracted. An incorrect extraction involves partial extraction of a record, merging one or more records into a single record and extracting content that appears outside the target region. Experiment on Dataset 1. The first dataset includes 60 Web pages taken from 60 Web sources that mostly include general and special purpose search engines. Each source was given one query and the returned result page was included in our dataset. There were a total of 879 target records across all 60 Web pages. Results of the experiment on this dataset are summarized in Table 1. Detailed results of the three systems on all the Web pages used in this and the remaining two datasets are available at [10]. As we can see from Table 1, the performance of PIE is significantly better than MDR and marginally better than ViNTs. Most of the Web pages on which MDR
520
D. Mundluru, V.V. Raghavan, and Z. Wu Table 1. Summary of experiment on dataset 1
PIE ViNTs MDR
#Correct 810 748 592
#Incorrect 45 54 176
Precision 94.73% 93% 77.08%
Recall 92.15% 85.09% 67.34%
failed completely (extracted zero records) were result pages of general-purpose search engines that displayed multiple regions. Interestingly, MDR was either failing completely or was perfectly extracting all the records in the Web page. Experiment on Dataset 2. While dataset 1 included mostly search engine result pages (normally seen in metasearch domain), dataset 2 includes 45 Web pages taken randomly from review sites and local search engines. Compared to dataset 1, Web pages in this dataset usually have distinct characteristics such as displaying records with varying lengths and displaying several non-target regions. Using such diverse datasets has allowed us to more effectively evaluate the three systems. Results on this dataset that has a total of 652 records are summarized in Table 2. As we can see PIE significantly outperformed both ViNTs and MDR. Unlike ViNTs, PIE performed consistently across both datasets 1 and 2 thus showing that it is effective in handling diverse Web pages. ViNTs and MDR performed almost equally. ViNTs precision was affected mainly as it wrongly extracted 889 records from one Web page. Table 2. Summary of experiment on dataset 2
PIE ViNTs MDR
#Correct 607 469 463
#Incorrect 21 1027 170
Precision 96.65% 31.35% 73.14%
Recall 93.09% 71.93% 71.01%
Experiment on Dataset 3. Dataset 3 includes a total of 58 Web pages taken from several prior third-party data extraction projects. This dataset was chosen to avoid any bias in our experiments. Like dataset 1, Web pages in this dataset are also mostly search engine result pages. Of the 58 Web pages, 40 were taken from Omini [7], 10 from RISE [8], and 8 from RoadRunner [9]. Results on this dataset that had a total of 1623 records are summarized in Table 3. As we can see, PIE once again outperformed MDR. In terms of precision, PIE performed considerably better than ViNTs. Though PIE had a higher recall than ViNTs, they both performed almost equally. The difference in recall was mainly due to failure of ViNTs on one particular Web page that had large number of records. Table 3. Summary of experiment on dataset 3
PIE ViNTs MDR
#Correct 1467 1338 928
#Incorrect 68 315 161
Precision 95.57% 80.94% 85.21%
Recall 90.38% 82.43% 57.17%
Automatically Extracting Web Data Records
521
5 Conclusion and Future Work In this paper, we proposed an algorithm for automatically extracting data records from Web pages. The algorithm is based on three important observations about the regularity in displaying records and uses a robust string matching technique to accurately identify the records. Specifically, we propose an approximate tree matching algorithm to effectively handle Web pages containing records with greater structural variation. Experiments on diverse datasets showed that the proposed system, PIE, is highly effective and performed considerably better than two other state-of-the-art automatic record extraction systems. In future, we would like to extend this work by: (1) further improving the algorithm to more effectively and efficiently extract records, (2) automatically extracting attributes from each record.
References [1] Mundluru, D.: Automatically Constructing Wrappers for Effective and Efficient Web Information Extraction. PhD thesis, University of Louisiana at Lafayette (2008) [2] Muslea, I., Minton, S., Knoblock, C.: A Hierarchical Approach to Wrapper. In: Proceedings of the 3rd International Conference on Autonomous Agents, Seattle, pp. 190–197 (1999) [3] Mundluru, D., Xia, S.: Experiences in Crawling Deep Web in the Context of Local Search. In: Proceedings of the 5th Workshop on Geographical Information Retrieval, Napa Valley (2008) [4] Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proceedings of the ACM International Conference on Knowledge Discovery & Data Mining, Washington, D.C, pp. 601–606 (2003) [5] Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully Automatic Wrapper Generation for Search Engines. In: Proceedings of the 14th International World Wide Web Conference, Chiba, pp. 66–75 (2005), http://www.data.binghamton.edu:8080/vints/ [6] Hall, P., Dowling, G.: Approximate String Matching. ACM Computing Surveys 12(4), 381–402 (1980) [7] Buttler, D., Liu, L., Pu, C.: A Fully Automated Extraction System for the World Wide Web. In: Proceedings of the International Conference on Distributed Computing Systems, Phoenix, pp. 361–370 (2001) [8] ISE. A Repository of Online Information Sources Used in Information Extraction Tasks, University of Southern California (1998), http://www.isi.edu/info-agents/ RISE/index.html [9] Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Rome, pp. 109–118 (2001) [10] PIE Demo System, http://www.fatneuron.com/pie/
Web User Browse Behavior Characteristic Analysis Based on a BC Tree Dingrong Yuan1,2 and Shichao Zhang2 1
The International WIC Institute, Beijing University of Technology Beijing 100022, China 2 College of Computer Science, Guangxi Normal University Guilin, 541004 China {dryuan,Zhangsc}@mailbox.gxnu.edu.cn
Abstract. Analysis of Web user browser behavior characteristics is a key technology in the domains, such as Initiative Web Information Retrieval and Information Recommendation. Taking into account the layout of Web pages, in this paper we constructed a user browse behavior characteristic (BC) tree based on the browsing history of Web users, and then established a new approach for analyzing Web user BC trees. This method delivers us the interesting topics of a user, the extent and depth of these topics, as well as the explanation of the frequency of accessing hierarchic block paths on web pages. We illustrated the efficiency with experiments, and demonstrated that the proposed approach is promising in Initiative Web Information Retrieval and Information Recommendation.
1
Introduction
Behavior characteristic analysis of user browsing is an important topic in Web intelligence [12,11]. For example, merchants recommend products to clients according to their purchasing behavior habits [2,5,6]. Recommendation system provides information according to user browsing behavior [7]. Web intelligence behavior has been studied well with brain informatics [4,13]. Enterprise and companies identify potential clients by user accessing behavior patterns [14]. Consequently, understanding the behavior characteristics of user is crucial in many areas, such as document pre-sending, information recommendation system, and potential clients recognizing. From existent techniques, initiatively retrieving the information required by users, called as initiative information retrieval, has become a hopeful and hot research topic in information retrieval system. Initiative information retrieval is essentially different from traditional information retrieval because such a retrieval system initiatively forms the multidimensional retrieval conditions according to behavior characteristics of a user. However, traditional retrieval systems acquire the required characteristics dependent on session information which is included in user logs located on a server. These characteristics actually tend to detract from generality, or specified only in a special site. In particular, the session information is unstructured. This means, it is often difficult in real applications. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 522–529, 2010. c Springer-Verlag Berlin Heidelberg 2010
Web User Browse Behavior Characteristic Analysis Based on a BC Tree
523
This paper is focused on the issue of initiative information retrieval. The needed hierarchical information is extracted from the layout of a Web page, so as to construct a BC (behavior characteristic) tree. To identify potentially useful information from the BC tree, a new approach is designed for analyzing Web user behavior. The rest of this paper is organized as follows: Section 2 briefly recalls main related work. Section 3 describes the construction of a behavior characteristic tree based on the browsing history of Web users. Section 4 provides methods for analyzing browsing behavior and acquiring behavior characteristics of web users. Section 5 evaluates the efficiency with experiments. Finally this paper is concluded in Section 6.
2
Related Work
Related work includes browse behavior modeling and analyzing, Web page analyzing and hierarchical information extracting, and Web user access behavior pattern discovering. Behavior is a representation of intelligence, recorded by data. Extracting out useful pattern or knowledge from this kind of behavior data needs to use data mining technique. However, most existing data mining algorithms and tools only stop at discovered models, which require human experts to do some post-process. Yang and Cao have upgraded data mining to behavior analysis [1,9]. They found a basic frame of behavior analysis, constructed a behavior model including plan, object and action, and analyzed the behavior model to obtain the direction of next behavior. Yang suggested customers to change from an undesired status to a desired one to maximize the objective function [8]. Cao developed a technique of exploring frequent impact-oriented activity patterns, impact-contrasted activity sequential pattern and impact-reversed sequential activity pattern [1]. A Web page can be divided into different blocks by its layout. Each block holds a topic. In fact, a block is an information class on one page, and its class name is the topic of the block. Information items on a Web page belong to one block or another, but usually goes with useless information called garbage or junk. Therefore, Web information preprocessing gradually becomes more important than before. For example, Wang et al erected a technique to analyze the layout characteristics of Web pages [7]. Song et al designed an algorithm to extract topics of blocks on a Web page and version comparison etc [6]. All these works are to structure Web page information and to make it more suitable to access and process. Web user browse behavior analysis is to find out behavior characteristics from user browse records. Present techniques mainly mine access patterns from user logs and discover behavior characteristics on the basis of vector space model and Markov model. For example, Zhou et al mine access patterns in the light of EOEM model to discover corresponding potential clients [14]. Zhu et al pre-send a document according to time sequence relevant document requirement and a user session model [15]. Zhao obtained anonymous user browse behavior pattern by session feature space which is generated from the session information [10]. Present techniques of behavior characteristic analysis mainly consider session message, neglect of layout characteristics of information on Web pages.
524
D. Yuan and S. Zhang
Therefore, we will propose a novel technique so as to analyze Web user browse behavior from history records of Web users. The technique makes use of the layout characteristics of information items on a Web page.
3
Constructing a User Browse Characteristic (BC) Tree
There is a hierarchical structure in the layout of Web pages. A Web page can be parsed as a Web tree. Figure 1 gives a sample Web tree, in which there are four blocks in the page M ypage, and block Sports includes three sub-blocks. Some definitions are given as follows:
Fig. 1. A Web page tree
Fig. 2. A BC tree
Definition 1. Provided a user access to information in the block W orldcup, the corresponding path in the Web tree is M ypage → Sports → F ootball → W orldcup → Drawresult. The path is called user-browse-behavior path. Definition 2. Set P− set = {p1 , p2 , . . . , pn }, pi (i = 1, 2, . . . , n) is a user behavior path. A tree is constructed based on pi as follow. The root of the tree is labeled as user. The node is defined as < topic, f requency, link >, topic is the theme of a block, f requency is the visiting count of a block, link is a pointer link to the next node. ∀pi , if there is no prefix in pi and branch of the tree, then pi should be inserted into the tree as a branch, else merge the prefix into the branch of the tree, different parts of pi should be linked as a sub-branch of the prefix. Such a tree is called browse behavior character tree (or BC tree in short). According to the behavior paths as shown in Table 1, A BC tree can be constructed as shown in Fig. 2. Algorithm 1 is used to construct such a BC tree. Table 1. User behavior path table PID
Path
1 a→c→g 2 a→c→m→k
PID
Path
PID
Path
3 c→k 5 c→f →k→t 4 c→f →k→t 6 a→c→g
Web User Browse Behavior Characteristic Analysis Based on a BC Tree
525
Algorithm 1. Constructing a BC tree Input: a path set P SET Output: a BC tree Procedure: 1. Open a node and label it as BC 2. From 1 to |P SET | 2.1 p = read P SET 2.2 if p is a branch of BC or there is a prefix between p and a branch of BC Then f requency + + for all nodes in the path or prefix of the path. Link last parts of the path into the branch of the prefix. Else link p into as a sub-branch of BC 2.3 read the next path until NULL in P SET 3. Output a BC tree. Definition 3. Set N = {node1 , node2 , . . . , noden }, nodei , i ∈ {1, 2, . . . , m} is a node of BC tree, ∀ε ≥ 0, if Sup(node1 ) =
nodei .f requent support(nodei ) = n ≥ε n support(nodei ) nodei .f requent i=1
i=1
then the nodei is called f requent node. Definition 4. Set T ∈ BC, P ∈ T and P = node1 − > node2 − > . . . − > noden , ∀ε ≥ 0, ∀nodei (i ∈ {1, 2, . . . , m}) is a node of P , if support(nodei ) ≥ ε P is called a f requent path of T . Definition 5. Set tree = {tree1 , tree2 , . . . , treem } is a BC tree, treei is a branch of the tree, hi is height of the treei , di denotes width of the treei . Therefore, 1 hi m i=1 n
h=
1 di m i=1 n
d=
where h denotes the height balance factor, d denotes the width balance factor. ∀ε ≥ 0 ∀i ∈ {1, 2, . . . , m}, if |hi − h| ≤ ε, the tree is called balance tree in height, if |di − d| ≤ ε, the tree is called balance tree in width, if a tree satisfies the two conditions, we call the tree as a balance BC tree. Definition 6. Set tree = {tree1 , tree2 , . . . , treem } is a BC tree, treei is a branch of the tree, hi is height of the treei , di denotes width of the treei , h and d denote height and width balance factors of the tree, respectively. Therefore, 1 (hi − h)2 m i=1 m
Sh =
1 (di − d)2 m i=1 m
Sd =
where Sh denotes the height deflection factor of the tree, Sd denotes the width deflection factor of the tree. ∀ε ≥ 0 ∃i ∈ {1, 2, . . . , m}, if |hi − Sh | ≥ ε, the
526
D. Yuan and S. Zhang
tree is called the deflection tree in height; if |di − Sd | ≥ ε, the tree is called the deflection tree in width; and if a tree satisfies the two conditions, the tree is called a def lection BC tree. Theorem 1. Set topici is an interesting topic of a user, nodei is a node of T , then topici ⇔ nodei . set T = {topic1 , topic2 , . . . , topicn }, topici (i = 1, 2, . . . , n) is an interesting topic of a user, I = {I1 , I2 , . . . , In }, Ii (i = 1, 2, . . . , n) is an information block on a page. N = {node1 , node2 , . . . , noden }, nodei (i = 1, 2, . . . , n) is a node of T . According to the principle of browsing Web page, ∃Ii = τ (toici ). According to the principle of constructing BC tree, ∃nodei = ζ(Ii ). So we have topici ⇒ nodei . On the other hand, since both τ and ζ are reversible, ∃τ − and ∃ζ − saitisfy topici = τ − (Ii ) = τ − 1(ζ − (nodei )). So we have nodei ⇒ topici . Therefore topici ⇔ nodei Theorem 2. Set p− treei (i = 1, 2, . . . , n) is a branch path of BC, l− topici is an interesting topic hierarchical path, then p− treei ⇔ l− topici . Set Layer− block = {l− block1 , l− block2 , . . . , l− blockn }, is a hierarchical path of a block in a Web page. Lay− topic = {l− topic1 , l− topic2 , . . . , l− topicn }, is a hierarchical path of an interesting topic for a user. According to the principle of browsing Web page, ∃l− blocki = ϕ(l− topici ). According to our strategy of constructing a BC tree, ∃p− treei = ψ(l− blocki ). So we have l− topici ⇒ p− treei . On the other hand, since both ϕ and ψ are reversible, ∃ϕ− and ∃ψ − satisfy l− topici = ϕ− (l− blocki ) = ϕ− (ψ − (p− treei )). So we have p− treei ⇒ l− topici . Therefore p− treei ⇔ l− topici . Theorem 3. The deflection branch of a BC tree is a user interesting information preference. Set tree is a BC tree, treei is the deflection branch of the tree, the width and height of the tree are dtreei , htreei , and the width and height deflection factors are Sd and Sh , respectively. According to Definition 6, we have |dtreei − Sd | ≥ ε, |htreei − Sh | ≥ ε. According to Theorems 1 and 2, we known that tree1 is the Web user interesting information preference.
4
Discovering User Behavior Characteristics
On the basis of the above discussion, we designed three algorithms to mine user behavior characteristics on the BC Tree, such as frequent paths, interesting topics, interesting preferences etc. Algorithm 2 is used to find frequent paths. For example, if the frequency of the path sports → worldcup → f ootball satisfies the condition of supporting, we take the path as frequent paths. Algorithm 2. Mining a frequent path Input: a BC tree, ξ Output: a frequent path Procedure: 1. calculate support of all son-nodes of the BC and cut out all branches of BC tree, which satisfies support(son− node) ≤ ξ 2. repeat step 1 until no branch being cut out; 3. go through the BCT ree and output all F requentP aths.
Web User Browse Behavior Characteristic Analysis Based on a BC Tree
527
Algorithm 3 is used to discover the user behavior characteristics of an interesting topic. For example, a user is interested in the national and international news. He/She would browse the topic about news on the corresponding website, probably on the BBC website or Yahoo. Anyway, so long as a page includes some news, the user will be interested in the page. This algorithm can help a user to find his/her interested website. Algorithm 3. Mining an interesting topic Input: a BC tree, ξ Output: an interesting topic set Procedure: 1. for every node in BC, calculate support(node); 2. merger the same name node; 3. for every node, if support(node) > ξ then insert the node into InterestingT opicSet; 4. output InterestingT opicSet. Algorithm 4 is used to discover the preference of interesting of a user. This is an interesting result in our work. Usually, there are many sub-topics in one topic. For example, local news, national news and international news are the sub-topics of news. Furthermore, international news includes news about Middle East, America, Korea and Japan; Korean news includes military affairs, politics etc. Two Web users are interested in this news, but the one is more interested in the news of Korea military affairs, the other one is only interested in the news in general. Algorithm 4. Mining interesting preference from the deflexed BC tree Input: a deflection BC tree, ξ Output: an interesting preference Procedure: 1. calculate d, h of every branch of a BC tree 2. calculate Sd , Sh of the BC 3. for every branch BC tree if |d − Sd | ≥ ξ if |h − Sh | ≥ ξ, then output the def lection branch in d and h, else output the def lection branch only in d else if |h − Sh | ≥ ξ, then output the def lection branch only in h, else output the branch is not a def lection tree. Next branch BC. end.
5
Experiments
In the experiments, we took Matlab as a tool, and synthesized 3000 browse paths of a user. The experiments run in the DELL PC machine, 2G main memory, 2.6GHz CPU. It took only two seconds to construct a BC tree, and all other tasks,
528
D. Yuan and S. Zhang
which obtain the behavior characteristics, such as interesting topic, interesting information path and preference of the topic, could be finished in one second. All results of our experiments are listed in Tables 2, 3, 4 and 5. Table 2 shows the frequent supporting of an access path. Table 3 shows the supports of an interesting topic. Tables 4 and 5 show the width and height deflections. Table 2. Frequent path supports
Table 3. Interesting topic supports
Table 4. Width deflections
Table 5. Height deflections
Based on the above discussion, we could acquire characteristic information of Web user browser behavior instantly, such as interesting topic, frequent behavior path and interesting preference of a Web user. Furthermore, we could answer questions, such as what the extent and depth of an interesting topic are according to interesting preference. The behavior characteristics of Web users could not be discovered by other techniques. The frequent path table could tell us what frequent behavior paths of a web user on Web page are. The path is a hierarchy of layout of a page, not a link path in the net. It, in fact, is a hierarchic classification. A node in the path is a class in the logic hierarchy. Therefore, our technique is practicable and valuable for the initiative retrieval and information recommendation.
6
Conclusion
Taking into account the layout of Web pages, we have constructed a BC tree from the browsing history of Web users. And then we proved that the BC tree is
Web User Browse Behavior Characteristic Analysis Based on a BC Tree
529
equivalent to user browse behavior. The BC tree is mined to discover user browsing behavior characteristics. The technique is valuable for Web document presending, link recommendation, and personalized information retrieval. It could answer questions, such as “what is the extent of an interesting topic?” “how much is the depth of an interesting topic?” and “what are frequent paths on the Web page?” and so on. While previous techniques only tell us the user characteristics on some specific site, and the characteristics acquired by present techniques are only about link related characteristics, not about the characteristics of a page layout. The layout characteristics are classification of information item on the page, but it is neglected by other techniques.
References 1. Cao, L., Zhao, Y., Zhang, H., Luo, D., Zhang, C.: Mining impact-targeted activity atterns in imbalanced data. IEEE Transactions on Knowledge and Data Engineering 20(8), 1053–1066 (2008) 2. Choa, Y.B., Chob, Y.H., Kimc, S.H.: Mining changes in customer buying behavior for collaborative recommendations. Expert Systems with Applications 28, 359–369 (2005) 3. Kobayashi, I., Saito, S.: A study on an information recommendation system that provides topical information related to user’s inquiry for information retrieval. New Generation Computing 26(1), 39–48 (2008) 4. Ohshima, M., Zhong, N., Yao, Y.Y., Liu, C.: Relational peculiarity oriented mining. Data Mining and Knowledge Discovery 15(2), 249–273 (2007) 5. Park, Y.-J., Chang, K.-N.: Individual and group behavior-based customer profile model for personalized prodBCt recommendation. Expert Systems with Applications 36, 1932–1939 (2009) 6. Song, Q., Shepperd, M.: Mining Web browsing patterns for E-commerce. Computers in Industry 57, 622–630 (2006) 7. Wang, L., Meinel, C.: Behaviour recovery and complicated pattern definition in Web usage mining. In: Koch, N., Fraternali, P., Wirsing, M. (eds.) ICWE 2004. LNCS, vol. 3140, pp. 531–543. Springer, Heidelberg (2004) 8. Yang, Q., Yin, J., Ling, C., Pan, R.: Extracting actionable knowledge from decision trees. IEEE Transactions on Knowledge and Data Engineering 19(1), 43–56 (2007) 9. Zhang, H., Zhao, Y., Cao, L., Zhang, C., Bohlscheid, H.: Customer activity sequence classification for debt revention in social security. J. Comput. Sci. and Technol. (2009) 10. Zhao, L., Zhang, S., Fan, X.: Anonymous user network browser feature mining. Computer Research and Development 39(12), 1758–1764 (2002) 11. Zhong, N., Liu, J., Yao, Y.Y., Wu, J., Lu, S., Li, K. (eds.): Web Intelligence Meets Brain Informatics. LNCS (LNAI), vol. 4845. Springer, Heidelberg (2007) 12. Zhong, N., Liu, J., Yao, Y.Y.: Envisioning Intelligent Information Technologies through the Prism of Web Intelligence. CACM 50(3), 89–94 (2007) 13. Zhong, N., Liu, J. (eds.): Intelligent Technologies for Information Analysis. Springer, Heidelberg (2004) 14. Zhou, B., Wu, Q., Gao, H.: On model and algorithms for mining user access patterns. Computer Research and Development 36(7), 870–875 (1999) 15. Zhu, P., Lu, X., Zhou, X.: Based on customer behavior patterns Web document present. Journal of Software 10(11), 1142–1148 (1999)
Clustering Web Users Based on Browsing Behavior Tingshao Zhu School of Information Science and Engineering Graduate University of Chinese Academy of Sciences Beijing 100190, P.R. China
Abstract. It is critical to acquire web users behavior model in E-commerce community. In this paper, we propose to train the web user’s browsing behavior, and clustering web users based on their browsing behavior. In particular, our method uses page-content information extracted from the user’s click stream, then trains a behavior model that describes how web user locates useful information in the Internet. The classifier is trained on the data which describes how the user treats the information that she has visited, that is, browsing behavior. We find that some user groups based on the browsing behavior can get much higher accuracy for prediction.
1
Introduction
While the World Wide Web contains a vast amount of information, people spend more and more time on browsing to find useful information, such as web pages that contain information that they are interested in, or some products that they are willing to buy. Meanwhile, because of the massive information in the Internet, it is very often difficult for web users to find the particular pieces of information that they are looking for. This has led to the development of a number of recommendation systems, which typically observe a user’s navigation through a sequence of pages, and then suggest pages that may provide relevant information (see, e.g., [7], [11]). There are also some research focusing on web science [1] [4], to find how people interact with the web, thus to understand more about web users. In this paper, we propose to cluster web users based on their browsing behaviors. Intuitively, the group-based behavior model has a better performance than population model [17] which is expected to produce Information-Content(IC) pages which are those pages that user must examine to complete her task.
2
Related Work
There are many ways to generate recommendations for web users. Collaborative Filtering [13] is the first attempt using AI technology for personalization [9], but it is unrealistic to ask the user to rank all the pages that explored, and it is very A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 530–537, 2010. c Springer-Verlag Berlin Heidelberg 2010
Clustering Web Users Based on Browsing Behavior
531
difficult to get enough manually labeled web pages in real world. Some frequency based methods such as[7], [11] can also be used to predict specific URLs, but they are not capable in some cases, especially across web sites. Since very few web pages would be accessed very frequently, so very few or even no clusters, rules or sequential patterns can be obtained, thus the recommendation system will keep in silence almost all the time. Most of all, these systems are trained on specific web pages or one particular web site, that is, they cannot give any recommendations if applied in a new environment. In our research we want to acquire web users behavior model describing how user finds the information that she really wants, and the model is trained based on generalize click streams. Since such model is not based on specific URLs or web pages, it can be used even in a totally new web environment. And we also propose that some web users may have very similar browsing behavior, since some users may have similar interesting, background, and browsing preference. Each such group should present very strong behavior model, thus it is much easier to acquire group’s behavior, and also recommendation system adaptive to the group may be real useful for the group member. Web user behavior model extraction is also a critical task for E-commerce community, and much work have been done here. Lynch and Ariely [6] show that if the customers can be presented with useful product-related information, it will increase their satisfaction with the merchandize they buy. In other word, if we can infer the goal of web users, and we not only can retrieve related information, but also can help them dramatically. Randolph and Sismeiro [2] developed a model to describe within-site browsing behavior: the visitor’s decisions to stay or exit and the duration of each page view. But their research is at individual level, and nothing to do with what kind of information users want. Park and Fader [10] incorporate observable outcomes and the latent propensities across web sites to build a web browsing behavior model. Moe et. al, [8] use a Bayesian Tree Model to present online purchasing model. Johnson, Moe, Fader, Bellman, and Lohse [5] propose a model of the users search behavior, but it only give us brief description, not explicit enough to infer what users want. However, the research in this area is limited, either does not take into account the content of pages viewed, or the model is too general to be used in real application. And the search behavior model is always proposed by expert after examining the recorded log data which may lose some important aspects of the real web user model. In our research, we propose to acquire user information seeking model by using machine learning, based on the content of her observed click stream. In particular, our method uses page-content information extracted from the user’s click stream to learn a classifier to obtain the model that describes how web user locates useful information in the Internet. The classifier is not trained on specific words or URLs, but generalized information to indicate how the use treats the information that she has visited, that is, browsing behavior. And also we cluster web users based on the browsing behavior, that is, not on what kind of information they want, but how they find useful information.
532
T. Zhu
This paper describes our research on learning web user browsing behavior, we are not supposed to find explicit rules or model, but to train machine learning algorithm to let it obtain the patterns of the web browsing, and then clustering web users based on such browsing behavior. Section 3 describes how we collected such data. Section 3.2 shows how we used the collected information to learn a classifier to obtain the web users’ browsing behavior, and in Section 4, we introduce our algorithm to find user groups based on browsing behavior, and also report the results of a first test of the performance of this approach, based on the data collected in our study.
3 3.1
Data Preprocessing User Study
To learn, and later evaluate, we collected a set of annotated web logs; a sequence of web pages that a user visits, where each page is labeled with a bit that indicates whether this page is ”IC” - i.e., essential to achieving the user’s specific task goal. We collected these annotated web logs in a laboratory study. A total of 128 subjects participated in the study. Each participant was asked to perform a specific task: 1. Identify 3 novel vacation destinations (i.e., places you have never visited) that you are interested in. 2. Prepare a detailed plan for a vacation at each destination (including specific travel dates, flight numbers, accommodation, activities, etc.). They were given access to our browsing tool (AIE - Annotation Internet Explorer; described below) [16], which recorded their specific web logs, and required them to provide the IC annotation. Each participant also had to produce a short report summarizing her specific vacation plans ; AIE was engineered to help the user remember these citations and insert them into her travel plan. To help motivate participants to take this exercise seriously, they were informed that two (randomly selected) participants would win $500 to help pay for the specific vacation they had planned. 3.2
Learning Problem
In the dataset, we know which pages are IC pages and the page sequence proceeded. Our goal is a classifier that can predict which words will appear in the IC page. We do not train the classifier based on specific words, but some features to show how user treats these words, because we think that based on the observation of the user’s action to the words, we can predict what information she wants. To train such a classifier, we first gather all the words from the observed page sequence, then compute certain features for each word, based on how that word appeared within the sequence. The label for each word is whether it appears in the IC page that terminates this session (Thanks to AIE, we know these IC labels.).
Clustering Web Users Based on Browsing Behavior
3.3
533
IC-Session Identification
To make it easy for our learner, we divide the whole page sequence of each subject into several IC-Sessions. An ”IC Session” is a consecutive sequence of pages that ends with an IC page, or the end of the user’s traversal sequence. In our case, since the browsing is task driven, we terminate a session on reaching an IC page. However, it is not clear that the next session should begin on the subsequent page. For example, imagine reaching an index page I after visiting a sequence of pages A → B → C → I, and moreover, I contains a number of useful links, say I → P1 and I → P2 , where both P1 and P2 were ICs. Here, each IC session should contain the sequence before the index page since they also contribute to locating each of the IC pages - i.e., given the browsing sequence A → B → C → I → P1 → I → P2 , we would produce the two ICsessions A → B → C → I → P1 and A → B → C → I → P2 . We use some heuristics to identify these IC-session, including the idea that sessions end with search engine queries, since it is very common that when one task is done, people will go to a search engine to begin the next task. 3.4
Feature Extraction
We consider all words that appear in all pages, removing stop words and stemming [12], and calculate a weight for each of these remaining words. The word’s frequency in the page is its initial weight; we add more weight according to its location and layout in the web page, such as, in the title, bold, strong size, etc. [14]. We next compute the following attributes, for each word, from the IC-session, to get a detailed description of these features, please refer to [15]. Search Query Category. As our data set includes many requests to search engines, we include several attributes to relate to the words in the search result pages. Each search engine will generate a list of results according to the query, but the content of each result may differ for different search engines. For example, one result from Google contains Description and Category, but in Altavista’s search result, there is no such information. In our research, we only considered information produced by every search engine: the title (i.e., the first line of the result) and the snippet (i.e., the text below the title). We can tag each title-snippet pair in each search result page as one of: Skipped, Chosen, and Untouched. If the user follows a link, the state of its title and snippet will be “Chosen”. All links in the list that the user did not follow, before the last chosen one, will be deemed “Skipped”, and all results after the last chosen link in the list will be “Untouched”. Sequential Attributes. All the following measures are extracted from pages in an IC session except the search result pages and the last destination page. If the URL refers to a frame page, then we will calculate all the following measures
534
T. Zhu
based on the page view. We say a hyperlink (in page P) is backed if the user followed that link to other page, but went back to page P later. A page is backward if that page has been visited before; otherwise we say a page is forward. 3.5
Classifier Training
After preparing the data, we learned NaiveBayes (NB) classifier. Recall that NB is a simple belief net structure which assumes that the attributes are independent of one another, conditioned on the class label [3]. Since the dataset we have collected is very imbalanced, that is, the number of N egative(Non-IC Words) is far greater than the number of P ositive(IC Words) ones. To generate the training and testing data, we randomly selected positive and negative instances of equal size as testing data, then diminishing negative samples by randomly select to get the equal number of positive and negative training samples. For each IC session, let wseq denote all the words in the sequence except the last page, which is an IC page, wdest denotes the words in that final IC page, and |wseq wdest| . coverage = wdest To better understand this, we computed precision and recall values. The ”Precision” for each IC page is TruePositive/ PredictedAsPositive and the ”Recall” for IC words is TruePositive/AllRealPositive. Similarly, we define Precision and Recall for non-IC words as TrueNegative/ PredictedAsNegative and TrueNegative/AllRealNegative, respectively. For each trial, we built 10-fold training/testing datasets, and computed the median value of these 10 results’ precision/recall as the final precision/recall. (We used medians because they are less sensitive to outliers than means.) For positive and negative prediction, we compute F-Measure: recision∗Recall F − M easure = 2∗P P recision+Recall and egativeF −M easure as the final prediction AF −M easure = P ositiveF −Measure+N 2 accuracy of this trial.
4
Greedy Web User Clustering
We propose that there exists general model of goal-directed information search on the web, and some users that share similar behavior can be clustered as one user group. The generic method for clustering web users is to represent each user with a feature vector, and then apply clustering algorithm(K-means, etc.) to find some user groups. But for web user, it is very difficult to define the exact features, since they may have very different interesting. Sometimes it is impossible to obtain users’ profile for traditional clustering analysis. Recall that the features that we extract for each word can be considered as the user’s browsing preference, thus we propose to find user clusters based on such feature data.
Clustering Web Users Based on Browsing Behavior
Algorithm W ebU serClustering : The subjectsU (ui , i = 1, 2, 3, . . . , n) CandidateL: Queue; stores the candidate groups. BEGIN Clear CandidateL Forany 2-user group {ui , uj }, i, j = 1 . . . n and i =j if its AF − M easure ≥ threshold, put it into CandiateL; while CandiateL is not Empty begin Removehead of CandidateL as maxGroup For each group(checkGroup) remaining in CandiateL begin if checkGroup ⊆ maxGroup, remove checkGroup; else if checkGroup maxGroup not Empty begin Merge them together, if the new group’s AF − M easure ≥ threshold take it as maxGroup, remove checkGroup; end end Output maxGroup end END
Fig. 1. Testing Result for Web User Clustering
535
536
T. Zhu
In Figure 1, we compare the AF −M easure of our greedy algorithm with other two Na¨ıve methods. The first is just to put all users in one big group, and the second one is take each user as a group, and calculate the average AF-Measure of all such 1-user groups. From Fig. 1, it is easy to see that the prediction increased dramatically for greedy group. This also support our proposal that there exist strong regularities among web user groups, and within such groups, we can acquire more accurate browsing model, and can provide better services for group member.
5
Conclusion and Future Work
Our ultimate goal is to get web users where they want to go more quickly. The first step towards this is to identify web user’s behavior model from the click stream. In particularly, we acquire the user model by training a classifier based on generalized click stream. The training data for the classifier is not just some specific words or URLs, but the way how web user treats the observable information. By clustering based on such generalized data, we can find web user groups, and for each group, we can get fair high accuracy. We are currently investigating more efficient ways to improve the accuracy of the prediction, and ways to further increase the recall for the positive prediction. We also plan to explore the potential of content and structure mining, as well as tools for learning from imbalanced datasets to aid us in this endeavor.
References 1. Berners-Lee, T., Hall, W., Hendler, J., Shadbolt, N., Weitzner, D.J.: Creating a science of the web. IEEE Transactions on Systems, Man and Cybernetics 36 (2006) 2. Bucklin, R., Sismeiro, C.: A model of web site browsing behavior estimated on clickstream data 3. Duda, R., Hart, P.: Pattern Classification and Scene Analysis. Wiley, New York (1973) 4. Hendler, J., Shadbolt, N., Hall, W., Berners-Lee, T., Weitzner, D.: Web science: An interdisciplinary approach to understanding the web. Communications of the ACM 51(7) (2008) 5. Johnson, E., Moe, W., Fader, P., Bellman, S., Lohse, J.: On the depth and dynamics of world wide web shopping behavior. Management Science 50(3), 299–308 (2004) 6. Lynch, J., Ariely, D.: Wine online: Search costs and competition on price, quality, and distribution. Marketing Science 19(1), 83–103 (2000) 7. Mobasher, B., Cooley, R., Srivastava, J.: Automatic personalization through web usage mining. Technical Report TR99-010, Department of Computer Science, Depaul University (1999) 8. Moe, W., Chipman, H., George, E., McCulloch, R.: A bayesian treed model of online purchasing behavior using in-store navigational clickstream 9. Mulvenna, M., Anand, S., B¨ uchner, A.: Personalization on the net using web mining: introduction. Communications of ACM 43(8), 122–125 (2000) 10. Park, Y.-H., Fader, P.: Modeling browsing behavior at multiple websites
Clustering Web Users Based on Browsing Behavior
537
11. Perkowitz, M., Etzioni, O.: Adaptive sites: Automatically learning from user access patterns. Technical Report UW-CSE-97-03-01, University of Washington (1997) 12. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980) 13. Resnick, P., Iacovou, N., Suchak, M., Bergstorm, P., Riedl, J.: Grouplens: An open architecture for collaborative filtering of netnews. In: Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work, Chapel Hill, North Carolina, pp. 175–186. ACM, New York (1994) 14. W3C. Html 4.01 specification 15. Zhu, T.: Goal-Directed Complete-Web Recommendation. PhD thesis, University of Alberta, Edmonton AB, Canada (2006) 16. Zhu, T., Greiner, R., H¨ aubl, G.: Learning a model of a web user’s interests. In: Brusilovsky, P., Corbett, A.T., de Rosis, F. (eds.) UM 2003. LNCS, vol. 2702, Springer, Heidelberg (2003) 17. Zhu, T., Greiner, R., H¨ aubl, G., Jewell, K., Price, B.: Goal-directed siteindependent recommendations from passive observations. In: Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI 2005), Pittsburgh, Pennsylvania, pp. 549–557 (July 2005)
Privacy Preserving in Personalized Mobile Marketing Yuqing Sun and Guangjun Ji School of Computer Science and Technology, Shandong University sun [email protected], jgj [email protected]
Abstract. With the popularity of portable smart devices and advances in wireless technologies, mobile marketing increases quickly. Among various methods, short message is regarded as the most efficient mode. While mobile advertising enhances communication with consumers, the messages without a required permission from users cause privacy violations. So, how to simultaneously support personalization and privacy preserving in mobile marketing is a challenging problem. In this paper, we investigate this problem and propose a Privacy Preserving Model for Personalized Mobile Marketing (P 2 PMM ), that can protect both users’ preferences and location privacy while supporting personalization in mobile marketing. We propose an efficient coding method to manage hierarchical categories and users’ preferences. We also investigate how such a model can be applied into practice and a prototype is implemented.
1 Introduction Mobile Marketing is a set of practices that enables organizations to communicate and engage with their audience in an interactive and relevant manner through any mobile device or network [1]. With the advances in wireless technologies and the popularity of portable smart devices, mobile marketing increases quickly. Among various methods, short message (SMS) is regarded as the most efficient mode [2] when businesses start to collect mobile phone numbers and send off wanted (or unwanted) content to users. Another trend on mobile marketing is the type of location-based services (LBS) that can offer messages to users based on their current location. A cell phone service provider gets the location from a GPS (Global Positioning System) chip built into a phone, or by radiolocation and trilateration based on the signal-strength of the closest cell-phone towers. While mobile marketing enhances communication with users, it may causes privacy violations when the sent messages are without a required permission from consumers. A number of concerns, such as mobile spam, personal identification, location information and wireless security etc., mainly stem from the fact that mobile devices are intimately personal and are always with the user[3]. Experts cited fear of spam as the strongest negative influence on consumer attitudes towards SMS advertising [4]. Actually, no matter how well advertising messages are designed, if consumers do not have confidence that their privacy will be protected, this will hinder their widespread deployment[5]. To support personalization, messages should be appropriately tailored before sending to consumers. Solutions have been deployed to personalize text messages based on a consumer’s local time, location, and preferences [6], e.g. directions to the nearest vegetarian restaurant open at the time of request. A. An et al. (Eds.): AMT 2010, LNCS 6335, pp. 538–545, 2010. c Springer-Verlag Berlin Heidelberg 2010
Privacy Preserving in Personalized Mobile Marketing
539
However, personalization in mobile marketing means collecting and storing information about a particular person, such as monitoring of user behavior, which causes other privacy concerns [7]. To address such problems, different techniques have been proposed that are based on two main approaches: location cloaking, under which a suitable large region is returned to the service provider instead of the precise user location [8]; location k-anonymization, under which the location of an individual is returned to the service provider only if it is indistinguishable with respect to the location of other k-1 individuals [9,10]. Some recent works discuss the preferences on privacy, such as the Privacy-preserving Obfuscation Environment system (PROBE)[11]. But these works are not suitable for privacy preserving in mobile marketing since they do not consider the customization requirement of message contents. In this paper, we investigate the problem on how to protect users’ privacy while supporting personalization. In the proposed Privacy Preserving Model for Personalized Mobile Marketing (P 2 PMM ), users can customize their preferences for messages without any privacy leakage to information provider. A trusted third party collects the marketing messages and makes classification according to predefined categories. Users thus have their options on location, time and categories etc. We investigate how such a model can be realized in the GPS and cellular network systems. The prototype system is designed and implemented. The remainder of the paper is organized as follows. Section 2 presents the main components of the proposed model. Section 3 investigates the problem of efficient information organization and query processing in the model. Section 4 discusses the system architecture and describes the details on implementation of the prototype. Section 5 concludes the paper and outlines future research directions.
2 The Privacy Preserving Model for Personalized Mobile Marketing In this section we introduce the Privacy Preserving Model for Personalized Mobile Marketing (P 2 PMM for short), depicted as Figure 1. There are four entities in this model. The functionalities of each party are narrated as follows. – Mobile network operator (M N O): is a telephone company that provides communication services for mobile phone subscribers. – Intermediary services provider(ISP ): is a trusted third party for users that is independent to Merchants. It provides platform for merchants to manage their advertisements, as well as for users to subscribe their interested messages and maintain private individual data. Sometimes, it can be integrated with M N O if required. – Users: are the cell phone subscribers. After registering on ISP , they are allowed to option their preferred messages from ISP based on their location, time or interested topics. – Merchants: represent the organizations who want to advertise their business messages. After registering on ISP , they are allowed to publish their advertisements to interested users. There are three distinct characteristics from other mobile marketing models. Firstly, it is active marketing. The P 2 PMM model is in a “PULL” schema rather than a traditional
540
Y. Sun and G. Ji
Fig. 1. The Privacy Preserving Model for Personalized Mobile Marketing
“PUSH” way such that all the messages sent to users are what they want. Second is that users’ preferences privacy is preserved. The sensitive information of an individual, such as user profile, user preferences in each query and current location etc., are stored in ISP . This avoids the case that every merchant has a copy of user profile. Thirdly, users’ location privacy is also preserved. ISP is allowed to acquire the approximate location square without any awareness of the exact real-time position of a user. Now we formalize the basic notions of our model. Definition 1 (Position). A geographic position Loc = [lngt, lat] denotes a point in a map, where lngt and lat are the longitude and latitude value of this point. Let P OS denote the class of positions. U and M respectively denote the set of users and the set of merchants. We assume that every merchant is associated with an exact position. When a merchant registers on ISP , its position can be acquired by some position technologies like GoogleM ap. Similarly, every mobile phone user has an exact position at any time, which can be acquired by GPS or wireless positioning technologies. We introduce two predicates here LocU (u ∈ U ) : P OS and LocM (m ∈ M ) : P OS to calculate the location of a user and a merchant, respectively. Definition 2 (Message). Let ISSUE denote the set of all issues considered in the model. A message msg is specified as a tuple msg=< ID, T XT, Issues, T W, Loc >, where ID is the unique identifier of msg, T XT represents the message content, Issues ⊆ ISSUE is a subset of ISSUE denoting the issues correlated with msg, T W is in form of [t1 , t2 ] representing the time window when msg is effective, and Loc ∈ P OS is in form of [lngt, lat] denoting the merchant’s position who launches msg. For example, the department store M acy ∗ s in West Lafayette, IN wants to make an advertisement for sales promotion. It launches the following message to ISP : msg= <201006081123, ‘‘There is a 10% discount on mobile phones for MACY VIP in West Lafayette", {discount, mobile phone}, [20100601, 20100630], [40.25N, 86.54W]>, in which ‘‘msg.ID= 201006081123’’ and ‘‘msg.Loc=[40.25N, 86.54W]’’ are automatically generated and attached by the ISP system according to the M acy ∗ s profile when accepting the message.
Privacy Preserving in Personalized Mobile Marketing
541
Fig. 2. Hierarchical Categories Mapping
Definition 3 (Message Request). A Message Request is a tuple M sgR=< ID, P REissue , Texpire , Loc squ >, where ID is the unique identifier of a user who initiates this request, P REissue ⊆ ISSUE is a subset of ISSUE denoting the preferred issues, Texpire denotes the expired effective date of the requested messages, and Loc squ is in form of (loc1 , loc2 ) denoting the square surrounding the user’s position with the top-left point loc1 ∈ P OS and the bottom-tight point loc2 ∈ P OS. An example of message request is MsgR=<13001234567,{discount, clothes}, 20100701, ([40.25N, 84.54W],[39.25N, 86.27W])>. This request is sent by the user whose mobile phone number is “13001234567”. His interested topics are “discount” and “clothes”, and the effective date of messages should not be late than Jun.1 2010. Specially, the location square is generated by the phone application system according to the user’s preference (We would discuss this in details in section 4).
3 Information Organization and Query Processing In this section, we would investigate how to organize message information so as to quickly response users’ requests. A foldable bit-mapping schema is proposed to code issues for the purpose of efficient storage. This method is especially meaningful for mobile applications due to its limitation on computing ability and transmission speed. We also present an efficient algorithm to precess user queries. The Efficient Coding of Issues and Preferences. The issues considered in a ISP system are hierarchically organized. The categories can be defined by specialists according to semantic structures with the help of classification tool, such as W ordN et[12]. Each category can be further classified into sub-issues. The categories of issues and their semantics are called meta data. The foldable bit-mapping coding schema is as follows:
542
Y. Sun and G. Ji
– Issues are classified into two categories: leaf issue and non-leaf issue. Each issue occupies a bit in a code tuple. – Each bit is either “0” or “1”, respectively denoting issue “unselected” or “selected” – In a folded code, if the current bit is mapped to a leaf issue, each bit is either 1 or 0; otherwise (non-leaf) each bit is one of {1, 0, ∗}, where “0” (or “1” ) represents this issue together with its all sub-issues unselected (or selected) and “*” represents some of sub-issues in this category are selected and the details of sub-issue bits are given following this issue bit. For example, Figure 2 (a) is the given Hierarchical Categories, in which ISSUE = {Electricappliance,Sports,Clothes,Entertainment}, Electric appliance = {mobile phone, TV, refrig, Microwave}, and Clothes = {man, woman, kid}. Figure 2 (b) shows how to encode a user’s preference, say mobile phone and clothes, into a folded schema “*1000010”. It is easy to see that the folded code reduces much storage and thus reduces transmission time. Similarly, a folded code can be decoded under the same principle. Message Management and User Query Processing. For the purpose of efficient management, we divide messages into two classes: effective and pending according to their active time window. A message is effective if and only if the current date is within its active time window, while it is pending if and only if the current date is ahead of its active time window. A message would be discarded when it expires. Accordingly, two queues are organized for storing messages. A pending message is added into the pending queue in an ascending order according to the beginning of time window and would be deleted from this queue when entering its active window. The effective messages are ordered ascendingly according to the expire date in the effective queue and would be deleted when becoming expired. Definition 4 (User Query). A User Query is a set U Q = {M sgR1 , M sgR2 , · · · , M sgRk } of messages request, where k is an integer and each M sgRi , i ∈ [1..k] is in form of M sgRi =< ID, P REissue , Texpire , Loc squ > including the preferred issues P REissue , the expired effective date TExpire and geographical range Loc squ. Generally, there are multiple user requests happened at the same time on ISP server, which could fall into the following modes. Category Only Query (CoQ): users only care about the issues of advertisements and the query range is in form of ⊥. Location Only Query (LoQ): users only care about geographical range of the messages and the preference field of user request would be set ∅. Hybrid Query (HQ): is the combination of above two cases. We formalize a general user request in definition 4 and present the algorithm of processing user request in 1. The main idea on which the algorithm is based is make three comparisons between the users’ preferences and messages information. First we compare the issues. Since a user’s preferences have been encoded, we need to decode it, as discussed in section 3, and store the unfolded code in a variable IssuesBIT . If the user query is of LoQ type, each bit of IssuesBIT is set 0. Here we adopt a predicate to perform the logic AN D operation on each pair of bits of IssuesBIT and the issues in a message, as depicted in step 12. Many programming languages provide such function, like C or C . Then, the algorithm determines
Privacy Preserving in Personalized Mobile Marketing
543
Algorithm 1: Handling user queries Require: the effective messages queue ActQ the received user query U Q Ensure: Return the messages M SG that satisfy the conditions in U Q 1: M SG = ∅ 2: for each M sgRi ∈ U Q do 3: Get the coded preferred issues M sgRi .P REissue 4: if M sgRi .P REissue = ∅ then 5: Unfold M sgRi .P REissue into a standard bit mapping IssuesBIT 6: else 7: IssuesBIT = 0 8: end if 9: ST AT E = T RU E 10: while STATE do 11: sequentially select an element from queue msg ∈ ActQ 12: if BitLogicAnd(IssuesBIT, msg.Issues) AND Loc squ =⊥ then 13: if msg.T W.t2 ≤ M sgRi .Texpire then 14: if (M sgRi .Loc squ.lngt1 ≤ msg.loc.long ≤ M sgRi .Loc squ.lngt2 ) AND (M sgRi .Loc squ.lat1 ≤ msg.loc.lat ≤ M sgRi .Loc squ.lat2 ) then 15: M SG = M SG ∪ {msg.T XT } 16: end if 17: else 18: ST AT E = F ALSE 19: end if 20: end if 21: end while 22: end for
whether the expire date of a message is within the user’s preferred period, as in step 13. Moreover, if the position of a message’s sender is located within the users’ preferred geographical range, the message is added to the result set M SG. In special case that users do not care about the geographical position, say Loc squ =⊥, this check is omitted. The time computational complexity of algorithm 1 is O(|U R| ∗ m), where |U R| is the number of the message requests in user’s query and m is the number of active messages in the IP S store. Since the overwhelming computation is the comparison between requests and messages, the adoption of bit logic operation can highly increase the efficiency.
4 The Prototype System In this section, we investigate how the proposed model can be realized in the GPS and cellular network systems. Our prototype is implemented on a PC with the operating system WINDOWS XP SP2, a 2Gz E5200 processor and 2GB RAM. We adopt Tomcat 6.0 as the Web server on ISP and Mysql 5.0 as the database. The development environment on mobil end is Myeclipse 6 and NetBeans 5.51.
544
Y. Sun and G. Ji
Fig. 3. The ISP system architecture
The architecture of ISP is depicted in Figure 3. There are three main components: repositories, interfaces and internal modules. The repositories include the Store of Registered Merchant Information (RMI), the Store of Registered User Profile (RUI) and the Store of Hierarchical Categories (HCS). RMI contains the information associated with the registered merchants, such as geographical information, while RUI records users’ private information, such as mobile phone numbers. HCS stores the metadata and the mapping schema of hierarchical categories. The Merchant Interface provides the operations to start registration and to complete the publish of an advertisement. The User Interface allows users to register on ISP server and manage their profile. The Administrator Interface offers the management of users, merchants and advertisement. Internal modules, depicted as rectangles in the figure, are the main functions to process merchants advertisements and users queries. The Identity Management module is responsible for identifying individuals and merchants as well as controlling the access to system resources by placing restrictions on the established identities. The module of User Preferences Decoding deals with the transformation of users’ folded preferences into normal bit mapping schema. This module would access the Hierarchical Categories Store HCS. The Categories Maintenance module provides platform for administrators to create, delete or modify the categories, as well as to maintain the bit mapping schema of categories and storage. The model of Ad Management processes the request from registered merchants and manages the pending and effective message queues. The Match Preferred Messages module performs the routine activities of users query on messages that need to access both HCS and effective messages queue. Location technologies perform the localization of a target and make the resulting location data available to external actors. In our model, to capture a user’s graphical position in real time, we adopt the A-GPS technology to locate a mobile phone user and associate it with a certain place in the real world. To use the location class getGps.java, we need to import two libs javax.microedition.location. Criteria and javax.microedition.location.Location. The adopted classes include criteria, Location and Coordinates. The longitude and latitude of a mobile end can be acquired by getLocation(), LocationProvider. getInstance(), getLongitude() and getLatitude(). To protect users’
Privacy Preserving in Personalized Mobile Marketing
545
location privacy, we adopt the cloaking technologies. A users is allowed to define a query range range. After having the geographical position, say [x, y], where x and y are the coordinates, the square [x − range, y − range, x + range, y + range] is generated and sent to the ISP server. To ensure the efficiency and stability of communication between users and server, a separate thread is invoked and content is transferred to Hex code.
5 Conclusions Simultaneously supporting personalization and privacy preserving is a challenging problem in mobile marketing. In this paper, we investigate this problem by proposing a Privacy Preserving Model for Personalized Mobile Marketing (P 2 PMM ). It can protect both users’ preferences and location privacy while supporting personalization in mobile marketing. We present an efficient coding method to manage hierarchical categories and users’ preferences. We discuss how to apply this model into practice and implement a prototype. As part of future work, we are planning to investigate the optimization of algorithm so as to efficiently process a large volume of queries and consider more sophisticated constraints. Acknowledgement. This work is supported partly by 863 Program of China (2006AA01A113) and the Science Foundation of Shandong Province (Y2008G28).
References 1. MMA: Mobile marketing association: Updates definition of mobile marketing (2009), http://mmaglobal.com/news/ mma-updates-definition-mobile-marketing 2. Marsit, N., Hameurlain, A., Mammeri, Z., Morvan., F.: Query processing in mobile environments: a survey and open problems. In: Proceeding of the First International Conference on Distributed Framework for Multimedia Applications, pp. 150–157 (2005) 3. Hinde, S.: Spam: the evolution of a nuisance. Computers and Security 22(6), 474–478 (2003) 4. Scharl A, A., Dickinger B, A., Murphy A, J.: Murphy Diffusion and success factors of mobile marketing. Electronic Commerce Research and Applications 4(2), 159–173 (2005) 5. Cleff, B.E.: Privacy issues in mobile advertising. In: British and Irish Law, Education and Technology Association, Annual Conference Hertfordshire (2007) 6. Balasubramanian, S., Peterson, R., Jarvenpaa, S.: Exploring the implications of m-commerce for markets and marketing. J. of the Academy of Marketing Science 30(4), 348–361 (2002) 7. Bertino, E.: Privacy-preserving techniques for location-based services. SIGSPATIAL Special archive 1(2), 2–3 (2009) 8. Cheng, R., Zhang, Y., Bertino, E., Prabhakar, S.: Preserving user location privacy in mobile data management infrastructures. In: Danezis, G., Golle, P. (eds.) PET 2006. LNCS, vol. 4258, pp. 393–412. Springer, Heidelberg (2006) 9. Kalnis, P., Ghinita, G., Mouratidis, K., Papadias, D.: Preventing location-based identity inference in anonymous spatial queries. IEEE Transactions on Knowledge and Data Engineering 19(12), 1719–1733 (2007) 10. Mokbel, M., Chow, C.Y., Aref, W.: The new casper: query processing for location services without compromising privacy. In: Proceedings of the 32nd International Conference on Very Large Databases, VLDB 2006 (2006) 11. Ghinita, G., Damiani, M.L., Bertino, E., Silvestri, C.: Interactive location cloaking with the probe obfuscator. In: International Conference on Mobile Data Management (2009) 12. PrincetonUniversity: Wordnet, http://www.wordnet.org/
Author Index
Aasim, Khan 26 Abe, Hidenao 150 Abe, Yusuke 498 Aminian-Modarres, Amir Farid Analoui, Morteza 6, 220 Anil Kumar, K.M. 486 Aspinall, Adam 318 Asseisah, Mohamed S. 18 Bae, Jae-Hak J. 403 Bahig, Hatem M. 18 Bahrainian, Seyed Ali 174 Bahrainian, Seyed Mohammad Beigi, Akram 330 Bell, Keith J. 212 Benedicenti, Luigi 191 Borner, Katy 448 Broek, Egon L. van den 395
128
174
498
Geva, Shlomo 267 Ginne, Naveen K.R. 353 Goh, Dion Hoe-Lian 232 Gong, Minglun 424 Gras, Robin 318 Hansen, Derek L. 47 Han, Xiaogang 34 Hao, Jun-Kang 288 Hassas, Salima 345 Hepting, Daryl H. 200, 411 Hoeber, Orland 424 Huang, Xiangji Jimmy 255, 383 Huang, Zhisheng 98, 475 Hu, Qinmin Vivian 383 Jahed-Motlagh, Mohammad Reza Jang, Seon-Ah 403 Ji, Guangjun 538 Jin, Song 255 Joly, Adrien 436 Jung, Hanmin 108
Chang, Chew Hung 232 Chatterjea, Kalyani 232 Chella, Antonio 306 Chen, Bin 448 Chen, ShanShan 448 Chen, Xiaobo 166 Cheung, Yiu-ming 296 Choi, Yun-Soo 373 Chukwu, Michael 338 Chun, Hong-Woo 158, 373 Cossentino, Massimo 306 Cox, Ingemar J. 277
Kando, Noriko 498 Kawada, Yasuhide 498 Kim, Jin-Dong 158, 373 Kim, Pyung 108 Kim, Thi Nhu Quynh 232 Kiyota, Yoji 498
Daigremont, Johann 436 Damljanovic, Danica 98 Daoud, Sameh S. 18 Dengel, Andreas 174 DiFranzo, Dominic 448 Ding, Li 448 Ding, Ying 448 Fang, Lizao 411 Fathy, Mahmood 128 Fuenzalida, Alvaro Graves Fujimoto, Yu 183
Fukuhara, Tomohiro Fu, Pengbin 166
448
Latifi, Leila 200 Lee, Mikyoung 108 Lee, Seungwoo 108 Li, Daifeng 448 Li, Meng 296 Li, Qiudan 63, 74, 86 Li, Zhihan 267 Liao, Le-Jian 288 Lim, Ee-Peng 232 Lin, Hongfei 255 Lu, Dongyuan 74 Luo, Xudong 34
128
548
Author Index
Ma, Jianhua 5 Maret, Pierre 436 Martinez-Enriquez, A.M. 26 Miao, Chunyan 34 Milojevic, Stasa 448 Minaei, Behrouz 330 Mozayani, Nasser 330 Muhammad, Aslam 26 Mundluru, Dheerendranath 510 Nakagawa, Hiroshi 498 Nakagawa, Hiroyuki 461 Nakao, Mitsuteru 158 Narasimha Murty, M. 116 Nishimoto, Kazushi 243 Ohara, Kouzou 183 Ohsuga, Akihiko 461 Oouchida, Kenta 158
108, 373
Tago, Atsushi 461 Tahara, Yasuyuki 461 Takagi, Toshihisa 158 Theng, Yin-Leng 232 Toma, Ioan 448 Tona, Calogera 306 Toˇsi´c, Predrag T. 353 Tsuchiya, Masatoshi 498 Tsumoto, Shusaku 150 Tzacheva, Angelina A. 212 Utsuro, Takehito
498
Wang, Cong 98, 475 Wang, Yan 98, 475 Wild, David 448 Wu, Melanie 448 Wu, Q.M. Jonathan 338, 361 Wu, Zonghuan 510
Parvin, Hamid 330 Petty, Sheila 191 Pham, Tan Phat 232 Qin, Yulin 475 Qiu, Jing 288 Quach, Huynh Nhu Hop
Sun, Aixin 232 Sun, Yuqing 538 Sun, Yuyin 448 Sung, Won-Kyung Suresha 486
232
Raghavan, Vijay V. 510 Ravindra Babu, T. 116 Razikin, Khasfariyati 232 Ren, Xu 475 Rezvani, Mohammad Hossein Rupert, Maya 345
Xu, Yue
6, 220
Salarinasab, Meytham 174 Sankaranarayanan, Madhuvanthi 448 Sarwer, Mohammed Golam 361 Seidita, Valeria 306 Seo, Dongmin 108 Shen, Zhiqi 34 Shiina, Tomofumi 183 Shneiderman, Ben 1, 47 Sluis, Frans van der 395 Smith, Marc 47 Song, Shuangyong 63, 86 Strong, Grant 424 Subrahmanya, S.V. 116
267
Yamaguchi, Atsuko 158 Yamamoto, Yasunori 158 Yang, Huirong 166 Yang, Jae-Gun 403 Ye, Zheng 255, 383 Yin, Baocai 166 Yokoyama, Yuki 243 Yoshida, Tetsuya 138 Yoshioka, Masaharu 498 Yuan, Dingrong 522 Yu, Fei 338 Zeinalpour-Tabrizi, Zeinab Zeng, Yi 98, 475 Zhang, Shichao 522 Zheng, Lei 277 Zheng, Nan 63, 86 Zhong, Ning 98, 475 Zhu, Tingshao 530
128