Giovanna Castellano, Lakhmi C. Jain, and Anna Maria Fanelli (Eds.) Web Personalization in Intelligent Environments
Studies in Computational Intelligence, Volume 229 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 207. Santo Fortunato, Giuseppe Mangioni, Ronaldo Menezes, and Vincenzo Nicosia (Eds.) Complex Networks, 2009 ISBN 978-3-642-01205-1 Vol. 208. Roger Lee, Gongzu Hu, and Huaikou Miao (Eds.) Computer and Information Science 2009, 2009 ISBN 978-3-642-01208-2 Vol. 209. Roger Lee and Naohiro Ishii (Eds.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, 2009 ISBN 978-3-642-01202-0 Vol. 210. Andrew Lewis, Sanaz Mostaghim, and Marcus Randall (Eds.) Biologically-Inspired Optimisation Methods, 2009 ISBN 978-3-642-01261-7 Vol. 211. Godfrey C. Onwubolu (Ed.) Hybrid Self-Organizing Modeling Systems, 2009 ISBN 978-3-642-01529-8 Vol. 212. Viktor M. Kureychik, Sergey P. Malyukov, Vladimir V. Kureychik, and Alexander S. Malyoukov Genetic Algorithms for Applied CAD Problems, 2009 ISBN 978-3-540-85280-3 Vol. 213. Stefano Cagnoni (Ed.) Evolutionary Image Analysis and Signal Processing, 2009 ISBN 978-3-642-01635-6 Vol. 214. Been-Chian Chien and Tzung-Pei Hong (Eds.) Opportunities and Challenges for Next-Generation Applied Intelligence, 2009 ISBN 978-3-540-92813-3 Vol. 215. Habib M. Ammari Opportunities and Challenges of Connected k-Covered Wireless Sensor Networks, 2009 ISBN 978-3-642-01876-3 Vol. 216. Matthew Taylor Transfer in Reinforcement Learning Domains, 2009 ISBN 978-3-642-01881-7 Vol. 217. Horia-Nicolai Teodorescu, Junzo Watada, and Lakhmi C. Jain (Eds.) Intelligent Systems and Technologies, 2009 ISBN 978-3-642-01884-8 Vol. 218. Maria do Carmo Nicoletti and Lakhmi C. Jain (Eds.) Computational Intelligence Techniques for Bioprocess Modelling, Supervision and Control, 2009 ISBN 978-3-642-01887-9
Vol. 219. Maja Hadzic, Elizabeth Chang, Pornpit Wongthongtham, and Tharam Dillon Ontology-Based Multi-Agent Systems, 2009 ISBN 978-3-642-01903-6 Vol. 220. Bettina Berendt, Dunja Mladenic, Marco de de Gemmis, Giovanni Semeraro, Myra Spiliopoulou, Gerd Stumme, Vojtech Svatek, and Filip Zelezny (Eds.) Knowledge Discovery Enhanced with Semantic and Social Information, 2009 ISBN 978-3-642-01890-9 Vol. 221. Tassilo Pellegrini, S¨oren Auer, Klaus Tochtermann, and Sebastian Schaffert (Eds.) Networked Knowledge - Networked Media, 2009 ISBN 978-3-642-02183-1 Vol. 222. Elisabeth Rakus-Andersson, Ronald R. Yager, Nikhil Ichalkaranje, and Lakhmi C. Jain (Eds.) Recent Advances in Decision Making, 2009 ISBN 978-3-642-02186-2 Vol. 223. Zbigniew W. Ras and Agnieszka Dardzinska (Eds.) Advances in Data Management, 2009 ISBN 978-3-642-02189-3 Vol. 224. Amandeep S. Sidhu and Tharam S. Dillon (Eds.) Biomedical Data and Applications, 2009 ISBN 978-3-642-02192-3 Vol. 225. Danuta Zakrzewska, Ernestina Menasalvas, and Liliana Byczkowska-Lipinska (Eds.) Methods and Supporting Technologies for Data Analysis, 2009 ISBN 978-3-642-02195-4 Vol. 226. Ernesto Damiani, Jechang Jeong, Robert J. Howlett, and Lakhmi C. Jain (Eds.) New Directions in Intelligent Interactive Multimedia Systems and Services - 2, 2009 ISBN 978-3-642-02936-3 Vol. 227. Jeng-Shyang Pan, Hsiang-Cheh Huang, and Lakhmi C. Jain (Eds.) Information Hiding and Applications, 2009 ISBN 978-3-642-02334-7 Vol. 228. Lidia Ogiela and Marek R. Ogiela Cognitive Techniques in Visual Data Interpretation, 2009 ISBN 978-3-642-02692-8 Vol. 229. Giovanna Castellano, Lakhmi C. Jain, and Anna Maria Fanelli (Eds.) Web Personalization in Intelligent Environments, 2009 ISBN 978-3-642-02793-2
Giovanna Castellano, Lakhmi C. Jain and Anna Maria Fanelli (Eds.)
Web Personalization in Intelligent Environments
123
Prof. Giovanna Castellano
Prof. Anna Maria Fanelli
Computer Science Department University of Bari Via Orabona, 4 70125 Bari Italy E-mail:
[email protected]
Computer Science Department University of Bari Via Orabona, 4 70125 Bari Italy E-mail:
[email protected]
Prof. Lakhmi C. Jain University of South Australia Adelaide Mawson Lakes Campus South Australia Australia E-mail:
[email protected]
ISBN 978-3-642-02793-2
e-ISBN 978-3-642-02794-9
DOI 10.1007/978-3-642-02794-9 Studies in Computational Intelligence
ISSN 1860-949X
Library of Congress Control Number: Applied for c 2009 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed in acid-free paper 987654321 springer.com
Foreword
At first sight, the concept of web personalization looks deceivingly simple. A web personalization system is a software component that collects information on visitors to a web site and leverages this knowledge to deliver them the right content, tailoring presentation to the user's needs. All over the world, web designers and web content managers rely on web personalization solutions to improve the effectiveness and usability of their web-based applications. Still, the scientific foundation of web personalization remains a controversial issue. Practitioners know very well that when properly implemented, personalization delivers a much better user experience; but when it is poorly implemented, personalization may backfire and even distract the user's attention away from some useful (and costly-to-develop) enriched content. In other words, tailoring content, and varying it routinely, may make a site more attractive; but an unstable site look can have a negative impact on the overall message. Everybody seems to agree that this is a real danger; but there are specific questions that are much harder to answer convincingly. For example, when does excessive customization become noise? How can we measure the effects of content tailoring on users' experience and cognitive gain? Without a clear answer to these questions, organizations that extensively use personalization in their content management projects have to take the risk of compromising the effectiveness of the underlying message. Historically, this factor kept the number of adopters low: most businesses are reluctant to risk jeopardizing their core message in exchange for some non-quantified future benefit of personalization. A sound scientific approach is needed to reverse this trend; but until quite recently, web personalization had little to do with scientific research. As a communication strategy, it was considered more an art than a science. This book provides an entirely different point of view, advocating a scientific approach to web personalization without forgetting the interdisciplinary nature of this field and its practical goals. Editors Giovanna Castellano, Lakhmi Jain and Anna Maria Fanelli, themselves outstanding researchers in this area, successfully put together a book which is selfcontained: it provides a comprehensive view of the state of the art, including a description of the personalization process and a classification of the current approaches to Web personalization. Also, the book delves very deeply into current investigation on intelligent techniques in the realm of Web personalization.
VI
Foreword
I leave it to the Editor's introduction to comment individually on the excellent selected chapters, that are authored by some of the leading international research teams working in this field. Here, it is more important to remark that these chapters collectively show what intelligent techniques can do to tackle two open research problems:
• •
discovering useful knowledge about users from the (uncertain) information collected during interactions. using such knowledge to deliver customized recommendations, tailor-made to the needs of the users.
Solving the first problem means providing a scientifically sound definition of user model. To put it simply, such models are composed of a visitor profile and a visitor segment. A visitor profile is a collection of attributes that must be known or guessed in order to support personalization. Explicit profile attributes are the easier part: they are data about the user, coming from online surveys, registration forms, integrated CRM or sales automation tools, and legacy or existing databases. Still, this multiplicity of sources poses uncertainty problems in case of conflicts (in which age group do we classify a user who declared that her age is 15 but also provided her driving license number?) and limited trustworthiness (e.g. due to data aging) of some information sources. Implicit profile attributes are much more uncertain than explicit ones: they are derived from browsing patterns, cookies, and other sources, i.e. from watching or interpreting customer behavior, a process which may be slow and is subject to error. Here, however, one must clarify how uncertainty arises. There is little uncertainty in the data collection process: personalization systems are probes, not sensors, and exactly register user behavior in terms of clicks and page visits. Uncertainty comes in when mapping profile attributes to profile segments. A segment is just a collection of users with matching profiles; so segment membership is usually uncertain, or better a matter of degree. Visitor segments have different granularity depending on the applications, and are crucial for developing and maintaining classification rules. How organizations collect and store visitor segments is a sensitive topic, as it gives rise to a number of privacy issues. Finally gaming, i.e. intentionally attacking the classification system by providing wrong information or acting erratically, is also not unheard-of on the Web and can worsen the situation. The second problem is the holy grail of web personalization. Web-based recommendation systems aggregate the online behavior of many people to find trends, and then make recommendations based on them. This involves some sophisticated mathematical modeling to compute how much one user's behavior is similar to another's. Once again, uncertainty mostly comes from the interaction between recommendation and segmentation: recommender systems will try to advise us based on past behavior of our peers, but their notion of “peer” is only as good as their profile segment construction algorithm. When segmentation fails (e.g. due to gaming, or wrong interpretation of implicit parameters) sometimes recommendations turn up plainly wrong, and in some extreme cases they can even be offensive to the users. Intelligent techniques map the above issues to data mining and machine learning problems. Namely, they use mining and learning to build intelligent (e.g., neuro-fuzzy or temporal) models of user behavior that can be applied to the task of predicting user
Foreword
VII
needs and adapting future interactions. The techniques described in this book are flexible enough to handle the various sources of data available to personalization systems; also, they lend themselves to experimental validation. Thanks to the combined effort of the volume's editors and of its outstanding authorship, this book demonstrates that intelligent approaches can provide a much needed hybrid solution to both these problems, smoothly putting together symbolic representation of categories and segments with quantitative computations. While much work remains to be done, the chapters in this volume provide convincing evidence that intelligent techniques can actually pave the way to a scientifically sounder (and commercially more effective) notion of Web personalization.
Ernesto Damiani Università di Milano, Italy
Preface
The Web emerges as both a technical and a social phenomenon. It affects business, everybody's life and leads to considerable social implications. In this scenario, Web personalization arises as a powerful tool to meet the needs of daily users and make the Web a friendlier environment. Web personalization includes any action that adapts the information or services provided by a Web site to the needs of users, taking advantage of the knowledge gained from the users' navigational behavior and individual interests, in combination with the content and the structure of the Web site. In other words, the aim of a Web personalization system is to provide users with the information they want or need, without expecting them to ask for it explicitly. The personalization process covers a fundamental role in an increasing number of application domains such as e-commerce, e-business, adaptive web systems, information retrieval. Depending on the application context, personalization functions may change ranging from improving the organization and presentation of Web sites to enabling better searches. Regardless of the particular application domain, the development of Web personalization systems gives rise to two main challenging problems: how to discover useful knowledge about the user's preferences from the uncertain Web data collected during the interactions of users with the Web site and how to deliver intelligent recommendations, tailor-made to the needs of the users by exploiting the discovered knowledge. The book aims to provide a comprehensive view of Web personalization and investigate the potential of intelligent techniques in the realm of Web personalization. The book includes six chapters. Chapter one provides an introduction to innovations in Web Personalization. A roadmap of Web personalization is delineated, emphasizing the different personalization functions and the variety of approaches proposed for the realization of personalization systems. In this chapter, a Web personalization process is presented as a particular data mining application with the goal of acquiring all possible information about users accessing the Web site in order to deliver personalized functionalities. In particular, according to the general scheme of a data mining process, the main steps of a Web personalization process are distinguished, namely Web data collection, Web data preprocessing, pattern discovery and personalization. This chapter provides a detailed description of each of these steps. To complete the introduction, different techniques proposed in literature for each personalization step are reviewed, by providing a survey of works in this field.
X
Preface
Chapter two by Pasquale Lops, et al. investigates the potential of folksonomies as the source of information about user interests for recommendation. The authors introduce a semantic content-based recommender system integrating folksonomies for personalized access. The main contribution is a novel integrated strategy that enables a content-based recommender to infer user interests by applying machine learning techniques, both on official item descriptions provided by a publisher and on tags which users adopt to freely annotate relevant items. Chapter three by John Garofalakis and Theodoula Giannakoudi shows how to exploit ontologies for Web search personalization. Ontologies are used to provide a semantic profiling of users’ interests, based on the implicit logging of their behavior and the onthefly semantic analysis and annotation of the web results summaries. Chapter four by Giovanna Castellano and M. Alessandra Torsello shows how to derive user categories for Web personalization. It presents a Web Usage Mining (WUM) approach based on fuzzy clustering to categorize users by grouping together users sharing similar interests. Unlike conventional fuzzy clustering approaches that employ distance-based metrics (such as the Euclidean measure) to evaluate similarity between user interests, the approach described in this chapter makes use of a fuzzy similarity measure that enables identification of user categories by capturing the semantic information incorporated in the original Web usage data. Chapter five by Fabián P. Lousame and Eduardo Sánchez presents an overview on recommender systems based on collaborative filtering, which represents one of the most successful recommendation technique to date. The chapter contributes with a general taxonomy useful to classify algorithms and approaches attending to a set of relevant features, and finally provides some guidelines to decide which algorithm best fits on a given recommendation problem or domain. In Chapter six, Corrado Mencar et al. present a user profile modeling approach conceived to be applicable in various contexts, with the aim of providing personalized contents to different categories of users. The proposed approach is based on fuzzy logic techniques and exploits the flexibility of fuzzy sets to define an innovative scheme of metadata. Along with the modeling approach, the design of a software system based on a Service Oriented Architecture is presented. The system exposes a number of services to be consumed by information systems for personalized content access. We are grateful to the authors and reviewers for their excellent contribution. Thanks are due to the Springer-Verlag and SCI Data Processing Team of Scientific Publishing Services for their assistance during the preparation of the manuscript.
May 2009
Giovanna Castellano Lakhmi C. Jain Anna Maria Fanelli
Editors
Giovanna Castellano is Assistant Professor at the Department of Computer Science of the Universitity of Bari, Italy. She received a Ph.D. in Computer Science at the same University in 2001. Her recent research interests focus on the study of Computational Intelligence paradigms and their applications in Web-based systems, image processing and multimedia information retrieval.
Professor Lakhmi C. Jain is a Director/Founder of the Knowledge-Based Intelligent Engineering Systems (KES) Centre, located in the University of South Australia. He is a fellow of the Institution of Engineers Australia. His interests focus on the artificial intelligence paradigms and their applications in complex systems, art-science fusion, e-education, e-healthcare, unmanned air vehicles and intelligent agents.
XII
Editors
Professor Anna Maria Fanelli is Full Professor at the Department of Computer Science of the Universitity of Bari, Italy, where she plays several roles. She is Director of the Computer Science Department, Director of the PhD School in Computer Science and chair of the CILab (Computational Intelligence Laboratory). Her recent research interests focus on the analysis, synthesis, and application of Computational Intelligence techniques with emphasis on the interpretability of fuzzy rulebased classifiers and Web Intelligence.
Contents
Chapter 1 Innovations in Web Personalization Giovanna Castellano, Anna Maria Fanelli, Maria Alessandra Torsello, Lakhmi C. Jain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Chapter 2 A Semantic Content-Based Recommender System Integrating Folksonomies for Personalized Access Pasquale Lops, Marco de Gemmis, Giovanni Semeraro, Cataldo Musto, Fedelucio Narducci, Massimo Bux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Chapter 3 Exploiting Ontologies for Web Search Personalization John Garofalakis, Theodoula Giannakoudi . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Chapter 4 How to Derive Fuzzy User Categories for Web Personalization Giovanna Castellano, Maria Alessandra Torsello . . . . . . . . . . . . . . . . . . . . . .
65
Chapter 5 A Taxonomy of Collaborative-Based Recommender Systems Fabi´ an P. Lousame, Eduardo S´ anchez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Chapter 6 A System for Fuzzy Items Recommendation Corrado Mencar, Ciro Castiello, Danilo Dell’Agnello, Anna Maria Fanelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
1 Innovations in Web Personalization Giovanna Castellano1 , Anna Maria Fanelli1 , Maria Alessandra Torsello1, and Lakhmi C. Jain2 1 2
Computer Science Department, University of Bari, Italy Via Orabona, 4 - 70125 Bari, Italy University of South Australia, Mawson Lakes Campus, South Australia, Australia
Abstract. The diffusion of the Web and the huge amount of information available online have given rise to the urgent need for systems able to intelligently assist users, when they browse the network. Web personalization offers this invaluable opportunity, representing one of the most important technologies required by an ever increasing number of real-world applications. This chapter presents an overview of the Web personalization in the endeavor of Intelligent systems.
1 Introduction With the explosive growth of Internet and the easy availability of information on the Web, we have entered a new information age. Today, the Web provides a new medium for communication, by changing the traditional way of gathering, presenting, sharing and using the information. In the era of the Web, the problem of information overload is continuously expanding. When browsing the Web, users are very often overwhelmed by a huge amount of information available online. Indeed, the ever more complex structure of sites combined with the heterogeneous nature of the Web, make Web navigation difficult for ordinary users, who often are faced with the challenging problem of finding the desired information in right time. An important step in the direction of alleviating the problem of information overload is represented by Web personalization. Web personalization can be simply defined as the task of adapting the information or services provided by a Web site to the needs and interests of users, exploiting the knowledge gained from the users’ navigational behavior and individual interests, in combination with the content and the structure of the Web site. The need to offer personalized services to the users and to provide them with information tailored to their needs has prompted the development of new intelligent systems able to collect knowledge about the interests of users and adapt its services in order to meet the user’s needs. Web personalization is a fundamental task in an increasing number of application domains, such as e-commerce, e-business, information retrieval. Depending on the context, the personalization functions may change. In e-commerce, for example, personalization can offer the useful function of suggesting interesting G. Castellano, L.C. Jain, A.M. Fanelli (Eds.): Web Person. in Intel. Environ., SCI 229, pp. 1–26. c Springer-Verlag Berlin Heidelberg 2009 springerlink.com
2
G. Castellano et al.
products or advertising on the basis of the interests of online customers. This function is generally realized through recommendation systems that represent one of the most popular approaches for Web personalization. In information retrieval, personalization allows to tailor the search process to the user needs, by providing them more appropriate results to their queries. These are only few examples among the variety of the personalization functions that could be offered. Web personalization has received the interest of the scientific community. Many research efforts have been addressed to the investigation of new techniques for the development of systems endowed with personalization functionalities. This has led to the growth of a new flourishing research area, known as Web Intelligence (WI), which has been recognized as the research branch which applies principles of the Artificial Intelligence and Information Technology in the Web domain. The main objective of WI is in the development of Intelligent Web Information Systems, i.e. systems endowed with intelligent mechanisms associated to the human intelligence, such as reasoning, learning, and so on. The growing development of WI is strongly related to the complexity and the heterogeneity of the Web, due to the variety of objects included in the network and the complex way in which these are connected. Indeed, Web data are characterized by uncertainty and are fuzzy in nature. In this context, a big challenge is how to develop intelligent techniques able to face with uncertainty and complexity. This chapter provides a comprehensive view of Web personalization, which is presented as a particular data mining application with the goal of acquiring all possible information about users accessing the Web site in order to deliver them personalized functionalities. In particular, according to the general scheme of a data mining process, the main steps of a Web personalization process are distinguished, namely Web data collection, Web data preprocessing, pattern discovery and personalization. This chapter provides a detailed description of each of these steps. To complete the introductory treatment of the Web personalization topic, the different techniques which have been proposed in literature for each distinguished personalization step are inspected, by providing a review of works in this field. Once the motivations of the need for Web personalization have been explained, a roadmap of Web personalization is delineated, emphasizing the different personalization functions which can be offered and the variety of approaches proposed for the realization of personalization systems. Successively, the Web personalization process is described as a data mining application and the ideas behind the Web usage mining and its use for Web personalization are presented. Hence, the stages involved in a usage-based Web personalization system are discussed in detail, with reference to the majority of the existing methods.
2 Web Personalization Roadmap Web personalization can be defined as any set of actions that can tailor the Web experience to a particular user or set of users. The actions can range from simply making the presentation more pleasing to anticipating the needs of a user and providing customized and relevant information. As a consequence, a Web
Innovations in Web Personalization
3
personalization system can be developed in order to offer a variety of personalization functions, making the Web a friendlier environment for its individual users and hence creating trustworthy relationships between a Web site and its visitors. However, different approaches have been proposed to develop effective Web personalization systems. In the following subsections, firstly, the variety of functions that can be offered by a Web personalization system are described. Then, the different approaches which have been proposed to develop several kinds of personalization forms are discussed. 2.1
Web Personalization Functions
According to Pierrakos et al. [2003], four basic classes of personalization functions can be distinguished, namely memorization, guidance, customization and task performance support. Each of these functions is examined below in more detail, starting from the simplest to the most complicated ones. Memorization Memorization represents the simplest and the most widespread class of personalization functions. In this form of personalization, the system records in its memory information about users accessing the Web site (e.g. using cookies), such as the name, the browsing history, and so on. Then, this information is used by the personalization system as a reminder of the user’s past behavior. In fact, when the user returns to the Web site, the stored information, without further processing, is exploited to recognize and to greet the returning user. Memorization is not offered as a stand-alone function but it is usually part of a more complete personalization solution. Examples of personalization functions belonging to this class are listed as: • User Salutation: The Web personalization system recognizes the returning user and visualizes a personalized message, generally including the user’s name together with a welcome sentence. Though the user salutation function is one of the simplest form of personalization, this represents a first step to increase the user’s loyalty, in most of Web commercial applications. In fact, users feel more comfortable accessing Web sites that recognize them as individuals, rather than as regular visitors. • Bookmarking: In this personalization function, the system is able to record the pages that a user has visited during his/her past accesses. The lists of the visited pages will be used in the successive visits of the same user. In fact, when the user returns to visit the Web site, the personalization system presents these by means of a personalized bookmarking scheme for that site supporting the user in the navigation activity. • Personalized access rights: A Web site can define personalized access rights that allow to distinguish different types of users (for example, common users and authorized users). Different access rights are useful to differentiate the category of information that users may access (product prices, reports) or to establish the set of operations that a category of users may execute (download files, e-mail).
4
G. Castellano et al.
Guidance Guidance (or recommender system) represents the class of personalization functions consisting in the ability of a Web site to assist users by quickly providing them with the relevant information according to their interests or suggesting them alternative browsing options [Mobasher et al., 2000, Nasraoui et al., 2002, Schafer et al., 1999]. In this case, the personalization system relies on data that reflects the user preferences collected both explicitly (user browsing history stored by the Web Server in access log files) and implicitly (through the fulfillment of questionnaires or apposite registration forms). In the following, examples of guidance functions are described. • Recommendation of hyperlinks: This function consists in the dynamic recommendation of links deemed to be interesting according to the user preferences. The suggested links can be presented in the form of recommendation list displayed in a separate frame of the current Web page or listed in an apposite pop-up window. In Kobsa et al. [2001], recommendation of links is presented as one of the most developed personalization functionalities, for the suggestion of links to topics of information or to an advised navigational path that the user might follow. Recommender systems are especially employed in e-business field and in many e-commerce applications in order to suggest products useful to the clients/users and to increase their loyalty. • User Tutoring: For this guidance function, basic notions of Adaptive Educational Systems have been applied to personalize Web sites. A Web site can offer personalized guidance functions to an individual in each step of the user interaction with the site, taking into account the knowledge and the interests of the same user. This functionality is achieved by the Web site, for example, by recommending to the user other Web pages or adding explanatory content to the Web pages. An application of this personalization function can be retrieved in Webinars (Web seminars), which are live or replayed multimedia presentations conducted from a Web site. Customization In general, in this form of personalization, the system takes as input the user preferences (generally collected by means of registration forms) and exploits these to customize the content or the structure of a Web page. This process generally tends to be manual or semi-automatic. The major goal of this personalization function is the efficient management of the information load by alleviating and facilitating the user interactions with the site. Examples of this class of personalization functions are: • Personalized Layout: This customization function refers to the change of the Web pages in the layout, color or local information according to the profile of the connected user. Personalized layout is usually exploited by Web portals, such as Yahoo and Altavista which offer customized functionalities in order to create personalized “\My-Portals” Web sites.
Innovations in Web Personalization
5
• Content Customization: The content of Web pages is modified in order to meet the interests and the preferences of the users. For example, this personalization function permits to visualize a Web page in different ways (summarized or in an extended form), depending on the type of user accessing the site. To make the appropriate modifications on the Web page content, the user knowledge is also taken into account. An example of Web site with content customization functions can be found in Schwarzkopf [2001]. • Customization of hyperlinks: A Web site can also offer customized functionalities by adding or removing links within a particular page. In this way, unusual links are eliminated, changing the topology of the Web site and improving its usability. This way of customization is described in Chignoli et al. [1999]. • Personalized pricing scheme: Together with the recommendation of hyperlinks, this personalization functionality can be employed in e-commerce applications in order to attract users who are not usual visitors or to confirm the client/user loyalty. For example, personalized pricing scheme allows special discount percentages to users that have been recognized as loyal customers. Acquisti and Varian [2005] present a model which allows sender to offer enhanced services to previous customers by conditioning their price offers depending on prior purchase behavior of consumers. • Personalized product differentiation: The aim of this form of personalization is to satisfy the customer needs by transforming a standard product into a personalized solution for an individual. This personalization function reveals to be a powerful method especially in the field of marketing. Voloper Global Merchant (VGM) represents an example of Web site which offers services of multiple pricing levels and product differentiation according to the user needs. A description of these last two kinds of personalization functions can be found in Choudhary et al. [2005]. Task performance support Task performance support represents the most advanced personalization function, inherited from a category of Adaptive Systems known as personal assistants [Mitchell et al., 1994]. In these client-side personalization systems, a personal assistant executes actions on behalf of the user, in order to facilitate the access to relevant information. This approach requires the involvement of the user, including access, installation and maintenance of the personal assistant software. Examples of personalization functions included in this class are described below. • Personalized errands: A Web personalization system offers this form of personalization by executing a number of actions in order to assist the work of the users, such as sending an e-mail, downloading various items, and so on. Depending on the sophistication of the personalization system, these errands may vary from simple routine actions to more complex ones to take into account the personal circumstances of the user. • Personalized query completion: This personalization function is generally used to improve the performances of the information retrieval systems. In
6
G. Castellano et al.
fact, a system can add terms to the user queries submitted to a search engine or to a Web database system with the aim to enhance or to complete the user requests and to make them more comprehensible. • Personalized negotiations: This represents one of the most advanced task performance support functions and it requires a high degree of sophistication by the personalization system in order to earn the trust of the user. Here, the system can play the role of negotiator on behalf of a user and it may participate in Web auctions [Bouganis et al., 1999]. 2.2
Approaches to Web Personalization
Web personalization has been recognized as one of the major remedies to the information overload problem and to increase the loyalty of Web site users. Due to the importance of providing personalized Web services, different approaches have been proposed in the past few years in order to develop systems provided with personalization functionalities. Starting from architectural and algorithmic considerations, Mobasher et al. [2000] have categorized the approaches and techniques used to develop the existing personalization systems in three general groups: rule-based systems, contentbased filtering systems and collaborative filtering systems. However, a great deal of work has been addressed to develop hybrid personalization systems, arisen from the combination of various elements which characterize the previous distinguished approaches. In the following, a brief description and overview of the most influential approaches proposed for the development of personalization systems is presented. Rule-based personalization systems Rule-based personalization systems are able to recommend items to their users by generating a number of decision rules in an automatic manner or manually. Many e-commerce Web sites that are provided with recommendation technologies employ manual rule-based systems to offer personalized services to their customers. In such kind of systems, decision rules are manually generated by the Web site administrator on the basis of demographic and other personal information about users. These rules are exploited to modify, for example, the content served to a user whose profile satisfies one or more decision rules. A first drawback of personalization systems based on decision rules is in the knowledge engineering bottleneck problem. In fact, in such systems the type of personalization highly depends on the knowledge engineering of the system designers to create a rule base taking into account the specific characteristics of the domain or the market research. A further drawback that these kinds of systems present is represented by the methods used for the generation of user profiles. Here, user profiles are generally created explicitly, during the interactions of users with the site. To classify users into different categories (or user profiles) and to derive rules which have to be used for personalization, research has mainly focused on the employment of machine learning techniques. In these tasks, the input is usually affected by the subjective description of users or their
Innovations in Web Personalization
7
interests given by the users themselves. Moreover, generated user profiles are often static and the performances of the personalization systems based on this approach decrease over time as the profiles age. Examples of products which adopt this kind of approach are the personalization engine of Yahoo [Manber et al., 2000], Websphere Personalization of IBM (www306.ibm.com/software/websphere/) and Broadvision (www.bradvision.com). Content-based filtering personalization systems Personalization systems which fall in this category exploit various elements concerning the Web content in order to discover the personal preferences of a current user. The basic assumption of this approach is that the choices in the immediate future of a user are very similar to the choices made by the same user in his/her immediate past. In content-based filtering personalization systems, the recommendation generation is based around the analysis of items previously rated by a user and the derivation of a profile for a user, based on the content descriptions of these items. The content description of the items generally includes a set of features or attributes that characterize the corresponding items. In particular, in such systems, the content description of the items for which the user has previously expressed interest represents the user profile. Then, the user profile is used to predict a rating for previously unseen items and those deemed as being potentially interesting are recommended to the user. In content-based filtering systems, the task of recommendation generation involves the comparison between the extracted features of unseen or unrated items and the content descriptions characterizing the user profile. Items that are retained sufficiently similar to the identified user profile are recommended to the current user. In most of e-commerce applications or in other Web-based applications where personalization functions are developed through the content-based filtering approach, the content descriptions of the items are usually represented by textual features extracted from the Web pages or product descriptions. In such kind of personalization systems, well-known techniques of document modeling together with other principles derived from research in the fields of information retrieval and information filtering are exploited. Generally, user profiles are expressed in the form of vectors, where each component represents a weight or an interest degree related to each item. Predictions about the user interest for a particular item can be derived through the computation of vector similarities, based on the employment of different methods such as the cosine similarity measure or using probabilistic approaches such as Bayesian classification. In content-based filtering personalization systems, the constructed user profiles have not a collective (or aggregate) nature but each profile refers to an individual user, built only on the basis of characteristics of items previously seen or rated by the active user. Examples of early systems which use the content-based filtering approach to implement personalization functions are NewsWeeder [Lang, 1995], Letizia [Lieberman, 1995], PersonalWebWatcher [Mladenic, 1996], InfoFinder [Krulwich and Burkey, 1996], Syskill and Webert [Pazzani and Billsus, 1997], and the naive Bayes nearest neighbour approach proposed in Schwab et al. [2000].
8
G. Castellano et al.
NewsWeeder is a tool which is able to adaptively construct user models starting from the browsing behavior of a user, based on the similarity between Web documents containing new items. The constructed models can be useful to filter new items taking into account the requirements of each user. Syskill and Webert generates user profiles from previously rated Web pages on a particular topic in order to distinguish between interesting and irrelevant pages. To learn user profiles, it uses the 128 most informative words from a page and trains a nave Bayes classifier to predict, among the unseen pages, the interesting and the uninteresting pages for the user. This system requires the initial definition by the user of the rates for Web pages. To avoid the user to explicitly rate Web documents, Letizia defines implicit interest indicators to compute content similarity between previously visited pages and candidate pages for recommendation. The nave Bayes nearest neighbor approach, proposed by Schwab et al. [2000], is used to build user profiles from implicit observations. In their recommendation system, they modify the use of nearest neighbor and nave Bayes to deal with only positive observations by using distance and probability thresholds. Content-based filtering approach for personalization suffers from different limitations. The primary drawback of personalization systems based on this approach is strictly related to the method of generation of user profiles. In fact, these are derived by considering only the descriptions of items previously rated or seen by the user. In this way, user profiles result overspecialized and they may often miss important pragmatic relationships between the Web objects such as their common utility in the context of a particular task. Also, the system highly depends on the availability of content descriptions of the items being recommended. However, approaches based on individual profiles lack of serendipity as recommendations are very focused on the past preferences of the users. In addition, given the heterogeneous nature of Web data, the extraction of textual features in the derivation of the content descriptions of items is not always a simple task to face. Collaborative filtering personalization systems To overcome the limitations of content-based filtering systems, Goldberg et al. [1992] introduced the collaborative filtering approach for generating a personalized Web experience for a user. Collaborative (also named social) filtering personalization systems aim to personalize a service without considering features referring to the Web content. This personalization approach is based on a basic idea: the interests of a current user are considered similar to the interests of users who have made similar choices in the past, referred as the current user neighborhood. Hence, in this kind of systems, personalization is achieved by searching for common features in the preferences of different users, generally expressed explicitly by the users in the form of item ratings stored by the system. More in particular, personalization systems based on this approach perform the matching between the ratings of a current user for items and those expressed by similar users to produce recommendations for items not yet rated or seen by the current user. One of the primary techniques to accomplish the task of recommendation
Innovations in Web Personalization
9
generation is the standard memory-based k-Nearest-Neighbour (kNN) classification approach. This approach consists in the comparison of the current user profile with the historical user profiles stored by the system in order to find the top k users who have expressed preferences more similar to those expressed by the current user. The kNN classification approach gives rise to an important limitation for collaborative filtering techniques as well as their lack of scalability. Essentially, kNN requires that the neighborhood formation phase is performed as an online process. As the number of users and items increases, this approach may lead to unacceptable latency for providing recommendations during the interaction of users. The sparsity of the available Web data represents another relevant point of weakness of the collaborative filtering approach for personalization. In fact, as the number of items increases, the density of each user record decreases, containing often a low number of rating values in correspondence to the rated or visited items. As a consequence, establishing the similarity among pairs of users becomes a complicate task, decreasing the likelihood of a significant overlap of visited or rated items included in the corresponding user records. Collaborative filtering approach suffers from additional disadvantages. The ratings for every item have to be available prior to its recommendation. This is referred as the new item rating problem. Another disadvantage is referred as the new user problem: a new user has to rate a certain number of items before he/she can obtain appropriate recommendations from the system. A number of optimization strategies have been proposed in order to remedy these shortcomings [Aggarwal et al., 1999, OConnor and Herlocker,1999, Sarwar et al., 2000]. The proposed strategies are characterized by the dimensionality reduction to alleviate the sparsity problem of the data as well as the offline categorization of the user records by means of different clustering techniques, allowing the online component of the personalization system to search only within a matching cluster. A growing body of work has also been performed to enhance collaborative filtering by integrating data from other sources such as content and user demographics [Claypool et al., 1999, Pazzani and Billsus, 2006]. Among all the proposed strategies, model-based collaborative filtering systems have been developed as one of the most relevant variants of the traditional collaborative filtering approach for Web personalization. A representative example of model-based variants of collaborative filtering is known as item-based collaborative filtering. In item-based collaborative filtering systems, the offline component builds, starting from the user rating database, the item-item similarity matrix where each component expresses the similarity existing among each pair of the considered items. The item similarity is not based on content descriptions of the items but only on the user ratings. Each item is generally represented in the form of a m-dimensional vector (m is the number of users) and the similarities between pairs of items are computed by using different similarity measures such as cosine similarity or the correlationbased similarity. The item similarity matrix is used in the online prediction phase of the system to generate recommendations by predicting the ratings for items not previously seen by the current user. The predicted rating values are
10
G. Castellano et al.
calculated as a weighted sum of the ratings of items in the neighborhood of the target item, consisting of only those items that have been previously rated by the current user. As the number of considered items increases, storing the item similarity matrix may require huge quantity of memory. Rather than considering all item similarity values, a proposed alternative consists in the memorization of the only similarity values for the k most similar items. k represents the model size which affects the accuracy of the recommendation approach; as k decreases, the coverage as well as the recommendation accuracy will reduce. Collaborative filtering personalization systems have gained popularity and commercial success in a huge number of e-commerce applications for recommending products. An example of such a system is represented by GroupLens [Konstan et al., 1997]. In this recommendation system, a user profile is defined as an n-dimensional vector, where n is the number of netnews articles. If an article has been rated, its corresponding element in the user profile vector contains the specified rating value. Articles not rated by the current user but highly rated by the neighborhood users are candidates to be recommended to the current user.
3 The Web Personalization Process Generally speaking, the ability of a Web Personalization system to tailor content and recommend items to a user assumes that it must be able to infer what are the needs of a user, based on previous or current interactions with that user, and possibly considering other users. This assumes that the system obtains information on the user and infers his/her needs, exploiting this information. Hence, central to any personalization system is a user-centric data model. Information about user interests may be collected implicitly or explicitly but in either case such information should be attributable to a specific user. The association of Web data to a specific user is not always a simple task, especially when data is implicitly collected. This is one of the first problems to be addressed in Web personalization process. The successive analysis of data characterizing the user interests has the aim to learn user profiles that are used to predict future interests of connected users. Thus, in terms of the learning task, personalization can be viewed as a: • Prediction Task : a model has to be built in order to predict ratings for items not currently seen or rated by the user. • Selection Task : a model has to be built in order to select the N most interesting items for the current user who has not already rated. The incorporation of machine learning techniques in the context of Web personalization can provide a complete solution to the overall adaption task. It reveals to be an appropriate way to analyze data collected on the Web and extract useful knowledge from these. The effort carried out in this direction has led to the growth of a new research area, named Web mining [Arotariteia and Mitra, 2004, Furnkranz, 2005, Kosala and Blockeel, 2000], which refers to the application of Data Mining methods to automatically discover and extract knowledge
Innovations in Web Personalization
11
from data generated by the Web. Commonly, according to different types of Web data which can be considered in the process of personalization, Web mining can be split into three different categories, namely Web content mining, Web structure mining, and Web usage mining. Web content mining [Cimiano and Staab, 2004, Liu and Chang, 2004] concerns with the discovery of useful information from the Web contents. Web content could encompass a very broad range of data, such as textual, image, audio, video, metadata as well as hyperlinks. Moreover, Web content data can be represented by unstructured data such as free texts, semi-structured data HTML documents, and a more structured data such as data in the tables or database generated HTML pages. Recently, research in this field is focusing on mining multi types of data, leading to a new branch called multimedia data mining representing an instance of the Web content mining. Web structure mining [Costa and Gong, 2005, Furnkranz, 2002] discovers the model underlying the link structures of the Web. The model is based on the topology of the hyperlinks, characterizing the structure of the Web graph. This can be used to categorize Web pages and to generate information about the relationships or the similarity degrees existing among different Web pages. Web usage mining [Facca and Lanzi, 2005, Mobasher, 2005, Pierrakos et al., 2003, Zhou et al., 2005] aims at discovering interesting patterns from usage data generated during the interactions of users with the Web site, generally characterizing the navigational behavior of users. Web usage data includes the data from Web server access logs, proxy server logs, registration data, mouse clicks, and any other data which is the result of the user interactions. Web usage mining can be a valuable and important source of ideas and solutions toward realizing Web personalization. It provides an approach to the collection and preprocessing of usage data, and constructs models representing the behavior and the interests of users. These models can be used by a personalization system automatically, i.e. without the intervention of any human expert, for realizing the required personalization functions. Web usage mining represents the most employed approach for the development of personalization systems, as also demonstrated by a large number of research papers published on this topic [Abraham, 2003, Facca and Lanzi, 2005, Mobasher, 2006, Pierrakos et al., 2003]. In this chapter, the attention is mainly focused on the Web personalization process based on the adoption of the Web usage mining approach. In general, a usage-based Web personalization process, being essentially a data mining process as asserted before, consists of the following basic data mining stages [Mobasher et al., 2000]: • Web data collection: Web data are gathered from various sources by using different techniques that allow to attain efficient collections of user data for personalization. • Web data preprocessing: Web data are preprocessed to obtain data in a form that is compatible to be analyzed in the next step. In particular, in this stage, data are cleaned from noise, the inconsistencies are solved, and finally data are organized in an integrated and consolidated manner.
12
G. Castellano et al.
• Pattern discovery: the collected data are analyzed in order to extract correlations between data and discover usage patterns, corresponding to the behavior and the interests of users. In this stage, learning methods, such as clustering, association rule discovery, sequential pattern discovery and so on, are applied in order to automate the construction of user models. • Personalization: the extracted knowledge is employed to implement the effective personalization functions. The knowledge extracted in the previous stage, is evaluated and the set of actions necessary for generating recommendations is determined. In a final step, the generated recommendations are presented to the users using proper visualization techniques. In the overall process of usage-based Web personalization, two principal and related modules can be identified as an offline and an online module. In the offline component, Web usage data are collected and preprocessed. Successively, the specific usage mining tasks are performed in order to derive the knowledge useful for the implementation of personalization functions. Hence, the offline component is generally faced with the first three stages previously identified: Web data collection, Web data preprocessing and pattern discovery. The online module mainly comprises a personalization engine which exploits the knowledge derived by the offline activities in order to provide users with interesting information according to their needs and interests.
Fig. 1. The scheme of a usage-based Web personalization process
Innovations in Web Personalization
13
Figure 1 depicts a generalized framework for the entire Web personalization process based on Web usage mining. In the following sub-sections, a comprehensive view of this process is presented, providing a detailed discussion of each involved activity. Additionally, an overview of works and methods proposed to provide different solutions to the development of each stage is presented. 3.1
Web Data Collection
As in any data mining application including the Web personalization process, data collection represents the primary task which has to be performed with the aim of gathering the relevant Web data, which will be analyzed to provide useful information about the user behavior [Srivastava et al., 2000]. There are two main sources of data for Web usage mining, corresponding to the two software systems interacting during a Web session: the Web server side and the client side. When intermediaries occur in the client-server communication, they become another important source of usage data, like proxy server and packet sniffers. Usage data collected at the different sources represent the navigation patterns of different segments of the overall Web traffic on the site. In the following, each source of usage data will be examined. Server Side Data Web servers represent surely the richest and the most common source of Web data because it explicitly can record large amounts of information characterizing the browsing behavior of site visitors. Data collected at the server side principally include various types of log files generated by the Web server. Data recorded into the server log files reflect the (eventually concurrent) accesses to a Web site by multiple users in chronological order. These log files can be stored in various formats. Most of the Web servers support as a default option the Common Log File Format (CLF), which typically includes information such as the IP address of the connected user, the time stamp of the request (date and time of the access), the URL of the requested page, the request protocol, a code indicating the status of the request, the size of the page (if the request is successful). Other formats of log files are the Extended Log Format (W3C), supported by Web servers as Apache and Netscape, and the very similar W3SVC format, supported by Microsoft Internet Information Server. These formats are characterized by the inclusion of additional information about the user requests, like the address of the referring URL to the requested page, the name and the version of the browser used for the navigation by the user, the operating system of the host machine. Data recorded in log files may not be always entirely reliable. The problem of the unreliability of these sources of data is mainly due to the presence of various levels of caching within the Web environment and to the misinterpretation of the IP user addresses.
14
G. Castellano et al.
Web cache is a mechanism developed to reduce the latency and the Web traffic. This mechanism consists in keeping track of the Web pages requested by the users and in saving a copy of these pages for a certain period of time. Web caches can be configured either at the level of the client local browser, or at the intermediate proxy server. The requests for cached Web pages are not recorded into log files. In fact, when a user accesses to a same Web page, rather than making a new request to the server, the cached copy is returned to the user. In this way, the user request does not reach the Web server holding the page and, as a result, the server is not aware of the actions and the page accesses made by the users. Cache-busting represents one solution to this first problem. This involves the use of special headers, defined either in Web servers or Web pages, that include directives to establish the objects that should be cached, the time that they should be cached. The second problem, the IP address misinterpretation, is essentially caused by two reasons. With the use of the intermediate proxy server which assigns to all users the same IP address, the requests from different host machines passing through the proxy server are recorded into log files with the same IP. The same problem occurs when different users use the same host machine. The dynamic IP allocation gives rise to the opposite situation, where different addresses may be assigned to a same user. Both these problems may cause serious complications in the whole Web personalization process, where it is fundamental to identify individual users in order to discover their interests. The Web server can also store other kinds of usage data through the dispensation and tracking of cookies. Cookies are tokens (short strings) generated by the Web server for individual client browsers in order to automatically track the site users. Through this mechanism, the Web server can store its own information about the user in a cookie log within the client machine. This information usually consists in a unique ID, created by the server, which will be used by the same server to recognize the user, the successive times that he/she will visit the site. The use of cookies has raised growing concerns regarding user privacy and security. In fact, these require the cooperation of users which, for different reasons, can choose to disable the option for accepting cookies. Another kind of data useful for Web personalization which the Web server can collect are the data explicitly supplied by the users during their interactions with the site. This kind of data is typically obtained through the fulfillment of the registration forms which can provide important demographic and personal information or also knowledge about the user preferences. However, these data are not always reliable, since users often provide incomplete and inaccurate information. Additional explicit user data collected at the server side can be represented by the query data generated by online visitors while searching for pages relevant to their information needs [Buchner and Mulvenna, 1999]. Client Side Data Usage data collected at the client side are represented by data originated by the host accessing the Web site.
Innovations in Web Personalization
15
A first method to collect client side data consists in the use of remote agents (generally implemented in Java or Javascripts) which are embedded in Web pages, such as for example Java applets [Shahabi et al., 2001]. These agents allow to directly collect information from the client such as the user browsing history, the pages visited before visiting the current page, the sites visited before and after the current site, the time that the user accesses to the site and when he/she leaves it. This mechanism of client side data collection provides more reliable data since it permits to overcome the limitations of Web cache and IP misinterpretation (seen before) underlying the adoption of server log files to collect information about the user browsing behavior. However, the implementation of this method of usage data collection requires the user cooperation in enabling the functionality of the Javascript and Java applets on their machines. In fact, since the employment of remote agents may affect the client system performances, introducing additional overhead whenever the users try to access the Web site, users may choose to disable these functionalities on their systems. An older mechanism used to collect usage data from the client host consists in modifying the source code of an existing browser, such as Mosaic and Mozilla to enhance its capabilities of data collection [Cunha et al., 1995]. Browsers are modified in order to allow them the memorization of information about the user navigational behavior, such as the Web pages visited by users, the access time, the response time of the server. As for the use of remote agents, even in this case, the user cooperation is necessary. Modified versions of browsers are often considered a threat to the user privacy. Thus, one of the main difficulties inherent in this method of data collection consists in convincing users to use these modified browser versions. A way often used to overcome this difficulty consists in offering incentives to users such as additional software or services such as those offered by AllAdvantage (www.alladvantage.com) and NetZero (www.netzero.com) companies. Moreover, modifying a modern browser is not a simple task, even when its source is available. Intermediary Data Another important source of data reflecting the user browsing behavior is represented by the proxy server. A proxy server is a software system which plays the role of intermediary between the client browser and the Web server able to ensure security, administrative control and caching services. Proxy caching represents a way to reduce the loading time of a Web page as well as the network traffic load at the server and client sides [Cohen et al., 1998]. This intermediate uses logs having similar format to server log files for storing the Web page requests and the corresponding responses from the server. This is the main advantage of using these logs. In fact, since proxy caching reveals the requests of multiple clients to multiple servers, this can be considered a valuable source of data characterizing the navigational behavior of a group of anonymous users sharing a common proxy server [Srivastava et al., 2000]. Packet sniffers provide an alternative method of usage intermediary data collection. A packet sniffer is a piece of software (sometimes a hardware device)
16
G. Castellano et al.
which is able to monitor the network traffic coming to a Web server and to extract usage data directly from TCP/IP packets. On the one hand, the use of packet sniffers has the advantage that data are collected and analyzed in real time. On the other hand, since data are not logged, this can give rise to the loss of data due to the the packet sniffer or with the data transmission. 3.2
Web Data Preprocessing
The second stage in any usage-based Web personalization is preprocessing of Web data. Web data collected from the various sources as seen above are usually voluminous and characterized by noise, ambiguity and incompleteness. As in most of data mining applications, these data need to be assembled in order to obtain data collections expressed in a consistent and integrated manner, useful to be used as input to the next step of pattern discovery. To accomplish this, a preliminary activity of data preprocessing reveals to be necessary. Data preprocessing involves the execution of a set of operations such as the elimination of noise, the solution of inconsistencies, the fulfillment of eventual missing values, the removal of redundant or irrelevant data, and so on. In the particular context of Web personalization, the goal of data preprocessing is to transform and to aggregate the raw data into different levels of abstraction which can properly be employed to characterize the behavior of users in the overall process of personalization. Among the various levels, pageview represents the most basic level of data abstraction. A pageview is a set of Web objects or resources corresponding to a single user event, such as frames, graphics, scripts. In Mobasher [2007], the author identifies in the session the most basic level of behavioral abstraction. The author defines a session as a sequence of pageviews referring to a single user during a single visit. Also, it is stated that a session could be directly used as a user profile, being able to capture the user behavior over time. To construct significant data abstractions, data preprocessing stage typically includes three main activities, namely data filtering, user identification and user session identification. Data preprocessing is strongly related to the problem domain and to the quality and type of available data. Hence, this step needs an accurate analysis of data and constitutes one of the hardest task in the overall Web personalization process. An additional facet to be taken into account is the trade-off regarding the preprocessing step. On one hand, an insufficient preprocessing could make more difficult the next pattern analysis task. On the other hand, an excessive preprocessing could remove data with implicit knowledge useful for the successive steps of the personalization process. As a consequence, the success of pattern discovery results highly dependent on the correct application of data preprocessing tasks. An extensive description of data preparation and preprocessing methods can be found in Cooley et al. [1999]. In the sequel, a rapid description of the activities involved in data preprocessing stage is given, by focusing on the techniques applied to perform the respective tasks.
Innovations in Web Personalization
17
Data Filtering Data filtering is the first activity included in data preprocessing stage. It represents a fundamental task which is devoted to clean raw Web data from noise. This activity mainly concerns server side data since these can be particularly noisy. Hence, the rest of the discussion about the data filtering activity will focus on log files. Since Web log files record all the interactions between Web site and its users, they may also comprise useless information for the description of the navigational behavior of visitors, and they often contain a large amount of noise. The aim of data filtering is to clean Web data by analyzing available data and removing from log files those records corresponding to irrelevant and redundant requests. Redundant records in log files are mainly due to the model used by the HTTP protocol which executes a separate access request for every file, image, multimedia objects, in general, embedded in the Web page which is requested by the user. In this way, a single user request for a Web page may often result in several log entries that correspond to files automatically downloaded without an explicit request of the same user. Since these records do not represent the effective browser activity of the connected user, they are deemed redundant and have to be removed. Elimination of these items can be reasonably accomplished by checking the suffix of the URL name. For example, all log entries with filename suffixes such as gif, jpeg, GIF, JPEG, jpg, JPG and map can be removed. The list can be modified depending on the type of site being analyzed. Actually, for a site consisting mainly of multimedia content, the elimination of the requests to the previous type of files should cause the loss of important and useful information [Cooley, 2000]. Besides, records corresponding to failed user requests, for example with error status code, are filtered also. Another crucial task of data filtering is represented by the identification and elimination of accesses generated by Web robots. Web robots (also known as Web crawlers or Web spiders) are programs which traverse the Web in a methodical and automated manner, downloading complete Web sites in order to update the index of a search engine. The entries generated by these programs are not considered usage data representative of the user browser behavior, so they are filtered out from the log files. In conventional techniques, Web robot sessions are detected in different ways: by examining sessions that access a specially formatted file called robots.txt, by exploiting the User Agent field of log files wherein most crawlers identify themselves, or by matching the IP address of sessions with those of known robot clients. A robust technique to detect spider sessions has been proposed by Tan and Kumar [2002]. Based on the assumption that the behavior of robots is different from those of human users, they have recognized Web robots with a high accuracy by using a set of relevant features extracted from access logs (percentage of media files requested, percentage of requests made by HTTP methods, average time between requests). Another simple method to recognize robots is to monitor the navigational behavior pattern of the user.
18
G. Castellano et al.
In particular, if a user accesses to all links of all the pages of a Web site, it will be considered a crawler. User Identification User identification is one of the steps more delicate and complicate in the overall Web personalization process. In fact, the task of identification of a single user is fundamental in order to distinguish his/her corresponding browsing behavior. Various methods have been proposed to automatically recognize a user. Some of the most important techniques employed are illustrated below. Many Web applications require the explicit user registration. However, a potential problem in using such methods might be the reluctance of users to share personal information. Besides, this approach presents another important limitation due to the burden to the users that in lots of Web sites disincentives the navigation and the visits. As a consequence, a number of methods able to automatically identify users have been developed. Among all these proposed methods, the simplest and also the mostly adopted approach consists in assigning a user to each different IP address present in log files [Nasraoui and Petenes, 2003, Suryavanshi et al., 2005]. However, this method is not very accurate because, for example, a visitor may access the Web from different computers, or many users may use the same IP address (if a proxy is used). Other Web usage mining tools use more accurate approaches for a priori identification of unique visitors such as cookies [Kamdar and Joshi, 2000]. The use of cookies is not without problems. In that sub-section, in fact, the problem concerning the possibility for users to disable cookies on their systems has been already illustrated. An alternative method of user identification is that proposed by Pitkow [1997]. This method consists in the use of special Internet services, such as the inetd and fingerd, which provide the user name and other information about the user accessing the Web server. However, as for cookies, also these services can be disabled by users. To overcome this limitation, further methods have been proposed in the literature on the topic. In Cooley et al. [1999], the authors have proposed two different heuristics for user identification. The first method analyzes Web log files expressed in the Extended Log Format by searching for different browsers or different operating systems, even when the IP address is the same. This information suggests that the requests are originated from different users. The second method exploits the knowledge about the topology of the Web site to recognize requests of different users. More precisely, if a request for a Web page derives from the same IP address of requests for other Web pages but no link exists between these pages, a new user is recognized. User Session Identification In personalization systems based on Web usage mining techniques the usage data are analyzed in order to discover the user browsing behavior on a specific Web site which is embedded, as specified above, in user sessions. For this reason, the
Innovations in Web Personalization
19
identification of user sessions represents a fundamental task for the successive development of personalization functions and constitutes another important step in Web data preprocessing. Based on the definitions found in different works of scientific literature, a user session can be defined as a delimited set of URLs corresponding to the pages visited by a user from the moment the user enters a Web site to the moment the same user leaves it [Suryavanshi et al., 2005]. Starting from this definition, we can state that the problem of user session identification is strictly related to the previous problem of identifying a single user. Assuming a user has been identified, following one of the methods previously described, the next step of Web data preprocessing is to perform user session identification, by dividing the clickstream of each user into sessions. As concerns more properly the problem of user session identification, Spiliopoulou [1999] has divided the different existing approaches in two main categories: time-based and context-based methods. In time-based methods, the usual solution is to set a minimum timeout and assume that consecutive accesses within it belong to the same session, or set a maximum timeout, where two consecutive accesses that exceed it belong to different sessions. Different values have been chosen for setting this timeout depending on the content of the examined site and on the particular purpose of the personalization process. On the other hand, context-based methods consider the access to specific kinds of pages or they refer to the definition of conceptual units of work to identify the different user sessions. Here, transactions are recognized where a transaction represents a subset of pages that occur in a user session. Based on the assumption that transactions depend on the contextual information, Web pages are classified as auxiliary, content and hybrid pages. Auxiliary pages contain links to other pages of the site; content pages contain the information interesting for the user and, finally, the hybrid pages are of both previous kinds of pages. Starting from this classification, Cooley et al. [1999] have distinguished content-only transactions from the auxiliary-content transactions. The first ones include all the content pages visited by the user whereas the second ones refer to the paths to retrieve a content page. Several methods have been developed to identify transactions, but none of them is without problems. 3.3
Pattern Discovery
Once Web data have been preprocessed, the next stage of the Web personalization process consists in discovering patterns of usage of the Web site through the application of the effective Web usage mining techniques. To achieve this aim, methods and algorithms belonging to several fields such as statistics, data mining, machine learning and pattern recognition are applied to discover useful knowledge for the ultimate personalization process. Most of commercial applications commonly derive knowledge about users by executing statistical analysis on session data. Many Web mining traffic tools produce periodic reports including important statistical information descriptive of the user browser patterns, such as the most frequently accessed pages,
20
G. Castellano et al.
average view time, average length of navigational paths. This kind of extracted knowledge may be useful to improve the system performance and facilitate the site modification. In the context of knowledge discovery techniques specifically designed for the analysis of Web usage data, research effort mainly focused on three distinct paradigms: association rules, sequential patterns and clustering. Han and Kamber [2001] give an exhaustive review of these techniques. The most straightforward technique employed in Web usage mining is represented by association rules explaining associations among Web pages which frequently appear in user sessions. Typically, an association rule is expressed in the following form: A.html, B.html ⇒ C.html which states that if a user has visited page A.html and page B.html, it is very likely that in the same session the same user has also visited page C.html. This kind of approach has been used in [Joshi et al., 2003, Nanopoulos et al., 2002]; while some measures of interest to evaluate association rules mined from Web usage data have been proposed by Huang et al. [2002]. Fuzzy association rules, obtained by the combination of association rules and fuzzy logic, have been extracted in Wong and Pal [2001]. Sequential pattern discovery turns out to be particularly useful for the identification of navigational patterns in Web usage data. In this kind of approach, the element of time is introduced in the process of discovering patterns which frequently appear in user sessions. To extract sequential patterns, two main class of algorithms are employed: methods based on association rule mining and methods based on the use of tree structures and Markov chains. Some well-known algorithms for mining association rule have been modified to obtain sequential patterns. For example the Apriori algorithm has been properly extended to derive two new algorithms: the AprioriAll and GSP proposed in Huang et al. [2002] and Mortazavi-Asl [2001]. An alternative algorithm based on the use of a tree structure has been presented in Pei et al. [2000]. Tree structures have also been used in Menasalvas et al. [2002]. Clustering is the most widely employed technique in the pattern discovery process. Clustering techniques look for groups of similar items among large amount of data based on a distance function which computes the similarity between items. Vakali et al. [2004] provide an exhaustive overview of Web data clustering methods used in different research works in this area. Following the classification suggested by Vakali, in Web usage domain, two kinds of interesting clusters can be discovered: usage clusters and Web document clusters. Xie and Phoha [2001] were the first to suggest that the focus of Web usage mining should be shifted from single user sessions to group of user sessions. Successively, in a large number of works usage clustering techniques have been used in the process of Web Usage Mining for grouping together similar sessions [Banerjee and Ghosh, 2001, Heer and Chi, 2002, Huang et al., 2002]. Clustering of Web documents aims to discover groups of pages having related content. In general, a Web document can be considered as a collection of Web pages (a set of related Web resources, such as HTML files, XML files, images, applets, multimedia resources). In this framework, the Web topology can be regarded as a directed graph, where the nodes
Innovations in Web Personalization
21
represent the Web pages with URL addresses and the edges among nodes represent the hyperlinks among Web pages. In this context, the concepts of compound documents [Eiron and Mc-Curley, 2003] and logical information units [Tajima et al., 1999] have been introduced. A compound document is a set of Web pages having the fundamental property that their link graph has to contain a vertex corresponding to a path conducting to every other part of the document. Moreover, a Web community is defined as a set of Web pages that link to more Web pages in the community than to pages outside of the community [Greco et al., 2004]. The main benefits derived by clustering include increasing Web information accessibility, understanding users’ navigation behavior identifying user profiles, improving information retrieval in search engines and content delivery on the Web. 3.4
Personalization
The knowledge extracted through the process of knowledge discovery has to be exploited in the effective and final personalization process. Personalization functions can be accomplished in a manual or in an automatic and transparent manner for the user. In the first case, the discovered knowledge has to be expressed in a comprehensible manner for humans, so that knowledge can be analyzed to support human experts in making decisions. To accomplish this task, different approaches have been introduced in order to provide useful information for personalization. An effective method for presenting comprehensive information to humans is the use of visualization tools as WebViz [Pitkow and Bharat, 1994] that represents navigational patterns as graphs. Reports are also a good method to synthesize and to visualize useful statistical information previously generated. Personalization systems as WUM [Spiliopoulou and Faulstich, 1998] and WebMiner [Cooley et al., 1997] use SQL-like query mechanisms for the extraction of rules from navigational patterns. Nevertheless, decisions made by the user may create delay and loss of information. As a consequence, a more interesting approach consists in the integration of Web usage mining techniques in the personalization process. In particular, the knowledge extracted from Web data is automatically exploited in a personalization process which adapts the Web-based application according to the discovered patterns. The discovered knowledge will be delivered subsequently to the users by means of one or more personalization functions. Thus, the activities performed in the effective personalization step strongly depend on the different personalization functions which the system offers. In this way, if the system offers the personalization function of adapting the content of Web site to the needs of current users, the content of Web pages is adapted to the interests of users, modifying also the graphical interface. In the case of link suggestion, for example, a list of links retained interesting for users is visualized in the page currently visited. In e-commerce applications, a list of products is recommended to the online customer taking into account the user interests. These are only few examples of personalization tasks performed in the step of effective personalization.
22
G. Castellano et al.
Following the scheme of a general Web usage based personalization system, this ultimate phase is included in the online module aimed to realize the personalization functionalities which are offered by the Web site. All the other steps involved in the Web personalization system, i.e. Web data preprocessing and pattern discovery, are periodically performed in the offline module.
4 Conclusions This chapter provided a comprehensive view of Web personalization, especially focusing on the different steps involved in a general usage-based Web personalization system and the variety of approaches to Web personalization. In the last few years, research has achieved encouraging results in the field of Web personalization. However, a number of challenges and open research questions have still to be addressed by researchers. One of the key aspects of a personalization process consists in the derivation of user models that are able to encode the preferences and the needs of users. In this context, lots of work has still to be done in the direction to derive adaptive user models that are able to capture dynamically the continuous changes related to the interests of users. Another important aspect that needs to be investigated concerns the definition of more appropriate metrics for the evaluation of the user satisfaction with respect to the generated recommendations. Also, the exploitation of the relevance feedback (explicitly expressed by the users or implicitly derived by observing the behavior of users once they receive recommendations) could be useful not only to dynamically adapt user models to the changing interests of users but also to provide some indicators to quantify the goodness of the provided suggestions. A further aspect extremely interesting that could be surely enhanced in the literature is strictly related to the possibility to individuate suitable measures able to estimate the benefits that can be obtained by endowing Web applications with personalization functionalities. This could permit to justify the huge research efforts carried out in the direction of developing adaptive Web applications that incorporate personalization processes able to support their users by providing them the right contents or services in the right time.
References Abraham, A.: Business intelligence from web usage mining. Journal of Information & Knowledge Management 2(4), 375–390 (2003) Acquisti, A., Varian, H.: Conditioning prices on purchase history. Marketing Science 24(3), 367–381 (2005) Aggarwal, C.C., Wolf, J., Yu, P.S.: A new method for similarity indexing for market data. In: Proceedings of the 1999 ACM SIGMOD Conference, Philadelphia, PA, pp. 407–418 (1999) Arotariteia, D., Mitra, S.: Web mining: a survey in the fuzzy frame-work. Fuzzy Sets and Systems 148(1), 5–19 (2004) Banerjee, A., Ghosh, J.: Clickstream clustering using weighted longest common subsequences. In: Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining (2001)
Innovations in Web Personalization
23
Bouganis, C., Koukopoulos, D., Kalles, D.: A real time auction system over the www. In: Proceeding of Conference on Communication Networks and Distributed Systems Modeling and Simulation, San Francisco, CA, USA (1999) Buchner, A.G., Mulvenna, M.D.: Discovering internet marketing intelligence through online analytical web usage mining. SIGMOD Record 27(4), 54–61 (1999) Chignoli, R., Crescenzo, P., Lahire, P.: Customization of links between classes. Technical report, Laboratoire d’Informatique, Signaux and Systmes de Sophia-Antipolis (1999) Choudhary, V., Ghose, A., Mukhopadhyay, T., Rajan, U.: Personalized pricing and R Management Science 51(7), 1120–1130 (2005) quality dierentiation. Cimiano, P., Staab, S.: Learning by googling. SIGKDD Explorations sepcial issue on Web Content Mining 6(2), 24–33 (2004) Claypool, M., Gokhale, A., Miranda, T., Murnikov, P., Netes, D., Sartin, M.: Combining content-based and collaborative filters in an online newspaper. In: Proceedings of the ACM SIGIR 1999 Workshop on Recommender Systems: Algorithms and Evaluation, Berkeley, California (1999) Cohen, E., Krishnamurthy, B., Rexford., J.: Improving end-to-end performance of the web using server volumes and proxy filters. In: Proceedings of ACM SIGCOMM (1998) Cooley, R.: Web usage mining: discovery and application of interesting patterns from Web data. PhD thesis, University of Minnesota (2000) Cooley, R., Mobasher, B., Srivastava, J.: Grouping Web page references into transactions for mining world wide web browsing patterns. Technical report TR 97-021, Dept. of Computer Science, University of Minnesota, Minneapolis, USA (1997) Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1(1), 32–55 (1999) Costa, M., Gong, Z.: Web structure mining: an introduction. In: Proceedings of IEEE International Conference on Information Acquisition (2005) Cunha, C., Bestavros, A., Crovella, M.E.: Characteristics of www client-based traces. Technical report tr-95-010., Boston University, Department of Computer Science (1995) Eiron, N., McCurley, K.: Untangling compound documents on the web. In: Proceedings of ACM Hypertext, pp. 85–94 (2003) Facca, F.M., Lanzi, P.: Mining interesting knowledge from weblogs: a survey. Data & Knowledge Engineering 53, 225–241 (2005) Furnkranz, J.: Web structure mining - exploiting the graph structure of the world-wide web. GAI-Journal 21(2), 17–26 (2002) Furnkranz, J.: Web mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Springer, Heidelberg (2005) Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to weave an information tapestry. Communications of the ACM 35(12), 61–70 (1992) Greco, G., Greco, S., Zumpano, E.: Web communities: models and algorithms. World Wide Web 7(1), 58–82 (2004) Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco (2001) Heer, J., Chi, E.: Mining the structure of user activity using cluster stability. In: Proceedings of the Workshop on Web Analytics (2002) Huang, X., Cercone, N., An, A.: Comparison of interestingness functions for learning web usage patterns. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 617–620 (2002)
24
G. Castellano et al.
Kamdar, T., Joshi, A.: On creating adaptive web sites using web log mining. Technical reporttr-cs-00-05., Department of Computer Science and Electrical Engineering University of Maryland (2000) Kobsa, A., Koenemann, J., Pohl, W.: Personalized hypermedia presentation techniques for improving online customer relationships. The Knowledge Engineering Review 16(2), 111–155 (2001) Konstan, J., Miller, B., Maltz, D., Herlocker, J., Gordon, L., Riedl, J.: Grouplens: Applying collaborative filtering to usenet news. Communications of the ACM 40(3), 77–87 (1997) Kosala, R., Blockeel, H.: Web mining research: a survey. ACM SIGKDD Explorations Newsletter 2, 1–15 (2000) Krulwich, B., Burkey, C.: Learning user information interests through extraction of semantically signi− cant phrases. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, Stanford, California (1996) Joshi, K., Joshi, A., Yesha, Y.: On using a warehouse to analyse web logs. Distributed and Parallel Databases 13(2), 161–180 (2003) Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning (1995) Lieberman, H.: Letizia: An agent that assists web browsing. In: Proceedings of the 14th International Joint Conference in Artificial Intelligence (IJCAI 1995), Montreal, Quebec, Canada, pp. 924–929 (1995) Liu, B., Chang, K.C.C.: Editorial: Special issue on web content mining. SIGKDD Explorations special issue on Web Content Mining 6(2), 1–4 (2004) Manber, U., Patel, A., Robison, J.: Experience with personalization on yahoo. Communications of the ACM 43(8), 35–39 (2000) Menasalvas, E., Millan, S., Pena, J., Hadjimichael, M., Marban, O.: Subsessions: a granular approach to click path analysis. In: Proceedings of FUZZ-IEEE Fuzzy Sets and Systems Conference, at the World Congress on Computational Intelligence, pp. 12–17 (2002) Mladenic, D.: Personal web watcher: Implementation and design. Technical report, Department of Intelligent Systems, J. Stefan Institute, Slovenia (1996) Mitchell, T., Caruana, R., Freitag, D., McDermott, J., Zabowski, D.: Experience with a learning personal assistant. Communications of the ACM 37(7), 81–91 (1994) Mobasher, B.: Web usage mining and personalization. In: Singh, M.P. (ed.) Practical Handbook of Internet Computing. CRC Press, Boca Raton (2005) Mobasher, B.: Web usage mining. In: Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer, Heidelberg (2006) Mobasher, B.: Data mining for personalization. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 90–135. Springer, Heidelberg (2007) Mobasher, B., Cooley, R., Srivastava, J.: Automatic personalization based on web usage mining. Communications of the ACM 43(8), 142–151 (2000) Mortazavi-Asl, B.: Discovering and mining user web-page traversal patterns. Master’s thesis, Simon Fraser University (2001) Nanopoulos, A., Katsaros, D., Manolopoulos, Y.: Exploiting web log mining for web cache enhancement. In: Kohavi, R., Masand, B., Spiliopoulou, M., Srivastava, J. (eds.) WebKDD 2001. LNCS, vol. 2356, pp. 68–87. Springer, Heidelberg (2002) Nasraoui, O., Krishnapuram, R., Joshi, A., Kamdar, T.: Automatic web user profiling and personalization using robust fuzzy relational clustering. In: Segovia, J., Szczepaniak, P., Niedzwiedzinski, M. (eds.) E-Commerce and Intelligent Methods in the series Studies in Fuzziness and Soft Computing, Springer, Heidelberg (2002)
Innovations in Web Personalization
25
Nasraoui, O., Petenes, C.: Combining web usage mining and fuzzy inference for website personalization. In: Proceedings of WEBKDD 2003: Web mining as premise to effective Web applications, pp. 37–46 (2003) OConnor, M., Herlocker, J.: Clustering items for collaborative filtering. In: Proceedings of ACM SIGIR 1999 Workshop on Recommender Systems: Algorithms and Evaluation, Berkeley, California (1999) Pazzani, M., Billsus, D.: Learning and revising user profiles: The identification of interesting web sites. Machine Learning 27, 313–331 (1997) Pazzani, M., Billsus, D.: Content-based recommendation systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 325–341. Springer, Heidelberg (2007) Pei, J., Han, J., Motazavi-Asl, B., Zhu, H.: Mining access patterns efficiently from web logs. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 396–407 (2000) Pierrakos, D., Paliouras, G., Papatheodorou, C., Spyropoulos, C.D.: Web usage mining as a tool for personalization: a survey. User Modeling and User-Adapted Interaction 13(4), 311–372 (2003) Pitkow, J.: In search of reliable usage data on the www. In: Proceedings of the 6th Int.World Wide Web Conference, Santa Clara, CA (1997) Pitkow, J., Bharat, K.: Webviz: A tool for world wide web access logvisualization. In: Proceedings of the 1st International World Wide Web Conference, pp. 271–277 (1994) Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Application of dimensionality reduction in recommender system - a case study. In: Proceedings of the WebKDD 2000 Web Mining for E-Commerce Workshop at ACM SIGKDD 2000, Boston (2000) Schafer, J.B., Konstan, J., Reidel, J.: Recommender systems in E-commerce. In: Proceeding of ACM Conf. E-commerce, pp. 158–166 (1999) Schwab, I., Kobsa, A., Koychev, I.: Learning about users from observation. In: Adaptive User Interfaces. AAAI Press, Menlo Park (2000) Schwarzkopf, E.: An adaptive web site for the UM 2001 conference. In: Proceeding of the UM 2001 Workshop on Machine Learning for User Modelling (2001) Shahabi, C., Banaei-Kashani, F., Faruque, J.: A reliable, efficient, and scalable system for web usage data acquisition. In: Proceedings of WebKDD 2001 Workshop in conjunction with the ACMSIGKDD (2001) ˙ Spiliopoulou, M.: Data mining for the web. In: Zytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS, vol. 1704, pp. 588–589. Springer, Heidelberg (1999) Spiliopoulou, N., Faulstich, L.: Wum: Aweb utilization miner. In: Proceedings of the International Workshop on the Web and Databases, Valencia, Spain, pp. 109–115 (1998) Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1(2), 1–12 (2000) Suryavanshi, B., Shiri, N., Mudur, S.: An efficient technique for mining usage profiles using relational fuzzy subtractive clustering. In: Proceedings of the 2005 Int. Workshop on Challenges in Web Information Retrieval and Integration (WIRI 2005), pp. 23–29 (2005) Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K.: Discovery and retrieval of logical information units in web. In: Proceedings of the Workshop on Organizing Web Space, WOWS 1999 (1999) Tan, P.N., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery 6(1), 9–35 (2002)
26
G. Castellano et al.
Vakali, A., Pokorn, J., Dalamagas, T.: An overview of web data clustering practices. In: Lindner, W., Mesiti, M., T¨ urker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004) Wong, S., Pal, S.: Mining fuzzy association rules for web access case adaptation. In: Proceedings of the Workshop on Soft Computing in Case-Based Reasoning (2001) Xie, Y., Phoha, V.V.: Web user clustering from access log using belief function. In: Proceedings of the First International Conference on Knowledge Capture, K-CAP 2001 (2001) Zhou, B., Hui, S.C., Fong, A.C.M.: Web usage mining for semantic web personalization. In: Proceedings of the Workshop on Personalization on the Semantic Web, PerSWeb 2005 (2005)
2 A Semantic Content-Based Recommender System Integrating Folksonomies for Personalized Access Pasquale Lops, Marco de Gemmis, Giovanni Semeraro, Cataldo Musto, Fedelucio Narducci, and Massimo Bux Department of Computer Science University of Bari “Aldo Moro” - Bari, Italy {lops,degemmis,semeraro,musto,narducci,bux}@di.uniba.it
Summary. Basic content personalization consists in matching up the attributes of a user profile, in which preferences and interests are stored, against the attributes of a content object. The Web 2.0 (r)evolution and the advent of user generated content (UGC) have changed the game for personalization, since the role of people has evolved from passive consumers of information to that of active contributors. One of the forms of UGC that has drawn more attention from the research community is folksonomy, a taxonomy generated by users who collaboratively annotate and categorize resources of interests with freely chosen keywords called tags. In this chapter, we intend to investigate whether folksonomies might be a valuable source of information about user interests for a recommender system. In order to achieve that goal, folksonomies have been included into ITR (ITem Recommender), a contentbased recommender system developed at the University of Bari [7]. Specifically, static content consisting of the descriptions of the items in a collection have been enriched with dynamic UGC through social tagging techniques. The new recommender system, called FIRSt (Folksonomy-based Item Recommender syStem), extends the original ITR system integrating UGC management by letting users to express their preferences for items by entering a numerical rating as well as to annotate rated items with free tags. The main contribution of the chapter is an integrated strategy that enables a content-based recommender to infer user interests by applying machine learning techniques, both on official item descriptions provided by a publisher and on tags which users adopt to freely annotate relevant items. Static content and tags are preventively analyzed by advanced linguistic techniques in order to capture the semantics of the user interests, often hidden behind keywords. The proposed approach has been evaluated in the domain of cultural heritage personalization. Experiments involving 40 real users show an improvement in the predictive accuracy of the tag-augmented recommender compared to the pure content-based one. Keywords: Content-based Recommender Systems, Web 2.0, Folksonomy, Machine Learning, Semantics. G. Castellano, L.C. Jain, A.M. Fanelli (Eds.): Web Person. in Intel. Environ., SCI 229, pp. 27–47. c Springer-Verlag Berlin Heidelberg 2009 springerlink.com
28
P. Lops et al.
1 Introduction The amount of information available on the Web and in Digital Libraries is increasing over time. In this context, the role of user modeling and personalized information access is becoming crucial: users need a personalized support in sifting through large amounts of retrieved information according to their interests. Information filtering systems, relying on this idea, adapt their behavior to individual users by learning their preferences during the interaction in order to construct a profile of the user that can be later exploited in selecting relevant items. Indeed, content personalization basically consists in matching up the attributes of a user profile, in which preferences and interests are stored, against the attributes of a content object. Recent developments at the intersection of Information Filtering, Machine Learning, User Modeling and Natural Language Processing offer novel solutions for personalized information access. Most work focuses on the use of Machine Learning algorithms for the automated induction of a structured model of user interests and preferences from text documents, referred to as user profile. If a profile accurately reflects user preferences, it is of tremendous advantage for the effectiveness of an information access process. For instance, it could be used to filter search results, by deciding whether a user is interested in a specific Web page or not and, in the negative case, preventing it from being displayed. The problem with this approach is that traditional keyword-based profiles are unable to capture the semantics of user interests because they are primarily driven by a string matching operation. If a string, or some morphological variant, is found in both the profile and the document, a match is made and the document is considered as relevant. String matching suffers from problems of: • polysemy, the presence of multiple meanings for one word; • synonymy, multiple words with the same meaning. The result is that, due to synonymy, relevant information can be missed if the profile does not contain the exact keywords in the documents while, due to polysemy, wrong documents could be deemed relevant. Semantic analysis and its integration in personalization models is one of the most innovative and interesting approaches nowadays proposed in literature to solve these problems. Semantic analysis is the key to learn more accurate profiles that capture concepts expressing user interests from relevant documents. These semantic profiles contain references to concepts defined in lexicons or ontologies. The Web 2.0 (r)evolution and the advent of user generated content (UGC) have changed the game for personalization, since the role of people has evolved from passive consumers of information to that of active contributors. UGC refers to various kinds of media content, publicly available, that are produced by endusers. For example, on Amazon.com the majority of content is prepared by administrators, but numerous user reviews of the products being sold are submitted by regular visitors to the site. One of the forms of UGC that has drawn more attention from the research community is folksonomy, a taxonomy generated by users who collaboratively
A Semantic Content-Based Recommender System Integrating Folksonomies
29
annotate and categorize resources of interests with freely chosen keywords called tags. Therefore, it should be investigated whether folksonomies might be a valuable source of information about user interests and whether they could be included in semantic user profiles. The main contribution of this chapter is a strategy to infer user profiles by applying machine learning techniques both on the “official” item descriptions provided by a publisher, and on tags which users adopt to freely annotate relevant items. Static content and tags are preventively analyzed by advanced linguistic techniques in order to capture the semantics of the user interests often hidden behind keywords. The goal of the paper can be formulated in form of the following research question: • Does the integration of tags cause an increase of the prediction accuracy in the process of filtering relevant items for users? This research has been conducted within the CHAT project (Cultural Heritage fruition and e-learning applications of new Advanced multimodal Technologies), that aims at developing new systems and services for multimodal fruition of cultural heritage content. Data has been gathered from the collections of the Vatican picture-gallery, for which both images and detailed textual information of paintings are available, and letting users involved in the study both rate and annotate them with tags. The paper is structured as follows. Section 2 briefly introduces Information Filtering and Recommender Systems. Section 3 provides details about strategies adopted by the content-based recommender for performing semantic document indexing and profile learning and how users tagging activity is handled by the recommender when building user profiles. Section 4 presents the experimental sessions carried out to evaluate the proposed idea and discusses the main findings of the study. Related work are briefly analyzed in Section 5, while conclusions and directions for future work are drawn in 6.
2 Information Filtering at Work: Recommender Systems Starting from a corpus containing all the informative content, Information Filtering techniques perform a progressive removal of non-relevant content according to information about user interests, previously acquired and stored in a user profile [12]. Recommender Systems represent the main area where principles and techniques of Information Filtering are applied. Nowadays many web sites embody recommender systems as a way of personalizing their content for users [25]. Recommender systems have the effect of guiding users in a personalized way to interesting or useful objects in a large space of possible options [4]. Recommendation algorithms use input about customer’s interests to generate a list of recommended items. At Amazon.com, recommendation algorithms are used to personalize the online store for each customer, for example showing programming titles to a software engineer and baby toys to a new mother [18].
30
P. Lops et al.
Among different recommendation techniques that have already been put forward in studies on this matter, the content-based and the collaborative filtering approaches are the most widely adopted to date. Systems implementing the content-based approach analyze a set of documents, usually textual descriptions of the items previously rated by an individual user, and build a model or profile of user interests based on the features of the objects rated by that user [24]. In this approach static content associated to items (the plot of a film, the description of an artwork, etc.) is usually exploited. The profile is then exploited to recommend new relevant items. Collaborative recommender systems differ from content-based ones in that user opinions are used, instead of content. User ratings about objects are gathered and stored in a centralized or distributed database. To provide recommendations to user X, the system firstly computes the neighborhood of that user (i.e. the subset of users that have a taste similar to X). Similarity in taste is measured by computing the closeness of ratings for objects that were rated by both users. The system then recommends objects that users in X’s neighborhood indicated to like, provided that they have not yet been rated by X. Each type of filtering method has its own weaknesses and strengths [31, 1, 17]. This work is focused on content-based recommender systems. In the next section we will introduce FIRSt (Folksonomy-based Item Recommender syStem), a content-based recommender system that implements the proposed idea of building user profiles by exploiting both static and dynamic content (UGC).
3 FIRSt (Folksonomy-Based Item Recommender syStem) FIRSt is a semantic content-based recommender system integrating UGC (tags) in the process of learning user profiles. FIRSt is build upon ITem Recommender (ITR), a system capable of providing recommendations for items in several domains (e.g., movies, music, books), provided that descriptions of items are available as text documents (e.g. plot summaries, reviews, short abstracts) [19, 7, 29]. In the following, we will refer to documents as textual descriptions of items to be recommended. FIRSt adds new functionalities to ITR for processing tags in order to include them in semantic profiles. Sections 3.1 through 3.3 describe the general architecture of ITR, by providing details about strategies adopted for semantic document indexing and profile learning. The evolution of ITR towards FIRSt is presented in Section 3.4, by describing how users’ tagging activity is handled for building user profiles. 3.1
ITR General Architecture
The general architecture of ITR is depicted in Figure 1. The recommendation process is performed in three steps, each of which is handled by a separate component: • Content Analyzer – it allows introducing semantics in the recommendation process by analyzing documents in order to identify relevant concepts
A Semantic Content-Based Recommender System Integrating Folksonomies
31
Fig. 1. ITR General Architecture
representing the content. This process selects, among all the possible meanings (senses) of each polysemous word, the correct one according to the context in which the word occurs. In this way, documents are represented using concepts instead of keywords, in an attempt to overcome the problems due to natural language ambiguity. The final outcome of the preprocessing step is a repository of disambiguated documents. This semantic indexing is strongly based on natural language processing techniques, such as Word Sense Disambiguation (WSD) [20], and heavily relies on linguistic knowledge stored in the WordNet lexical ontology [23]. Details are provided in Section 3.2. • Profile Learner – it implements a supervised learning technique for learning a probabilistic model of user interests from disambiguated documents rated according to her interests. This model represents the semantic profile, which includes those concepts that turn out to be the most indicative of the user preferences. Details are provided in Section 3.3. • Recommender – it exploits the user profile to suggest relevant documents by matching concepts contained in the semantic profile against those contained in documents to be recommended. Details are provided in Section 3.3. 3.2
Semantic Indexing of Documents
Semantic indexing of documents is performed by the Content Analyzer, which relies on META (Multi Language Text Analyzer) [2], a natural language processing tool developed at the University of Bari, able to deal with documents in English or Italian. The goal of the semantic indexing step is to obtain a concept-based document representation. To this purpose the text is first tokenized, then for each word, possible lemmas as well as their morpho-syntactic features are collected. Part
32
P. Lops et al.
of speech ambiguities are solved before assigning the proper sense (concept) to each word. This last step requires the identification of a repository for word senses and the design of an automated procedure for performing word-concept association. As regards the first issue, WordNet version 2.0 has been embodied in the semantic indexing module. The basic building block for WordNet is the synset (SYNonym SET), a structure containing sets of words with synonymous meanings, which represents a specific meaning of a word. As regards the second issue, we designed a WSD algorithm called JIGSAW [3]. It takes as input a document d = [w1 , w2 , . . . , wh ] encoded as a list of words in order of their appearance, and returns a list of WordNet synsets X = [s1 , s2 , . . . , sk ] (k ≤ h), in which each element sj is obtained by disambiguating the target word wi based on the semantic similarity of wi with the words in its context, that is a set of words that precede and follow wi . Notice that k ≤ h because some words, such as most proper names, might not be found in WordNet, or because of bigram recognition. Semantic similarity computes the relatedness of two words. We adopted the Leacock-Chodorow measure [16], which is based on the length of the path between concepts in an IS-A hierarchy. The complete description of the adopted WSD strategy adopted is not described here, because already published in [30]. What we would like to point out here is that the WSD procedure allows to obtain a synset-based vector space representation, called bag-of-synsets (BOS), that is an extension of the classical bag-of-words (BOW) model. In the BOS model a synset vector, rather than a word vector, corresponds to a document. ITR is able to suggest potentially relevant items to users, as long as item properties can be represented in form of textual slots. The adoption of slots does not jeopardize the generality of the approach since the case of documents not structured into slots corresponds to have just a single slot in our document representation strategy. The text in each slot is represented by the BOS model by counting separately the occurrences of a synset in the slots in which it appears. More formally, assume that we have a collection of N documents structured in M slots. Let s be the index of the slot, the n-th document is reduced to M bags of synsets, one for each slot: dsn = tsn1 , tsn2 , . . . , tsnDns where tsnk is the k-th synset in slot s of document dn and Dns is the total number of synsets in slot s of document dn . For all n, k and s, tsnk ∈ Vs , which is the vocabulary for the slot s (the set of all different synsets found in slot s). Document dn is finally represented in the vector space by M synset-frequency vectors: s s s , wn2 , . . . , wnD fns = wn1 ns s is the weight of the synset tk in the slot s of document dn and can be where wnk computed in different ways: it can be the frequency of synset tk in s or a more complex feature weighting score. By invoking META on a text t, we get META(t) = (x, y), where x is the BOS containing the synsets obtained by applying JIGSAW on t, and y is the
A Semantic Content-Based Recommender System Integrating Folksonomies
33
corresponding synset-frequency vector. BOS-indexed documents are used in a content-based information filtering scenario for learning accurate sense-based user profiles, as discussed in the following section. 3.3
Multivariate Poisson Model for Learning User Profiles
The problem of learning user profiles can be cast as a binary Text Categorization task [28] since each document has to be classified as interesting or not with respect to the user preferences. Therefore, the set of categories is restricted to c+ , the positive class (user-likes), and c− the negative one (user-dislikes). The algorithm for inferring user profiles is na¨ıve Bayes text learning, widely adopted in content-based recommenders [24]. Although na¨ıve Bayes performance are not as good as some other statistical learning methods such as nearest-neighbor classifiers or support vector machines, it has been shown that it can perform surprisingly well in the classification tasks where the computed probability is not important [10]. Another advantage of the na¨ıve Bayes approach is that it is very efficient and easy to implement compared to other learning methods. There are two different probabilistic models in common use, both of which assume that all features are independent of each other, given the context of the class. In the multivariate Bernoulli model a document is a binary feature vector over the space of words representing whether each word is present or absent. In contrast, the multinomial model captures word frequency information in documents: when calculating the probability of a document, the probability of the words that occur are multiplied. Although the classifiers based on the multinomial model significantly outperform those based on the multivariate model at large vocabulary sizes [21], their performance is unsatisfactory when: 1) documents in the training set have different lengths, thus resulting in a rough parameter estimation; 2) handling rare categories (few training documents available). These conditions frequently occur in the user profiling task, where no assumptions can be made on the length of training documents, and where obtaining an appropriate set of negative examples (i.e., examples of the user-dislikes class) is problematic. Indeed, since users do not perceive having immediate benefits from giving negative feedback to the system [27], the training set for the class user-likes might be often larger than the one for the class user-dislikes. In [14], the authors propose a multivariate Poisson model for na¨ıve Bayes text classification that allows a more reasonable parameter estimation under the above mentioned conditions. We adapt this approach to the case of user profiling task. The probability that a document dj belongs to a class c (user-likes/userdislikes) is calculated by the Bayes’ theorem as follows: P (c|dj ) = =
P (dj |c)P (c) P (dj |c)P (c) + P (dj |¯ c)P (¯ c) P (dj |c) P (dj |¯ c) P (c) P (dj |c) c) P (dj |¯ c) P (c) + P (¯
(1)
34
P. Lops et al.
If we set: zjc = log
P (dj |c) P (dj |¯ c)
(2)
then Eq. (1) can be rewritten as: ezjc P (c) ezjc P (c) + P (¯ c)
P (c|dj ) =
(3)
Using Eq. (3) we can get the posterior probability P (c|dj ) by calculating zjc . In the Poisson model proposed in [14] for learning the na¨ıve Bayes text classifier: zjc =
|V |
wij · log
i=1
λic μi¯c
(4)
where |V | is the vocabulary size, wij is the frequency of term ti in dj , λic (μi¯c ) is the Poisson parameter that indicates the number of occurrences of ti in the positive (negative) training documents on average. The flexibility of this model relies on the fact that it can be expanded by adopting various methods to estimate wij , λic and μi¯c . In the following, the strategies to adapt this model to the specific task of user profiling are described. The first adaption is needed because, as described in Section 3.2, documents are subdivided into slots, therefore the model should take into account that dj is the concatenation of M documents dsj , M being the number of slots, s = 1, . . . , M . According to the na¨ıve assumption of features independence, slots are independent of each other, given the class (i.e. the token probabilities for one slot are independent of the tokens that occur in other slots), therefore: P (dj |c) =
M
P (dsj |c)
(5)
s=1
then Eq. (1) can be rewritten as: M P (c|dj ) = M
P (dsj |c) s=1 P (dsj |¯ c) P (c)
P (dsj |c) s=1 P (dsj |¯ c) P (c)
(6)
+ P (¯ c)
If we set: s = log zjc
P (dsj |c) P (dsj |¯ c)
(7)
then Eq. (6) can be rewritten as: M zs e jc P (c) P (c|dj ) = M s=1 s zjc P (c) + P (¯ c) s=1 e
(8)
A Semantic Content-Based Recommender System Integrating Folksonomies
35
In the Poisson model with slots, Eq. (4) becomes: s = zjc
|V | i=1
s wij · log
λsic μsi¯c
(9)
s is the frequency of term ti in the slot s of dj . where wij Using Eq. (6) and (9), the posterior probability P (c|dj ) can be computed by estimating the Poisson parameters λsic and μsi¯c . Since we want to normalize term frequencies according to document lengths, we compute λsic (μsi¯c ) as an average of the normalized frequency of ti in the slot s over the number of documents in class c (¯ c):
λsic =
|Dc | 1 s w ˆ |Dc | j=1 ij
μsi¯c =
|Dc¯| 1 s w ˆ |Dc¯| j=1 ij
s = 1, . . . , M
(10)
where Dc (Dc¯) is the number of documents in class c (¯ c), s w ˆij =
s wij α · avgtf s + (1 − α) · avgtfjs
(11)
avgtfjs is the average frequency of a token in the slot s of dj , while avgtf s is the average frequency of a token in the slot s in the whole collection. This linear combination smoothes the term frequency using the characteristics of the entire document collection. For the training step we assume that each user provided ratings on items using a discrete scale ranging from M IN (strongly dislikes) to M AX (strongly likes). Items whose ratings are greater than or equal to (M IN +M AX)/2 are supposed to be liked by the user and included in the positive training set, while items with lower ratings are included in the negative training set. The user profile is learned from rated items by adopting the above described approach. Therefore, given a new document dj , the recommendation step consists in computing the a-posteriori classification scores P (c+ |dj ) and P (c− |dj ) (Eq. 6) by using Poisson parameters for synsets estimated in the training step as in Eq. (10). Classification scores for the class c+ are used to produce a ranked list of potentially interesting items, from which items to be recommended can be selected. 3.4
From ITR to FIRSt: Integrating Folksonomies into Semantic Profiles
In order to involve folksonomies in the processing performed by ITR, static content describing the items is integrated with dynamic UGC (tags). Tags are collected during the training step by letting users: 1. express their preferences for items through a numerical rating 2. annotate rated items with free tags.
36
P. Lops et al.
Given an item I, the set of tags provided by all the users who rated I is denoted as SocialTags(I), while the set of tags provided by a specific user U on I is denoted by PersonalTags(U,I). In addition, PersonalTags(U) denotes the set of tags provided by U on all the items in the collection. Tags are stored in an additional slot, different from those containing static content. For example, in the context of cultural heritage personalization an artwork can be generally represented by at least three slots, namely artist, title, and description. Provided that users have a digital support to annotate artifacts, tags can be easily stored in a fourth slot, say tags, which is not static as the other three slots because tags evolve over time. The distinction between personal and social tags aims at evaluating whether including either just personal tags or social tags in user profiles produces beneficial effects on the recommendations. The inclusion of social tags in the personal profile of a user allows also to extend the pure content-based recommendation paradigm, previously adopted by ITR, toward a hybrid content-collaborative paradigm [4]. The architecture described in Figure 1 has been modified in order to include tags in the recommendation process. The main adaptation was due to the need of defining an appropriate indexing strategy for the slot containing tags, in addition to that already defined for static slots (Figure 2).
Fig. 2. Architecture of FIRSt
Since tags are freely chosen by users and their actual meaning is usually not very clear, the identification of user interests from tags is a challenging task. We face such a problem by applying WSD to tags as well. This process allows us to enhance the document model from representing tags as mere keywords or strings, to exploiting tags as pointers to WordNet synsets (semantic tags).
A Semantic Content-Based Recommender System Integrating Folksonomies
37
Semantic tags are obtained by disambiguating tags in a folksonomy, thus producing as a result a synset-based folksonomy. More specifically, we denote as SemanticSocialTags(I) the set of synsets obtained by disambiguating SocialTags(I). In fact, META applied to SocialTags(I) produces the synset-based folksonomy corresponding to SocialTags(I). SemanticPersonalTags(U,I) is the set of synsets obtained by disambiguating the tags given by U on I, thus it is the result of invoking META on PersonalTags(U,I). The algorithm used by META for tag disambiguation is JIGSAW, with a different setting for the context compared to that adopted for disambiguating static content. Indeed, while for static content the context for the target word is the text in the slot in which it occurs, this strategy is not suitable for tags since the number of tags provided by users is generally low. This may result in a poor context and consequently in a high percentage of WSD errors on tags. The intent is to exploit a more reliable context, when available. Therefore, whether the target tag occurs in one of the static slots, the text in that slot is used as a context, otherwise we are forced to accept all the other tags as a context. Semantic tags are exploited by the Profile Learner to include information about tags in the user profiles. The profile learning process for user U starts by selecting all items (disambiguated documents) and corresponding ratings provided by U . Each item falls into either the positive or the negative training set depending on the user rating, in the same way as described in Section 3.3. Let T R+ and T R− be the positive and negative training set respectively for user U . Several options for generating the user profile can be chosen at this point, depending on the type of content involved in the process. If we would like to infer a user profile strictly related to personal preferences (one-to-one user profile), all the semantic tags obtained from personal tags provided by U on all items she rated should be exploited in the learning step. This means that, for each dj ∈ T R+ ∪ T R− , the additional slot for dj is SemanticPersonalTags(U,dj ). On the other hand, if we would like to build a content-collaborative profile for U , semantic tags obtained from social tags provided by users on all items rated by U should be exploited in the learning step. This means that, for each dj ∈ T R+ ∪ T R− , the additional slot for dj is SemanticSocialTags(dj ). The generation of the user profile is performed by the Profile Learner, which infers the profile as a binary text classifier as described in Section 3.3. The profile contains the user identifier and the a-priori probabilities of liking or disliking an item. Moreover, the profile is structured in two main parts: profile like contains features describing the concepts able to deem items relevant, while features in profile dislike should help in filtering out not relevant items. Each part of the profile is structured in four slots, mirroring the representation adopted for items, which are artworks represented by title, artist, description and tags in this case. Each slot reports the features (WordNet identifiers) occurring in the training examples, whose frequencies are computed in the training step. Frequencies are used by the Bayesian learning algorithm to induce the
38
P. Lops et al.
classification model (i.e. the user profile) exploited to suggest relevant items in the recommendation phase.
4 Experimental Evaluation of FIRSt The goal of the experimental evaluation was to measure the predictive accuracy of FIRSt when different types of content are used in the training step. Preliminary experiments have been presented in [8]. As a matter of fact, in order to properly investigate the effects of including social tagging in the recommendation process, a distinction has to be made between considering, for an artifact I rated as interesting by a user, either the whole folksonomy SocialTags(I), or only the tags entered by that user for that artifact, i.e. PersonalTags(U,I). Moreover, tags produced by expert users are distinguished from those of nonexpert users, with the aim of investigating the impact of a more specific lexicon in producing recommendations. In the context of cultural heritage domain, expert users are supposed to have specific knowledge in the art domain, such as museum curators, while non-expert users are supposed to be na¨ıve museum visitors. 4.1
Users and Dataset
The dataset considered for the experiments consists of 45 paintings chosen from the collection of the Vatican picture-gallery. The dataset was collected using screenscraping bots, which captured the required information from the official website of the Vatican picture-gallery. In particular, for each element in the dataset an image of the artifact was collected, along with three textual properties, namely its title, artist, and description.
Fig. 3. Collecting users’ ratings and tags
A Semantic Content-Based Recommender System Integrating Folksonomies
39
30 non-expert users and 10 expert users voluntarily took part in the experiments. Notice that users were selected according to the availability sampling strategy. Even though random sampling is the best way of having a representative sample, that strategy requires a great deal of time and money. Therefore much research in psychology is based on samples obtained through non-random selection, such as the availability sampling, i.e. a sampling of convenience based on users available to the researcher, often used when the population source is not completely defined [26]. According to this strategy, non-expert users were selected among young people having a master degree in Computer Science or Humanities, while expert users were selected among teachers in Arts and Humanities disciplines. Users were requested to interact with a web application (Figure 3), in order to express their preferences for all the 45 paintings in the collection. A preference was expressed as a numerical vote on a 5-point scale (1=strongly dislike, 5=strongly like). Moreover, users were left free to annotate the paintings with as many tags as they wished. For the overall 45 paintings in the dataset, 4300 tags were provided by nonexpert users, while 1877 were provided by expert users. Some statistics about tag distribution are reported in Table 1. Table 1. Tag distribution in the dataset Type of tags Avg. expert users PersonalTags(U,I) 4.17 PersonalTags(U) 187.7 SocialTags(I) 41.71
Avg. non-expert users 3.18 143.33 95.55
Each user provided about from 3 to 4 tags for each rated item, thus the additional workload due to tagging activity is quite moderate. The average number of tags associated with each painting is about 95 for non-expert users and 41 for expert users, thus experiments relied on a sufficient number of user annotations. 4.2
Design of the Experiments and Evaluation Metrics
Since FIRSt is conceived as a text classifier, its effectiveness can be evaluated by classification accuracy measures, namely Precision and Recall [28]. Precision (P r) is defined as the number of relevant selected items divided by the number of selected items. Recall (Re) is defined as the number of relevant selected items divided by the total number of relevant items. Fβ measure, a combination of precision and recall, is also used to have an overall measure of predictive accuracy (β sets the relative degree of importance attributed to P r and Re). Fβ =
(1 + β 2 ) · P r · Re β 2 · P r + Re
40
P. Lops et al.
For the evaluation of recommender systems, these measures have been used in [13]. Since users should trust the recommender, it is important to reduce false positives. It is also desirable to provide users with a short list of relevant items (even if not all the possible relevant items are suggested), rather than a long list containing a greater number of relevant items mixed-up with not relevant ones. Therefore, we set β = 0.5 for Fβ measure in order to give more weight to precision. These classification measures do not consider predictions and their deviations from actual ratings, they rather compute the frequency with which a recommender system makes correct or incorrect decisions about whether a painting is advisable for a user. These specific measures were adopted because we are interested in measuring how relevant a set of recommendations is for a user. In the experiment, a painting is considered relevant for a user if the rating is greater than or equal to 4, while FIRSt considers a painting relevant for a user if the a-posteriori probability of the class likes is greater than 0.5. We organized three different experimental sessions, each one with the aim of evaluating the accuracy of FIRSt for a specific community of users: 1. session#1: non-expert user community – All paintings are rated and tagged by 30 non-expert users, for whom recommendations are computed. 2. session#2: whole user community – All paintings are rated and tagged both by expert and non-expert users. Recommendations are provided for the whole set of 40 users. 3. session#3: non-expert user community supported by experts’ tags – In this session we evaluate whether tags provided by experts have positive effects on recommendations generated for non-expert users. All paintings are rated solely by non-expert users, but tags used for generating nonexpert user profiles are provided by expert users. For SESSION#1 and SESSION#2, 5 different experiments were designed, depending on the type of content used for training the system: • Exp#1: Static Content - only title, artist and description of the paintings, as collected from the official website of the Vatican picture-gallery • Exp#2: SemanticPersonalTags(U,I) • Exp#3: SemanticSocialTags(I) • Exp#4: Static Content+SemanticPersonalTags(U,I) • Exp#5: Static Content+SemanticSocialTags(I) For example, SemanticSocialTags(I) in SESSION#1 includes the set of synsets obtained by disambiguating tags provided by all non-expert users who rated I, while in SESSION#2 it includes the set of synsets obtained by disambiguating tags provided by both expert and non-expert users who rated I. For SESSION#3, 2 different experiments were designed, depending on the type of content used for training the system: • Exp#1: SemanticSocialTags(I) – SemanticSocialTags(I) includes the set of synsets obtained by disambiguating tags provided by all experts on I. In this
A Semantic Content-Based Recommender System Integrating Folksonomies
41
way tags provided by experts contribute to the profiles of non-expert users. The aim of the experiment is to measure whether accuracy of recommendations for non-expert users is improved by tags provided by expert users. • Exp#2: Static Content+SemanticSocialTags(I) – SemanticSocialTags(I), as intended in Exp#1 in this session, are combined with static content. All experiments were carried out using the same methodology, consisting in performing one run for each user, scheduled as follows: 1. 2. 3. 4.
select the appropriate content depending on the experiment being executed; split the selected data into a training set Tr and a test set Ts; use Tr for learning the corresponding user profile; evaluate the predictive accuracy of the induced profile on Ts.
The methodology adopted for obtaining Tr and Ts was the K-fold cross validation [15], with K = 5. Given the size of the dataset (45), applying a 5-fold cross validation technique means that the dataset is divided into 5 disjoint partitions, each containing 9 paintings. The learning of profiles and the test of predictions were performed in 5 steps. At each step, 4 (K-1) partitions were used as the training set Tr, whereas the remaining partition was used as the test set Ts. The steps were repeated until each of the 5 disjoint partitions was used as the Ts. Results were averaged over the 5 runs. 4.3
Results
Table 2 reports results for Exp#1-Exp#5 in SESSION #1. Table 3 reports results for Exp#1-Exp#5 in SESSION #2. Table 2. Results of Exp#1-Exp#5 in SESSION #1 Exp.
Type of Content
Exp#1 Exp#2 Exp#3 Exp#4 Exp#5
Static Content SemanticPersonalTags(U,I) SemanticSocialTags(I) Static Content+SemanticPersonalTags(U,I) Static Content+SemanticSocialTags(I)
Precision Recall Fβ=0.5 77.01 77.63 77.40 78.63 77.78
93.54 79.83 86.57 79.27 91.87 79.92 92.79 81.11 93.35 80.46
Table 3. Results of Exp#1-Exp#5 in SESSION #2 Exp.
Type of Content
Exp#1 Exp#2 Exp#3 Exp#4 Exp#5
Static Content SemanticPersonalTags(U,I) SemanticSocialTags(I) Static Content+SemanticPersonalTags(U,I) Static Content+SemanticSocialTags(I)
Precision Recall Fβ=0.5 75.17 76.60 74.91 77.31 76.60
92.63 78.11 89.86 78.93 89.93 77.50 90.61 79.65 91.58 79.19
42
4.4
P. Lops et al.
Results
The first outcome of experiments in SESSION#1 is that the integration of social or personal tags causes an increase of precision in the process of recommending artifacts to users. More specifically, precision of profiles learned from both static content and tags (hereafter, augmented profiles) outperformed the precision of profiles learned from either static content (hereafter, content-based profiles) or just tags (hereafter, tag-based profiles). The improvement of augmented profiles with personal tags (Exp#4) is 1.62 with respect to content-based profiles (Exp#1), while it is about 1 with respect to tag-based profiles (Exp#2 and Exp#3). Lower improvements are observed by comparing results of Exp#5 with those of Exp#2 and Exp#3. The increase in precision of augmented profiles corresponds to a slight and physiological loss of recall. Lowest recall has been observed for Exp#2. This result is not surprising since personal tags summarize cultural interests and represent them in a deeper and “more precise” way compared to static content, which, on the other hand, allows covering a broader range of user preferences. To sum up, by observing the Fβ figures, we can conclude that for non-expert users, the highest accuracy is achieved by augmented profiles with personal tags. Similar results are observed in SESSION#2, where the community also includes expert users. It is interesting to compare results of Exp#1, Exp#2 and Exp#4 in SESSION#1 with those of same experiments in SESSION#2, in order to evaluate the accuracy of recommendations provided by content-based profiles, tag-based profiles built using just personal tags, and augmented-profiles with personal tags in both communities. The values of Fβ in SESSION#2 are lower than those observed in SESSION#1, thus we can conclude that it is more difficult to provide recommendations for expert users. Another interesting finding regards profiles built by using social tags (Exp#3). A comparison between results obtained in SESSION#1 and SESSION#2 highlights a significant loss both in precision and recall when expert users are included in the community. Since social tags represent the lexicon of the community, this result might be interpreted as the fact that tagging with more specific and technical lexicon does not bring a significant improvement of system predictive accuracy. SESSION#3 provides a more insight on the impact of the lexicon introduced by expert users on recommendation provided to non-expert users (Table 4). Table 4. Results of Exp#1-Exp#2 in SESSION #3 Exp.
Type of Content
Exp#1 SemanticSocialTags(I) Exp#2 Static Content+SemanticSocialTags(I)
Precision Recall Fβ=0.5 76.98 77.47
92.40 93.51
79.64 80.22
By analyzing results of Exp#1, we observed that precision and recall of tagbased profiles do not outperform those obtained in Exp#3 in SESSION#1, thus
A Semantic Content-Based Recommender System Integrating Folksonomies
43
suggesting that the specific lexicon adopted by expert users does not positively affect recommendations for non-expert users. Anyway, the slight improvement in recall (+0.53) suggests that the more technical tags adopted by experts might help to select relevant items missed by profiles built with simple tags. Even integrating social tags provided by experts with content does not improve accuracy of recommendations for non-expert users. Indeed, precision and recall observed in Exp#2 do not significantly change compared to results of Exp#5 in SESSION#1. The general conclusion is that the expertise of users contributing to the folksonomy does not actually affect the accuracy of recommendations.
5 Related Work To the best of our knowledge, few studies investigated on how to exploit tag annotations in order to build user profiles. In [9], the user profile is represented in the form of a tag vector, with each element indicating the number of times a tag has been assigned to a document by that user. A more sophisticated approach is proposed in [22], which takes into account tag co-occurrence. The matching of profiles to information sources is achieved by using simple string matching. As the authors themselves foresee, the matching could be enhanced by adopting WordNet, as in the semantic document indexing strategy proposed in this work. In the work by Szomszor et al. [33], the authors describe a movie recommendation system built purely on the keywords assigned to movies via collaborative tagging. Recommendations for the active user are produced by algorithms based on the similarity between the keywords of a movie and those of the tag-clouds of movies she rated. As the authors themselves state, their recommendation algorithms can be improved by combining tag-based profiling techniques with more traditional content-based recommender strategies, as in the approach we have proposed. In [11], different strategies are proposed to build tag-based user profiles and to exploit them for producing music recommendations. Tag-based user profiles are defined as collections of tags, which have been chosen by a user to annotate tracks, together with corresponding scores representing the user interest in each of these tags, inferred from tag usage and frequencies of listened tracks. While in the above described approaches only a single set of popular tags represents user interests, in [36] it is observed that this may not be the most suitable representation of a user profile, since it is not able to reflect the multiple interests of users. Therefore, the authors propose a network analysis technique (based on clustering), performed on the personal tags of a user to identify her different interests. About tag interpretation, Cantador et al. [5] proposed a methodology to select “meaningful” tags from an initial set of raw tags by exploiting WordNet, Wikipedia and Google. If a tag has an exact match in WordNet, it is accepted, otherwise possible misspellings and compound nouns are discovered by using the Google “did you mean” mechanism (for example the tag sanfrancisco
44
P. Lops et al.
or san farncisco is corrected to san francisco). Finally, tags are correlated to their appropriate Wikipedia entries. The main differences between the tag-based profiling process we proposed in this chapter and the previously discussed ones are: 1. we propose a hybrid strategy that learns the profile of the user U from both static content and tags associated with items rated by U , instead on relying on tags only; 2. we elaborate on including in the profile of user U not only her personal tags, but also the tags adopted by other users who rated the same items as U . This aspect is particularly important when users who contribute to the folksonomy have different expertise in the domain; 3. we propose a solution to the challenging task of identifying user interests from tags. Since the main problem lies in the fact that tags are freely chosen by users and their actual meaning is usually not very clear, we have suggested to semantically interpret tags by means of WordNet. Indeed, some ideas on how to analyze tags by means of WordNet in order to capture their intended meanings are reported in [6], but suggested ideas are not supported by empirical evaluations. Another approach in which tags are semantically interpreted by means of WordNet is the one proposed in [37]. The authors demonstrated the usefulness of tags in collaborative filtering, by designing an algorithm for neighbor selection that exploits a WordNet-based semantic distance between tags assigned by different users. When focusing on the application of personalization techniques in the context of cultural heritage, it is worth to notice that museums have recognized the importance of providing visitors with personalized access to artifacts.The projects PEACH (Personal Experience with Active Cultural Heritage) [32] and CHIP (Cultural Heritage Information Personalization) [35] are only two examples of the research effort devoted to support visitors in fulfilling a personalized experience and tour when visiting artworks collections. In particular, the recommender system developed within CHIP aims at providing personalized access to the collections of the Rijksmuseum in Amsterdam. It combines Semantic Web technologies and content-based algorithms for inferring visitors’ preference from a set of scored artifacts and then, recommending other artworks and related content topics. The Steve.museum consortium [34] has begun to explore the use of social tagging and folksonomy in cultural heritage personalization scenarios, to increase audience engagement with museums’ collections. Supporting social tagging of artifacts and providing access based on the resulting folksonomy, open museum collections to new interpretations, which reflect visitors’ perspectives rather than curators’ ones, and helps to bridge the gap between the professional language of the curator and the popular language of the museum visitor. Preliminary explorations conducted at the Metropolitan Museum of Art of New York have shown that professional perspectives differ significantly from those of na¨ıve visitors. Hence, if tags are associated to artworks, the resulting folksonomy can be
A Semantic Content-Based Recommender System Integrating Folksonomies
45
used as a different and valuable source of information to be carefully taken into account when providing recommendations to museum visitors.
6 Conclusions and Future Work The research question we have tried to answer in this chapter was: Does the integration of tags cause an increase of the prediction accuracy in the process of filtering relevant items for users? The main contribution of the chapter is a technique to infer user profiles from both static content, as in classical contentbased recommender systems, and tags provided by users to freely annotate items. Being free annotations, tags also tend to suffer from syntactic problems, like polysemy and synonymy. We faced such a problem by applying WSD to content as well as tags. Static content and tags, semantically indexed using a WordNetbased WSD procedure, are exploited by a na¨ıve Bayes learning algorithm able to infer user profiles in the form of binary text classifiers. As a proof of concepts, we developed the FIRSt recommender system, whose recommendations were evaluated in a cultural heritage scenario. Experiments aimed at evaluating the predictive accuracy of FIRSt when different types of content were used in the training step (pure content, personal tags, social tags, content combined with tags). We also distinguished tags provided by non-expert users from those provided by expert ones. The main outcomes of experiments are: • the highest overall accuracy is reached when profiles learned from both content and personal tags are exploited in the recommendation process • the expertise of users contributing to the folksonomy does not actually affect the accuracy of recommendations. We are currently working on the integration of FIRSt in an adaptive platform for multimodal and personalized access to museum collections. In this context, specific recommendation services, based upon augmented profiles, are being developed. Each visitor is supposed to be equipped with a mobile terminal supporting her during the visit to the museum. For example, the intelligent guide provided by the terminal might help the visitor to find the most interesting artworks according to her profile and contextual information, such as her current location in the museum.
References 1. Balabanovic, M., Shoham, Y.: Fab: content-based, collaborative recommendation. Commun. ACM 40(3), 66–72 (1997) 2. Basile, P., de Gemmis, M., Gentile, A., Iaquinta, L., Lops, P., Semeraro, G.: META - MultilanguagE Text Analyzer. In: Proc. of the Language and Speech Technnology Conference, pp. 137–140 (2008)
46
P. Lops et al.
3. Basile, P., Degemmis, M., Gentile, A., Lops, P., Semeraro, G.: UNIBA: JIGSAW algorithm for Word Sense Disambiguation. In: Proceedings of the 4th ACL 2007 International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, June 23-24. Association for Computational Linguistics, pp. 398–401 (2007) 4. Burke, R.: Hybrid recommender systems: survey and experiments. User Model. User-Adapt. Interact. 12(4), 331–370 (2002) 5. Cantador, I., Szomszor, M., Alani, H., Fern´ andez, M., Castells, P.: Enriching Ontological User Profiles with Tagging History for Multi-Domain Recommendations. In: Proc. of the Collective Semantics: Collective Intelligence and the Semantic Web, CISWeb2008, Tenerife, Spain (2008) 6. Carmagnola, F., Cena, F., Cortassa, O., Gena, C., Torre, I.: Towards a tag-based user model: How can user model benefit from tags? In: Conati, C., McCoy, K., Paliouras, G. (eds.) UM 2007. LNCS, vol. 4511, pp. 445–449. Springer, Heidelberg (2007) 7. Degemmis, M., Lops, P., Semeraro, G.: A content-collaborative recommender that exploits WordNet-based user profiles for neighborhood formation. User Model. User-Adapt. Interact. 17(3), 217–255 (2007) 8. Degemmis, M., Lops, P., Semeraro, G., Basile, P.: Integrating tags in a semantic content-based recommender. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F. (eds.) Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008, Lausanne, Switzerland, October 23-25, 2008, pp. 163–170. ACM, New York (2008) 9. Diederich, J., Iofciu, T.: Finding communities of practice from user profiles based on folksonomies. In: Innovative Approaches for Learning and Knowledge Sharing, EC-TEL Workshop Proc., pp. 288–297 (2006) 10. Domingos, P., Pazzani, M.J.: On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2-3), 103–130 (1997) 11. Firan, C.S., Nejdl, W., Paiu, R.: The benefit of using tag-based profiles. In: Proc. of the Latin American Web Conference, Washington, DC, USA, pp. 32–41. IEEE Computer Society, Los Alamitos (2007) 12. Hanani, U., Shapira, B., Shoval, P.: Information Filtering: Overview of Issues, Research and Systems. User Model. User-Adapt. Interact. 11(3), 203–259 (2001) 13. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22(1), 5–53 (2004) 14. Kim, S.-B., Han, K.-S., Rim, H.-C., Myaeng, S.-H.: Some effective techniques for naive bayes text classification. IEEE Trans. Knowl. Data Eng. 18(11), 1457–1466 (2006) 15. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of IJCAI-1995, pp. 1137–1145 (1995) 16. Leacock, C., Chodorow, M., Miller, G.: Using corpus statistics and wordnet relations for sense identification. Computational Linguistics 24(1), 147–165 (1998) 17. Lee, W.S.: Collaborative learning for recommender systems. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 314–321. Morgan Kaufmann, San Francisco (2001) 18. Linden, G., Smith, B., York, J.: Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Comp. 7(1), 76–80 (2003) 19. Lops, P., Degemmis, M., Semeraro, G.: Improving Social Filtering Techniques Through WordNet-Based User Profiles. In: Conati, C., McCoy, K., Paliouras, G. (eds.) UM 2007. LNCS, vol. 4511, pp. 268–277. Springer, Heidelberg (2007) 20. Manning, C., Sch¨ utze, H.: Foundations of Statistical Natural Language Processing, ch. 7: Word Sense Disambiguation, pp. 229–264. The MIT Press, Cambridge (1999)
A Semantic Content-Based Recommender System Integrating Folksonomies
47
21. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998) 22. Michlmayr, E., Cayzer, S.: Learning User Profiles from Tagging Data and Leveraging them for Personal(ized) Information Access. In: Proc. of the Workshop on Tagging and Metadata for Social Information Organization, Int. WWW Conf. (2007) 23. Miller, G.: Wordnet: An on-line lexical database. International Journal of Lexicography 3(4) (Special Issue) (1990) 24. Mladenic, D.: Text-learning and related intelligent agents: a survey. IEEE Intelligent Systems 14(4), 44–54 (1999) 25. Resnick, P., Varian, H.: Recommender systems. Communications of the ACM 40(3), 56–58 (1997) 26. Royce, S.A., Straits, B.C.: Approaches to Social Research, 3rd edn. Oxford University Press, New York (1999) 27. Schwab, I., Kobsa, A., Koychev, I.: Learning user interests through positive examples using content analysis and collaborative filtering (2001) 28. Sebastiani, F.: Machine learning in automated text categorization. ACM Comp. Surveys 34(1), 1–47 (2002) 29. Semeraro, G., Basile, P., de Gemmis, M., Lops, P.: User Profiles for Personalizing Digital Libraries. In: Theng, Y.-L., Foo, S., Lian, D.G.H., Na, J.-C. (eds.) Handbook of Research on Digital Libraries: Design, Development and Impact, pp. 149–158. IGI Global (2009) ISBN 978-159904879-6 30. Semeraro, G., Degemmis, M., Lops, P., Basile, P.: Combining learning and word sense disambiguation for intelligent user profiling. In: Proc. of IJCAI 2007, pp. 2856–2861. M. Kaufmann, California (2007) 31. Shardanand, U., Maes, P.: Social information filtering: algorithms for automating/word of mouth. In: Proceedings of ACM CHI 1995 Conference on Human Factors in Computing Systems, Denver, Colorado, United States, vol. 1, pp. 210–217 (1995) 32. Stock, O., Zancanaro, M., Busetta, P., Callaway, C., Kr¨ uger, A., Kruppa, M., Kuflik, T., Not, E., Rocchi, C.: Adaptive, intelligent presentation of information for the museum visitor in PEACH. User Modeling and User-Adapted Interaction 17(3), 257–304 (2007) 33. Szomszor, M., Cattuto, C., Alani, H., O’Hara, K., Baldassarri, A., Loreto, V., Servedio, V.D.P.: Folksonomies, the semantic web, and movie recommendation. In: Proc. of the Workshop on Bridging the Gap between Semantic Web and Web 2.0 at the 4th ESWC (2007) 34. Trant, J., Wyman, B.: Investigating social tagging and folksonomy in art museums with steve. museum. In: Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland (May 2006) 35. Wang, Y., Aroyo, L., Stash, N., Rutledge, L.: Interactive user modeling for personalized access to museum collections: The Rijksmuseum case study. In: Conati, C., McCoy, K., Paliouras, G. (eds.) UM 2007. LNCS, vol. 4511, pp. 385–389. Springer, Heidelberg (2007) 36. Yeung, C.M.A., Gibbins, N., Shadbolt, N.: A study of user profile generation from folksonomies. In: Proc. of the Workshop on Social Web and Knowledge Management, WWW Conf. (2008) 37. Zhao, S., Du, N., Nauerz, A., Zhang, X., Yuan, Q., Fu, R.: Improved recommendation based on collaborative tagging behaviors. In: Proc. of Int. Conf. on Intelligent User Interfaces. ACM Press, New York (2008)
3 Exploiting Ontologies for Web Search Personalization John Garofalakis1,2 and Theodoula Giannakoudi2 1 2
RA Computer Technology Institute Telematics Center Department N. Kazantzaki str. 26500, Greece University of Patras Computer Engineering and Informatics Dept 26500 Patras, Greece
[email protected],
[email protected]
Summary. In this work, we present an approach for web search personalization by exploiting the ontologies. Our approach aims to provide personalization in web search engines by coupling data mining techniques with the underlying semantics of the web content. To this purpose, we exploit reference ontologies that emerge from web catalogs (such as ODP - Open Directory Project), which can scale to the growth of the web. Our methodology uses ontologies to provide the semantic profiling of users’ interests, based on the implicit logging of their behavior and the on-the-fly semantic analysis and annotation of the web results summaries. Experimental evaluation of our approach shows that the objectives expected from semantic users’ clustering in search engines are achievable. Keywords: Web Usage Mining, Semantic Annotation, Clustering, Ontology, User Profiles, Web Search, Personalization.
1 Introduction While Web is constantly growing, web search has to deal with a lot of challenges. The collection of the web documents expands rapidly and the users demand to find the desired information directly. The vital question is which the right information for a specific user is and how this information could be efficiently delivered, saving the web user from consecutive submitted queries and timeconsuming navigation through numerous web results. Most existing Web search engines return a list of results based on the query without paying any attention to the underlying user’s interests or even to the searching behaviors of other users with common interests. There is no prediction of the user’s information needs and problems of polysemy and synonymy often arise. Thus, when a user submits searching keywords with multiple meaning (polysemy) or several words having the same meaning with the submitted keyword (synonymy), he will probably get a large number of web results and most of them will not meet his need. For, example, a user submitting the term “opera” may be interested in arts or computers but the results will be the same regardless of what he looks for. G. Castellano, L.C. Jain, A.M. Fanelli (Eds.): Web Person. in Intel. Environ., SCI 229, pp. 49–64. c Springer-Verlag Berlin Heidelberg 2009 springerlink.com
50
J. Garofalakis and T. Giannakoudi
Fig. 1. The overall personalization methodology
Some current search engines such as Google or Yahoo! have hierarchies of categories to provide users with the opportunity to explicitly specify their interests. However, these hierarchies are usually very large; therefore, they discourage the user from browsing them in order to define the interested paths. To overcome these overloads in the users searching tasks, the user interests may be implicitly detected by tracking his search history and personalizing the web results. In this work, we propose a personalization method (Figure 1) which couples data mining techniques with the underlying semantics of the web content in order to build semantically enhanced clusters of user profiles. In our methodology, apart from exploiting a specific user search history, we further exploit the search history of other users with similar interests. The user is assigned to relevant conceptual classes of common interest, so as to predict the relevance score of the results with the user goal and finally re-rank them. To this purpose, we exploit reference ontologies that emerge from web catalogs (such as ODP-Open Directory Project1 ), which can scale to the growth of the web. Ontologies provide for the semantic profiling of users’ interests based on the implicit logging of their 1
The Open Directory Project: http://www.dmoz.org/
Exploiting Ontologies for Web Search Personalization
51
behavior and the on-the-fly semantic analysis and annotation of summaries of the web results. Regarding the semantic clusters, they actually comprise taxonomical subsets of a general category hierarchy, such as ODP, representing the categories of interest for groups of web users with similar search tasks. Specifically, our methodology consists of five tasks: (1) gathers user’s search history, (2) processes the user activity, taking into consideration other users’ activities and constructing clusters of commonly preferred concepts, (3) defines ontology-based profiles for the active user based on the detected interests from his current activity and the interests depicted from the semantic cluster in which he has been assigned from previous searching sessions, (4) re-ranks the web results combining the above information with the semantics of the delivered results and (5) constantly re-organizes the conceptual clusters in order to be up-to-date with the users’ interests. Our approach has been experimentally evaluated by utilizing the Google Web Service and delivering a transparent Google search web site and the results show that semantically clustering users in terms of detecting commonly interesting ODP categories in search engines is effective. The remainder of the paper is structured as follows: Section 2 discusses related work. In Section 3, we describe the reference ontology that our approach uses based on the ODP categorization. Using this ontology, we outline the semantic annotation of web results to the ontology classes. Moreover, we present how the user profiles are defined over the reference ontology referred earlier as task (2) and how the semantic user clusters are formed, referred as task (3). In Section 4, we discuss what sort of ontology we can discover from a set of positive documents. We also present an ontology mining algorithm. In Section 4, we propose a novel technique for web search personalization combining profiles of semantic clusters with the emerging profile of the active user referred as tasks (4) and (5). In Section 5, we exhibit and discuss our experiments to show the performance of the proposed approach for the Web search personalization. In this section we describe the task (1) and the experimental results of the implementation. Section 7 presents the conclusions and gives an outlook on further work.
2 Related Work In this section, we present work that has been conducted in similar contexts, such as personalized web searching, usage-based personalization and semantic-aware personalization. Several ontology-based approaches have been proposed for users profiling taking advantage of the knowledge contained in ontologies ([6], [13]) in personalization systems. In [5], an aggregation scheme towards more general concepts is presented. Clustering of the user sessions is provided to identify related concepts at different levels of abstraction in a recommender system. Significant studies have been conducted for personalization based on user search history. A general framework for personalization based on aggregate usage
52
J. Garofalakis and T. Giannakoudi
profiles is presented in [15]. This work distinguishes between the offline tasks of data preparation and usage mining and the online personalization components. [17] suggests learning a user’s preferences automatically based on their past click history and shows how to use this learning for result personalization. Many researchers have proposed several ways to personalize web search through biasing ranking algorithms towards possible interested pages for the user. For example, [18] extends the HITS algorithm to promotes pages marked “relevant” by the user in previous searches. A great step towards biased ranking is performed in [9], where a topic-oriented PageRank is built, considering the first-level topics listed in the Open Directory. The authors show this algorithm overperforms the standard PageRank if the search engine can effectively estimate the query topic. Specifically, regarding the exploitation of large-scale taxonomies in personalized search, a number of interesting works has been presented. In [4], several ways are explored of extending ODP metadata to personalized search. In [12] , users’ browsing history is exploited to construct a much smaller subset of user-specific categories than the entire ODP and a novel ranking logic is implemented among categories. In [9], sets of known user interests are automatically mapped onto a group of categories in ODP and manually edited data of ODP are used for training text classifiers to perform search results categorization and personalization. Our work differs from previous works in several tasks. We exploit large-scale taxonomies, such as ODP, to construct combinative semantic user profiles. In our emerging profiles, both user browsing history and automatically created clusters of user categories are incorporated in personalizing web results. In this way, we re-rank search results taking under consideration apart from the active user tasks, the subsets of “interesting” taxonomy categories that co-occur in other users searches, in the case that these users exhibit similar behavior with the active one.
3 Ontology-Based User Clusters The general aim of this work is to introduce a method for personalizing the results of web searching. For this reason we focused on constructing user profiles implicitly and automatically, according to their interests and their previous behavior on searching. At this direction we were based on the work described in [3]. 3.1
Reference Ontology
Our first goal was to create a reference ontology upon which we will base the user profiles. The profile of each user will be represented by a weighted ontology, depicting the users’ interest for every class of the reference ontology. Rather than creating a new ontology from scratch, we decided to base our reference ontology on already existing subject hierarchies. Some of them are Yahoo.com 2 , About.com 3 , 2 3
Yahoo! Search Engine. http://www.yahoo.com About. http://www.about.com
Exploiting Ontologies for Web Search Personalization
53
Fig. 2. A depiction of the ODP
Lycos 4 and the Open Directory Project that provide manually-created online subject hierarchies. Our implementation of the reference ontology was finally based on the Open Directory Project. In Figure 2 there is a depiction of some of the concepts of the first three levels of the ODP taxonomy. The choice of the Open Directory Project instead of the other directories for the construction of the reference ontology made no difference because there is a correspondence among them. The ontology created is actually a directed acyclic graph (DAG). Since we wish to create a relatively concise user profile that identifies the general areas of a user’s interests we created our reference ontology by using concepts from only the first three levels of the Open Directory Project [19], which are the directories used by Google search Engine. In addition, since we want concepts that are related by a generalization-specialization relationship, we remove subjects that were linked based on other criteria, e.g. alphabetic or geographic associations. The ontology was created by the Protege 5 , the free, open source ontology editor and knowledge-base framework and the language used for development was OWL. 3.2
Semantic Annotation
The construction of the profile, i.e. the weighted ontology, for every user includes the semantic annotation of the user’s previous choices. The semantic characterization of the user choices is based on the methodology proposed in [7]. Therefore, the user’s previous choices are analyzed into keywords extracted 4 5
Lycos Search Engine. http://www.lycos.com/ The Protege Ontology Editor and Knowledge Acquisition System: http://protege.stanford.edu
54
J. Garofalakis and T. Giannakoudi
from the visited web pages and the keywords are semantically characterized. The calculation of the semantic similarity between each keyword and each term of the ontology was computed by using semantic similarity measures with WordNet [13]. In Wordnet[14], English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. The measure that was applied in our methodology is the Wu and Palmer one [19]. This measure calculates the relatedness by considering the depths of the two synstes (on or more sets of synonyms) in the Wordnet taxonomies, along with the depth of the LCS (Lexical Conceptual Structure). Score =
2 × depth(lcs) depth(k) + depth(c)
(1)
where k = user keyword, c = ontology class, lcs = nearest common ancestor This means: Score ∈ (0, 1] The score can never be zero because the depth of the LCS is never zero (the depth of the root of a taxonomy is one). The score is one if the two input synsets are the exactly same. The assignment process is time-expensive, therefore we have implemented a caching policy to improve system response. The assignments of instances words are kept in cache, to minimize response time in case these words are met again. Every time that this process is executed the amount of previous choices that are semantically annotated are the users’ choices that have not been annotated at the last execution of this step of the methodology. This saves time from the execution, since semantic annotation is a quite time consuming step of the overall method applied. As a result, the keywords and consequently, the users’ choices are assigned to relevant classes of the ontology, when score is over a threshold (e.g. 0.7), after the completion of the ontology assignment step in the proposed method. Experimentation and fine tuning using different threshold values resulted in the choice of 0.7 as a concept similarity threshold. 3.3
Definition of User Profiles
In this step, our methodology uses the semantic annotations of the users’ choices so as to construct the profile for every user. After the semantic characterization of the user’s choices to the ontology concepts our methodology moves on the profile creation. From the web access logs kept in the web server our method extracts the user’s previous choices, which have already been semantically annotated. Therefore, for every user we extract the concepts and the frequency of appearance from the previous choices that the specific user has made. In the end of the execution of this step, there is an accumulation of the preferences for every user and of the frequency for every concept, which is the weight, for every class (preference) in the ontology.
Exploiting Ontologies for Web Search Personalization
55
In this step of the methodology proposed, apart from the accumulation of the concepts for which the user has shown interest, we construct the vector that represents each user’s profile. The vector’s size is the number of concepts that the ontology consists of. The value of each element of the vector corresponds to the weight of the user interest for this concept. So we propose that, the weight for a concept i for the user u, is calculated as: wiu =
cfiu sum(cfu )
(2)
where cfiu = the number of times that the concept i has been assigned to the user u. sum(cfu ) = the sum of the times that all the concepts of the ontology has been assigned to the user u. For the concepts that the user has not selected any previous choice assigned to this concept the value is set to zero. So for a user u the profile is represented as follows: pu =< w1u , w2u , ..., wnu >
(3)
Where n is the number of concepts in the ontology and weight(concepti , u), if concepti > 0 wiu = 0, otherwise Therefore, it is obvious that the weight of each concept is the relative frequency of the concept among all concepts of the ontology. The sum of all weight is equal to one, representing the percentage of the user’s interest for every concept. Moreover, for each user we create a file that has the profile vector. 3.4
Semantic Clustering of User Profiles
After creating each user profile, the suggested methodology moves on profile clustering. From the profile creation step, a profile for every user is stored in the database and a file with the user’s vector weighted ontology is created. At this step of the methodology, the profiles of all the users that reacted with the search engine are accumulated and are clustered into clusters with similar interests. This procedure is done for the users that have already reacted with the search engine and their previous reaction has been stored in the web access logs. The clustering algorithm that has been applied in the methodology proposed in the profile clustering step is the K-Means algorithm [10]. K-Means is one of the most common clustering algorithms that groups data into clusters with similar characteristics or features together. The data in a cluster will have similar features or characteristics which will be dissimilar from the data in other clusters. The K-Means algorithm accepts the number of clusters to group data into and the dataset to cluster as input values. It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. It calculates the Arithmetic Mean of each cluster
56
J. Garofalakis and T. Giannakoudi
formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure, which was used in this module. K-Means re-assigns each record in the dataset to the most similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster centre or Arithmetic Mean of each cluster formed is the same as the old cluster centre. In the end of the execution of this step the users are grouped into clusters with similar interests and the clusters are stored to the database. Thus, a cluster profile is built, utilizing the sum of preferences of all cluster members: pc =< w1 , w2 , ..., wn >
(4)
We should note that every time this step is executed, the clusters are constructed from the beginning and the users are grouped again. Thus, the clustering procedure is not based on the previous constructed clusters. This has been chosen as a way of developing the methodology, considering that the user’s choices will alter periodically and he may not have similar interest with the users in the cluster he was clustered in a previous execution of the clustering procedure. The construction of the semantic users’ profiles clusters is presented in Figure 3.
Fig. 3. Creation of the clusters with the semantic users’ profiles
Exploiting Ontologies for Web Search Personalization
57
4 Personalization Algorithm The preprocessed user’s choices, their semantic characterization and the users’ clusters are used for processing and personalizing the results from a search engine. At this point every user that has reacted previously with the online search engine has been put in one cluster. This cluster consists of users with similar interests and can be depicted as a weighted ontology such as the profiles. This weighted ontology will be presented as a vector, too. The elements of the vector, representing the weighted ontology, would be the sum of interests for a concept of all the users belonging to the cluster divided by the sum of interests of all the users of the cluster for all the concepts of the ontology. The formulation is the same that was followed in the users’ profiles described in paragraph 3.3. The personalized search includes the calculation of the similarity of each result returned by the search engine with the cluster’s interests. This calculation requires the execution of all the steps of the ontology-based user clusters for each result returned by the search engine. Therefore, for every query that is set to the search engine the proposed methodology follows the following steps: 1. Extracts the keywords from the users’ previous choices, i.e the users previous visited rsults pages 2. Applies the semantic annotation step with the difference that at this assignment the ontology is not the reference ontology but a part of the ontology which consists of the concepts of the ontology for which the cluster that the user belongs has a non-zero weight. The output of this step is a vector containing the similarity values of keywords with the concepts of the ontology and is depicted as: result simjc =< simj1 , simj2 , ..., simjm >
(5)
Where: j is the jth result of the search engine and m is the number of the concepts in the cluster 3. Since we have calculated the similarity of each result to the cluster we calculate the value score value for each result. This score is calculated as the internal product of the cluster vector represented in relation (4) and the similarity vector represented in relation (5). So the score will be: Score = pc × result simjc
(6)
The above three steps are executed for every result and the score value is kept in cache. Afterwards, the results of the search engine are organized for presentation to the user according to the score that has been calculated, beginning with the one with the highest score (Figure 4). During the reaction of the user with the search engine the choices of the user are stored in the database so as to be processed in the next run of the method.
58
J. Garofalakis and T. Giannakoudi
Fig. 4. The Personalization algorithm
5 Testing and Evaluation 5.1
Experimental Implementation of the Methodology
We developed a WWW search engine utilizing the Google Search API 6 so as to test our methodology. The Search API returns the URL, the title and a short summary for each one of the first ten results of the Google search engine. At first we run this limited search engine without personalizing the results but accumulating the users’ choices. At next, we applied the method proposed and compared the results of the personalized representation with the non personalized representation. Logging Search History The Google search API, used for the experimental implementation, returns the URL, the title and a small summary for every result, just like the results of the Google search engine. For our experimental implementation, we use a database used for storing the users’ choices for every query applied in the limited search engine used for testing. Through the website of this limited search engine we store the IP address, the domain name and the user agent for the identification of each user. Every time that a user enters the search engine there is an identification of the IP address, the agent and the domain name keeping off the multiple storage of a user in the database Moreover, the search engine stores in the database the query and the choices of the user in every query. So, for every result that is clicked by the user the search engine stores the title, the URL and the short summary returned in the database. This database consists of the history of the requests and therefore is used as the web access logs in this methodology. At next we apply the steps of the methodology proposed earlier in the web access logs for the creation of the semantic users’ profiles clusters. In the web 6
Google Web Apis Home Page. http://code.google.com/apis
Exploiting Ontologies for Web Search Personalization
59
access logs, i.e. in the database, there are the choices of all the users. For every choice that has been selected we extract the keywords. For the experimental implementation the methodology for the keyword extraction is similar to the one proposed in [3] for the keywords of the pages that have a link for a specific page. The keywords that are extracted for every URL are accumulated from the title of the URL and the short summary returned by the Google search API. The title and the summary are parsed and are cleaned by the HTML tags and the stop words (very common words, numbers, symbols, articles) are removed, since they are considered not to contribute to the semantic denotation of the web page’s content. The words that remain are considered the keywords for every URL since their number is small and no frequency is being taken into consideration. After the running of this step the keywords for every URL are stored to the database. At next the keywords are semantically characterized according to the way described in paragraph 3.2. Afterwards, the profiles of the users are created as analyzed in paragraph 3.3 and finally the users are grouped into clusters as referred in 3.4 according to the methodology proposed. 5.2
Experimental Results and Evaluation
In order to evaluate the proposed method and prove the efficient behavior of our personalization method, we performed some queries with polysemy expecting the personalized results to be personalized according to the profile of the cluster that a user is set and to verify that our method can improve the results’ ranking quality as desired. We applied the queries in the experimental implementation that returns the first ten results from the Google search engine through the Google search API. In one case we applied our personalization methodology whereas in the other case we extracted the results as they were returned by the search API. We evaluated the use of our automatically created user profiles for personalized search using the approach of ranking. A function is applied to the documentquery match values and the rank orders returned by the search engine. The relevant documents are moved higher in the results set and demote non-relevant documents. Our experimental implementation was online for 1 month and twenty users have reacted with it. The choices that they have made for every query were stored in the database. The choices were processed and the user profiles were created. Next, we clustered the users in three clusters. The user that made the queries has already been put in a cluster and the reference ontology of the cluster upon which the score of the results will be based has been created. We should note that the cluster has users that are interested in Acting, Advertising, American, Animation, Apple, Appliances, Artists, Audio, Ballet, Ballroom, Biography, Bonsai, Buses, Cables, Choices, Companies, Darwin, DEC, Exploits, Flowers, Fraud, Games, Journals, Licenses, Mach, Mainframe, Morris, Mosaics, Music, Oceania, Opera, Painters, People, Pick, Programs, Quotations, Reference, Representatives, Roleplaying, Security, Series, Soaps, Sports, Sun, Supplies, Syllable, Telephony, Test Equipment, Youth, Assemblage, Characters, Christian,
60
J. Garofalakis and T. Giannakoudi
Computer, Cracking, Creativity, Creators, Drawing, Editorial, Home, Instruments, Internet, Organizations, Radio, Searching, Unix with various weights for each concept of the reference ontology. Methodology Performance under Polysemy Queries An example query that was applied in the search engine was “opera”. The word “opera” has a twofold meaning. Opera is a form of musical and dramatic work and also it is a very common used web browser. Thus, it is a query that the results of the search engines will refer both to music and computers. The user that is giving the query to the search engine asks for information about opera as a kind of music and expects results related to music. In the following table we can see the results of the search engine. The first column represents the order of the results of the search API without the application of the personalization methodology while in the second column we can see the order of the personalized results of the experimental application, according to the score of each result. In Table 1 we can see the titles of results for the query “Opera”. Table 1. Personalized and non-personalized results for query “Opera” for a user that is searching for opera related with music and the cluster he belongs has interest in Arts but in Computes as well Non-personalized results
Personalized results
Download Opera Web Browser (comput- Opera Software-Company (computers) ers) Opera Software-Company (computers) Welcome to LA Opera — LA Opera (music) Opera - Wikipedia the free encyclo- Opera - Wikipedia the free encyclopedia (music) pedia (music) Opera (Internet suite) - Wikipedia, the free Opera Community (computers) encyclopedia (computers) Opera Mini - Free mobile Web browser for Opera (Internet suite) - Wikipedia, the free your phone (computers) encyclopedia (computers) Welcome to LA Opera - LA Opera Opera in to the Ozarks (music) (music) OperaGlass (computers) Opera Mini - Free mobile Web browser for your phone (computers) The Metropolitan Opera (music) The Metropolitan Opera (music) Opera in to the Ozarks (music) OperaGlass (computers) Opera Community (computers) Download Opera Web Browser (computers)
Next to each title we give in parenthesis the general concept of the result, which we have concluded after reading the summary. The user searches for results related to music. The first column represents the results that are returned from the search API without personalization. In this column the results that the user
Exploiting Ontologies for Web Search Personalization
61
searches are in places 3, 6, 8, 9. On the other hand,b the second column has the personalized results and the results related with music are in places 2, 3, 6, 8. It is obvious that after the application of the personalization methodology that is proposed the results related with music are pushed to places closer to the top. The cluster into which the user belongs, as we have mentioned, has many interests that include music and this has been taken into consideration while calculating the score of each result pushing the results related with music in a higher place in the list of the results. Also, because of the fact that the results returned have high similarity with the concepts of the cluster reference ontology the music related results are pushed closer to the top. Apart from this query, we have tested the proposed methodology in a second query, the “Apple Company”. The Apple Company has many meanings. The “Apple Company” is the name of a company that develops and sells products related with computers. Moreover, Apple is the name of the record company that the group of Beatles created and the name another company related with music the “Mountain Apple Company”. Also, there is a company named Table 2. Personalized and non-personalized results for query “Apple Company” for a user that is a company related with music and the cluster he belongs has interest in Arts but in Computes as well Non-personalized results Apple Inc. - Wikipedia, the free encyclopedia (computers) Welcome to the Apple Company Store (computers) Apple-Quicktime (computers)
Personalized results
Apple Moving Company, Austin, Texas (moving company) Hawaiian Music - The Mountain Apple Company (music) Little Apple Browsing Company Something New is Brewing (entertainment) Apple Inc. and the Beatles’ Apple Apple Inc. - Wikipedia, the free encyclopeCorps Ltd. Enter into New Agree- dia (computers) ment (music) Apple company and contact information Apple Canvon Company — Specialty (computers) Foods From the Heart of New Mexico (food) Hawaiian Music - The Mountain Ap- Apple company and contact information ple Company (music) (computers) Green Apple Co. Inc. (handcraft) Green Apple Co. Inc. (handcraft) Little Apple Browsing Company - Welcome to the Apple Company Store Something New is Brewing (enter- (computers) tainment) Apple Moving Company, Austin, Texas Apple Inc. and the Beatles’ Apple (moving company) Corps Ltd. Enter into New Agreement (music) Apple Canvon Company - Specialty Foods Apple-Quicktime (computers) From the Heart of New Mexico (food)
62
J. Garofalakis and T. Giannakoudi
“Green Company” which is related to handcraft, a company named “Apple Canyon Company” related to food, a company named “Little Apple Brewing Company” related to entertainment and a company named “Apple moving Company” which is a moving company. In Table 2 in the first column there are the results as they are returned by the search API whereas in the second column there are the results as they are reorganized according to the score calculated by the personalization methodology we propose. For each result next to the title there is a general description in parenthesis. The user keeps on searching for results related to music. The results related with music in the non personalized presentation are in places 4, 6, 8 while in the personalized presentation the places are 2, 3, 9. The personalization methodology has pushed the desired results to the first places of the list of the results returned by the search engine. In both examples, the cluster that the user belongs except for the interest in music shows also interest in computers, and this interest is depicted in the results of the personalization methodology applied. The first result in both queries was about computers because the weighted ontology depicting the cluster has higher weights for concepts related to computers than concepts related to arts. However, the methodology given the relatedness of the results with the cluster’s preferences has pushed the desired results in places higher than the places they were put without personalization. Precision Evalution The twenty users were asked to characterize the top five results in the personalized and non- personalized results set as being “relevant” or “non-relevant”. On average, before re-ranking, only 40% of the top retrieved pages were found to be relevant. This amount is remarkably lower than the findings in [1], which reports that roughly 50% of documents retrieved by search engines are irrelevant. The
Fig. 5. Average precision of the semantic personalization search engine compared with the non-personalization search engine
Exploiting Ontologies for Web Search Personalization
63
reason is that the queries tested by the users had polysemy, thus the probability of retrieving irrelevant results was higher. The re-ranking of the results by promoting those that classify into concepts that belong in the user’s cluster profile produced an overall performance increase as shown in figure 5. We see that the ontology based system consistently outperforms compared with the simple search, validating our approach of using reference ontology for clustering user profiles in the Semantic search.
6 Conclusions and Future Work We presented a personalization methodology which is based on clustering semantic user profiles. The method analyzes and annotates semantically the web access logs. At next it organizes the users’ profiles and groups the users into clusters. The personalization of the results returned by the search engine is done by an on-the-fly semantic characterization and the score of each result is calculated. The scores of the results are kept in cache and the results are reorganized and presented to the user according to this score putting the one with the highest score first. By the experimental implementation we showed that the personalized method proposed has notably possibilities to change the scene in personalization. Future work includes the use of Fuzzy K-Means [2] that allows the creation of overlapping clusters, so that a user may belong to different cluster profiles with different weights. Also, the development of a reference ontology with more levels and alteration in factors such as the score of each result taking into consideration the user’s preference with greater weight than the rest users of the cluster.
References 1. Casasola, E.: ProFusion Personal Assistant: An Agent for Personalized Information Filtering on the WWW. Master’s thesis, The University of Kansas (1998) 2. Castellano, G., Torsello, A.: Categorization of web users by fuzzy clustering. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part II. LNCS, vol. 5178, pp. 222–229. Springer, Heidelberg (2008) 3. Cauch, S., Chafee, J., Pretschner, A.: Ontology-Based User Profiles for Search and Browsing. Web Intelligence and Agent systems 1(3-4), 219–234 (2003) 4. Chirita, P.A., Nejdl, W., Paiu, R., Kohlschutter, C.: Using ODP metadata to personalize search. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Salvador, Brazil (2005) 5. Dai, H., Mobasher, B.: Using Ontologies to Discover Domain-Level Web Usage Profiles. In: Proceddings of the 2nd Workshop on Semantic Web Mining at PKDD 2002, Helsinki, Finland (2002) 6. Eirinaki, M., Vazirgiannis, M., Varlamis, I.: SEWeP: Using Site Semantics and a Taxonomy to Enhance the Web Personalization Process. In: Proceedings of the 9th SIGKDD Conference (2003) 7. Garofalakis, J., Giannakoudi, T., Sakkopoulos, E.: An Integrated Technique for Web Site Usage Semantic Analysis: The ORGAN System. Journal of Web Engineering (JWE). Special Issue Logging Traces of Web Activity 6(3), 261–280 (2007)
64
J. Garofalakis and T. Giannakoudi
8. Gauch, S., Madrid, J., Induri, S., Ravindran, D., Chadlavada, S.: KeyConcept: A Conceptual Search Engine. Information and Telecommunication Technology Center. Technical Report: ITTC-FY2004-TR-8646-37, University of Kansas. 9. Haveliwala, T.: Topic-Sensitive PageRank. In: Proceedings of the Eleventh International World Wide Web Conference (2002) 10. Ma, Z., Pant, G., Sheng, O.: Interest-based personalized search. ACM Transactions Information Systems 25(1) (2007) 11. MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, vol. 1, pp. 281–297. University of California Press (1967) 12. Makris, C., Panagis, Y., Sakkopoulos, E., Tsakalidis, A.: Category ranking for personalized search. Data and Knowledge Engineering Journal (DKE) 60(1), 109– 125 (2007) 13. Middleton, S., Shadbolt, de Roure, D.C.: Ontological User Profiling in Recommender Systems. ACM Transactions Information Systems 22(1), 54–88 (2004) 14. Miller, G.A.: WordNet: A lexical database for English. Communications of the ACM 38(11), 39–41 (1995) 15. Mobasher, B., Cooley, R., Srivastava, J.: Automatic Personalization based on web usage Mining. Communications of the ACM 43(8), 142–151 (2000) 16. Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet: Similarity - Measuring the Relatedness of Concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence, pp. 1024–1025. AAAI, San Jose (2004) 17. Qiu, F., Cho, J.: Automatic identification of user interest for personalized search. In: Proceedings of the 15th International WorldWide Web Conference, Edinburgh, Scotland, U.K. ACM Press, New York (2006) 18. Tanudjaja, F., Mui, L.: Persona: A contextualized and personalized web search. In: Proceedings of the 35th Annual Hawaii International Conference on System Sciences (2002) 19. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics, Las Cruces, New Mexico, pp. 133–138 (1994)
4 How to Derive Fuzzy User Categories for Web Personalization Giovanna Castellano and Maria Alessandra Torsello University of Bari, Department of Informatics, Via Orabona, 4 - 70126 Bari, Italy {castellano,fanelli,torsello}@di.uniba.it
Summary. Today, Web personalization offers valid tools for the development of applications that have the attractive property to meet in a more effective manner the needs of their users. To do this, Web developers have to address an important challenge concerning the discovery of knowledge about interests that users exhibit during their interactions with Web sites. Web Usage Mining (WUM) is an active research area aimed at the discovery of useful patterns of typical user behaviors by exploiting usage data. Among the different proposed techniques for WUM, clustering has been widely employed in order to categorize users by grouping together users sharing similar interests. In particular, fuzzy clustering reveals to be an approach especially suitable to derive user categories from Web usage data available in log files. Usually, fuzzy clustering is based on the use of distance-based metrics (such as the Euclidean measure) to evaluate similarity between user preferences. However, the use of such measures may lead to ineffective results by identifying user categories that do not capture the semantic information incorporated in the original Web usage data. In particular, in this chapter, we propose an approach based on a relational fuzzy clustering algorithm equipped with a fuzzy similarity measure to derive user categories. As an application example, we apply the proposed approach on usage data extracted from log files of a real Web site. A comparison with the results obtained using the cosine measure is shown to demonstrate the effectiveness of the fuzzy similarity measure. Keywords: fuzzy similarity measures, relational fuzzy clustering, Web personalization, Web user categorization, Web Usage Mining.
1 Introduction The growing diffusion of Internet as a new medium of information dissemination and the increased number of users that daily browse the network have led more and more organizations to public their information and to provide their services on the Web. However, the explosive growth in the use and the size of Internet has increased the difficulties in managing these information and has originated a growing interest for the development of personalized Web applications, i.e. applications able to adapt their content or services to the user interests. Today, Web personalization represents one of the most powerful tools for the improvement of Web-based applications by allowing to provide contents tailor-made to G. Castellano, L.C. Jain, A.M. Fanelli (Eds.): Web Person. in Intel. Environ., SCI 229, pp. 65–79. c Springer-Verlag Berlin Heidelberg 2009 springerlink.com
66
G. Castellano and M.A. Torsello
the needs of users, satisfying in this way their actual desires without asking for them. Hence, one of the main challenges that Web applications have to face consists in understanding user preferences and interests in order to provide personalized functions that appeal to the users. As a result, knowledge discovery about user interests reveals to be a crucial activity in the overall process of personalization. In particular, such activity is aimed at the identification of user behavior patterns, i.e. the discovery of common behaviors exhibited by groups of users during their visits to Web sites. Advanced technologies, such as those coming from data mining and Web mining may offer valid tools to reach this aim. Among these, Web Usage Mining (WUM) [15], [7] is an important branch of Web mining that is devoted to the discovery of interesting patterns in the user browsing behavior through the analysis of Web usage data characterizing the interactions of users with sites. Since access log files store a huge amount of data about user access patterns, they represent the most important source of usage data. Of course, if properly exploited, log files can reveal useful information about the browsing behavior of users in a site. As a consequence, these data can be employed to derive categories of users capturing common interests and trends among users accessing the site. The discovered user categories can be exploited in order to deliver personalized functions to currently connected users. In the absence of any a priori knowledge, unsupervised classification or clustering seems to be the most promising way for learning user behavior patterns and identifying user categories by grouping together users with common browsing behavior [24], [25]. In the choice of an effective clustering method for WUM, several factors have to be considered. Early research efforts have relied on clustering techniques that often revealed to be inadequate to deal with the noise typically present in Web usage data. In this context, desirable techniques should be able to handle the uncertainty and vagueness underlying data about the interactions of users with the sites. Another important aspect to be considered is the possibility to obtain overlapping clusters, so that a user can belong to more than one group. In effect, the browsing behavior of users is highly uncertain and fuzzy in nature. A Web site is generally visited by a huge number of users having a variety of needs. Moreover, a user may access the same page of a site for different purposes and may have several goals whenever he/she visits a site. Such overlapping interests cannot be adequately captured through crisp partitions obtained by hard clustering techniques that assign each object exclusively to a single cluster. Thanks to their capacities of deriving clusters with hazy boundaries where objects may have characteristics of different classes with certain degrees, fuzzy clustering methods result particularly suitable for usage mining [17], [10], [23]. The main advantage of fuzzy clustering over hard clustering is that it allows to yield more detailed information about the underlying structure of the data. Another main challenge in the use of clustering for the categorization of Web users is the definition of an appropriate measure that is able to capture the similarity between user interests. In fact, the choice of the distance measure to be incorporated in clustering algorithms highly affects the quality of the obtained partitions.
How to Derive Fuzzy User Categories for Web Personalization
67
In this chapter, we focus on the adoption of fuzzy clustering for the categorization of users visiting a Web site. In particular, to extract user categories, we propose the employment of CARD+, a fuzzy relational clustering algorithm that works on data quantifying similarity between user interests. Instead of using standard similarity measures, such as the cosine based similarity, we equip CARD+ with a fuzzy distance measure in order to evaluate the similarity degree among each pair of Web users. The adopted measure is directly derived from the similarity quantification of fuzzy sets. The adoption of similarity metrics based on fuzzy logic theory reveals to be particularly effective to evaluate the similarity among Web users for different reasons. A first advantage deriving from the use of the fuzzy paradigm concerns the possibility to define a measure that is able to deal with data that can have a symbolic nature. In fact, while measures based on the distance concept in metric spaces reveal to be inefficient to deal with this kind of data, fuzzy similarity measures permit to reflect the semantic of the employed data and, hence, to apply clustering processes also to data with hybrid nature (eg. numerical, ordinal, and categorical). Moreover, the use of similarity metrics based on the fuzzy logic theory are especially appropriate to deal with the vague and imprecise nature characterizing Web usage data. Classical distance-based metrics could not permit to effectively face the uncertainty and the ambiguity that underlie Web interaction data. The chapter is articulated as follows. Section 2 briefly overviews works that employ different fuzzy clustering techniques for user categorization. Section 3 describes our approach for the categorization of Web users. Firstly, we detail the process of creation of the relation matrix through the computation of the similarity degree among users. Then, we describe CARD+, the clustering algorithm that we employ to extract user categories. In section 4, we present the results obtained by applying CARD+ on real-world data and we show the values obtained for some validity metrics in order to evaluate the effectiveness of the proposed approach. Finally, section 5 concludes the chapter by summarizing the key points.
2 Fuzzy Clustering for User Categorization One active area of research in WUM is represented by clustering of users based on their Web access patterns. User clustering provides groups of users that seem to behave similarly when they browse a Web site. The knowledge discovered by analyzing the characteristics of the identified clusters can be properly exploited in a variety of application domains. For example, in e-commerce applications, clustering of Web users can be used to perform market segmentation. In elearning context, user categories discovered by applying clustering algorithms can be employed in order to suggest learning objects that meet the information needs of users or to provide personalized learning courses. Also, clusters of users can be exploited in the personalization process of a Web site where the aims can be different. For example, user clustering results can help to re-organize the Web
68
G. Castellano and M.A. Torsello
portal by restructuring the site content more efficiently, or even to build adaptive Web portals, i.e. portals whose organization and presentation of content change depending on the specific user needs. Clustering is a well known data mining technique which has been widely used in WUM to categorize the preprocessed Web log data. More precisely, user clustering groups users having similar navigational behavior (and, hence, having common interests) in the same cluster (or user category) and puts users exhibiting dissimilar browsing behavior in different clusters. In WUM, among the different clustering techniques adopted to extract user categories, fuzzy clustering reveals to be particularly effective for mining significant browsing patterns from usage data thanks to their capacity to handle the uncertain and the vague nature underlying Web data. In this section, we give an overview of different works that employ fuzzy clustering methods for the categorization of Web users. In literature, surveys of works that propose the employment of fuzzy clustering techniques to support the WUM methodology are presented in [11] and [13]. In [16], different kinds of fuzzy clustering techniques are used to discover user categories. The well-known Fuzzy C-Means (FCM) has been employed in [14] for mining user profiles by partitioning user sessions identified from log data. Here, a user session is defined as the set of the consecutive accesses made by a user within a predefined time period. The FCM algorithm has been successfully applied to Web mining in different works such as [2] and [9]. In [1], the authors proposed a novel ’intelligent miner’ that exploits the combination of a fuzzy clustering algorithm and a fuzzy inference system to analyze the trends of the network traffic flow. Specifically, a hybrid evolutionary FCM approach is adopted to individuate groups of users with similar interests. Clustering results are then used to analyze the trends by using a Takagi-Sugeno fuzzy inference system learned through a combination of an evolutionary algorithm and the neural network learning. Lazzerini and Marcelloni [12] presented a system based on the use of a fuzzy clustering approach to derive a small number of profiles of typical Web site users starting from the analysis of Web access log files and to associate each user to the proper profile. The system is composed of two subsystems: the profiler and the classifier. In the profiler subsystem, the authors applied an Unsupervised Fuzzy Divisive Hierarchical Clustering (UFDHC) algorithm to cluster the users of the Web portal into a hierarchy of fuzzy groups characterized by a set of common interests. Each user group is represented by a cluster prototype which defines the profile of the group members. To identify the profile a specific user belongs to, the classifier employs a classification method which completely exploits the information contained in the hierarchy. In particular, a user is associated with a profile by visiting the tree from the root to the deepest node to which the user belongs with a membership value higher than a fixed threshold. The profile corresponding to this last node is assigned to the user. In [3], Runkler and Bezdeck focused on the use of relational fuzzy clustering approach for Web mining. This approach results particularly suitable for the management of datasets including non-numerical patterns. In fact, this kind of
How to Derive Fuzzy User Categories for Web Personalization
69
data can be properly represented numerically by relations among pairwise of objects. The obtained relational datasets can be successively clustered by means of appropriate clustering algorithms. Specifically, as an application, the authors proposed the use of the Relational Alternating Cluster Estimation (RACE) for the identification of prototypes that can be interpreted as typical user interests. In [19], the authors proposed an extension of the Competitive Agglomeration clustering algorithm so that it can work on relational data. The resulting Competitive Agglomeration for Relational Data (CARD) algorithm is able to automatically partition session data into an optimal number of clusters. Moreover, CARD can deal with complex and subjective distance/similarity measures which are not restricted to be Euclidean. Another relational fuzzy clustering method was proposed in [10] for grouping user sessions. In their work, each session includes the pages of a certain traversal path. Here, the Web site topology is considered as a bias in the calculation of the similarity between the sessions depending on the relative position of the corresponding pages in the site. In [18], the Relational Fuzzy Clustering-Maximal Density Estimator (RFCMDE) algorithm was employed to categorize user sessions identified by the analysis process of the Web log data. The authors demonstrated that this algorithm is robust and can deal with outliers that are typically present in this application. RFC-MDE was applied on real-world examples for the extraction of user profiles from log data. Many other fuzzy relational clustering algorithms have been used for mining Web usage profiles. Among these, we mention the fuzzy c-Trimered Medoids Algorithm [9], the Fuzzy c-Medoids (FCMdd) algorithm [20], and the Relational Fuzzy Subtractive clustering algorithm [23]. In the present work we propose an approach based on the use of relational fuzzy clustering for the categorization of Web site users. In particular, we propose the use of CARD+, a relational fuzzy clustering algorithm derived from a modified version of CARD. CARD+ permits to incorporate a similarity measure based on the fuzzy logic theory which enables to better capture similarity degrees among user interests. In the following sections, we describe in more details the approach that we propose for the identification of fuzzy user categories.
3 Categorization of Web Users To discover Web user categories encoding interests shared by groups of users, a preliminary activity has to be performed to extract a collection of patterns that model user browsing behaviors. In our work, information contained in access log files are exploited to derive such data. Log files are important sources of information in the process of knowledge discovery about user browsing behavior since they store in chronological order all the information concerning the accesses made by all the users to the Web site. However, access log files contain a huge and noisy amount of data, often comprising a high number of irrelevant and useless records. As a
70
G. Castellano and M.A. Torsello
consequence, a preprocessing phase of log files has to be performed so as to retain only data that can be effectively exploited in order to model user navigational behavior. In this work, the preprocessing of log files is performed by means of LODAP, a software tool that we have implemented for the analysis of Web log files in order to derive models characterizing the user browsing behaviors. To achieve this aim, based on information stored in log files, LODAP executes a first process, known in literature as sessionization [6], aimed at the derivation of a set of user sessions. More precisely, for each user, LODAP determines the sequence of pages accessed during a predefined time period. User sessions are then exploited to create models expressing the interest degree exhibited by each user for each visited page of the site. Briefly speaking, log file preprocessing is performed through four main steps: 1. Data Cleaning that removes all redundant and useless records contained in the Web log file (e.g. accesses to multimedia objects, robots’ requests, etc.) so as to retain only information concerning accesses to pages of the Web site. 2. Data Structuration that groups the significant requests into user sessions. Each user session contains the sequence of pages accessed by the same user during an established time period. 3. Data Filtering that selects only significant pages accessed in the Web site. In this step, the least visited pages as well as the most visited ones, are removed. 4. Interest degree computation that exploits information about accessed pages to create a model of the visitor behavior by evaluating a degree of interest of each user for each accessed page. Main details about the working scheme of LODAP can be found in [5]. As a result, LODAP extracts data which are synthetized in a behavior matrix B = [bij ] where the rows i = 1, . . . , n represent the users and the columns j = 1, . . . , m correspond to the Web pages of the site. Each component bij of the matrix indicates the interest degree of the i-th user for the j-th page. The i-th user behavior vector bi (i-th row of the behavior matrix) characterizes the browsing behavior of the i-th user. Starting from the derived behavior data, CARD+ can be applied to categorize users. In the categorization process, two main activities can be distinguished: • The creation of the relation matrix containing the dissimilarity values among all pairs of users; • The categorization of users by grouping similar users into categories. In the following subsections, we detail the activities performed in the categorization process of Web users. 3.1
Computing Similarity among Web Users
Once the log file preprocessing step has been completed and behavior data are available, the effective categorization process of Web users can start. The first
How to Derive Fuzzy User Categories for Web Personalization
71
activity in the categorization process of similar users based on the use of relational fuzzy clustering consists in the creation of the relation matrix including the dissimilarity values among all pairs of users. To create the relation matrix, an essential task consists in the evaluation of the (dis)similarities among two generic users on the basis of a proper measure. In our case, based on the behavior matrix, the similarity between two generic users is expressed by the similarity between the two corresponding user behavior vectors. In literature, different metrics have been proposed to measure the similarity degree between two generic objects. One of the most common measures employed to this aim is the angle cosine measure [21]. In the specific context of user category extraction, the cosine measure computes the similarity between any two behavior vectors bx and by as follows: m bx by j=1 bxj byj = . SimCos (bx , by ) = m m bx by 2 2 j=1 bxj j=1 byj
(1)
The use of the cosine measure might be ineffective to define the similarity between two users visiting a Web site. In effect, to evaluate the similarity between two generic users (rows of the available matrix), the cosine measure takes into account only the common pages visited by the considered users. This approach may produce ineffective results, leading to the loss of semantic information underlying Web usage data related to the relevance of each page for each user. To better capture the similarity between two generic Web users, we propose the use of a fuzzy similarity measure. Specifically, two generic users are modeled as two fuzzy sets and the similarity between these users is expressed as the similarity between the corresponding fuzzy sets. To do so, the user behavior matrix B is converted into a matrix M = [μij ] which expresses the interest degree of each user for each page in a fuzzy way. A very simple characterization of the matrix M is provided as follows: ⎧ 0 if bij < IDmin ⎨ ij −IDmin (2) μij = idbmax if b ∈ [IDmin , IDmax ] ij −IDmin ⎩ 1 if bij > IDmax where IDmin is a minimum threshold for the interest degree under which the interest for a page is considered null, and IDmax is a maximum threshold of the interest degree, after which the page is considered surely preferred by the user. Starting from this fuzzy characterization, the rows of the new matrix M are interpreted as fuzzy sets defined on the set of Web pages. Each fuzzy set μi is related to a user bi and it is simply characterized by the following membership function: ∀j = 1, 2, . . . , m (3) μi (j) = μij In this way, the similarity of two generic users is intuitively defined as the similarity between the corresponding fuzzy sets. The similarity among fuzzy sets
72
G. Castellano and M.A. Torsello
can be evaluated in different ways [26]. One of the most common measures to evaluate similarity between two fuzzy sets is the following: σ (μ1 , μ2 ) =
|μ1 ∩ μ2 | |μ1 ∪ μ2 |
(4)
According to this measure, the similarity between two fuzzy sets is given by the ratio of two quantities: the cardinality of the intersection of the fuzzy sets and the cardinality of the union of the fuzzy sets. The intersection of two fuzzy sets is defined by the minimum operator: (μ1 ∩ μ2 ) (j) = min μb1 (j) μb2 (j) (5) The union of two fuzzy sets is defined by the maximum operator: (μ1 ∪ μ2 ) (j) = max μb1 (j) μb2 (j)
(6)
The cardinality of a fuzzy set (also called ”σ-count”) is computed by summing up all its membership values: |μ| =
m
μ (j)
(7)
j=1
Summarizing, the similarity between any two users bx and by is defined as follows: m j=1 min μbx j , μby j . (8) Simf uzzy (bx , by ) = m max μ , μ j=1 bx j by j This fuzzy similarity measure permits to embed the semantic information incorporated in the user behavior data. In this way, a better estimation of the true similarity degree between two user behaviors is obtained. Similarity values are mapped into the similarity matrix Sim = [Simij ]i,j=1,...,n where each component Simij expresses the similarity value between the user behavior vectors bi and bj calculated by using the fuzzy similarity measure. Starting from the similarity matrix, the dissimilarity values are simply computed as Dissij = 1 − Simij , for i, j = 1, . . . , n. These are mapped in a n × n matrix R = [Dissij ]i,j=1,...,n representing the relation matrix. 3.2
Grouping Users by Fuzzy Clustering
Once the relation matrix has been created, the next activity is the categorization of user behaviors in order to group users with similar interests into a number of user categories. To this aim, we adopt the fuzzy relational clustering approach. In particular, in this work, we employ CARD+, that we proposed in [4] as an improved version of the CARD (Competitive Agglomeration Relational Data) clustering algorithm [17]. A key feature of CARD+ is its ability to automatically
How to Derive Fuzzy User Categories for Web Personalization
73
categorize the available data into an optimal number of clusters starting from an initial random number. In [17], the authors stated that CARD was able to determine a final partition containing an optimal number of clusters. However, in our experience, CARD resulted very sensitive to the initial number of clusters by often providing different final partitions, thus failing in finding the actual number of clusters buried in data. Indeed, we observed that CARD produces redundant partitions, with clusters having a high overlapping degree (very low inter-cluster distance). CARD+ overcomes this limitation by adding a post-clustering process to the CARD algorithm in order to remove redundant clusters. As common relational clustering approaches, CARD+ obtains an implicit partition of the object data by deriving the distances from the relational data to a set of C implicit prototypes that summarize the data objects belonging to each cluster in the partition. Specifically, starting from the relation matrix R, the following implicit distances are computed at each iteration step of the algorithm: dci = (Rzc )i − zc Rzc /2
(9)
for all behavior vectors i = 1, . . . , n and for all implicit clusters c = 1, . . . , C, where zc is the membership vector for the c-th cluster, defined as on the basis of the fuzzy membership values zci that describe the degree of belongingness of the i-th behavior vector in the c-th cluster. Once the implicit distance values dci have been computed, the fuzzy membership values zci are updated to optimize the clustering criterion, resulting in a new fuzzy partition of behavior vectors. The process is iterated until the membership values stabilize. Finally, a crisp assignment of behavior vectors to the identified clusters is performed in order to derive a prototype vector for each cluster, representing a user category. Precisely, each behavior vector is crisply assigned to the closest cluster, creating C clusters: χc = {bi ∈ B|dci < dki ∀c = k}
1 ≤ c ≤ C.
(10)
Then, for each cluster χc a prototype vector vc = (vc1 , vc2 , . . . , vcm ) is derived, where bi ∈χc bij vcj = j = 1, . . . , NP . (11) |χc | The values vcj represent the significance (in terms of relevance degree) of a given page pj to the c-th user category. Summarizing, the CARD+ mines a collection of C clusters from behavior data, representing categories of users that have accessed to the Web site under analysis. Each category prototype vc = (vc1 , vc2 , ..., vcm ) describes the typical browsing behavior of a group of users with similar interests about the most visited pages of the Web site.
4 Simulation Results To show the suitability of CARD+ equipped with the fuzzy measure to identify Web user categories. we carried out an experimental simulation. We used the
74
G. Castellano and M.A. Torsello
access logs from a Web site targeted to young users (average age 12 years old), i.e. the Italian Web site of the Japanese movie Dragon Ball (www.dragonballgt.it). This site was chosen because of its high daily number of accesses (thousands of visits each day). The preprocessing of log files Firstly, the preprocessing of log files was executed to derive models of user behavior. To this aim, LODAP was used to identify user behavior vectors from the log data collected during a period of 12 hours (from 10:00 a.m. to 22:00 p.m.). Once the four steps of LODAP were executed, a 200 × 42 behavior matrix was derived. The 42 pages in the Web site were labeled with a number (see table 1) to facilitate the analysis of results, by specifying the content of the Web pages. Table 1. Description of the retained pages in the Web site Pages
Content
1, ..., 8 Pictures of characters 9,..., 13 Various kind of pictures related to the movie 14,..., 18 General information about the main character 19, 26, 27 Matches 20, 21, 36 Services (registration, login, ...) 22, 23, 24, 25, 28, ..., 31 General information about the movie 32, ..., 37 Entertainment (games, videos,...) 38, ..., 42 Description of characters
Categorization of Web users Starting from the available behavior matrix, the relation matrix was created by using the fuzzy similarity measure. Next, the CARD+ algorithm (implemented in the Matlab environment 6.5) was applied to the behavior matrix in order to obtain clusters of users with similar browsing behavior. We carried out several runs by setting a different initial number of clusters Cmax = (5, 10, 15). To establish the goodness of the derived partitions of behavior vectors, at the end of each run, two indexes were calculated: the Dunn’s index and the Davies-Bouldin index [8]. These were used in different works to evaluate the compactness of the partitions obtained by several clustering algorithms. Good partitions correspond to large values of the Dunn’s index and low values for the Davies-Bouldin index. We observed that CARD+ with the use of the fuzzy similarity measure provided data partitions with the same final number of clusters C = 5, independently from the initial number of clusters Cmax . The validity indexes took the same values in all runs. In particular, the Dunn’s index value was always equal to 1.35 and the value for the Davies-Bouldin index was 0.13. As a consequence, the CARD+ algorithm equipped with the fuzzy similarity measure resulted to be quite stable, by partitioning the available behavior data into 5 clusters corresponding to the identified user categories.
How to Derive Fuzzy User Categories for Web Personalization
75
Fig. 1. Comparison of the Dunn’s index obtained by the employed algorithms and similarity measures
Fig. 2. Comparison of the Davies-Bouldin index obtained by the employed algorithms and similarity measures
Evaluation results To evaluate the effectiveness of the employed fuzzy similarity measure, we compared it to the cosine measure within the CARD+ algorithm. We carried out the same trials of the previous experiments. Moreover, to establish the suitability of CARD+ for the task of user categorization, we applied the original CARD algorithm to categorize user behaviors by employing either the cosine measure and the fuzzy similarity measure for the computation of the relation matrix. In figures 1 and 2, the obtained values for the validity indexes are compared. In this figure, in correspondence of each trial, the final number of clusters extracted by the employed clustering algorithm is also indicated. As it can be observed, CARD+ with the use of the cosine measure derived partitions which categorized data into 4 or 5 clusters, resulting less stable than CARD+ equipped with the fuzzy similarity measure. Moreover, the CARD algorithm showed an instable behavior with both the similarity measures, by providing data partitions with a different final number of clusters in each trial. Analyzing the results obtained by the different runs, we can conclude that CARD+ with the employment of the fuzzy similarity measure was able to derive the best partition in terms of compactness; hence, it revealed to be a valid approach for the identification of user categories.
76
G. Castellano and M.A. Torsello
The information about the user categories extracted by CARD+ equipped with the fuzzy similarity measure are summarized in table 2. In particular, for each user category (labeled with numbers 1,2,...,5) the pages with the highest degree of interest are indicated. It can be noted that some pages (e.g. P1 , P2 , P3 , P10 , P11 , and P12 ) are included in more than one user category, showing how different categories of users may exhibit common interests. Table 2. User categories identified on real-world data User Relevant pages (interest degrees) category 1 2 3 4
5
P1 (55), P2 (63), P3 (54), P5 (52), P7 (48), P8 (43), P14 (66), P28 (56), P29 (52), P30 (37) P1 (72),P2 (59), P3 (95), P6 (65), P7 (57), P10 (74), P11 (66), P13 (66) P1 (50), P2 (50), P3 (45), P4 (46), P5 (42), P6 (42), P8 (34), P9 (37), P12 (40), P15 (41), P16 (41), P17 (38), P18 (37), P19 (36) P2 (49), P10 (47), P11 (38), P12 (36), P14 (27), P31 (36), P32 (29), P33 (39), P34 (36), P35 (26), P36 (20), P37 (37), P38 (29), P39 (30), P40 (34), P41 (28), P42 (24) P4 (70), P5 (65), P20 (64), P21 (62), P22 (54), P23 (63), P24 (54), P25 (41), P26 (47), P27 (47)
We can give an interpretation of the identified user categories, by individuating the interests of users belonging to each of these. The interpretation is indicated in the following. • Category 1. Users in this category are mainly interested on information about the movie characters. • Category 2. Users in this category are interested in the history of the movie and in pictures of movie and characters. • Category 3. These users are mostly interested to the main character of the movie. • Category 4. These users prefer pages that link to entertainment objects (games and video). • Category 5. Users in this category prefer pages containing general information about the movie. The extracted user categories may be used to implement personalization functions in the considered Web site.
5 Conclusions The implicit knowledge discovery about the interests and the preferences of users through the analysis of their navigational behavior has become a crucial task for the development of personalized Web applications able to provide information or services adapted to the needs of their users.
How to Derive Fuzzy User Categories for Web Personalization
77
To discover significant patterns in the user browsing behavior, the WUM methodology was widely used in literature. Based on this methodology, knowledge about user interests is discovered by analyzing the usage data describing the interactions of users with the considered Web site. To do this, among the different techniques proposed in literature, clustering has been largely employed. Specifically, user clustering derives groups of users sharing similar interests namely also user categories. In WUM, fuzzy clustering techniques revealed to be especially suitable by giving the possibility to capture the overlapping interests that users exhibit when they visit a Web site. In this way, in fact, a same user may fall in different categories with a certain membership degree according to the fact that a user may have different kinds of interests or needs when he visits a site. In addition, fuzzy clustering allows a more efficient management of data permeated by uncertainty and ambiguity, characteristics of Web interaction data. In this chapter, to derive user categories from access log files, we proposed an approach based on the use of relational fuzzy clustering. In particular, we presented CARD+, a fuzzy clustering algorithm that works on relational data (expressed in terms of dissimilarities among all pairs of users) to partition user behavior data. To evaluate similarity between Web users, a fuzzy measure has been proposed. Differently from the traditional distance-based measures typically used in literature, such as the cosine measure, the fuzzy similarity measure allowed to incorporate the semantic information embedded in data reflecting better the concept of similarity among the interests expressed by two generic Web users. In particular, we showed by presenting comparative results how, in effect, CARD+ equipped with the proposed fuzzy similarity measure overcomes CARD+ equipped with the standard cosine similarity measure. Also, we showed that it overcomes the original CARD algorithm, whatever the adopted measure is. Clusters derived by CARD+ using the fuzzy measure are sufficiently separate and correspond to actual user categories embedded in the available log data. The identified user categories will be exploited to realize personalization functionalities in the considered Web site, such as the dynamical suggestion of links to pages considered interesting for a current user, according to his category membership. This chapter was intended to provide a contribute to the research in the WUM field, emphasizing on the suitability and effectiveness of fuzzy clustering techniques in the knowledge discovery process of typical patterns in user navigational behavior. In particular, this work focused on the importance of defining new and more appropriate measures for the evaluation of similarity between Web users in order to obtain more robust clustering results (and, hence, more significant user categories). Particularly, we highlighted the advantages deriving from the use of fuzzy logic for the definition of similarity measures. In effect, the employment of similarity measures based on fuzzy logic theory may provide the additional value coming from the introduction of a bias into the clustering process, with the definition of a measure embedding the specific context a priori knowledge expressed in linguistic terms. Additionally, the fuzzy definition of the similarity concept may be much more interpretable since it is more intuitive and closer
78
G. Castellano and M.A. Torsello
to the human ways of perceiving and understanding. This could enable a better comprehension of the clustering results and their translation into the natural language constructs. Other important facets may be addressed in the process of derivation of Web user categories. For example, one of the most interesting aspects concerns the possibility to create adaptive models of user categories that are able to identify the continuous changes in interests or needs of users and dynamically adapt user categories according to these changes. This opens a new challenge in WUM and a promising research direction for the development of Web applications equipped with even more refined and effective personalization functions.
References 1. Abraham, A., Wang, X.: i-Miner: A Web Usage Mining Framework Using Hierarchical Intelligent Systems. In: The IEEE Int. Conf. on Fuzzy Systems, pp. 1129–1134. IEEE Press, Los Alamitos (2003) 2. Arotaritei, D., Mitra, S.: Web Mining: a survey in the fuzzy framework. Fuzzy Sets and System 148, 5–19 (2004) 3. Bezdek, J.C.: Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York (1981) 4. Castellano, G., Fanelli, A.M., Torsello, M.A.: Relational Fuzzy approach for Mining User Profiles. LNCI, pp. 175–179. WSEAS Press (2007) 5. Castellano, G., Fanelli, A.M., Torsello, M.A.: LODAP: A Log Data Preprocessor for mining Web browsing patterns. In: Proc. of The 6th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases (AIKED 2007), Corfu Island, Greece (2007) 6. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Journal of Knowledge and Information Systems 1, 5–32 (1999) 7. Facca, F.M., Lanzi, P.L.: Mining interesting knowledge from weblogs: a survey. Data and Knowledge Engineering 53, 225–241 (2005) 8. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster Validity Methods: Part II. SIGMOD Record (2002) 9. Krishnapuram, R., Joshi, A., Nasraoui, O., Yi, L.: Low-complexity fuzzy relational clustering algorithms for web mining. Journal IEEE-FS 9, 595–607 (2001) 10. Joshi, A., Joshi, K.: On mining Web access logs. In: ACM SIGMOID Workshop on Research issues in Data Mining and Knowledge discovery, pp. 63–69 (2000) 11. Joshi, A., Krishnapuram, R.: Robust Fuzzy Clustering Methods to Support Web Mining. In: Proc. ACM SIGMOD Workshop on Data Mining and Knowledge Discovery (August 1998) 12. Lazzerini, B., Marcelloni, F.: A hierarchical fuzzy clustering-based system to create user profiles. International Journal on Soft Computing 11, 157–168 (2007) 13. Liu, M., Lui, Y., Hu, H.: Web Fuzzy Clustering Web and its applications in Web Usage Mining. In: 9th International Symposium on future Software Technology ISFST, Xian, Cina, (October 20-23, 2004) 14. Martin-Bautista, M.J., Vila, M.A., Escbar-Jeria, V.H.: In: IADIS European Conference Data Mining, pp. 73–76 (2008)
How to Derive Fuzzy User Categories for Web Personalization
79
15. Mobasher, B., Cooley, R., Srivastava, J.: Automatic personalization based on Web usage mining. TR-99010, Department of Computer Science. DePaul University (1999) 16. Mobasher, B.: Web Usage Mining and Personalization. In: Practical Handbook of Internet Computing. CRC Press LLC, Boca Raton (2005) 17. Nasraoui, O., Frigui, H., Joshi, A., Krishnapuram, R.: Mining Web access log using relational competitive fuzzy clustering. In: Proc. of the Eight International Fuzzy System Association World Congress (1999) 18. Nasraoui, O., Krishnapuram, R., Joshi, A.: Relational Clustering based on a new robust estimator with application to Web mining. In: Proc. of the North American Fuzzy Information Society, pp. 705–709 (1999) 19. Nasraoui, O., Krishnapuram, R., Frigui, H., Joshi, A.: Extracting Web user profiles using relational competitive fuzzy clustering. International Journal on Artificial Intelligence Tools 9(4), 509–526 (2000) 20. Nasraoui, O., Krishnapuram, R., Joshi, A., Kamdar, T.: Automatic Web User Profiling and Personalization using a Robust Fuzzy Relational Clustering. ECommerce and Intelligent Methods in Studies in Fuzziness and Soft Computing (2002) 21. Rossi, F., De Carvalho, F., Lechevallier, Y., Da Silva, A.: Dissimilarities for Web Usage Mining. Data Science and Classification, Studies in Classification, Data Analysis and Knowledge Organization, 39–46 (2006) 22. Runkler, T.A., Bezdek, J.C.: Web mining with relational clustering. International Journal of Approximate Reasoning 32, 217–236 (2003) 23. Suryavanshi, B.S., Shiri, N., Mudur, S.P.: An efficient technique for mining usage profiles using Relational Fuzzy Subtractive Clustering. In: Proc. of WIRI 2005, Tokyo, Japan (2005) 24. Vakali, A., Pokorny, J., Dalamagas, T.: An Overview of Web Data Clustering Practices. In: EDBT Workshops, pp. 597–606 (2004) 25. Wang, X., Abraham, A., Smith, K.A.: Intelligent web traffic mining and analysis. Journal of Network and Computer Applications 28, 147–165 (2005) 26. Zhizhen, L., Pengfei, S.: Similarity measures on intuitionistic fuzzy sets. Pattern Recognition Letter 24, 2687–2693 (2003) 27. Kajan, E.: Information technology encyclopedia and acronyms. Springer, Heidelberg (2002) 28. Broy, M.: Software engineering – From auxiliary to key technologies. In: Broy, M., Denert, E. (eds.) Software Pioneers. Springer, Heidelberg (2002) 29. Che, M., Grellmann, W., Seidler, S.: Appl. Polym. Sci., vol. 64, pp. 1079–1090 (1997) 30. Ross, D.W.: Lysosomes and storage diseases. MA Thesis, Columbia University, New York (1977)
5 A Taxonomy of Collaborative-Based Recommender Systems Fabi´ an P. Lousame and Eduardo S´ anchez
1 Introduction The explosive growth in the amount of information available in the WWW and the emergence of e-commerce in recent years has demanded new ways to deliver personalized content. Recommender systems [51] have emerged in this context as a solution based on collective intelligence to either predict whether a particular user will like a particular item or identify the collection of items that will be of interest to a certain user. Recommender systems have an excellent ability to characterize and recommend items within huge collections of data, what makes them a computerized alternative to human recommendations. Since useful personalized recommendations can add value to the user experience, some of the largest e-commerce web sites include recommender engines. Three well known examples are Amazon.com [1], LastFM [4] and Netflix [6]. Although the first studies can be traced back to cognitive science, approximation theory and information retrieval among other fields, recommender systems became an independent research area in the mid-1990s when Resnick et al. [50], Hill et al. [29] and Shardanand et al. [56] proposed recommendation techniques explicitly based on user rating information. Since then, numerous approaches have been developed that use content or historical information: user-item interactions, explicit ratings, or web logs, among others. Nowadays, recommender systems are typically classified into the following categories: • content-based, if the user is recommended items that are content-similar to the items the user already liked; • collaborative, if the user is recommended items that people with similar tastes and preferences liked in the past; • hybrid, if the user is recommended items based on a combination of both collaborative and content-based methods. This chapter presents a study focused on recommender systems based on collaborative filtering, the most successful recommendation technique to date. The chapter provides the reader an overview of recommender systems based on collaborative filtering, contributes with a general taxonomy to classify the algorithms and approaches attending to a set of relevant features, and finally provides G. Castellano, L.C. Jain, A.M. Fanelli (Eds.): Web Person. in Intel. Environ., SCI 229, pp. 81–117. c Springer-Verlag Berlin Heidelberg 2009 springerlink.com
82
F.P. Lousame and E. S´ anchez
some guidelines to decide which algorithm best fits on a given recommendation problem or domain.
2 Recommending Based on Collaborative Filtering The term Collaborative Filtering (CF) was first introduced by Goldberg et al. [23]. They presented Tapestry, an experimental mail system that combined both content-based filtering and collaborative annotations. Although the system was enriched with collaborative information, users were required to write complex queries. The first system that automated recommendations was the GroupLens system [50, 37] which helped users find relevant netnews from a huge stream of articles using ratings given by other similar users. Since then, many relevant research projects have been developed (Ringo [56], Video Recommender [29], Movielens [19, 5], Jester [24]) and the results have positioned the CF techniques as the most successful ones to build recommender engines. Popular e-commerce systems, such as Amazon [1], CDNow [3] or LastFM [4], are taking advantage of these engines. CF relies on the assumption that finding similar users to a new one and examining their usage patterns leads to useful recommendations for the new user. Users usually prefer items that like-minded users prefer, or even that dissimilar users don’t prefer. This technology does not rely on the content descriptions of the items, but depends on preferences expressed by a set of users. These preferences can either be expressed explicitly by numeric ratings or can be indicated implicitly by user behaviors, such as clicking on a hyperlink, purchasing a book or reading a particular news article. CF requires no domain knowledge and offers the potential to uncover patterns that would be difficult or impossible to detect using content-based techniques. Besides that, collaborative filtering has proved its ability to identify the most appropriate item for each user, and the quality of recommendations is improved over time as long as the user database gets larger. Two different approaches have been explored for building Pure CF recommenders. The first approach, referred to as memory-based [56, 37, 15, 50, 27, 54], essentially makes rating predictions based on the entire collection of rated items. Items frequently selected by users of the same group can be used to form the basis to build a list or recommended items. They produce high-quality recommendations but suffer serious scalability problems as the number of users and items grow. The other approach, known as model-based [56, 14, 15, 9], analyzes historical interaction information to build a model of the relations between different items/users which is intended to find the recommended items. Model-based schemes produce faster recommendations than memory-based do, but requires a significant amount of time to build the models and leads to lower quality recommendations. Definitions and Notation In the context of recommender systems, a dataset is defined as the collection of all transactions about the items that have been selected by a collection of users.
A Taxonomy of Collaborative-Based Recommender Systems
83
Symbols n and m will be used in this text to denote the number of distinct users and items in a particular dataset, respectively. Each dataset will be represented formally by a n × m matrix that will be referred to as the user-item matrix, A = U × I. U denotes the set of all users and I the set of all items available in the database. The value of element ak,i ∈ {1, 0} denotes whether an interaction between user k and item i has been observed or not. In a recommendation problem, there usually exists additional information about the utility of the user-item interactions, commonly captured as a rating that indicates how a particular user liked a particular item. This rating information is represented in a different n × m matrix that will be denoted R. The rating that user k expressed for item i is in general a real number and will be referred to as rk,i . rk denotes the vector of all ratings of user k. In recommender systems terminology, the active user is the user that queries the recommender system for recommendations on some items. The symbol a will be used to refer to the active user’s rating vector. By convention, if di denotes a vector that results from taking row i from a certain matrix D, dTj will be used to denote the vector that results from taking column j from that matrix. The symbol Ak refers to the set of items the user has already experienced and Rk is the set of items for which user k has actually given ratings. Note that Rk ⊆ I and Rk ⊆ A. Problem Formulation In its most common formulation, the CF recommendation problem is reduced to the problem of estimating, using collaborative features, the utility for the items that have not been selected by the active user. Once these utilities for unseen items are estimated, a top-N recommendation can be built for every user, by recommending the user the items with the highest estimated values. This estimation is usually computed from the ratings explicitly given by the active user to a specific set of items (rating-based filtering) but ratings could also be derived from historical data (purchases, ...) or from other sources of information. In the rest of the chapter we will assume without loss of generality that interactions are based on rating activity. In movie recommendation, for instance, the input to the recommender engine would be a set of movies the user has seen, with some numerical rating associated with each of these movies. The output of the recommender system would be another set of movies, not yet rated by the user, that the recommender predicts to be highly rated by the user. More formally, given the user-item rating matrix R and the set of ratings a specified by the active user, the recommender engine tries to identify an ordered set of items X such that X ∩Rk = ∅. To achieve this, the recommendation engine defines a function ν :U ×I →Ê
(k, j) → ν(k, j) = E(rk,j )
(1)
84
F.P. Lousame and E. S´ anchez
Fig. 1. Illustration of the recommendation process. Given the vector of ratings of the active user, the collaborative filtering algorithm produces a recommendation by selecting the N items with the highest estimated predictions.
that predicts the utility of the interactions between each user k and every item j. Note that for a given user k, the utilities need to be computed only for items j ∈ I − Rk . Once all utilities are predicted, recommendations to the active user are made by selecting the items with the highest estimated utility (see figure 1). The prediction computation is usually performed on a sparse user-item matrix. Typical values of sparsity are in the order of 98%, what means an almost empty interaction matrix. In addition to recommender systems that predict the absolute values of ratings, there are other proposals focused on preference-based filtering, i.e., predicting the relative preferences of users [18, 35, 36]. These techniques predict the correct relative order of the items, rather than their individual ratings. 2.1
Memory-Based Collaborative Filtering
Memory-based collaborative filtering is motivated from the observation that users usually trust the recommendations from like-minded neighbors. These methods are aimed at computing unknown relations between users and items by means of nearest neighbor schemes that either identify pairs of items that tend to be rated similarity or users with a similar rating history. Memory-based collaborative filtering became very popular because they are easy-to-implement, very intuitive, avoid the need of training and tuning many parameters, and the user can easily understand the rationale behind each recommendation. Three components characterize this approach: (1) data preprocessing, in which input data to the recommender engine is preprocessed to remove global effects, to normalize ratings, etc; (2) neighborhood selection, which consists in selecting the set of K users [items] that are most similar to the active user [to the set of items already rated by the active user]; and (3) prediction computation, which generates predictions and aggregates items in a top-N recommendation. Table 1 summarizes different memory-based algorithms that are briefly explained in next subsections.
A Taxonomy of Collaborative-Based Recommender Systems
85
Table 1. Summary of memory-based algorithms based on the different components of the recommendation process Data (preprocessing) User-based
Ratings (default voting)
Neighborhood selection · Pearson correlation · Vector similarity → Inverse user frequency
· Mean squared difference Predictability paths
Item-based
Ratings
Ratings (adjusted ratings)
Prediction computation · Rating aggregation · Most frequent item
· Predictability condition heuristics
· Linear rating transformation
· Vector similarity · Pearson correlation · Conditional probability based similarity
· Rating aggregation · Regression based
Item-to-item coocurrence Cluster-based smoothing
Ratings (cluster-based smoothing)
· Pearson correlation
· Rating aggregation
· Compute trust of users Trust inferences Ratings
Improved neighborhood
Ratings (remove global effects)
→ Pearson correlation
Weighted average composition · Weight optimization
· Rating aggregation
· Rating aggregation
User-Based This CF approach estimates unknown ratings based on recorded ratings of likeminded users. The predicted rating of the active user for item j is a weighted sum of ratings of other users, ¯l ) l∈Uk wk,l · (rl,j − r νk,j = r¯k + (2) l∈Uk |wk,l | where Uk denotes the set of users in the database that satisfy wk,l = 0. This weights can reflect distance, correlation or similarity between each user and the active user. r¯k and r¯l represent the mean rating of the active user k and user l, respectively. Different weighting functions can be considered. Pearson correlation, cosine vector similarity, Spearman correlation, entropy-based uncertainty, mean-square difference are some examples. The Pearson correlation (eq. 3)1 was the first measure used to compute these weights [50]. Breese et al. [15] and Herlocker et al. [27] proved that Pearson correlation performs better than other metrics. ¯k )(rl,i − r¯l ) i∈Rk ∩Rl (rk,i − r (3) wk,l = ¯k )2 ¯l )2 i∈Rk ∩Rl (rk,i − r i∈Rk ∩Rl (rl,i − r 1
Note that Pearson correlation is defined in [−1, +1] and then, in order to make sense when using negative weights, ratings should be re-scaled to fit [−r, +r].
86
F.P. Lousame and E. S´ anchez
Vector similarity is another weighting function that can be used to measure the similarity between users: i∈Rk ∩Rl rki · rli (4) wk,l = 2 2 i∈Rk ∩Rl rki i∈Rk ∩Rl rli Though Pearson correlation and vector similarity are the most popular, other metrics are also used. For instance, Shardanand and Maes [56] used a Mean Squared Difference to compute the degree of dissimilarity between users k and l and predictions were made by considering all users with a dissimilarity to the user which was less than a certain threshold and computing the weighted average of the ratings provided by the most similar users, where weights were inverse proportional to this dissimilarity. They also presented a Constrained Pearson correlation to take into account the positivity and negativity of ratings in absolute scales. Most frequent item recommendation. Instead of using equation 2 to compute predictions and then construct a top-N recommendation by selecting the highest predicted items, each similar item could be ranked according to how many similar users selected it 1 (5) sk,j = l∈Uk /al,j =1
and the recommendation list would be then computed by sorting the most frequently selected N items. Weighting Schemes Breese et al. [15] investigated different modifications to the weighting function that have shown to improve performance of this memory-based approach: Default voting was proposed as an extension of the Pearson correlation (equation 3) that improves the similarity measure in cases in which either the active user or the matching user have relatively few ratings (Rk ∩ Rl has very few items). Refer to [15] for a mathematical formulation. Inverse user frequency tries to reduce weights for commonly selected items based on the background idea that commonly selected items are not as useful in characterizing the user as those items that are selected less frequently. Following the original concepts in the domain of information retrieval [10] the user inverse frequency can be defined as: fi = log
n | {uk } | = log | {uk : i ∈ Bk } | ni
(6)
where ni is the number of users who rated item i and n is the total number of users in the database. To use the inverse user frequency in equation 4 the transformed rating is simply the original rating multiplied by the user inverse frequency. It can also be used in correlation but the transformation is not direct (see Breese et al. [15] for a detailed description).
A Taxonomy of Collaborative-Based Recommender Systems
87
Predictability Paths Aggarwal et al. [9] proposed a graph-based recommendation algorithm in which the users are represented as nodes of a graph and the edges between the nodes indicate the degree of similarity between the users. The recommendations for a user were computed by traversing nearby nodes in this graph. The graph representation has the ability to capture transitive relations which cannot be captured by nearest neighborhood algorithms. Authors reported better performance than the user-based schemes. The approach is based on the concepts of horting and predictability. The horting condition states whether there is enough overlap between each pair of users (k, l) to decide whether the behavior of one user could predict the behavior of the other or not. By definition, user k horts user l if the following equation is satisfied: (7) card(Rk ∩ Rl ) ≥ min(F · card(Rk ), G) where F ≤ 1 and G is some predefined threshold. The predictability condition establishes that user l predicts behavior of user k if there exists a linear rating transformation Tsk,l ,tk,l : xk,j = s · rl,j + t
(8)
that carries ratings rl,j of user l into ratings xk,j of user k with an acceptable error. The (s, t) pair of real numbers is chosen so that the transformation 8 keeps at least one value in the rating domain (see [9] for further details on s-t value pair restrictions). More formally, user l predicts user k if user k horts user l (eq. 7) and if there exists a linear rating transformation Ts,t such that the expression 9 is satisfied, with β a positive real number. j∈Rk ∩Rl |rk,j − xk,j )| <β (9) card(Rk ∩ Rl ) Each arc between users k and l indicates that user l predicts user k and therefore it has associated a linear transformation Tsk,l ,tk,l . Using an appropriate graph search algorithm a set of optimal directed paths between user k and any user l that selected item j can be constructed. Each directed path allows a rating prediction computation based on the composition of transformations (eq. 8). For instance, given the directed graph k → l1 → ... → ln with predictor values (sk,1 , tk,1 ), (s1,2 , t1,2 ), ..., (sn−1,n , tn−1,n ) the predicted rating of item j will be Tsk,1 ,tk,1 ◦ (Ts1,2 ,t1,2 ◦ (... ◦ Tsn−1,n,tn−1,n (rn,j )...)). Since different paths may exist, the average of these predicted ratings is computed as the final prediction. A topN recommendation is constructed by aggregating the N items with the highest predicted ratings. Item-Based The item-based algorithm is an analogous alternative to the user-based approach that was proposed by Sarwar et al. [53] to address the scalability problems of
88
F.P. Lousame and E. S´ anchez
the user-based approach. The algorithm, in its original formulation, generates a list of recommendations for the active user by selecting new items that are similar to the collection of items already rated by the user. As for the user-based approach, the item-based approach consists of two different components: the similarity computation and the prediction computation. There are different ways to compute the similarity between items. Here we present four of these methods: vector similarity, Pearson correlation, adjusted vector similarity and conditional probability-based similarity. Vector similarity. One way to compute the similarity between items is to consider each item i as a vector in the m dimensional user space. The similarity between any two items i and j is measured by computing the cosine of the angle between these two vectors: k∈Ui ∩Uj rk,i rk,j (10) wi,j = 2 2 k∈Ui ∩Uj rk,i k∈Ui ∩Uj rk,j where the summation is extended to users who rated both of the items, k ∈ Ui ∩ Uj . Pearson correlation. Similarly to equation 3, the Pearson correlation between items i and j is given by: ¯i )(rk,j − r¯j ) k∈Ui ∩Uj (rk,i − r wi,j = (11) 2 2 (r − r ¯ ) (r − r ¯ ) k,i i k,j j k∈Ui ∩Uj k∈Ui ∩Uj where r¯i and r¯j denote the average rating of items i and j, respectively. Adjusted vector similarity. Computing similarity between items using the vector similarity has one important drawback: the difference in the rating scale between different users is not taken into account. This similarity measure addresses this problem by subtracting the corresponding user average from each rating: ¯k )(rk,j − r¯k ) k∈Ui ∩Uj (rk,i − r (12) wi,j = ¯k )2 ¯k )2 k∈Ui ∩Uj (rk,i − r k∈Ui ∩Uj (rk,j − r Conditional probability-based similarity. An alternative way to compute the similarity between each pair of items is to use a measure based on the conditional probability of selecting one of the items given that the other item was selected. This probability can be expressed as the number of users that selected both items i, j divided by the total number of users that selected item i: wi,j = P (j|i) =
| {uk : i, j ∈ Rk } | | {uk : i ∈ Rk } |
Note that this similarity measure is not symmetric: P (j|i) = P (i|j).
(13)
A Taxonomy of Collaborative-Based Recommender Systems
89
To compute predictions using the item-based approach, a recommendation list is generated by ranking items with a prediction measure computed by taking a weighted average over all active user’s ratings for items in the collection Rk : i∈Rk rk,i · wi,j (14) νk,j = i∈Rk |wi,j | Model Based Interpretation Since similarities among items do not change frequently, relations between items can be stored in a model M . This is why some researchers consider the item-based is a model-based approach to collaborative filtering. Model M could contain all relations between pairs of items but one common approach is to store, for each item i, its top-K similar items only. This parameterization of M on K is motivated due to performance considerations. By using a small value of K, M would be very sparse and then similarity information could be stored in memory even in situations in which the number of items in the dataset is very large. However, if K is very small, the resulting model will contain limited information and could potentially lead to low quality recommendations (see [53] for further reading). Item-to-Item Extension Greg Linden et al. [41] proposed this extension to the item-based approach that is capable of producing recommendations in real time, to scale to massive datasets and to generate high-quality recommendations. The algorithm is essentially an item-based approach but includes several advantages to make the item-to-item algorithm faster than the item-based: (1) the similarity computation is extended only to item pairs with common users (co-ocurrent items) and (2) the recommendation list is computed by looking into a small set that aggregates items that were found similar to a certain basket of user selections. To determine the most similar match from a given item, the algorithm builds a co-ocurrence matrix by finding items that users tend to select together. The similarity between two items i and j is not zero if at least q+1 users have selected the pair (i, j), with q ≥ 0 some predefined threshold. The similarity between two items satisfying this property can be computed in various ways but a common method is to use the cosine similarity described in equation 10. Predictions for new items are computed with equation 14 (see [41] for further details). Cluster-Based Smoothing Xue et al. [59] proposed a collaborative filtering algorithm that provides higher accuracy as well as increased efficiency in recommendations. The algorithm is a user-based algorithm that has been enhanced with clustering and a rating smoothing mechanism based on clustering results. Clustering was performed by using the K-means algorithm with the Pearson correlation coefficient as the distance metric (eq. 3) between users. Data smoothing is a mechanism to fill in the missing values of the rating matrix. To do data smoothing Xue et al. [59] made explicit use of the
90
F.P. Lousame and E. S´ anchez
clusters as smoothing mechanisms. Based on the clustering results they applied the following smoothing strategy rk,j if rk,j = ∅ (15) rk,j = otherwise r¯k,j where r¯k,j denotes the smoothed value for user k’s rating towards an item j. By considering the diversity of the user, Xue et al. [59] proposed the following equation to compute the smoothed rating: r¯k,j = r¯k + ΔrC(k),j = r¯k +
1 |C(k, j)|
(rl,j − r¯l )
(16)
l∈C(k,j)
where C(k) denotes the cluster of user k and C(k, j) the subset of users in cluster C(k) that rated item j. Smoothed ratings are used to compute a pre-selection of neighbors. Basically, given the active user k, a set of most similar clusters is selected to build a neighborhood of similar users. After this preselection, the similarity between each user l in the neighborhood and the active user is computed using the smoothed representation of the user ratings, j∈Rk δl,j · (rk,j − r¯k )(rl,j − r¯l ) (17) wk,l = 2 2 2 j∈Rk (rk,j − r¯k ) j∈Rk δl,j (rl,j − r¯l )
where δl,j =
1 − λ if rl,j = λ otherwise
(18)
represents the confidential weight for the user l on item j. λ ∈ [0, 1] is a parameter for tuning the weight between the user rating and the cluster rating. Predictions for the active user are computed by aggregating ratings from the top-K most similar users in the same manner as for the user-based algorithm (see equation 2): δl,j · wk,l · (rlj − r¯l ) k (19) νkj = r¯k + l∈U l∈Uk δl,j · |wkl | By assigning different values to λ Xue et al., [59] adjusted the weighting schema. For instance, if λ = 0 the algorithm only uses the original rating information for the similarity computation and prediction. But if λ = 1 the algorithm is a cluster-based CF that uses the average ratings of clustering for similarity and prediction computation. Trust Inferences This approach focuses on developing a computational model that permits to explore transitive user similarities based on trust inferences. Papagelis et al. [46], presented the concept of associations between users as an expression of established
A Taxonomy of Collaborative-Based Recommender Systems
91
trust between each other. This trust is defined in the context of similarity conditions and is computed by means of the Pearson correlation (see equation 3). The more similar two users are, the greater their established trust would become. While computation of trust in direct associations is based on user-to-user similarity, for length-K associations a transitive rule is adopted. According to this, trust is propagated in the network and associations between users are built, even if they have no co-rated items. If V = {Vi ; i = 1, 2, ...K} is the set of all intermediate nodes in a trust path that connects user k with user l, then their associated inferred trust would be given by: V1 →...→VK = (((Tk→V1 ) ⊕ TV1 →V2 ) ⊕ ...) ⊕ TVK−1 →VK ) ⊕ TVK →l Tk→l
(20)
The symbol ⊕ denotes a special operation that can be best understood for the case of only one intermediate node Z: Z = Tk→Z ⊕ TZ→l Tk→l |Bk,Z | |Bk,Z | |Tk→Z | + |TZ→l | =δ· |Bk,Z | + |BZ,l | |Bk,Z | + |BZ,l |
where Bk,Z = Rk ∩ RZ , Bk,Z = Rk ∩ RZ and +1 if Tk→Z > 0, TZ→l > 0 δ= −1 if Tk→Z · TZ→l < 0 The inferred trust is not applicable if Tk→Z < 0 and TZ→l < 0. In this case the length of the path between users k and l is supposed to be infinite. To build a recommendation for the active user, a collection of paths between the user and another trusted users is selected in a first step. Pagagelis et al. [46] proposed different selection mechanisms but one of the best approaches was Weighted Average Composition, which computes the trust between any two unconnected users k and l using the following equation: Tk→l = |P |
|P |
1
Pi i=1 Ck→l
Pi Pi Ck→l · Tk→l
(21)
i=1
Pi expresses the confidence of the association k → l through the path where Ck→l Pi , V1 →...VK = ((Ck→V1 · CV1 →V2 · ...) · CVK−1 →VK ) · CVK →l (22) Ck→l
and the confidence of each direct association k → l is assumed to be directly related to the number of co-rated items between the users: Ck→l =
|Rk ∩ Rl | |Rk ∩ Rumax |
(23)
where umax represents the user who rated most items in common with user k. Predictions for unseen items can be computed using equation 2 in which each weight wk,l is given by equation 21.
92
F.P. Lousame and E. S´ anchez
Improved Neighborhood-Based The success of neighborhood-based algorithms depends on the choice of the interpolation weights (equations 2, 14) which are used to compute unknown ratings from neighboring known ones. But the aforementioned user- and itemoriented approaches lack of a rigorous way to derive these weights. Different algorithms use different heuristics to compute these weights and there is not any fundamental justification to choose one or another. Bell and Koren [13] proposed a method to learn interpolation weights directly from the ratings. Their approach improved prediction accuracy by means of two mechanisms: (1) preprocessing the user-item rating matrix removing global effects to make the different ratings more comparable and (2) deriving interpolation weights from the rating matrix. The preprocessing step consists of a set of rating transformations that prepare input data: remove systematic user or item effects (to adjust that some items were mostly rated by users that tend to rate high, etc), adjust ratings using item variables (such as the number of ratings given to an item, the average rating of an item, etc.) or adjust ratings by analyzing characteristics (such as date of rating) that may explain some of the variation in ratings2 . Interpolation weights are computed by modeling the relations between item j and its neighbors through the following optimization problem: 2 wi,j · rk,i (24) rk,j − minw k,j ∈R / k
i∈Rk
and are used with 14 in order to predict rk,j . Authors reported that this approach can be very successful when combined with model-based approaches that use matrix factorization techniques (see section 2.2). An alternative user-based approach formulation can be derived analogously by simply switching roles of users and items. 2.2
Model-Based Collaborative Filtering
Model-based collaborative filtering first learns a descriptive model of user preferences and then uses it for predicting ratings. Many of these methods are inspired from machine learning algorithms: neural-network classifiers [14], induction rule learning [61], Bayesian networks [15], dependency networks [26], latent class models [31, 38], principal component analysis [24] and association rule mining [39]. Table 2 synthesizes some of the model-based algorithms that are described in next subsections. Cluster Models and Bayesian Classifiers From a probabilistic perspective, the collaborative filtering task can be viewed as calculating the expected value of the active user’s rating on an item given what we know about the user: 2
Further information about mathematical formulation of these preprocessing steps can be found in [13].
A Taxonomy of Collaborative-Based Recommender Systems
93
Table 2. Different model-based algorithms based on the different components of the recommendation process: data preprocessing, model building and prediction computation Data processing
Model building
Bayesian networks
Instance-based representation
→ EM fitting
Latent class models
Binary preference representation
· Probabilistic clustering
SVD
· Bayesian classifier · Dependency networks · Latent class models → EM fitting
Low dimensional representation → SVD
Prediction computation · Probabilistic aggregation · Probabilistic selection · Neighborhood formation in the reduced space → User-based
Simple Bayesian classifier
Instance-based representation
· Naive Bayes classifier
· Probabilistic classification
Association rule mining
· Binary rating representation · Instance-based representation
Association rule mining
· Selection based on support and confidence of rules
Eigentaste
PMCF
PCA rating transformation Low dimensionality reduction Recursive rectangular clustering Generative probabilistic model
νk,j =
p(rk,j = x|rk ) · x
· Most frequent item · Probabilistic aggregation
(25)
s
where the probability expression is the probability that the active user will have a particular rating to item j given the previously observed ratings rk = {rk,i , i ∈ Rk }. Character x denotes rating values in interval [rmin , rmax ]. Breese et al. [15] presented two different probabilistic models for computing p(rk,j = x|rk,i , i ∈ Rk ). In the first algorithm, users are clustered using the conditional Bayesian probability based on the idea that there are certain groups that capture common sets of user preferences. The probability of observing a user belonging to a particular cluster cs ∈ C = {C1 , C2 , ...CK } given certain set of item ratings rk is estimated from the probability distribution of ratings in each cluster:
p(rk,i |cs ) (26) p(cs , rk ) = p(cs ) i
The clustering solution (parameters p(cs ) and p(rk,i |cs )) is computed from data using the expectation maximization (EM) algorithm. The second algorithm is based on Bayesian network models where each item in the database is modeled as a node having states corresponding to the rating of that item. The learning problem consists of building a network on these nodes such that each node has a set of parent nodes that are the best predictors for the child’s rating. They presented a detailed comparison of these two model-based
94
F.P. Lousame and E. S´ anchez
approaches with the user-based approach and showed that Bayesian networks model outperformed the clustering model as well as the user-based scheme. A related algorithm was proposed by Heckerman et al. [26] based on dependency networks instead of Bayesian networks. Although the accuracy of dependency networks is lower than the accuracy of Bayesian networks, they learn faster and have smaller memory requirements. Latent Class Models Latent class models can be used in collaborative filtering to produce recommendations. This approach is similar to probabilistic models but the resulting recommendations are generated based on a probability classification scheme. Using latent class models, a latent class z ∈ Z = {z1 , z2 , ...zK } is associated with each observation (x, y). The key assumption made is that x and y are independent given z. In the context of collaborative filtering observations are transactions and the probability of observing a transaction between user k and item j can be modeled via latent class models as follows3 : p(k, j) = p(z)p(k|z)p(j|z) (27) z∈Z
where p(k|z) denotes the probability of having user k given latent variable z and p(j, z) represents the probability of observing item j given variable z. The standard procedure to compute probabilities p(k, z) and p(j, z) is to use a EM algorithm (see [31] for further details). Recommendation was performed by simply selecting the most probable latent classes given the active user k and for each latent class the most probable observations p(j|z) such that j ∈ / Rk . Hofmann et al. [31] extended this formulation by introducing an additional random variable that captured additional binary preferences (like and dislike). Singular Value Decomposition Singular Value Decomposition (SVD) is a matrix-factorization technique that factors an m × n matrix R into three matrices: R = V · E · WT
(28)
where V and W T are two orthogonal matrices of size m×r and r×n, respectively, with r the rank of the matrix R. E is a diagonal matrix that has all singular values of matrix R. The matrices obtained by performing SVD are particularly useful to compute recommendations and have been used in different research works to address the problem of sparsity in the user-item matrix [54, 22]. If the r × r matrix E is reduced to have only the q largest diagonal values, Eq , and the matrices V and W T are reduced accordingly, the reconstructed matrix Rq is the closest rank-q matrix to R. If R is the original user-item rating matrix, SVD will produce a low dimensional representation of the user-item matrix that can be used as a basis to compute recommendations. 3
For a detailed description of latent class models refer to Hofmann et al., [31, 30].
A Taxonomy of Collaborative-Based Recommender Systems
95
Sarwar et al., [52] used SVD to build recommendations following a user-based like approach. They successfully applied SVD to obtain a m × q representation of the users Vq · Eq 1/2 and compute the user similarity from that low dimensional representation. Compared to correlation-based systems, results showed good quality predictions and the potential to provide better online performance. Drineas et al., [22] further studied SVD and showed from a mathematical point of view that this approach can produce competitive recommendations. Simple Bayesian classifier Most collaborative filtering systems adopt numerical ratings and try to predict numerical ratings. However, there are other systems that produce recommendations by accurately classifying items and selecting those that are predicted relevant to the user. The simple Bayesian classifier [44] is one of the most successful algorithms on many classification domains (text categorization, content-based filtering, etc.) and has shown to be competitive for collaborative filtering. To use this algorithm for CF, an especial representation that merges both the interaction matrix and the rating matrix R is required. Suppose that D is a 2n × m is matrix in which each user rating vector rl is divided into two binary and ddis which have a boolean value indicating whether the user vectors dlik l l liked the item and did not like the item, respectively. Making the na¨ıve assumption that features are independent given the class label, the probability of observing that an item belongs to cs ∈ {lik, dis} given its 2(n − 1) feature values is:
2(n−1)
p(cs , dTi ) = p(cs )
p(dl,i |cs )
(29)
l=1
where both the probability of observing the active user labeling item i with cs , p(cs ), and the probability of having feature dl,i if the active user labeled the item with class cs , p(dl,i |cs ), are estimated from the database: p(cs ) =
| {dk,i = cs } | m
;
pk,i (dl,i |cs ) =
| {dl,i = 1; dk,i = cs } | | {dk,i = cs } |
(30)
To determine the most likely class of a new item for the active user, the probability of each class is computed and the item is assigned to the class with the highest probability. Items that are classified into the like class are aggregated in a recommendation list. Association Rule Mining Within the context of using association rules to derive top-N recommendations, Lin et al. [39] developed a method for collaborative recommendation based on an association rule mining. Given a set of user transactions, an association rule is a rule of the form X → Y where both X and Y are sets of items. The standard problem of mining association rules is to find all association rules that are above
96
F.P. Lousame and E. S´ anchez
a certain minimum support and confidence for the user4 . The recommendation strategy is based on mining two types of associations: user associations (where both X and Y are sets of users) and item associations (if X and Y are sets of items). To produce recommendations, user and item associations are combined in the following way: if user association rule mining gives a minimum support, recommendations are based on user associations, otherwise item associations are used to compute recommendations. Mobasher et al. [45] also presented an algorithm for recommending additional webpages to be visited by a user based on association rules. In this approach, the historical information about users and their web-access patterns were mined using a frequent itemset discovery algorithm and were used to generate a set of high confidence association rules. The recommendations were computed as the union of the consequent of the rules that were supported by the pages visited by the user. In the same context, Demiriz et al., [20] studied the problem of how to weight the different rules that are supported by the active user to generate recommendations. Each item the user did not select was scored by finding corresponding rules and aggregating the scores between rules and the active user. These scores are computed by multiplying the similarity measure between the active user and the rules and the confidence of the rule. To compute the similarity between the active user and the rules, an Euclidean distance was used. He compared this approach both with the user-based scheme and the dependency network-based algorithm [26]. Experiments showed that the proposed association rule-based scheme is superior to dependency networks but inferior to the user-based schemes. Eigentaste Goldberg et al., [24] proposed a collaborative filtering algorithm that applies a dimensionality reduction technique (Principal Component Analysis, PCA) for clustering of users and fast computation of recommendations. PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. It can be applied to collaborative filtering to find a transformed representation of the user-item matrix: (31) R = R · W where W is a orthogonal matrix. By keeping the q lower-order principal components and ignoring higher-order ones, Goldberg et al. [24] used the resulting ‘principal’ transformation matrix Wq to cluster users in a low dimensional space and compute recommendations by aggregating ratings from users in the same cluster. The resulting algorithm, Eigentaste, is essentially a user-based approach in which users are clustered based on their representations in the transformed space. 4
Further details about association rule mining and algorithms can be found in Lin et al, [40].
A Taxonomy of Collaborative-Based Recommender Systems
97
The resulting algorithm is as good as the classical user-based approach in terms of accuracy but the computation of the recommendations is much faster and scalable. Probabilistic Memory-Based Was proposed by Yu et al. [60] as a efficient approach that generates predictions from a carefully selected small subset of the overall database of user ratings (the profile space). The algorithm is similar to a memory-based approach but uses a probabilistic approach to build a compact model from which recommendations are generated. This probabilistic approach assumes that user k’s real ratings can be described as a vector xk = {xk,i ; i = 1, 2, ...m} that encode the underlying, ‘true’ preferences of the user (i.e. his/her personality). Assuming a generative probabilistic model, the ratings of an active user a are generated based on a probability density given by: p(a|P) =
|P| l=1
|P|
1 p(a|xl ) · p(xl |P) = p(a|xl ) |P|
(32)
l=1
where P is the profile space, which consists of a subset of rows of the original rating matrix, R. Assuming that ratings on individual items are independent, given a profile xl , the probability of observing the active user’s ratings a if we assume that the user has the prototype profile xl is p(a|xl ) =
m
true p(rk,j = aj |rk,j = xl,j )
(33)
j=1
Both Yu et al [60] and Pennock et al. [48] assume that users report ratings for items they’ve selected with Gaussian noise. This means that user k’s reported rating for item i is computed from a independent normal distribution with mean true rk,i : true p(rk,i = x|rk,i = y) ∝ e−(x−y)
2
/2σ2
(34)
where σ is a free parameter. The posterior density of the active user k’s ratings on not yet rated items an based on the ratings the user has already specified ar can be computed using equation 32 and gives: |P| n r p(an , ar |P) n r l=1 p(a |xl ) · p(a |xl ) p(a |a , P) = = (35) |P| r p(ar |P) l=1 p(a |xl ) With this probabilistic model, predictions for the active user are computed by combining the predictions based on other prototype users xl , weighted by the degree of like-mindedness to the active user. 2.3
Limitations of Collaborative Filtering
Pure collaborative filtering does not show some of the problems that content-base recommenders do. For instance content recommenders require explicit textual
98
F.P. Lousame and E. S´ anchez
information which may not be available in some domains (multimedia recommendation, etc). Since collaborative filtering systems use other user’s ratings, they can deal with any kind of items, no matter whether they have content information or not. Besides, content-based systems generally recommend items that are scored highly against the user’s profile so that only items that are very similar to those already rated high will be recommended. In contrast, CF recommenders are able to recommend items that are very dissimilar to those already seen in the past. Despite their popularity and advantages over content-based filtering, pure CF has several shortcomings: Sparsity. This problem has been identified as one of the main technical limitations of CF. Commercial recommender systems are used to evaluate large collections of items [1, 3, 4] in which even very active users may have purchased less than 1% of the items (1% of 2 million movies is 20.000 movies!). This implies that memory-based recommender systems may be unable to make any recommendations and the accuracy may be poor. Even users that are very active rate just a few number of the total available items, and on the contrary, even very popular items result in having been rated by only few users. As a consequence, it is possible that the similarity between two users could not be defined, making CF useless. Even if the evaluation of similarity is feasible, it may not be reliable if there is not enough information. Cold start problem. CF requires users to rate a sufficient number of items before getting accurate and reliable recommendations. Therefore, unless the user rates a substantial number of items, the recommender system will not provide accurate results. This problem applies to new users but also to non-regular users (with rare tastes), for whom similarities cannot be computed with sufficient reliability. New item problem. Collaborative filtering algorithms rely only on user’s preferences to make recommendations. Therefore, in a situation in which new items are added regularly, they can not be recommended until rated by a certain number of users. Scalability. The computational complexity of collaborative, memory-based methods grows linearly with the number of users, which in typical commercial applications can reach several millions. In this situation the recommender could suffer serious scalability problems and algorithms may have performance problems with individual users, for whom the system has large amounts of information. Different memory-based algorithms have been proposed to address the problems of scalability and sparsity. For instance, Sarwar et al. [53] proposed the itembased algorithm to address scalability problems of the user-based approaches. And Aggarwal et al., [9] and Papagelis et al. [46] proposed different graph based approaches to exploit transitive relations among users. To address the new user problem, Rashid et al. [49] and Yu et al. [60] proposed different techniques based on item popularity, item entropy and user personalization to determine the best items for a new user to rate. Dimensionality reduction techniques such as Singular Value Decomposition could reduce the dimensionality of the original sparse
A Taxonomy of Collaborative-Based Recommender Systems
99
matrix, [14, 52] and provide faster recommendations. Therefore, model-based approaches can partially address some limitations of memory-based collaborative filtering such as sparsity and scalability, but others, such the new item problem, still remain unsolved.
3 Hybrid Filtering Different experiments have shown that collaborative filtering systems can be enhanced by incorporating content-based characteristics. Hybrid recommender systems combine different types of recommender systems, usually collaborative and content-based filtering methods, and are essentially intended to avoid the limitations of both technologies. There are different ways content-based and collaborative filtering methods can be combined. For instance, collaborative filtering could be enhanced with content-based characteristics, results from separate collaborative and contentbased recommenders could be merged into a unique recommendation or recommendations may be generated based on a unifying recommendation model. There are also other recommender systems that are basically content recommenders with enhanced recommendations via collaborative features, but they are out of the scope of this text. Table 3 summarizes some of the hybrid approaches that are explained here. Table 3. Summary of different hybrid-based algorithms based on the different components of the recommendation process Enhance collaborative filtering with content-based characteristics Input data
Contentboosted CF
· User-item ratings · Item features
CBF component
CF component
Bayesian text classifier
Memory-based CF
Build a pseudo-rating matrix
using
content
fea-
Build
predictions
from
the
pseudo-rating matrix using a user-based approach
tures
Feature-based CF
· User-item matrix · Item-feature matrix
Content matching
Memory-based CF
Neighborhood
Filter recommended items using
formation
based on item features
an item-based approach
Combine separate recommenders Recommender components
Weighted CF Similarity fusion
CBF: Content CBF- matching Match
user
profiles
CF: Memory-based Build predictions from the to
user-rating matrix using a
item contents
user-based approach
CF: Memory-based
CF: Memory-based
Probabilistic user-based
Probabilistic item-based
Prediction computation Linear combination Combination weights adjusted from data
Linear combination Combination weights adjusted from data
Develop a unifying recommendation model
Spread activation
Input data
Background model
Prediction computation
· User-item matrix · Item contents · Demographic data
2-layer graph Enhanced graph with con-
· Direct retrieval · Association mining · Spread-activation
tent features
→ Hopfield Net algorithm
100
3.1
F.P. Lousame and E. S´ anchez
Enhance Collaborative Filtering with Content-Based Characteristics
Content-based recommender systems evolved from information retrieval [10] and information filtering [12] systems and are designed mostly to recommend textbased items. In content-based filtering, items are recommended to a certain user based on similarities between new items and the corresponding user profile. The content of these items is usually described by keywords. User profiles contain information about the users’ tastes, preferences and needs that can be extracted from different types of information: the collection of items the user has rated high in the past, keywords that represent topics of interest, text queries, transactional information from web logs, etc. Though the significant and early advancements made in information retrieval and information filtering, the importance of several text-based applications and new improvements such as the use of user profiles, content-based recommenders suffer from several limitations. Limited understanding of users and items or overspecialization are some examples. But content-based filtering may be used in conjunction with collaborative filtering to enhance recommendations. Several hybrid recommender systems use essentially collaborative filtering techniques and maintain content-based user profiles that store useful information and from which user similarities are computed. This allows to overcome problems such as sparsity problem and provides a mechanism to recommend users new items not only when they are rated highly by similar users, but when they score highly against the user profile, so that both the new item and cold start problems can be tackled. Content-Boosted CF Melville et al. [43] proposed a system to overcome two of the main limitations of pure collaborative filtering, namely sparsity and the new user problem. Their method, content-boosted collaborative filtering (CBCF), uses a pure contentbased predictor to convert a sparse user matrix into a full ratings matrix and then uses pure collaborative filtering to provide recommendations. The content-based predictor was implemented using a Bayesian text classifier that learned a user model from a set of rated items. The user model was used to predict ratings of unrated items and create a pseudo-ratings matrix as follows, rk,j if rk,j = ∅ (36) rk,j = ck,j if rk,j = ∅ where ck,j is the rating of item j for user k predicted by the pure content recommender. The collaborative filtering component was implemented following the user-oriented approach (equation 2) with a slightly modified version5 of the Pearson correlation (equation 3) to compute user similarity from the dense representation R . Further details can be found in [43]. 5
They multiplied the correlation by a significance weighting factor (see [27]), that gives less confidence to correlations computed from users with few co-rated items.
A Taxonomy of Collaborative-Based Recommender Systems
101
Soboroff et al. [57] described a similar hybrid filtering technique that combined collaborative data with content descriptions of items to generate recommendations. The approach used Latent Semantic Indexing (LSI) with SVD to create a simplified view of a user-profile matrix built from relevant item contents. Feature-Based CF Han and Karypis [25] presented several feature-based recommendation algorithms to enhance collaborative filtering with content-based filtering in contexts in which there is not enough historical data for measuring similarity between items, i.e. frequently changing items and product catalogs with tailored items. In the first context, using content-based filtering, a set of similar items were computed by matching the set of items selected by the active user with the items in the catalog. Using a item-oriented approach to collaborative filtering, recommended items were selected and the collection of most representative features were extracted as the recommended features. From the real catalog of items, a top-N recommendation was generated by selecting products with these recommended features. An alternative method, using association rules, was proposed to generate recommendations in this context. A similar approach, based on feature recommendation was presented for the context of product catalogs with custom items (see [25] for details). 3.2
Combine Separate Recommenders
Weighted CBF-CF One of the first approaches that combined recommenders was proposed by Claypool et al. [17]. Rating predictions were obtained from separate content-based and collaborative recommenders and merged into one recommendation using a linear combination of ratings, keeping the basis of each approach separated. To perform the content-based filtering, each user is represented with a threecomponent profile that gathers information about user preferences for items, explicit keywords from search queries and implicit keywords extracted from highly rated items. Content-based filtering is performed by matching the active user’s profile to the textual representation of new items. Collaborative filtering is performed following a user-based approach (see equation 2) with weights computed using a Pearson correlation (equation 3). Weights of the linear combination are dynamically adjusted to minimize past rating prediction errors. Their approach realizes the strengths of the content-based filtering and mitigates the effects of both the sparsity and the new item problem. The combination of content-based and collaborative filtering results can be tunned to both avoid the cold start problem by giving more weight the content-based component for these users or weighting more heavily the collaborative component as the number of users and ratings for each item increases. A similar approach was presented by Pazzani [47]. Their hybrid recommender combined recommendation results from 3 different approaches: content-based,
102
F.P. Lousame and E. S´ anchez
demographic-based and collaborative. Content-based was performed by applying a content-based learning algorithm, called Winnow [42], that estimated the relative weights of each keyword of the content model of an item so that the aggregation of these weights was highly correlated with the rating associated by the user. Similarly, demographic-based recommendations were computed by applying the Winnow algorithm to demographic features that represent users. Finally collaborative filtering was performed following a pure user-oriented approach (equation 2 combined with 3). The combination was shown to have the potential of improving the precision of recommendations. Similarity Fusion Most collaborative recommenders [15, 53] produce recommendations based only on partial information from the data in the user-item matrix (using either correlation between user data or correlation between item data). Wang et al., [58] recently proposed a probabilistic approach to exploit more of the data available in the user-item matrix, by combining all ratings with predictive value into a single recommendation. The confidence of each individual prediction can be estimated by considering its similarity towards both the test user and the test item. The overall prediction is made by averaging the individual ratings weighted by their confidence. The confidence of each rating is computed using a probabilistic approach (equation 25) that combines three different probabilistic models that estimate predictions based on user similarity, item similarity and rating similarity. Two linear combination weights, λ and δ, control the importance of the different prediction sources and were determined experimentally. This similarity fusion scheme was proved to improve prediction accuracy in collaborative filtering and, at the same time, was more robust against data sparsity. For further details about implementations and results, read [58]. 3.3
Develop a Unifying Recommendation Model
Spread-Activation This graph based algorithm was proposed to provide a more comprehensive representation of the data gathered in the user-item matrix and to support flexible recommendations by using different strategies, [34, 33, 32]. The approach is hybrid in the sense that both collaborative and content features are merged to generate recommendations, but also in the way that different collaborative filtering strategies can be combined to find relevant items. Recommendations are generated from a background two-layer graph-theoretic representation of the user-item matrix. Nodes represent users and items. Input information about users (demographic data, answers to questionnaires, query inputs, web usage patterns, etc.), items (textual descriptions, etc.) and transactions (purchase history, explicit ratings, browsing behavior, etc.) is transformed into links between nodes that capture user similarity, item similarity or
A Taxonomy of Collaborative-Based Recommender Systems
103
associations between users and items, respectively. This results in a very flexible recommendation engine that may combine different recommendation methods, different types of information to model the links and different measures to compute the strength of these relations: • Direct retrieval. Generates recommendations by retrieving items similar to the active user’s previous selections and items selected by users similar to the active user. Depending on the algorithm to form neighbors from the graph, the engine can generate content-based, collaborative or hybrid recommendations. • Association mining. Generates recommendations by building first a model of association rules that are computed from transaction history. Two different types of association rules are generated: content-based rules, built from content similarity among items; and transaction-based rules, built from transaction history data. Depending on the type of association rules considered, the engine can produce content-based, collaborative or hybrid recommendations. • High-degree association. Recommendations are generated from a graph that combines information from the previous approaches and uses the Hopfield net algorithm [16] to produce recommendations. By setting the activation level that corresponds to the active user to μuk = 1 the algorithm repeatedly performs the following activation procedure μj (t + 1) ∝
n−1
tij · μi (t)
(37)
i=0
until the activation levels of all nodes converge. tij represents the weight of the link between nodes i and j. Depending on the nature of the links that are enabled, the algorithm can produce content-based, collaborative or hybrid recommendations.
4 Evaluation of Recommender Systems 4.1
Datasets
To evaluate performance of recommender systems, a number of different datasets has been reviewed: • EachMovie was one of the most widely used data sets in recommender systems but it is no longer available for download. It contained 2,811,983 ratings (discrete values from 0 to 5) entered by 72,916 users for 1,628 different movies. • MovieLens has over 10 million ratings and 100,000 tags for 10,681 movies by 71,567 users. Ratings are on a scale from 1 to 5. It contains additional data about movie title and genres. Tags are user-generated metadata about the movies. • Jester contains about 4.1 million continuous ratings (ranged from -10.00 to +10.00) about 100 jokes from 73,421 users collected between April 1999 and May 2003.
104
F.P. Lousame and E. S´ anchez
• Book-Crossing was collected between August and September 2007 from the Book-Crossing community [2]. It contains 278,858 users providing 1,149,780 ratings about 271,379 books. User demographic data and content information such as title, author and year of publication are also provided. Ratings may be explicit (expressed on a scale from 1 to 10) or implicit. • Netflix is a movie rating dataset collected between October 1998 and December 2005 that contains over 100 million ratings from 480,000 randomly-chosen Netflix [6] users over 17,000 movie titles. Ratings are on a scale from 1 to 5. It also contains the title and year of release of each movie. Some researchers [9, 21] have also evaluated recommender systems using synthetic datasets in order to characterize the proposed recommendation algorithms in a controlled setting. 4.2
Accuracy Evaluation Metrics
Research methods in recommender systems include several types of measures for evaluating the quality of recommendations. Measures can be mainly categorized into two classes: predictive accuracy metrics and decision-support accuracy metrics. • Predictive accuracy metrics evaluate the accuracy of a system by comparing the numerical recommendation scores (predictions) against the real user ratings for each user-item interaction in the test dataset. Mean Absolute Error (MAE) is one of the most frequently used. • Decision-support accuracy metrics evaluate how effective a recommendation engine is at helping a user select high-quality items from the set of all items. These metrics consider the prediction process as a binary operation (items are predicted as either relevant or not). The most commonly used decisionsupport accuracy metrics is Precision/Recall. Mean Absolute Error and Related Measures MAE is a widely popular measure of the deviation of recommendations from their true user-specified values and is computed by averaging the absolute errors |ri − νi | corresponding to each rating-prediction pair, M AE =
N 1 |ri − νi | N i=1
(38)
The lower the MAE, the better the accuracy of the generated predictions. Some research papers compute the Normalized MAE, or NMAE, which is the regular MAE divided by the rating scale. Similar measures are the Mean Squared Error (MSE), which is computed by averaging squared errors; and the Root Mean Squared Error (RMSE), which is computed from MSE by taking the square root.
A Taxonomy of Collaborative-Based Recommender Systems
105
Precision/Recall Measures Precision and recall are the most popular metrics for evaluating Information Retrieval systems and they have also been used in collaborative filtering by many authors. If L = Lr + Lnr is the list of items that are recommended to the active user and H = Hr + Hnr denotes the rest of items in the dataset, Precision and Recall measures are computed as P recision =
Lr Lr + Lnr
Recall =
Lr . Hr + L r
(39)
Subindexes ‘r’ and ‘nr’ stand for ‘relevant’ and ‘not relevant’, respectively. 4.3
Other Quality Metrics
The first recommender systems primarily focused on exploring different techniques to improve the prediction accuracy. Other important aspects, like scalability, incoming data adaptation, and comprehensibility have received little attention. Recommender systems must provide not only accuracy, but also usefulness. These quality aspects can be quantified through different measures [28] such as coverage (rate of items for which the system is capable of making recommendations), adaptation/learning rate (how the recommender improves as new data is gathered), novelty/serendipity (how good is the recommender at giving nonobvious results) or confidence (measured as the percentage of recommendations that are accepted by users, for instance).
5 A Taxonomy for CF Several works have proposed taxonomies to classify recommender systems attending to different aspects. Huang et al. [33] presented a taxonomy of recommender systems based on 3 dimensions: the system input, the representation methods and the recommendation approach. Table 4 summarizes this taxonomy of recommender systems. Adomavicius et al. [7] categorized recommender systems using only 2 dimensions: the recommendation approach and the recommendation technique. Based on the recommendation approach, recommender systems were classified as being content-based, collaborative or hybrid; and based on the types of recommendation techniques used for the rating estimation they were classified into heuristic-based or model-based. Table 5 shows this second classification. But classification schemes presented so far do not clearly differentiate systems by their real contributions and originality, but by their recommendation approach or technique (which in most cases is irrelevant for the user). Aspects such as the associations that are modeled among the entities and how they are built are essential to get a deep understanding of how they work and what are the real benefits and requirements of these systems. In this section, aforementioned classification schemes are extended by proposing a taxonomy that classifies algorithms according to 4 main aspects: (1) the entities involved and
106
F.P. Lousame and E. S´ anchez
their representation, (2) the associations among the entities, (3) the techniques used to build the relations, and (4) the recommendation method. Table 4. Recommender systems’ taxonomy according to Huang et al., [33]. Rec. Sys. are classified in terms of the input data, its representation and the recommendation approach System input Type
Data Content
User Item
Factual data Factual data
Transaction
Transactional data
Acquisition
Explicit or implicit feedback
Data representation Type User Item Transaction
Method User attributes, items associated, transactions, item attributes Item attributes, users associated Transaction attributes, items Recommendation approach Method
Type Basis
Technique
Knowledge engineering Content-based Collaborative
Hybrid
kNN, Classification User-based, Item-based, Transaction-based
kNN, Association rule mining, Machine learning
CBF + CF
Merge results from different approaches, CF augmented with content information, CBF augmented with CF, Comprehensive model
CF + Knowledge engineering
Table 5. Recommender systems’ taxonomy presented by Adomavicius et al., [7]. Rec. Sys. are classified attending to the recommendation approach and the recommendation technique Recommendation approach
Recommendation Technique Heuristic-based
Model-based
Content-based
TF-IDF, Clustering
Bayesian classifiers, Clustering, Decision Trees, Artificial neural networks
Collaborative
kNN, Clustering, Graph theory
Bayesian networks, Clustering, Artificial neural networks, Linear regression, Probabilistic models
Hybrid
CBF+CF: Linear combination of predicted ratings, Various voting schemes, Incorporating CBF as part of the heuristic for CF
CBF+CF: Incorporating CBF as part of the model for the other, Building one unifying model
A Taxonomy of Collaborative-Based Recommender Systems
5.1
107
Entities and Representation
Recommender systems studied so far generate recommendations by using information modeled in 2 different entities6 -user and item- and in their relations. The entity user contains characteristics that differentiate the users of the system. The entity item models information that somehow characterizes and identifies each single item. In a recommendation problem, entities may be represented with different types of information, depending on the requirements of the recommendation technique. Users are usually represented with a unique id but some recommender systems may use additional factual information such as demographic information (name, gender, date of birth, address, etc.), textual preferences about the features of the items or keywords that describe general user interests. Depending on the recommendation approach, items may be represented only by a unique id (which is the most common approach in CF) or by content information, usually in the form of textual attributes (for content-based or hybrid recommenders) such as brand, price, title or description. 5.2
Associations among Entities
The term association or relation describes a certain degree of dependence between entities. The majority of the approaches to the problem of recommendation assume a data representation for each entity and focus on a single relation between the entities, commonly the one derived from rating activity. But other different relations may be examined to build richer models. In the context of recommender systems, relations may record the user’s explicit expression of the interest on an item, such as a rating or a comment; or the implicit interaction between users and items, including examination (selection, purchase, etc), retention (annotation, print, ...) and reference, for instance. These relations are explored in order to infer information about user tastes, item similarities, etc. and generate recommendations. Table 6 summarizes some examples of associations. 5.3
Association Building Techniques
Recommender systems can be distinguished by the methods involved in building the associations among entities. Associations can be obtained via 2 mechanisms: (1) explicit, using the information provided by users directly, such as ratings or comments, which is usually stored in the user-item matrix; or (2) implicit, by computing new associations from existing ones or from sources such as purchase history or user behavior patterns. Implicit associations are derived using different techniques: knowledge engineering (case-based reasoning,...), neighborhood formation techniques (kNN, clustering,...), association rule mining, machine learning, etc. 6
An entity is defined as an object that has a distinct, separate existence. It models a fictitious or a real thing and may have stated relations to other entities.
108
F.P. Lousame and E. S´ anchez
Table 6. Examples of relations among entities. Symbols E, I and D denote an explicit, implicit and derived association, respectively. Type U-I
Associated items
Description E Explicit expression of interest for items I Implicit interaction between users and items
U-I
Item attributes
D Expression of user preferences or satisfaction
U-U
User similarity
D Expression of user similarity, trust or confidence
I-I
Item similarity
D
5.4
Recommendation Method
Examples User ratings and comments Examination of items (selection, purchase) Retention of items (save, annotate, print) Reference to items
Expression of item similarity or dependence
Different recommender systems produce recommendations based on different techniques. As a result, recommendations may have slightly different semantics. For instance, the user-based approach produces recommendations by ‘recommending items selected (liked) by users similar to the active user ’ whereas a item-oriented approach will produce recommendations based on ‘items similar to those the active user already selected (liked)’. Therefore, recommender systems could be further classified depending on the meaning of the recommendations they produce: • User similarity. Recommendations are generated by exploiting user similarity patterns which are computed using different metrics and sources of information (the way items are rated, the profiles of preferences and tastes of the users, etc.) • Item similarity. In this case recommendations are computed by selecting a neighborhood of items with a certain degree of similarity. Again, the similarity between items can be computed using different metrics and sources of information (item ratings, selections made by the users, inherent item features) • Item features. Recommendations are generated by matching textual item features and textual user preferences, stored in user profiles • Item association. Performs recommendations by exploring item association rules, which are frequently derived from user selection patterns. • Item relevance. It is not used much since it does not produce personalized recommendations, but it may be useful to address the cold-start-problem or to get a kind of ‘smart’ set of items from which the recommender can start building the collaborative user profile. Recommendations are built from
A Taxonomy of Collaborative-Based Recommender Systems
109
relevance statistics of items: most popular items, the top-N rated items, etc. could be recommended to new users. • Expert’s relevance. This method may build recommendations by analyzing user statistics as experts in recommending to other users. Following this method, a top-N list of items could be built from the items liked by users that usually are good mentors (experts) to other users. • Hybrid method. In this case, recommendations are built by combining some of the previous methods. Table 7. Proposed taxonomy which classifies recommender systems attending to the entities and their representation, the associations among these entities, the association building techniques and the recommendation method Entities and representation User
Item
Factual data
Demographic information (name, gender, birth date, address, etc.)
Textual preferences
Features of the items or keywords that describe general user interests
Content information
Textual attributes (brand, price, title or description)
Associations among entities - see table 1.6 Association building Behavior based
Inferred
Explicit
User-item matrix
Interactions (binary) Satisfaction (ratings)
Implicit
Behavior patterns
Examination, tion, Reference
Reten-
Knowledge engineering Neighborhood formation
kNN Clustering
Association rule mining Probabilistic models Recommendation method User similarity
Items selected (liked) by users similar to the active user
Item similarity
Items that are similar to those selected (liked) by the active user
Item features
Recommend items based on the similarity between the active user’s profile and the textual content of the items
Item association
Items highly associated with items selected (liked) by the active user
Item relevance
Most popular items, the top-N rated items, etc. to the active user
Expert relevance
Items from popular users, whose recommendations are universally accepted
Hybrid method
Recommend items by combining some of the previous methods
Depending on the type of associations explored to compute recommendations and on the information used to build the relations, the association building techniques can lead to the different recommendation approaches: knowledge engineering, collaborative filtering, content-based filtering or hybrid filtering. Following this taxonomy definition, table 8 summarizes some of the recommender systems previously explained.
110
F.P. Lousame and E. S´ anchez
Table 8. Entities and representation, associations among entities, association building techniques and recommendation method for several recommender systems. Unless stated otherwise, both user and item entities are represented with a unique id Entity & Associations → [entity representation]
Association building
Recommendation method
User-based Resnick et al., 1994 Shardanand et al., 1995 Breese et al., 1998
U-I: numeric ratings
Explicit
U-U: user similarity, based on ratings
Memory Heuristic: vector similarity, mean squared difference, Pearson correlation
User similarity · Weighted aggregation of ratings from similar users
Predictability paths Aggarwal et al., 1999
U-I: numeric ratings
Explicit
U-U: predictability conditions, based on interactions
Memory - Predictability condition estimation
Item-based Shardanand et al., 1995 Sarwar et al., 2001
U-I: numeric ratings
Explicit
I-I: item similarity, based on ratings
Memory - Heuristic: vector similarity, constrained Pearson correlation
Cluster-based smoothing Xue et al., 2005
Trust inferences Papagelis et al., 2005 Improved neighborhood Bell et al., 2007 Bayesian networks Breese et al., 1998 Association rule mining Lin et al., 2000 Eigentaste Goldbert et al., 2001
Content-boosted Melville et al., 2002
Similarity fusion Wang et al., 2006
U-I: numeric ratings
Explicit
U-I2 : smoothed ratings
Memory - K-means clustering
U-U: user similarity, based on smoothed ratings
Memory Heuristic: vector similarity, mean squared difference, Pearson correlation
U-I: numeric ratings
Explicit
U-U: user similarity, based on ratings
Memory - Heuristic: propagation of trust and confidence
User similarity · Linear rating transformations and aggregation of ratings from similar users Item similarity · Weighted aggregation of similar item ratings
User similarity · Weighted aggregation of ratings from similar users
User similarity · Weighted aggregation of ratings from trusted users
Item similarity · Weighted aggregation of ratings from similar items
U-I: numeric ratings
Explicit
I-I: item similarity, based on ratings
Memory - Optimization of weights
I-U: instance-based representation
Model Probabilistic Bayesian classifier
Item similarity · Classification
U-U: user associations
Model - Association rule mining
Item association
U-I: numeric ratings
Explicit
U-U: user clustering, based on ratings
Model - PCA
User similarity · Cluster selection + Aggregation of ratings from similar users
I-I: item associations
U-I: numeric ratings
Explicit
U-U: user similarity
Memory - Heuristic: Pearson correlation based on pseudo-ratings
U-I2 : pseudo-ratings → Item content features
Model - Bayesian classifier
U-I: numeric ratings
Explicit
U-U: item similarity, based on ratings
Model Probabilistic Bayesian model
User similarity · Weighted aggregation of ratings from similar users
User similarity · Cluster selection + Aggregation of ratings from similar users
I-I: user similarity, based on ratings U-I: transaction history → Binary transactions U-U: user similarity, based on demographic data → Demographic data Spread-activation Huang et al., 2004
Implicit
Memory - Vector similarity
I-I: item similarity, based on content features → Content features I-I2 : item similarity, based on content features → Content features I-I3 : item similarity, based on transactions → Binary transactions
Model - Association rule mining
Hybrid method: · User similarity · Item similarity · Item association →Hopfield net algorithm
A Taxonomy of Collaborative-Based Recommender Systems
111
6 Conclusion The selection of the appropriate algorithm may depend on different aspects such as the type of information available to represent both users and items, or scalability restrictions. In this section, general guidelines to decide which algorithms are better are provided on the basis of the following key aspects: accuracy, meaning of recommendations, scalability and performance, new data, application domain, user activity and prior information. Accuracy. As a central issue in CF research, prediction accuracy has received high attention and various methods were proposed for improvement. Still, conventional memory-based methods using Pearson correlation coefficient remain among the most successful. In domains where content information is available, hybrid methods can provide more accurate recommendations than pure collaborative or content-based approaches (see [11, 47, 57, 43] for empirical comparisons). Figure 2 shows some experimental NMAE results compiled from different research works in different domains.
Fig. 2. Experimental accuracy NMAE results from different research works. Results are shown for different datasets with colored bars.
Meaning of recommendations. As shown in the proposed taxonomy, recommendations can stand for slightly different semantics. While user and item similarity are probably the most frequent used recommending strategies, other methods, such as item association, may be interesting in a recommendation engine as well. Scalability and performance. Memory-based CF often suffers from slow response time, since each single prediction requires the scanning of a whole database of user ratings. This is a clear disadvantage when compared to typically fast responses of model-based CF. Recommending items in real time requires the underlying engine to be highly scalable. To achieve this, recommendation algorithms usually divide the recommendation generation into two parts: the off-line and the on-line component. The first is the part of the algorithm that requires a
112
F.P. Lousame and E. S´ anchez
enormous amount of operations and the second is the part of the algorithm that is dynamically computed to provide predictions using data from the stored component. In this sense, model-based approaches may be more suitable in terms of scalability and performance than hybrid and neighborhood-based ones. New data. In case of high volumes of new data, model-based approaches have to be trained and updated too often, which makes them computationally expensive and intractable. For this situation, memory-based solutions can easily accommodate to new data by simply storing it. Application domain. Depending on the application domain one algorithm may fit better than another. For instance, in domains such as music recommending, approaches resorted on content-based filtering are useless and pure collaborative filtering is still the only way to perform personalization. On the contrary, in domains such as movie recommending, where content information is available, the quality of the recommender will probably be enhanced by adding contentbased features. User activity/sparsity. Users do not present the same degree of activity in all domains. For instance a movie/music recommendation site may have thousands of transactions per day, while in other domains, such as tourism, users may be less active, thus emphasizing the sparsity problem. As a result, in low-level activity domains, either content-based filtering or hybrid filtering would come up with more accurate results than pure collaborative filtering approaches. Prior information. If an initial preference/rating database is not available, only content-based or hybrid recommenders can face both new user and new item problems. Learning extensions are essential to select informative query items the user is likely to rate and thus keep the information gathering stage as short as possible. To address the limitations of collaborative filtering, it is often a good idea to ask for the creation of a user profile for each newcomer. This ensures that the new user has the opportunity to rate items which others have also rated, so that there is some commonality among user’s profiles. 6.1
Future Directions of CF
Better methods for representing user behavior and product items, more advanced recommendation modeling methods, introduction of various contextual information into the recommendation process, utilization of multicriteria ratings, or a provision of more flexible and less intrusive types of recommendations are some ways to improve recommender systems, [55, 21, 7]. The most promising research lines are here discussed: Context-aware recommenders. Most CF methods do not use neither user nor item profiles during the recommendation process. Hybrid methods incorporated user and item profiles but these profiles still are quite simple. New research in context-aware recommenders essentially tries to model additional information, that may be relevant to recommendations in different senses: (1) for identifying pertinent subsets of data when computing recommendations, (2) for building
A Taxonomy of Collaborative-Based Recommender Systems
113
richer rating estimation models, or (3) for providing constraints on recommendation outcomes. There are different active research directions in context-aware recommenders, such as: (1) establishing relevant contextual features, (2) advanced techniques for learning context from data, (3) contextual modeling techniques, and (4) developing richer interaction capabilities for context-aware recommender systems (recommendation query languages, intelligent user interfaces). Flexibility. Flexibility stands for the ability of the recommender system to allow the user to query the system with his/her specific needs in real time. REQUEST (REcommendation QUEry STatements) [8] is a language that allows users to customize recommendations to fit individual needs more accurately. The language is based on a multidimensional data model in which users, items, ratings and other contextual relevant information are represented together following the OLAP-based paradigm. In this sense, flexibility of recommenders is closely related with context-rich applications. For instance the query ’recommend me and my girlfriend top-3 movies and moments based on my personal ratings’ could be expressed: RECOMMEND Movie,Time TO Peter, Lara USING MovieRecommender BASED ON PersonalRating RESTRICT Companion.Type=’Girlfriend’ SHOW TOP 3
Non-Intrusiveness. Many recommender systems are intrusive in the sense that they get ratings explicitly from users. Other systems get implicit feedback from users, but non-intrusive ratings are often inaccurate and are not as reliable as the explicit ratings provided by users. Minimizing intrusiveness while maintaining the accuracy of recommendations is a critical issue in designing recommender systems: if the system demands bigger user involvement, users are more likely to reject the recommender system. Methods aimed at reducing either the required user feedback, by means of attentive interfaces, or the set of required item ratings to maintain a representative user model, while maintaining a reasonable degree of confidence in predictions, could be promising directions.
References 1. 2. 3. 4. 5. 6. 7.
Amazon.com (March 2008) Book-crossing site (March 2008) Cdnow.com (March 2008) Lastfm site (March 2008) Movielens site (March 2008) Netflix site (March 2008) Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. on Knowl. and Data Eng. 17(6), 734–749 (2005)
114
F.P. Lousame and E. S´ anchez
8. Adomavicius, G., Tuzhilin, A., Zheng, R.: Rql: A query language for recommender systems. Information Systems Working Papers Series (2005) 9. Aggarwal, C.C., Wolf, J.L., Wu, K.-L., Yu, P.S.: Horting hatches an egg: a new graph-theoretic approach to collaborative filtering. In: KDD 1999: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 201–212. ACM, New York (1999) 10. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999) 11. Balabanovi´c, M., Shoham, Y.: Fab: content-based, collaborative recommendation. ACM Commun. 40(3), 66–72 (1997) 12. Belkin, N.J., Croft, W.B.: Information filtering and information retrieval: two sides of the same coin? ACM Commun. 35(12), 29–38 (1992) 13. Bell, R., Koren, Y.: Improved neighborhood-based collaborative filtering. In: KDDCup 2007: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, San Jose, California, USA, pp. 7–14. ACM, New York (2007) 14. Billsus, D., Pazzani, M.J.: Learning collaborative information filters. In: ICML 1998: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 46–54. Morgan Kaufmann Publishers Inc., San Francisco (1998) 15. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: UAI 1998: Proceedings of the fourteenth conference on uncertainty in artificial intelligence, pp. 43–52 (1998) 16. Chen, H., Ng, T.: An algorithmic approach to concept exploration in a large knowledge network (automatic thesaurus consultation): symbolic branch-and-bound search vs. connectionist hopfield net activation. J. Am. Soc. Inf. Sci. 46(5), 348–369 (1995) 17. Claypool, M., Gokhale, A., Mir, T., Murnikov, P., Netes, D., Sartin, M.: Combining content-based and collaborative filters in an online newspaper. In: Proceedings of ACM SIGIR Workshop on Recommender Systems (1999) 18. Cohen, W.W., Schapire, R.E., Singer, Y.: Learning to order things. In: NIPS 1997: Proceedings of the 1997 conference on Advances in neural information processing systems, vol. 10, pp. 451–457. MIT Press, Cambridge (1998) 19. Dahlen, B.J., Konstan, J.A., Herlocker, J.L., Good, N., Borchers, A., Riedl, J.: Jump-starting movielens: User benefits of starting a collaborative filtering system with “dead-data”. University of Minnesota TR 98-017 (1998) 20. Ayhan, D.: Enhancing product recommender systems on sparse binary data. Data Min. Knowl. Discov. 9(2), 147–170 (2004) 21. Deshpande, M., Karypis, G.: Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst. 22(1), 143–177 (2004) 22. Drineas, P., Kerenidis, I., Raghavan, P.: Competitive recommendation systems. In: STOC 2002: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 82–90. ACM, New York (2002) 23. Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to weave an information tapestry. ACM Commun. 35(12), 61–70 (1992) 24. Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval 4(2), 133–151 (2001) 25. Han, E.-H(S.), Karypis, G.: Feature-based recommendation system. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 446–452. ACM, New York (2005)
A Taxonomy of Collaborative-Based Recommender Systems
115
26. Heckerman, D., Chickering, D.M., Meek, C., Rounthwaite, R., Kadie, C.: Dependency networks for inference, collaborative filtering, and data visualization. J. Mach. Learn. Res. 1, 49–75 (2001) 27. Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J.: An algorithmic framework for performing collaborative filtering. In: SIGIR 1999: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 230–237. ACM, New York (1999) 28. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22(1), 5–53 (2004) 29. Hill, W., Stead, L., Rosenstein, M., Furnas, G.: Recommending and evaluating choices in a virtual community of use. In: CHI 1995: Proceedings of the SIGCHI conference on Human factors in computing systems, New York, USA, pp. 194–201. ACM Press/Addison-Wesley Publishing Co. (1995) 30. Hofmann, T.: Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. 22(1), 89–115 (2004) 31. Hofmann, T., Puzicha, J.: Latent class models for collaborative filtering. In: IJCAI ’99: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 688–693. Morgan Kaufmann Publishers Inc., San Francisco (1999) 32. Huang, Z., Chen, H., Zeng, D.: Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Transactions on Information Systems 22(1), 116–142 (2004) 33. Huang, Z., Chung, W., Chen, H.: A graph model for E-commerce recommender systems. Journal of the American Society for Information Science and Technology 55(3), 259–274 (2004) 34. Huang, Z., Chung, W., Ong, T.-H., Chen, H.: A graph-based recommender system for digital library. In: JCDL 2002: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pp. 65–73. ACM, New York (2002) 35. Jin, R., Si, L., Zhai, C.: Preference-based graphic models for collaborative filtering. In: UAI 2003: Procceedings of the 19th Conference on Uncertainty in Artificial Intelligence, pp. 329–336 (2003) 36. Jin, R., Si, L., Zhai, C.X., Callan, J.: Collaborative filtering with decoupled models for preferences and ratings. In: CIKM 2003: Proceedings of the twelfth international conference on Information and knowledge management, pp. 309–316. ACM Press, New York (2003) 37. Konstan, J.A., Miller, B.N., Maltz, D., Herlocker, J.L., Gordon, L.R., Riedl, J.: Grouplens: Applying collaborative filtering to usenet news. Communications of the ACM 40(3), 77–87 (1997) 38. Lee, W.S.: Collaborative learning and recommender systems. In: ICML 2001: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 314–321. Morgan Kaufmann Publishers Inc., San Francisco (2001) 39. Lin, W., Alvarez, S.A., Ruiz, C.: Collaborative recommendation via adaptive association rule mining. In: Data Mining and Knowledge Discovery, vol. 6, pp. 83–105 (2000) 40. Lin, W., Ruiz, C., Alvarez, S.A.: A new adaptive-support algorithm for association rule mining. Technical report (2000) 41. Linden, G., Smith, B., York, J.: Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7(1), 76–80 (2003) 42. Nick, L.: Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Mach. Learn. 2(4), 285–318 (1988)
116
F.P. Lousame and E. S´ anchez
43. Melville, P., Mooney, R.J., Nagarajan, R.: Content-boosted collaborative filtering for improved recommendations. In: Eighteenth national conference on Artificial intelligence, pp. 187–192. AAAI, Menlo Park (2002) 44. Miyahara, K., Pazzani, M.J.: Collaborative filtering with the simple bayesian classifier. In: Proceedings of the 6th Pacific Rim International Conference on Artificial Intelligence, pp. 679–689 (2000) 45. Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Discovery and evaluation of aggregate usage profiles for web personalization. Data Mining and Knowledge Discovery 6, 61–82 (2002) 46. Papagelis, M., Plexousakis, D., Kutsuras, T.: Alleviating the sparsity problem of collaborative filtering using trust inferences. In: Herrmann, P., Issarny, V., Shiu, S.C.K. (eds.) iTrust 2005. LNCS, vol. 3477, pp. 224–239. Springer, Heidelberg (2005) 47. Pazzani, M.J.: A framework for collaborative, content-based and demographic filtering. Artif. Intell. Rev. 13(5-6), 393–408 (1999) 48. Pennock, D.M., Horvitz, E., Lawrence, S., Lee Giles, C.: Collaborative filtering by personality diagnosis: A hybrid memory and model-based approach. In: UAI 2000: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 473–480. Morgan Kaufmann Publishers Inc., San Francisco (2000) 49. Rashid, A.M., Albert, I., Cosley, D., Lam, S.K., McNee, S.M., Konstan, J.A., Riedl, J.: Getting to know you: learning new user preferences in recommender systems. In: IUI 2002: Proceedings of the 7th international conference on Intelligent user interfaces, pp. 127–134. ACM, New York (2002) 50. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: an open architecture for collaborative filtering of netnews. In: CSCW 1994: Proceedings of the 1994 ACM conference on Computer supported cooperative work, pp. 175–186. ACM, New York (1994) 51. Resnick, P., Varian, H.R.: Recommender systems. Communications of the ACM 40(3), 56–58 (1997) 52. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of dimensionality reduction in recommender systems–a case study. In: ACM WebKDD Workshop (2000) 53. Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering recommendation algorithms. In: WWW 2001: Proceedings of the 10th international conference on World Wide Web, pp. 285–295. ACM, New York (2001) 54. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Analysis of recommendation algorithms for E-commerce. In: ACM Conference on Electronic Commerce, pp. 158–167 (2000) 55. Schafer, J.B., Konstan, J., Riedi, J.: Recommender systems in E-commerce. In: EC 1999: Proceedings of the 1st ACM conference on Electronic commerce, pp. 158–166. ACM, New York (1999) 56. Shardanand, U., Maes, P.: Social information filtering: algorithms for automating “word of mouth”. In: CHI 1995: Proceedings of the SIGCHI conference on Human factors in computing systems, New York, USA, pp. 210–217. ACM Press/AddisonWesley Publishing Co. (1995) 57. Soboroff, I.M., Nicholas, C.K.: Combining content and collaboration in text filtering. In: Proceedings of the IJCAI 1999 Workshop on Machine Learning for Information Filtering, pp. 86–91 (1999) 58. Wang, J., de Vries, A.P., Reinders, M.J.T.: Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In: SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 501–508. ACM Press, New York (2006)
A Taxonomy of Collaborative-Based Recommender Systems
117
59. Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., Chen, Z.: Scalable collaborative filtering using cluster-based smoothing. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 114–121. ACM, New York (2005) 60. Yu, K., Schwaighofer, A., Tresp, V., Xu, X., Kriegel, H.-P.: Probabilistic memorybased collaborative filtering. IEEE Transactions on Knowledge and Data Engineering 16(1), 56–69 (2004) 61. Zhang, T., Iyengar, V.S.: Recommender systems using linear classifiers. J. Mach. Learn. Res. 2, 313–334 (2002)
6 A System for Fuzzy Items Recommendation Corrado Mencar, Ciro Castiello, Danilo Dell’Agnello, and Anna Maria Fanelli Università degli Studi di Bari
[email protected], {mencar,castiello,fanelli}@uniba.it
Summary. This contribution presents a user profile modelling approach based on fuzzy logic techniques. The proposed approach is conceived to find application in various contexts, with the aim of providing personalised contents to different categories of users. Both contents and users are described by metadata, so a description language is introduced along with a formal model defining their association mechanism. The strength of the model is the use of the expressive flexibility of fuzzy sets exploited by an innovative scheme of metadata. Along with the formal presentation of the profile modelling approach, the design of a software system based on a Service Oriented Architecture is presented. The system exposes a number of services to be consumed by information systems for personalized content access. In this way the system can be used in different application contexts.
1 Introduction Personalisation is one of the key issues pervading most of the technological applications designed for content providing, such as e-commerce systems, web portals, e-learning platforms and so on [1]. In the diverse contexts where they find application, personalisation mechanisms are mainly based on the definition of user profiles. These are formal structures representing different pieces of information related to the user, ranging from her expressed preferences or previous knowledge to his specific role within the area of interest. Usually, profiles are defined to represent categories of users sharing common features; in this way, user profiles stand as filters which favour the allocation of personalised contents. Generally speaking, the definition of user profiles determines a specific granularity level to be introduced inside the area of interest. This kind of information granulation can be established on a range where the roughest case refers to the definition of a single profile for all the ensemble of users (no personalisation allowed), while the finest case refers to the definition of a distinct profile for each user (maximum level of personalisation). Inside this range, the choice for a particular granularity level is mainly driven by a number of factors concerning the trade-off between the involved costs and the produced benefits. G. Castellano, L.C. Jain, A.M. Fanelli (Eds.): Web Person. in Intel. Environ., SCI 229, pp. 119–140. c Springer-Verlag Berlin Heidelberg 2009 springerlink.com
120
C. Mencar et al.
More fine-grained profiles can be achieved through automated profiling systems, which rely on data mining and machine learning techniques [2]. Nevertheless, this kind of approach requires a significant learning time, during which the user behaviour must be monitored by the system in order to build customised profiles. The work activity described in this chapter refers to personalisation processes whose final aim consists in providing items to the users, so as to satisfy their needs and goals at best [3]. The proposed approach is aimed at combining the benefits of automated profiling mechanisms with some form of avalaible a priori knowledge about domain. In practice, the users are assigned to pre-established user profiles, shared among the entire user community. At the same time, each user is also associated with individual profiles, which can be used to track his specific behaviour. The definition of complex profile structures (on the basis of simpler profile components) allows to build up any kind of user profile, responding to the articulated conditions of real world applications. Another peculiarity of the proposed approach consists in the introduction of fuzzy logic for modelling the association of users and profiles. In typical real world situations, a user can be hardly characterised in terms of a single profile: the specification of fuzzy degrees of membership allows to associate a single user with multiple profiles. Moreover, the employment of fuzzy logic is useful also for defining a suitable metadata specification, to be adopted for the description of the items (such as Learning Objects, items in an e-commerce platform and so on). Actually, metadata are largely used in profiling systems to characterise the objects involved in the personalisation process [4]. The usual association mechanisms based on the common metadata schemes produce the simple identification of a number of items to be connected with the demanding users. Obviously, that means that a great deal of items are left outside from the association process. Fuzzy logic allows for a more comprehensive metadata specification, including the description of imprecise properties of the items. Consequently, it can be realised a gradual association between users and items, configured as a ranking where degrees of compatibility are used to identify the most suitable items for each user, without excluding those characterised by lower degrees of compatibility. The chapter is organised as follows. In the next section a breef overview of the state of art is presented. In section 3 the profile modelling approach is introduced. In section 4 the model for the description of an item is formalised, while section 5 is devoted to the formalisation of the model describing the profiles of the actors involved in the items fruition process. Section 6 describes a proposal for a software system implementing the model described. A metric based evaluation of a prototype of this system is provided in section 7 along with some architectural remarks. Finally section 8 closes the chapter with some conclusive considerations.
2 Related Work In the last decade Soft Computing techniques (including Fuzzy Logic, Neural Networks, Probabilistic Reasoning, Genetic Algorithms etc.) have been successfully applied in user modeling [5, 6].
A System for Fuzzy Items Recommendation
121
Fuzzy Logic is usually employed in user modeling for its intrinsic ability of representing and manipulating imprecise and graded concepts. Its usefulness is generally recognized when – as in many real world cases – user models cannot be precisely defined without arbitrary approximations (for survey on user modeling with several paradigms, including Fuzzy Logic, see [7]). A noteworthy application of Fuzzy Logic for user modeling in e-learning systems is given in [8]. Here, fuzzy sets are used to model the user knowledge and are dynamically adapted while user learns new concepts from the e-learning platform. In [9] fuzzy rules are employed to register user actions and to refine the strength of relationship between the user model attributes and the concepts of the knowledge domain. In [10] fuzzy sets are used to model beliefs about interactions that students make with items and quizzes; In this way the educational system is able to evaluate how much plausible a student actually studied her assigned items. Fuzzy Logic has been also used for user modeling in several areas other than e-learning systems. As an example, in [11] a fuzzy nearest neighbor approach is used in a collaborative filtering system to guess user preferences on the basis of historical records. In [12] a Fuzzy Logic based approach has been adopted for the modeling of users to improve the interaction between the user and information retrieval systems. In [13] fuzzy logic tecniques applicated to recommer systems have been presented. In [14] Fuzzy Multiple Criteria Analysis has been used as a tool for user modeling in a Sales Assistance software. In most works on user modeling with Fuzzy Logic, the representation of a user model is flat (i.e. usually based on a collection of fuzzy sets, or a vector of fuzzy values). However, in some industrial contexts, users of an e-learning system may require more complex representations that better represent the role, the knowledge, the preferences of each user in her professional context. Besides, the roles of a user could be described in complex terms, such as a composition of sub-roles. In the subsequent sections an approach is proposed to account for these complexities by providing a very flexible framework for representing user profiles.
3 Rationale of the Profile Modelling Approach The activity described in this contribution moves from the assumption that the proposed approach for profile modelling can find application in different contexts. Our investigation is thus addressed to formalise the association process between a set of items (Its) describing an object, and users, on the basis of suitable metadata specifications. The main concern of our approach is related to provide a modelling strategy independent from the system that owns items. As concerning the independence requirements, we intend to preserve: • the independence from the actual representation of the items inside the platform; • the independence from the actual representation of the users inside the platform;
122
C. Mencar et al.
• the independence from the specific technologies adopted for the representation of metadata inside the platform. By conforming to these independence requirements it is possible to devote peculiar attention to the management of the profile modelling process, regardless of the constraints related to the practical realisation of the platform. This kind of approach allows to set aside a proper definition of users and items: they can be simply acknowledged as class instances, without additional specifications. As concerning the capability requirements, we intend to realise: • the capability to employ metadata specifications allowing for the representation of imprecise properties; • the capability to formalise profiles of high complexity; • the capability to perform (possibly partial) associations of a single user with several profiles. The key to association between users and items are metadata, which connect an item with an attribute and its respective value. Our approach differentiates from usual metadata specifications since we assume that the value for an attribute, far from being simply an element inside the attribute domain, can be specified as a fuzzy set. The theory of fuzzy sets basically modifies the concept of membership: the classical binary membership leaves room for a more comprehensive variety of membership degrees, defined in terms of a mathematical function (as we are going to detail in the next section). In this way, fuzzy sets allow for a partial membership of their elements [15]. The employment of fuzzy metadata characterisation enables the definition of different properties related to an item. In particular, we can distinguish among: simple properties (regarding the punctual evaluation of an attribute by determining a single value inside the set of infinite possible values); collective properties (regarding the extensional specification of a discrete set of values for an attribute); imprecise properties (regarding the intensional definition of a qualitative value for an attribute). It should be noted how this kind of approach produces a granulation of the attribute domains, where fuzzy sets are adopted to represent each information granule. This favours a mechanism of elaboration of concepts that is in agreement with the human reasoning schemes [16]. Actually, the formalisation of imprecise properties is included in our model to cope with the intrinsic difficulties related to some metadata characteristics, which can not be described in terms of simple or collective values. Attempts to formalise such properties by means of discretisation processes lead to arbitrariness, resulting in a poor management of the involved items. The introduction of fuzzy sets is intended to overcome this kind of difficulties, together with the adoption of particular mathematical operators that are especially suitable for handling imprecise information. In this way, gradual associations can be realised between users and items, on the basis of a compatibility ranking. As a result, each user can be ultimately addressed to the most compatible item, without arbitrarily discarding those characterised by a lower degree of compatibility.
A System for Fuzzy Items Recommendation
123
The user profiles are used to represent stereotypical categories of learners. In order to take into account stereotypes of high complexity, in this work the user profiles are formalised as collections of profile components. Analogously to the metadata specification, the profile components are characterised in terms of fuzzy sets: this homogeneity expedites the comparison process aiming at defining a compatibility degree between profile components and items. The aggregation of such compatibility degrees produces the final association of a profile with an item. Users are characterised by their corresponding profiles, however a single user hardly finds a full representation inside a single profile. For that reason, the proposed modelling approach allows a partial membership of users to different profiles, as in the case of real world situations. Therefore, the final association of a user with an item is evaluated by considering the compatibility degrees related to the different profiles of belonging. In the following sections we are going to detail the profile modelling approach by distinguishing the characterisation of the items from the description of the user profiling mechanisms. All the involved entities are formally defined in terms of mathematical concepts and suitable examples are provided to illustrate the working scheme of the modelling approach.
4 Modelling Items To provide a general way to describe a generic item, a description by means of metadata is considered. Regardless to the context in wich item is employed (e.g. e-learning, e-commerce, item recomendation and so on) the model deals only with items description. 4.1
Items and Attributes
An item (It) is any object owned by the platform which a user can be interested in. The proposed model leaves aside the peculiar structure of an object description which is simply defined as an element of a set. Let O be a non-empty set of physical objects, namely the item space. Definition 1. An item is an element o in the item space O, i.e. o ∈ O. With reference to a particular scenario, an item may be represented by a multimedia support, a learning object in an e-learning platform, a document file, a presentation, a book, a hardware component and so on. Each item can be associated with a set of attributes. Generally speaking, an attribute may be numeric or symbolic and it is related to a (possibly infinite) number of distinct values. Let A be a non-empty set, namely the attribute space. Definition 2. An attribute is an element A in the attribute space A, i.e. A ∈ A. In particular, an attribute A is a set of values a ∈ A. Example 1. If we consider an item represented by a book in an e-commerce platform, a list of related attributes may include:
124
C. Mencar et al.
1. the 2. the 3. the 4. the 5. the
name of the item; difficulty level of the item (e.g. undergraduate, professional etc.) publishing year of the item; author of the item; topic of the item (i.e. fiction, scientific and so on).
Again if we consider an item represented by a Learning Object (LO) in an e-learning system, a list of related attributes may include: 1. the name of the LO; 2. the difficulty level of the LO; 3. the fruition time of the LO. The peculiarity of the proposed modelling approach consists in associating a particular item with the imprecise values of its attributes. To manage these associations, the concept of fuzzy set is employed, standing as a generalisation of the classical concept of the mathematical set. By defining a fuzzy set over a domain, it is possible to extend the membership evaluation process to every element inside the domain, thus moving from binary membership values (0/1) to a gradation of membership values over a continuous range. For our purposes, we define a fuzzy set over each attribute of an item as follows. Let A ∈ A be an attribute. Definition 3. A fuzzy set defined over A is a function: FA : a ∈ A → FA (a) ∈ [0, 1]
(1)
FA (a) is called memberhip degree of value a in fuzzy set FA . The definition of fuzzy sets enables the items characterisation in terms of the correspondence between the attributes and their possible values. This kind of relationship can be defined in terms of a set of Attribute-Value pairs, specified as follows. Let A be the attribute space. Definition 4. An Attribute-Value pair is an ordered pair (A, FA ), being A ∈ A and FA a fuzzy set defined over the attribute A. An Attribute-Value set is a set of Attribute-Value pairs: f = {(A, FA )|A ∈ A}. Remark 1. An Attribute-Value set can be formalised as the function: f : A ∈ A −→ FA ∈ FA , being FA the space of all the possible fuzzy sets which can be defined over the attribute A. 4.2
Metadata and Item Description
Attributes and values are strictly connected with an item. Therefore, it is useful to introduce the metadata concept (to be defined for every attribute), associating an item with a fuzzy set which represents the attribute value. Let A ∈ A be an attribute of an item o ∈ O.
A System for Fuzzy Items Recommendation
125
Definition 5. A metadata mA is a function associating the item o with a fuzzy set defined on A: mA : O → F A . In order to obtain a thorough description for an item, it is necessary to refer to its attributes and their related values. A straightforward mechanism to generate an item description is the simple enumeration of the attributes, together with the fuzzy sets reporting the corresponding values. This kind of description is based on the set of metadata that it is possible to define for an item. Let A be the attribute space. Definition 6. The description of an item o ∈ O, with respect to A, is the set of all the Attribute-Metadata pairs associated to o: D(o) = {(A, mA (o))|A ∈ A}.
(2)
Remark 2. The description of an item o ∈ O can be formalised as the function: D(o) : A ∈ A → FA ∈ FA . Remark 3. The description D(o) is an Attribute-Value set. Remark 4. We admit the presence of attributes associated with the entire set of values, i.e. when mA (o) = A. This condition is verified when no values are specified for the attribute A in the characterisation of the item o. Example 2. Inside the illustrative scenario introduced in the example 1, the item description (here a LO representation has been considered) can be expressed by listing the attributes together with the fuzzy sets reporting the corresponding values: 1. Name → {“Introduction to word processing”/1}; 2. Fruition time → about 10 = T [8, 10, 15]; 3. Creation date → {“07-06-07”/1}; 4. Complexity → {“Easy”/1, “Average”/0.3, “Expert”/0}; 5. Scope → {“ICT”/0.7, “Word Processing”/1}. It can be observed that the fuzzy sets reporting the values for the attributes «Name» and «Creation date» refer to simple properties of the item. They assign the maximum membership degree (equal to 1) only to one of the infinite values the attributes may assume. All the other values are not reported inside the characterisations of the fuzzy sets, since their membership degree is equal to zero. This peculiar condition can be graphically represented by means of fuzzy singletons, as depicted in Fig. 1. The fuzzy sets reporting the values for the attributes «Complexity» and «Scope» refer to collective properties of the item. They are defined over discrete sets and assign a membership degree to each one of the possible values, as depicted in Fig. 2. Finally, the fuzzy set reporting the value for the attribute «Fruition time» refers to an imprecise property of the item. It is defined over a continuous set and assigns a membership degree to
126
C. Mencar et al.
(a)
(b)
Fig. 1. Fuzzy singletons representing simple properties of an item: the «Name» attribute (a) and the «Creation date» attribute (b)
(a)
(b)
Fig. 2. Fuzzy sets representing collective properties of an item: the «Complexity» attribute (a) and the «Scope» attribute (b)
Fig. 3. Fuzzy set representing an imprecise property of an item: the «Fruition time» attribute
A System for Fuzzy Items Recommendation
127
each one of the possible values by means of a triangular function, as depicted in Fig. 3. (It should be noted that different kinds of membership functions may be adopted, such as trapezoidal or Gaussian functions). Definition 7. An item collection O is a subset of the space O, i.e. O ⊆ O. The description of an item collection can be further specified with reference to the definition of the single item description as follows. Definition 8. The description of an item collection O is the union set of all the Attribute-Value sets defined by the description of each item: {(D(o))}. (3) D(O) = o∈O
Example 3. The mathematical formalisation of the item description can be extended to manage several distinct items. In this case, the formula in the definition 6 should be properly generalised by means of a matrix representation, where rows and columns correspond to the items and their attributes, respectively. To this aim, we introduce the concept of item collection. The previously described illustrative scenario can be expanded by involving a number of different items. The information reported in table 1 represents matrix describing a sample item collection. Table 1. The matrix describing a sample item collection Name
Fruition time
Creation date
Complexity
Scope
{“Easy”/1, {“Introduction {“ICT”/0.7, about 10’ “07-06-07”/1 “Average”/0.3, to Word”/1} “WordProcessing”/1} “Expert”/0} {“Easy”/0.4, {“Introduction 1 {“ICT”/0.7, LO2 hour “22-05-07”/1 “Average”/0.7, 2 to Latex”/1} “WordProcessing”/1} “Expert”/0.1} {“Easy”/0.3, {“HTML for {“ICT”/0.7, LO3 about 40’ “22-04-07”/1 “Average”/0.8, Dummies”/1} “Web”/1} “Expert”/0.2} LO1
5 Modelling the Actors of the Item Fruition Process In the previous section a way to model items by means of a set of metadata has been provided. In the same way in this section a description of user by mean of a set of metadata is presented. The structure of the user description is quite more complex than the items one, reflecting the fact that a user can assume diverse roles at the same time. 5.1
Profile Components and Compatibility Degrees
The user profiles are regarded as complex concepts whose analysis can be performed on the basis of simpler elements, that are the profile components. Each of
128
C. Mencar et al.
them is formalised in terms of the previously introduced Attribute-Value pairs, so that the fuzzy valorisation of attributes can be replied. Let A be the attribute space. Definition 9. A profile component c is defined as the set of ordered pairs: c = {(A, FA ) |A ∈ A}.
(4)
Remark 5. A profile component can be formalised as the function: c : A ∈ A → FA ∈ FA . Remark 6. The ensemble of the profile components spans the set C, namely the space of the profile components. The formalisation of the profile components is useful to define the concept of user profiles. Definition 10. A user profile p is a set of profile components, i.e. p ⊆ C. Remark 7. The ensemble of the user profiles spans the power set P = 2C of user profiles. Example 4. A specific user profile can be constituted by a number of profile components. As an example, we refer to a couple of profile components. The first one (c1 ) is characterised by the following Attribute-Value pairs: 1. Fruition time → short = T [0, 15, 30]; 2. Complexity → {“Easy”/1, “Average”/1, “Expert”/0.5}; 3. Scope → {“ICT”/0.5, “Word Processing”/0.8}. The second profile component (c2 ) is characterised by the following AttributeValue pairs: 1. Complexity → {“Easy”/0.5, “Average”/1, “Expert”/1}; 2. Scope → {“Management”/1}. Such a user profile can be properly associated to a «secretary» profile and it is defined in terms of the same attributes employed for the item descriptions reported in the previous examples. Here the «Fruition time», «Complexity» and «Scope» attributes refer to the characteristics of items that the user is supposed to be addressed to. The attributes not appearing in this example are not deemed useful for describing the profile components. The pieces of information reported in the example are quite illustrative of the usefulness of profile components. In fact, the first component c1 is related to the ICT competence of the secretary, with special reference to the use of word processing software. This kind of competence can be reasonably regarded as a non-priority issue for the secretary profile; for that reason the related items are characterised by a low level complexity and a short fruition time. Conversely, the secretary profile is fully qualified in terms of management activities, as represented by the the maximum membership degree associated to the value of the «Scope» attribute in the second component c2 . As a consequence, more complex items are to be considered, without a specification for the «Fruition time» attribute: in this case the user should be addressed to items requiring any time of fruition.
A System for Fuzzy Items Recommendation
129
The homogeneity between the item description and the profile components is straightforward, as resulting from the comparison of definitions 6 and 9. The common structure of these elements allows the definition of a compatibility degree among them, which is actually evaluated between a couple of AttributeValue sets. For this purpose, it is possible to exploit the possibility measure among fuzzy sets and the aggregation operators. Particularly, the possibility measure [17],[18] verifies the existence of an attribute value both in the profile component and in the item description; the aggregation process, performed over the evaluated possibility measures, produces a compatibility degree between the component profile and the item. Definition 11. The possibility degree between two fuzzy sets FA , FA , defined on the same attribute A, is defined as follows: Π(FA , FA ) = sup min{FA (a), FA (a)}. a∈A
An example is shown in Fig. 4. The possibility degree provides a measure of the compatibility of two granular values defined on the same attribute. It is hence the basic operation for the definition of the compatibility degree between an item and a profile component. The calculation of the possibility degree spans all the attributes in A. As a consequence, given two Attribute-Value sets f1 , f2 the related possibility degree can be specified. Definition 12. The possibility degree between two Attribute-Value sets f1 , f2 is defined as follows: Ψ (f1 , f2 ) : A ∈ A −→ Π(f1 (A), f2 (A)) ∈ [0, 1]. The definition of the compatibility degree of the two Attribute-Value sets f1 , f2 requires the aggregation of the possibility degrees attained for each attribute. Definition 13. The compatibility degree between f1 and f2 is defined as: Kω (f1 , f2 ) = ω(Ψ (f1 , f2 )). Function ω is an OWA (Ordered Weighted Average, [19]) aggregation operator: ω : [0, 1]|A| → [0, 1], defined as: |A|
ω π1 , π2 , . . . , π|A| = πij · wj , j=1
where: πi1 ≤ πi1 ≤ · · · ≤ πi|A| and w1 , w2 , . . . , w|A| ∈ [0, 1] are weight factors such that: |A| j=1
wj = 1.
130
C. Mencar et al.
Remark 8. By changing the weight factors, several OWA can be defined, such as the minimum function (by setting w1 = 1 and wj = 0 for j > 1) or the mean value function (by setting wj = 1/|A| for all j). The choice of a specific OWA is a matter of design. Remark 9. The compatibility degree Kω (c, D(o)) between a profile component c and an item description D(o) can be defined in terms of the compatibility degree between a couple of Attribute-Value sets introduced by the definition 13. Generally speaking, a user profile is compatible with an item if at least one of its profile components is compatible with the item. Since we are dealing with fuzzy evaluations, it is necessary to refer to the maximum compatibility degree evaluated for each profile component. Definition 14. The compatibility degree between a profile p and an item o is defined as the maximum compatibility degree of the profile components: Kω (p, D(o)) = max Kω (c, D(o)). c∈p
Example 5. It is possible to evaluate the compatibility degree between the user profile defined in example 4 and item description reported in example 2. The compatibility degree is equal to the maximum compatibility degree between one of its profile components (that are c1 , c2 ) and the item description. By considering the profile component c1 , the evaluation of the possibility measures among the fuzzy sets defined for the attributes «Scope», «Complexity» and «Fruition time» is illustrated in Fig. 4, with the assistance of the graphical representations of the involved fuzzy sets.
Fig. 4. Evaluation of the possibility measures among fuzzy sets
A System for Fuzzy Items Recommendation
131
By adopting the minimum function as OWA aggregation function, the compatibility degree Kω (c1 , LO) between the profile component c1 and the item can be properly evaluated as: Kω (c1 , LO) = ω(0.8, 1, 1) = 0.8. An analogous process can be performed with reference to the profile component c2 , yielding the compatibility degree: Kω (c2 , LO) = ω(0, 1) = 0. According to the definition 14, the final degree of compatibility between the «secretary» user profile and the item is equal to: max(Kω (c1 , LO), Kω (c2 , LO)) = max(0.8, 0) = 0.8.
5.2
Users and User Profiles
The items are intended to be demanded by users. Each user can be associated with multiple profiles: these associations are characterised by fuzzy membership degrees. Three kinds of profiles have been conceived in our profile modelling approach: 1. competence profiles (characterising the users in terms of their specific roles or working activities); 2. preference profiles (characterising the users in terms of their specific choices during the interaction with the system); 3. acquaintance profiles (characterising the users in terms of the specific information they have collected during the interaction with the system). In any case, the structure of the profiles is the same as defined in the previous section for all the above specified categories. A user can be defined in terms of the membership degree with reference to a profile base. Let U be a non-empty set of users. Definition 15. A user is an element u in the set of users U, i.e., u ∈ U. Definition 16. A profile base is a subset P of the profile space P, i.e. P ⊆ P. Let P ⊆ P be a profile base and let u ∈ U be a user. Definition 17. The description of the user u is defined by the fuzzy set: DP (u) : p ∈ P −→ [0, 1] Example 6. It could be possible to further detail the scenario illustrated in the example 4 by supposing that the «secretary» user profile may be compatible with some other user profile (possibly corresponding to some other working function). As an example, we could think of a person inside a company who plays the
132
C. Mencar et al.
different roles of secretary and (to a lesser extent) of tax consultant. In the context of the profile modelling approach, such a user u is represented by the following description: D(u) = {“secretary”/0.8, “tax consultant”/0.2}. The above formalisation is based on the assumption that there exist both the user profile «secretary» and the user profile «tax consultant»: the latter may be described in a similar way as illustrated in the example 4. The compatibility degree between a user and an item can be defined on the basis of the compatibility degree between the description of the item and the profiles associated with the user. In practice, several degrees of compatibility should be taken into account, weighted by the user membership degrees with respect to the profile base. Let u ∈ U be a user and let P ⊆ P be a profile base. Definition 18. The compatibility degree between the description of the user u and the description of the item o ∈ O is defined as: Kω (DP (u), D(o)) = max min{Kω (p, D(o)), DP (u)(p)}. p∈P
Example 7. With reference to the example 6, the compatibility degree between the user and the item can be evaluated by the maximum compatibility degree between the item and the user profiles (namely, the «secretary» and the «tax consultant» profiles). As concerning the «secretary» profile, we have already evaluated its compatibility degree with the item, which is equal to 0.8. By supposing a compatibility degree equal to 0.1 for the (undefined) «tax consultant» profile, the ultimate compatibility degree between the user and the item would be equal to 0.8, i.e. the maximum value of the profile compatibility degrees. Finally, it is possible to formalise the different role of the previously specified profile categories for the user characterisation. Let u ∈ U be a user and let C (Competence), A (Acquaintance) and P (Preference) be three profile bases. Definition 19. The compatibility degree between the description of the user u and the description of the item o ∈ O is defined as: Kω (u, o) = min{ max{Kω (DP (u), D(o)), Kω (DC (u), D(o))}, 1 − Kω (DA (u), D(o))}.
(5)
The relationship expressed by (5) represents the logical property associating an item to a specific user on the basis of his competence, the preferences he has expressed during the interaction process and the items he had the opportunity to get acquainted with. Specifically, relationship (5) express the logical property that associates an item to a user if the latter has competence or preference on the item, but he is not acquainted with.
A System for Fuzzy Items Recommendation
133
6 Defining the System Architecture The main issue in designing a system conforming to the model discussed in the previous sections is to deal with its very general nature. Since the model provides a tool to associate items to users regardless to the context of application, the architecture must also reflect this focal point. One can imagine the need of such a mechanism of association in an e-learning system, or in an e-commerce platform, and so on. The main concern is to develope a component that could be used as a service provider by existing systems, so the integration effort can be minimized. The proposed architecture has three layers, each of the them related with a specific function: 1. a Frontend acting as a request dispatcher; 2. a Backend layer dealing with the implementation of the model. 3. a Persistence Abstraction Layer dealing with data stored on physical system (i.e. item and profile descriptions); The first one accepts incoming requests for services and sends back the system computation result, the second one performs operations according to incoming requests and the third one is responsible of the management of database transactions. Each component offer an interface used by other components in the interaction. In Fig. 5 an overview of system components is provided.
Fig. 5. An general overview of the system architecture. The system is highlighted into the boxed area. Each external component, namely databases and service consumers, has a label reflecting its stereotype.
134
C. Mencar et al.
Fig. 6. The architecture of the Frontend Layer. The bounded region identifies boundary of the Frontend Layer.
6.1
The Frontend Layer
The task of this layer is to provide for an external interface of the system. Requests from service consumers are decoded and forwarded to the Backend layer while the result of the processing performed by the system is encoded and sent back to the requester due to the need of designing a Service Provider, a Service Oriented Architecture paradigm has been chosen. There are several advantages with this approach: 1. the use of a mature protocol for communications between service provider and service consumer; 2. the use of an easy up-to-date architecture; 3. the implementation of a platform-independent system. To manage request for system services SOAP protocol has been chosen1 . Every request incoming from clients is encapsulated into SOAP envelopes and delivered to the system. Every envelope has a standard format with a header and a body section. An envelope conveys information over the net through HTTP protocol, their bodies embedding both information on the service request and 1
For details on SOAP specification see URL http://www.w3.org/TR/soap/
A System for Fuzzy Items Recommendation
135
on data to deal with. A Dispatcher component is responsible to receive and send back this requests. An Encoder component is responsible for both encoding and decoding respectively outgoing and incoming messages. A Forwarder component is responsible to forward the decoded request to the Backend layer. A result is encapsulated into a SOAP envelope and sent back to the service consumer. A diagram showing the architecture of this layer is shown in Fig 6. 6.2
The Backend Layer
The task of this layer is to take care of the computational effort of the the system. It provides the mechanism to associate items to the user conforming to the model formalized in the previous sections. Knowledge about matching strategies, internal description of objects involved in the matching process and fuzzy operators is possesed by the components in this layer. This layer is uncoupled from other layers and operates on data translated into an appropriate internal format. This makes possible the realization of the general purpose matching strategy formalized by the model. A Matcher component is responsible of the association between user and item descriptions, while a Fuzzy Inference Engine component must take care of the semantic expressed by fuzzy operators presented in the model. Other
Fig. 7. The Architecture of the Backend Layer. The bounded region identifies boundary of the Backend Layer.
136
C. Mencar et al.
components can be inserted in this layer, one for any further service exposed by the system. In Fig. 7 has been shown the architecture of this layer. 6.3
The Persistence Abstraction Layer
This layer is responsible of database connection and transactions. In this layer there are components that deal with the conversion of data as physically stored in databases and an internal format that the system can process. The ratio underlying this choice is strictly connected to the need of providing a general way to process information regardless to the format in which it is stored. It may be possible to use a relational database or an XML sheet or any other support to store data of users and items. A Translator component is responsible of the adaptation of data beetwen this layer and the above Backend Layer. A mechanism to uncouple Persistence Abstraction Layer implementation from the database underliyng has been designed. For this reason the responsibility to interact with database is demanded to only one component. This component, namely the Data Access Manger, has the knowledge about the format in wich data are stored on the physical system and about the mechanism to retrieve them. Fig. 8 show the architecture of this layer.
Fig. 8. The architecture of the Persistence Abstraction Layer. Connection both to items and users databases are shown. The bounded region identifies boundary of the Persistence Abstraction Layer.
A System for Fuzzy Items Recommendation
137
7 Evaluating the Prototype The implemented prototype has been tested in order to evaluate some remarkable characteristics. The main subject of interest was to test how much the system conforms to the model. In order to inspect this focal point two aspects have been considered: functionality and reliability. Functionality measures how a certain software satisfies the needs expressed in analysis phase while efficiency measures the association time between users and items. A testing environment has been built by poulating the item database with learning objects (hereafter LOs) and user description database with user profiles (hereafter UPs). The sets of metadata describing both users UPs and LOs was bound to have a non-empty intersection so that at least one attribute in UPs descriptions could match with its corresponding one in LOs descriptions. At the end of the building process the items database stored five items represnting LOs whose set of describing metadata had a various cardinality. In the same way the user description database stored nine UPs with the set metadata with various cardinalities. The test phase consisted of the computation of a score for each of the subjects of the above mentioned analysis. To inspect functionality of the system the Semantic Consistency of Matching Operator (hereafter SCMO) has been defined. Finally to inspect the efficiency is computed by mean of Profile-Items Association indicator (hereafter PIA) that measure the time the system needs to perform an association between a user description and a set of items. Of this indicator the average value (PIA_AVG) and the standard deviation of values (PIA_STD) in a battery of tests have been considered. 7.1
The Estimation of SCMO Indicator
To estimate the SCMO the following process has been defined: 1. a set I of items and a set U of users are considered; 2. for each user in the set we manually define a list of items ordered with an empirical criterium that estimates the order of preferences on the basis of the semantics of items descriptions; 3. an association test with the system is performed for each user in the set so that a set of ordered lists of items is obtained; 4. differences in elements ordering between manually defined and system obtained lists are evaluated, by assigning a score Si with respect to successfull comparisons; 5. the average of scores is evaluated in order to obtain the value of the SCMO indicator by means of the formula SCM O =
|U|
i=1
|U |
Si
138
C. Mencar et al.
After the testing phase the value of this indicator was estimated to be SCM O 86% showing a high index of functionality 7.2
The Estimation of PIAs Indicators
To estimate the Efficiency the following process has been defined: 1. a set of resouces and a set of users are considered of cardinality |U |; 2. an association test with the system is performed for each user in the set and association times Ti are registered; 3. the average time |U| Ti T = i=1 |U | is evaluated; 4. the PIA_AVG value is evaluated with the formula: P IA_AV G = 100 ∗ exp(−a ∗ T )
(6)
where the parameter a is obtained with the formula: a=
ln(2) T _F AIR2
and T _F AIR = 500msec, thus leading to a convergence of (6) to a value of 50. T _F AIR is the maximum time considered acceptable for the system to provide a result; 5. the standard deviation time |U| 2 i=1 (Ti − T ) T = |U | is evaluated; 6. the PIA_STD value is evaluated with the formula: P IA_ST D = exp(−b ∗ T)
(7)
where the parameter b is obtained with the formula b=
ln(2) S_F AIR2
and S_F AIR = 100msec, thus leading to a convergence of (7) to a value of 100. S_F AIR is the maximum time considered acceptable for the system to provide a result; After the testing phase the value of this indicator was estimated to be P IA_AV G = 63 and P IA_ST D = 91 showing a high index of efficiency.
A System for Fuzzy Items Recommendation
139
8 Conclusions In this contribution a profile modelling approach has been proposed to be applied in every context in wich a system has to provide an item to a user on the basis of an esteemed prefernce. The peculiarity of the illustrated approach consists in the employment of fuzzy logic for modelling the descriptions of items to be provided by a system and user profiles. In this way, it is possible to formalise a mathematical scheme of metadata to describe similar as well as complex attributes characterised by collective and imprecise properties. That is done by defining a fuzzy set over each attribute, so that a fuzzy attribute valorisation can be determined. Moreover, the profiling mechanism benefits from the use of fuzzy membership values, since each user can be partially associated with more than a single profile. Finally, the adoption of fuzzy operators provides further association mechanisms, enabling the evaluation of compatibility degrees, which constitute the basis for building up a ranking of items to be associated with a specific user. A system architecture based on this model is also designed. The aim of providing a general system is reflected by the use of a Service Oriented Architecture for the design of a service provider component. A test of a prototype has been also provided with respect to the evaluation of functionality and efficiency. Results show high values for the defined indexes. Future work is to be addressed to a more comprehensive study of the fuzzy operators involved in the association mechanisms, in order to define the most suitable functions for modelling the different semantics of the personalisation process. In fact the model considers only a possibilistic semantic associated to compatibility among metadata describing attributes. As a future address the veristic semantic [18] should be also explored to provide for a more flexible way to express relationships intercurring among metadata.
References 1. Riecken, D.: Introduction: personalized views of personalization. Communications of the ACM 43(8), 26–28 (2000) 2. Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. ACM Transactions on Internet Technology 3(1), 1–27 (2003) 3. De Bra, P., Brusilovsky, P., Houben, G.: Adaptive hypermedia: from systems to framework. ACM Computing Surveys, Article No.12, 31(4es) (1999) 4. Neven, F., Duval, E.: Reusable learning objects: a survey of lom-based repositories. In: MULTIMEDIA 2002: Proceedings of the tenth ACM international conference on Multimedia, pp. 291–294. ACM, New York (2002) 5. Azvine, B., Wobcke, W.: Human-centred intelligent systems and soft computing. BT Technology Journal 16(3), 125–133 (1998) 6. Frías-Martínez, E., Magoulas, G., Chen, S., Macredie, R.: Recent soft computing approaches to user modeling in adaptive hypermedia. In: De Bra, P.M.E., Nejdl, W. (eds.) AH 2004. LNCS, vol. 3137, pp. 104–114. Springer, Heidelberg (2004) 7. Brusilovsky, P., Millán, E.: User models for adaptive hypermedia and adaptive educational systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 3–53. Springer, Heidelberg (2007)
140
C. Mencar et al.
8. Kavcic, A.: Fuzzy user modeling for adaptation in educational hypermedia. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 34(4), 439–449 (2004) 9. Martinovska, C.: A fuzzy-based approach to user model refinement in adaptive hypermedia systems. In: De Bra, P., Brusilovsky, P., Conejo, R. (eds.) AH 2002. LNCS, vol. 2347, pp. 411–414. Springer, Heidelberg (2002) 10. Kosba, E., Dimitrova, V., Boyle, R.: Using fuzzy techniques to model students in web-based learning environments. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp. 222–229. Springer, Heidelberg (2003) 11. Suryavanshi, B.S., Nematollaah Shiri, S.P.M.: A fuzzy hybrid collaborative filtering technique for web personalization. In: Proceedings of the 3rd Workshop on Intelligent Techniques for Web Personalization (ITWPŠ 2005), pp. 1–8 (2005) 12. John, R.I., Mooney, G.J.: Fuzzy user modeling for information retrieval on the world wide web. Knowl. Inf. Syst. 3(1), 81–95 (2001) 13. Yager: Fuzzy logic methods in recommender systems. Fuzzy sets and systems 136, 133–149 (2003) 14. Popp, H., Lödel, D.: Fuzzy techniques and user modeling in sales assistants. User Modeling and User-Adapted Interaction 5(3), 349–370 (1995) 15. Zadeh, L.: Fuzzy sets. Information and Control 8, 338–353 (1965) 16. Zadeh, L.: A note on web intelligence, world knowledge and fuzzy logic. Data & Knowledge Engineering 50, 291–304 (2004) 17. Prade, D.D.H.: Possibility theory: an approach to computerized processing of uncertainty. Plenum Press (1988) 18. Yager: Veristic variables. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 30, 71–84 (2000) 19. Yager: On ordered weighted averaging aggregation operators inmulticriteria decisionmaking. IEEE Transactions on Systems, Man and Cybernetics 18, 183–190 (1988)
Author Index
Bux, Massimo
27
Castellano, Giovanna Castiello, Ciro 119
Lops, Pasquale 27 Lousame, Fabi´ an P.
Mencar, Corrado 119 Musto, Cataldo 27
de Gemmis, Marco 27 Dell’Agnello, Danilo 119
Narducci, Fedelucio Fanelli, Anna Maria
1
27
1, 119
Garofalakis, John 49 Giannakoudi, Theodoula Jain, Lakhmi C.
81
1, 65
49
S´ anchez, Eduardo 81 Semeraro, Giovanni 27 Torsello, Maria Alessandra
1, 65