The Information Retrieval Series
Series Editor W. Bruce Croft
Jin Zhang
Visualization for Information Retrieval Foreword by Edie Rasmussen
Jin Zhang University of Wisconsin School of Information Studies 532 Bolton Hall 53211 Milwaukee, WI, USA E-mail:
[email protected]
ISBN: 978-3-540-75147-2
e-ISBN: 978-3-540-75148-9
Library of Congress Control Number: 2007937243 ACM Codes: H.3, H.4, H.5 c 2008 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover Design: K¨unkel Lopka, Heidelberg Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com
This book is dedicated to my parents, wife Yi, and son Theodore
Foreword
It was my good fortune, as a relatively new professor at the University of Pittsburgh’s School of Information Sciences, to meet Jin Zhang when he was first sent to the US for a year of study by his university in Wuhan, China. Jin impressed me with his energy and enthusiasm for research and I welcomed the chance to work with him. Knowing he had only a year, he accomplished in that time what many take two or three years to do, laying the foundation for his PhD in information sciences. Though he had to return to China after that first year, he continued to actively develop his ideas and models on the mathematical foundations of visualization for information retrieval. He was able to return with his family a few years later to complete his degree, when he again worked at a feverish pace to complete his research and thesis on “Visual Information Retrieval Environments”. Jin received his PhD from the University of Pittsburgh in 1999 and moved to the University of Wisconsin – Milwaukee to take up a faculty position in the School of Information Studies, where he is now an Associate Professor. Jin was driven in large part by his passion for his field of study, visualization models for information retrieval. His inspiration at the University of Pittsburgh was Professor Robert Korfhage. A few years earlier Bob had been involved in the design of VIBE (Visualization by Example), one of the earliest models for visualization in information retrieval. The problem of projecting an n-dimensional space onto a two-dimensional one is elegantly but simply solved in VIBE, but it is only one of many possible solutions. Bob’s background in mathematics was a good match for Jin’s, and when he introduced Jin to the problem of visualization for information retrieval, they began a collaboration. There is a fascinating challenge in developing models that are mathematically interesting while producing a display that can be unambiguously interpreted to produce effective retrieval, efficiently calculated and capable of handling large databases. Taking VIBE as a starting point, Jin developed new models, and—not always the case in those early years—insisted on implementing them and evaluating their performance as well. Their collaboration ended with Bob’s death in 1998, six months before Jin completed his PhD. Though his research interests have broadened to include other areas, Jin has continued to work on developing mathematical models and prototypes for information visualization, including some in the Web environment. Visualization for Information Retrieval is the result of over ten years of research in this field. In this book Jin presents the models, limitations and challenges of visualization for information retrieval, and he provides a significant resource for new researchers in the field. Edie Rasmussen University of British Columbia
Preface
Preface The dynamics, diversity, heterogeneity, and complexity of information on the dramatically growing Internet and other information retrieval systems have posed an unprecedented challenge to traditional information retrieval techniques and theories. These challenges have driven the need for more interactive, intuitive, and effective systems for information retrieval. The situation has necessitated intense interest in looking for new ways to facilitate users in retrieving relevant information. Information visualization techniques, which can demonstrate data relationships in a visual, transparent, and interactive environment, have become our best hope in dealing with this challenge. Information visualization has a very natural relationship to information retrieval. In fact information retrieval is a thread that goes through all information visualization systems. Information visualization offers a unique way to reveal hidden information in a visual presentation and allows users to seek information from the visual presentation. Browsing as a powerful information seeking means is fully utilized and strengthened in such a visualization environment. Visualization techniques hold a lot of promise for information retrieval. Addressing information visualization from an information retrieval perspective would definitely benefit both information retrieval and information visualization. The book Visualization for Information Retrieval provides a systematic explanation of the latest advancements in information retrieval visualization from both theoretical and practical perspectives. It reviews the main approaches and techniques available in the field. It explicates theoretic relationships between information retrieval and information visualization and introduces major information retrieval visualization algorithms and models. The book addresses crucial and common issues of information retrieval visualization such as elusive evaluation, notorious ambiguity, and intriguing metaphorical applications in depth. It takes a detailed look into the theory and applications of information retrieval visualization for Internet traffic analysis, and Internet information searching and browsing as well. At end of this book, it compares the introduced information retrieval visualization models from multiple perspectives. And finally it discusses important issues of information retrieval visualization and research directions for future explorations.
X
Preface
Readers of this book will gain a good understanding of the current status of information retrieval visualization, technical and theoretical findings and advances made by leading researchers, sufficient and practical details for implementation of an information retrieval visualization system, and existing problems for researchers and professionals to be aware of. The book is organized and presented as follows: Chap. 1 provides answers to the fundamental questions about information retrieval visualization such as why the information visualization technique is vital and necessary for information retrieval, how it enhances information retrieval on two fronts: querying and browsing, what are the basic information retrieval visualization paradigms, what are the potential applications and implications of information visualization in information retrieval, and what are the basic procedures for the development of an information retrieval visualization model. Chap. 2 covers the basic and necessary concepts and theories of information retrieval. These concepts and theories such as similarity measures, information retrieval models, and term weighting algorithms, are prerequisites for the following chapters about information retrieval visualization models. Putting these concepts and theories together as a chapter would not only avoid unnecessary duplicative introduction of these concepts and theories in the following chapters, but also lay a theoretical foundation and better prepare readers to understand the information retrieval visualization models. Chaps. 3 through 7 address the multiple reference point based models, Euclidean spatial characteristics based models, self-organizing map models, Pathfinder associative network models, and multidimensional scaling models, respectively. The history, concept definition, categorization, algorithm description, algorithm procedure, and applications and implications of these major information retrieval visualization models on information retrieval are discussed in depth. These chapters are at the heart of the book. Chap. 8 introduces the application of information retrieval visualization to the Internet. The Internet not only poses unprecedented challenges for information retrieval visualization but also provides an enormous opportunity for its application. Information retrieval visualization techniques can be used to alleviate the notorious lost in cyberspace syndrome or disorientation during navigation, making navigation smoother and more comfortable. In addition information visualization applications in related fields such as hyperlink hierarchies, subject directories, browsing history, visual search engine results presentation, Web user information seeking behavior patterns, networking security, and user online discussions are included. Chap. 9 addresses the notorious concept of ambiguity in a visual space. Reasons for the ambiguity phenomenon are analyzed in different information retrieval visualization environments, both positive and negative implications on information retrieval are expounded, types of ambiguity are defined, and solutions to the problems are also included. Chap. 10 discusses the basic elements of a metaphor and cognitive implication of a metaphorical interface on communication among users, system developers, and system designers. Metaphorical applications in information retrieval
Preface
XI
visualization in various situations at different levels are analyzed. Procedures and principles of a metaphorical application in the field are presented. Chap. 11 focuses on the evaluation issue. Evaluation for information retrieval visualization is both important and difficult. Two aspects: visualization environment evaluation and visualization retrieval evaluation are distinguished and analyzed. An evaluation standard system for information retrieval visualization, including information exploration, query search, visual information presentation, and controllability, is proposed. The last chapter of the book is titled “Afterthoughts”. This chapter briefly recapitulates the main ideas of the chapters. It compares the five major information retrieval visualization models from the angles of a visual space, semantic framework, projection algorithm, ambiguity, and information retrieval. And finally, it addresses important issues, challenges, and future research directions of information retrieval visualization. The selected information retrieval visualization models in this book are based on the following criteria. [1] They are mainstream and mature algorithms or models in information retrieval visualization. These models are widely used and recognized. [2] They are representative for various types of information retrieval visualization. Each of the introduced models is sophisticated enough to derive a cluster of related models. [3] They must reflect information retrieval characteristics. Unique features of information retrieval in the context of information visualization are included. [4] They can reveal deep semantic and comprehensive relationships of displayed objects. Although the five information retrieval visualization models are introduced, many other models are also included in various contexts such as metaphorical application, and information retrieval visualization evaluation in the book. In each of these chapters, a complete example of a visualization model is given and implication of information retrieval is presented. Internet information visualization is an independent chapter because the Internet offers an ideal stage for information visualization techniques and a wide spectrum of information retrieval visualization approaches can be applied to it. I would like to take this opportunity to thank Dr. Edie Rasmussen for writing a foreword for the book and her inspiration and support; Dr. Robert Korfhage for introducing me to this amazing and intriguing field of information retrieval visualization when I pursued my Ph.D. in University of Pittsburgh; Dr. Dietmar Wolfram for his reviewing this book and providing valuable suggestions; the anonymous proposal reviewers and final manuscript reviewers for their insightful comments; Ralf Gerstner and the staff in Springer who made a contribution to the book for their excellent and professional work; Ms. Lynda Citro for her editing the book; and other people who made a contribution to the book. I am also grateful to these publishers Elsevier, Wiley, and IEEE for permission to use their figures in the book. The work is in part sponsored by the Program of Introducing Talents of Discipline to Universities from the Chinese Ministry of Education and the State Administration of Foreign Experts Affairs of China (Grant No.:B07042). Furthermore, the University of Wisconsin Milwaukee has been very supportive of the work. Finally, thanks must go to my family for their support.
Contents
Chapter 1 Information Retrieval and Visualization........................................... 1 1.1 Visualization................................................................................................ 3 1.1.1 Definition............................................................................................ 3 1.1.2 Scientific visualization and information visualization........................ 3 1.2 Information retrieval.................................................................................... 4 1.2.1 Browsing vs. query searching............................................................. 5 1.2.2 Information at micro-level and macro-level ....................................... 7 1.2.3 Spatial characteristics of information space ....................................... 8 1.2.4 Spatial characteristics of browsing ................................................... 10 1.3 Perceptual and cognitive perspectives of visualization.............................. 11 1.3.1 Perceptual perspective ...................................................................... 11 1.3.2 Cognitive perspective ....................................................................... 12 1.4 Visualization for information retrieval ...................................................... 13 1.4.1 Rationale........................................................................................... 13 1.4.2 Three information retrieval visualization paradigms ........................ 16 1.4.3 Procedures of establishing an information retrieval visualization model........................................................................... 16 1.5 Summary.................................................................................................... 20 Chapter 2 Information Retrieval Preliminaries ............................................... 21 2.1 Vector space model.................................................................................... 22 2.2 Term weighting methods ........................................................................... 24 2.2.1 Stop words ........................................................................................ 25 2.2.2 Inverse document frequency............................................................. 25 2.2.3 The Salton term weighting method................................................... 26 2.2.4 Another term weighting method....................................................... 26 2.2.5 Probability term weighting method .................................................. 26
XIV
Contents
2.3 Similarity measures....................................................................................27 2.3.1 Inner product similarity measure ......................................................28 2.3.2 Dice co-efficient similarity measure.................................................28 2.3.3 The Jaccard co-efficient similarity measure .....................................28 2.3.4 Overlap co-efficient similarity measure............................................29 2.3.5 Cosine similarity measure.................................................................29 2.3.6 Distance similarity measure..............................................................30 2.3.7 Angle-distance integrated similarity measure...................................32 2.3.8 The Pearson r correlation measure....................................................33 2.4 Information retrieval (evaluation) models .................................................34 2.4.1 Direction-based retrieval (evaluation) model ...................................34 2.4.2 Distance-based retrieval (evaluation) model ....................................35 2.4.3 Ellipse retrieval (evaluation) model..................................................36 2.4.4 Conjunction retrieval (evaluation) model .........................................36 2.4.5 Disjunction evaluation model ...........................................................38 2.4.6 The Cassini oval retrieval (evaluation) model ..................................39 2.5 Clustering algorithms.................................................................................40 2.5.1 Non- hierarchical clustering algorithm .............................................42 2.5.2 Hierarchical clustering algorithm .....................................................43 2.6 Evaluation of retrieval results ....................................................................45 2.7 Summary....................................................................................................46 Chapter 3 Visualization Models for Multiple Reference Points ......................47 3.1 Multiple reference points ...........................................................................48 3.2 Model for fixed multiple reference points .................................................49 3.3 Models for movable multiple reference points ..........................................52 3.3.1 Description of the original VIBE algorithm .....................................52 3.3.2 Discussions about the model.............................................................59 3.4 Model for automatic reference point rotation ............................................66 3.4.1 Definition of the visual space ...........................................................67 3.4.2 Rotation of a reference point ............................................................69 3.5 Implication of information retrieval...........................................................70 3.6 Summary....................................................................................................72 Chapter 4 Euclidean Spatial Characteristic Based Visualization Models .........73 4.1 Euclidean space and its characteristics ......................................................73 4.2 Introduction to the information retrieval evaluation models......................75 4.3 The distance-angle-based visualization model...........................................79 4.3.1 The visual space definition ...............................................................79 4.3.2 Visualization for information retrieval evaluation models ...............81 4.4 The angle-angle-based visualization model ...............................................88 4.4.1 The visual space definition ...............................................................88 4.4.2 Visualization for information retrieval evaluation models ...............89 4.5 The distance-distance-based visualization model ......................................97 4.5.1 The visual space definition ...............................................................97 4.5.2 Visualization for information retrieval evaluation models ...............99 4.6 Summary.................................................................................................. 104
Contents
XV
Chapter 5 Kohonen Self-Organizing Map--An Artificial Neural Network .... 107 5.1 Introduction to neural networks............................................................... 107 5.1.1 Definition of neural network .......................................................... 108 5.1.2 Characteristics and structures of neuron network........................... 109 5.2 Kohonen self-organizing maps ................................................................ 111 5.2.1 Kohonen self-organizing map structures ........................................ 112 5.2.2 Learning processing of the SOM algorithm.................................... 113 5.2.3 Feature map labeling ...................................................................... 119 5.2.4 The SOM algorithm description...................................................... 120 5.3 Implication of the SOM in information retrieval ..................................... 121 5.4 Summary.................................................................................................. 124 Chapter 6 Pathfinder Associative Network..................................................... 127 6.1 Pathfinder associative network properties and descriptions .................... 128 6.1.1 Definitions of concepts and explanations ....................................... 128 6.1.2 The algorithm description............................................................... 131 6.1.3 Graph layout method ...................................................................... 136 6.2 Implications on information retrieval ...................................................... 137 6.2.1 Author co-citation analysis ............................................................. 137 6.2.2 Term associative network............................................................... 139 6.2.3 Hyperlink........................................................................................ 140 6.2.4 Search in Pathfinder associative networks...................................... 141 6.3 Summary.................................................................................................. 142 Chapter 7 Multidimensional Scaling ............................................................... 143 7.1 MDS analysis method descriptions .......................................................... 144 7.1.1 Classical MDS ................................................................................ 144 7.1.2 Non-metric MDS ............................................................................ 151 7.1.3 Metric MDS .................................................................................... 157 7.2 Implications of MDS techniques for information retrieval ...................... 158 7.2.1 Definitions of displayed objects and proximity between objects ... 158 7.2.2 Exploration in a MDS display space............................................... 160 7.2.3 Discussion ...................................................................................... 161 7.3 Summary.................................................................................................. 163 Chapter 8 Internet Information Visualization................................................ 165 8.1 Introduction ............................................................................................. 165 8.1.1 Internet characteristics.................................................................... 165 8.1.2 Internet information organization and presentation methods ......... 166 8.1.3 Internet information utilization....................................................... 168 8.1.4 Challenges of the internet ............................................................... 170 8.2 Internet information visualization............................................................ 171 8.2.1 Visualization of internet information structure............................... 172 8.2.2 Internet information seeking visualization ..................................... 180
XVI
Contents
8.2.3 Visualization of web traffic information.........................................183 8.2.4 Discussion history visualization .....................................................188 8.3 Summary..................................................................................................189 Chapter 9 Ambiguity in Information Visualization .......................................191 9.1 Ambiguity and its implication in information visualization ....................192 9.1.1 Reason of ambiguity in information visualization..........................192 9.1.2 Implication of ambiguity for information visualization..................193 9.2 Ambiguity analysis in information retrieval visualization models ..........194 9.2.1 Ambiguity in the Euclidean spatial characteristic based information models.........................................................................194 9.2.2 Ambiguity in the multiple reference point based information visualization models .......................................................................202 9.2.3 Ambiguity in the Pathfinder network .............................................207 9.2.4 Ambiguity in SOM..........................................................................209 9.2.5 Ambiguity in MDS..........................................................................210 9.3 Summary..................................................................................................211 Chapter 10 The Implication of Metaphors in Information Visualization........215 10.1 Definition, basic elements, and characteristics of a metaphor ...............215 10.2 Cognitive foundation of metaphors........................................................218 10.3 Mental models, metaphors, and human computer interaction................219 10.3.1 Metaphors in human computer interaction..................................219 10.3.2 Mental models.............................................................................220 10.3.3 Mental models in HCI.................................................................220 10.4 Metaphors in information visualization retrieval ...................................223 10.4.1 Rationales for using metaphors...................................................223 10.4.2 Metaphorical information retrieval visualization environments ..............................................................................225 10.5 Procedures and principles for metaphor application ..............................231 10.5.1 Procedure for metaphor application ............................................231 10.5.2 Guides for designing a good metaphorical visual information retrieval environment..................................................................232 10.6 Summary................................................................................................236 Chapter 11 Benchmarks and Evaluation Criteria for Information Retrieval Visualization......................................................................................239 11.1 Information retrieval visualization evaluation .......................................239 11.2 Benchmarks and evaluation standards ...................................................243 11.2.1 Factors affecting evaluation standards ........................................243 11.2.2 Principles for developing evaluation benchmarks.......................244 11.2.3 Four proposed categories for evaluation criteria.........................244 11.2.4 Descriptions of proposed benchmarks ........................................246 11.3 Summary................................................................................................253
Contents
XVII
Chapter 12 Afterthoughts................................................................................. 255 12.1 Introduction ........................................................................................... 255 12.2 Comparisons of the introduced visualization models ............................ 257 12.3 Issues and challenges............................................................................. 260 12.4 Summary................................................................................................ 268 Bibliography ...................................................................................................... 269 Index ................................................................................................................... 287
Chapter 1 Information Retrieval and Visualization
Available digitized information on the Internet, in OPAC systems, digital libraries, and other forms of information retrieval systems grows at an exponential rate. About 1 million terabytes of data are generated annually and more than 99% are in digital form (Keim, 2001). Data in these information systems is becoming more complex and more dynamic. More and more people are accessing these data on a daily basis. As users with different backgrounds, traits, abilities, dispositions, and intentions increase dramatically, users’ needs also become more diverse and complicated. Therefore the demand for a more effective and efficient means of exploiting and exploring data is a pressing issue. This poses a challenge to the traditional approaches and techniques used in current information retrieval systems. In a traditional information retrieval system, information retrieval is primarily keyword-based search and the search process is discontinuous because users have no control over the internal matching process. The internal matching process is not transparent to users, search result list presentation is linear and has a limited display capacity, relationships and connections among documents are rarely illustrated, and the retrieval environment lacks an interactive mechanism for users to browse. These inherent weaknesses of traditional information retrieval systems prevent them from coping with the sheer complexity of information needs and the multitude of data dimensionality. Driven by meg-hertz and mega-bytes, computers and their powerful graphic capacity in conjunction with mature modern information retrieval theory and human computer interaction theory; information visualization techniques are emerging as an innovative solution to the posed problems. Information visualization is an emerging field whose primary goal is the spatialization of information for users to interact with. Windows, icon, menus, and pointing devices equip interfaces with an unprecedented interactive capacity. Graphically agile computers have made sophisticated visual presentations feasible. As a result, new intuitive and interactive information visualization methods for information organization, presentation, explanation, and retrieval can provide decent insight into a data collection, capture the richness of both the data contexts and contents, and discover patterns in the data. The methods of information retrieval visualization empower people to make full use of their flexibility, creativity, and imagination to search for information. The truly elegant information retrieval visualization techniques should serve for both young and old, both the experienced and inexperienced, both people with information retrieval expertise and these without, and both English and non-English speakers.
2
Chapter 1 Information Retrieval and Visualization
Images have been a constant presence in human intellectual activity down through human civilization because they are a primary means of information expression and communication. Early visual presentation can be dated back to the14th century and even earlier. The cosmographical diagram, where the earth is situated in the center of the universe, surrounded by concentric circles representing the four elements, the seven planets, the signs of the zodiac, and the positions and phases of the moon (Cresques, 1978). Visualization applications have been in existence almost since the modern computer was invented. The enquiry of modern graphics theories started with the pioneering work which outlined the principles and theory for visual presentation of quantitative data (Tufte, 1983). The concept information visualization, which was used to describe 2D and 3D animation and explore information and its structure, was initially coined by Robertson et al. (1989). Korfhage (1988) made a significant contribution to the early research on information retrieval visualization. He focused on application of information visualization in information retrieval, introduced important visualization concepts such as reference points or interest points, integrated traditional information retrieval theories with information visualization such as the visualization of a conventional information retrieval model, and came up with new information visualization models for information retrieval. Information retrieval and information visualization have a natural and inherent relationship. A visual presentation, regardless of its content and form, is supposed to convey information to people by a visual means. People receive and get the information by browsing it. From the information retrieval perspective, that is a process of information retrieval because people use browsing, one of the two information seeking means, to seek information by a special medium. Converting information from an original form to a visual presentation, in a broad sense, is a process of subject analysis and information organization. Impacts and benefits are mutual for both information retrieval and information visualization. Information retrieval has had a profound impact on the evolution of information visualization as a field. Many task analysis and user studies framed interacting with information visualization as an information retrieval (Chen, 2005). The spatial characteristics of both information space and information seeking lay the theoretical foundation for the application of information visualization in information retrieval. The spatial, perceptual and cognitive advantages of information visualization can be used to strength and enhance information retrieval in multiple ways. Through out this book, the terms such as information retrieval visualization environment, visual space, semantic framework, visual presentation, and visualization configuration are frequently used. This is necessary to define and distinguish them. A visual space refers to a 2-dimensional or 3-dimensional space where projected objects are displayed and internal semantic relationships are illustrated. A semantic framework refers to a structure where objects can be projected. A visual configuration or visual presentation refers to the visual display which is constituted by a semantic framework, the projected objects, and their contexts in a visual space. A visual configuration reveals internal semantic relationships of objects from a data set. The same visual space can hold different visual configurations
1.1 Visualization
3
from different data sets. An information retrieval visualization environment includes a visual space, interactive information retrieval features and functionality, and visual configurations. An information retrieval visualization environment should include all of the elements of an information retrieval visualization system.
1.1 Visualization
1.1.1 Definition According to McCormick et al. (1987), visualization is a method of computing which transforms the symbolic into the geometric, enables researchers to observe their simulations and computations, offers a method for seeing the unseen, enriches the process of scientific discovery, and fosters profound and unexpected insights. Visualization is the process of transforming data, information, and knowledge into graphic presentations to support tasks such as data analysis, information exploration, information explanation, trend prediction, pattern detection, rhythm discovery, and so on. Without the visualization assistance, there is less perception or comprehension of the data, information, or knowledge by people for a variety of reasons. These reasons may include the limitations of human vision, or the invisibility and abstractness of the data, information, and knowledge. Visualization requires certain methods or algorithms to convert raw data into a meaningful, interpretable, and displayable form to visually convey information to users. In this sense, visualization is the process of crystallizing a mental image, or a valued added process of information reorganization and knowledge reconstruction, or a special process of communication between users and data. Visualization is also a visual data analysis method that outperforms the numerical and statistical methods because data contexts and relationships are maintained during visual data analysis.
1.1.2 Scientific visualization and information visualization Generally speaking, visualization can be classified into two categories: scientific visualization and information visualization. Scientific visualization is often used as an augmentation of the human sensory system by showing things that are on timescales too fast or slow for the eye to perceive, or structures much smaller or larger than human scale, or phenomena such as x-ray or infrared radiation that people cannot directly sense (Munzner, 2002). Examples for scientific visualization application includes, but are not limited to, shapes of molecules, missile tracking, astrophysics, fluid dynamics, medical images, ozone layer display, and fluid flow patterns of hemispherical surface. Information visualization is generally utilized to view abstract information. An incomplete list of examples for information visualization application includes visual reasoning, visual data modeling,
4
Chapter 1 Information Retrieval and Visualization
visual programming, information retrieval visualization, visualization of program execution, visual languages, spatial reasoning, and visualization of systems (Morse et al., 1995). Scientific visualization is informative and information visualization is also scientific. Scientific visualization and information visualization share similarities and both employ a visual means to present and explore information. Although their fundamental design principles, implementation means, and concerned issues are common at large, there is a striking difference between scientific visualization and information visualization. Information visualization does not have an inherent spatial structure or geometry of data to display, whereas scientific visualization possesses an inherent spatial structure of data to illustrate. In other words, unlike scientific visualization, a spatial structure or framework for semantic relationships among data must be created in information visualization. As a result, the primary task of scientific visualization is to faithfully reflect and render the inherent structure while information visualization has to define a spatial structure suitable for display of abstract data. On the one hand, finding or defining a spatial structure for information visualization is challenging because data in an information space may be multi-faceted, relationships of data are interwoven and are complicated, and the diverse nature of data also contributes to the complexity. On the other hand, the characteristic of not inheriting a spatial structure from the original data gives people broader imaginary room to define and create any meaningful and interpretable spatial structures for visualization. Definition of such a spatial structure for information visualization is not simply a process of drawing the objects in a visual space. It is a process of extracting salient displayable attributes from objects, establishing a semantic framework for displayed objects, organizing the information, projecting objects onto the structure, and synthesizing search features, objects and object relationships into the visual space. Therefore it is a creative and sophisticated process.
1.2 Information retrieval Information retrieval is an important and long standing research field. Information retrieval refers to a process of searching, exploring, and discovering information from organized data repositories to satisfy users’ information needs. Information retrieval contains two fundamental components: information retrieval and information organization. They are dependent upon each other like the two sides of a coin. You simply cannot talk about one term and ignore the other. From the perspective of common users, information organization in an information retrieval system is an internal process. Although information organization is essential for information retrieval, it may be invisible and not transparent to users. But from the system perspective, information organization is indispensable, vital, and fundamental. The way and method of information organization and storage affects and determines the way and method of information retrieval. There is no exception for information retrieval visualization systems. Information retrieval visualization
1.2 Information retrieval
5
requires appropriate information organization methods and ways to support visual information presentations and its retrieval features. It is worthy to analyze the two fronts (browsing and query searching) of information retrieval and the spatial characteristics of an information space before embracing information visualization techniques wholeheartedly. This analysis helps us to understand the necessities of applying information visualization to information retrieval.
1.2.1 Browsing vs. query searching There are two basic and widely recognized paradigms for information retrieval: browsing and query searching. These paradigms reflect two basic kinds of information seeking behaviors. Each of the paradigms has its own strengths and weaknesses and they are complementary to each other. Query searching is a complex task which involves the articulation of a dynamic information need into a logical group of relevant keywords. The relationships among the keywords in a query are parsed and the keywords are matched with the surrogates of documents/objects in a database. Consequently a list of the best matched documents is provided to users. The risk of the query-based search is that if user’s vocabulary does not match the index vocabulary in a database, search failure is inevitable. Browsing refers to viewing, looking around, glancing over, and scanning information in an information environment. Browsing is an extremely important means to explore and discover information. An information environment is indispensable, essential, and vital for browsing. A well organized information environment assures smooth and successful browsing. Widely used information organization methods for browsing are hyperlink structures and hierarchical structures. However, browsing capacity is not fully utilized in these information environments though they are much better than a linear list environment. Differences between browsing and query searching are summarized as the follows. x Relevance judgment. Query searching is based on keyword matching between query terms and surrogates of documents in a database at a lexical level rather than at a conceptual level. Keyword matching is an automatically done by an information retrieval system. Relevance judgment of query searching, which determines whether a document is retrieved or not, is done by the system. However, the relevance judgment of browsing is completed by users and it is a concept-matching process instead of a keyword-matching process. Browsing is a heuristic search through a well connected collection in order to find information relevant to one’s need (Thompson and Craft, 1989). x Continuity. A retrieval process is continuous for browsing while a retrieval process is discrete for query searching. Every step of the entire retrieval process such as selecting a browsing path, examining a context, and relevance judgment decision making is continuous and controlled by users during browsing. Query searching is discontinuous in some sense. After a query is submitted to an information retrieval system, users lose control over further internal query
6
x
x
x
x
x
Chapter 1 Information Retrieval and Visualization processing. The internal query processes such as query parsing, term matching, and result ranking are a “black box” for users. Users cannot control them. Users regain control over the process only after search results are returned to them. Time and effort costs. Browsing is a laborious and lengthy task compared to query searching in general. Browsing can be time-consuming because users have to remember the browsing path, digest the contents, and constantly make decisions. This may result in information overload in a poorly designed information environment. Browsing may not be efficient, especially for an exhaustive search in a large data set. Query search involves term selection and query formulation, and has fewer steps to complete a search. Query searching may be more efficient in this sense. Information seeking behavior. Browsing is a kind of “what can you (system) offer” information seeking behavior while query searching is a kind of “what do I (user) want” information seeking behavior. Information seeking is similar to shopping in a store. When a customer shops in a store, he/she prefers to have a salesperson discuss what is carried in terms of the needed merchandise in the store rather than the customer directly asking for what he/she is looking. That is because the salesperson could give more options in terms of prices, styles, and types for the customer to compare and make a smart decision. This also holds true for information seeking. Browsing allows users to compare the contents of browsed information or data guided by a variety of controls in a very flexible way. Iteration. Browsing involves successive acts of glimpsing, fixing on a target to examine visually or manually more closely, examining, then moving on to start the cycle over again (Bates, 2002). It is clear that a retrieval task is completed by a series of browsing acts. Query searching also involves acts of defining search terms, formulating a query, and examining results to complete a search. Query search may be iterative. But the way and degree of iteration of both browsing and query searching are different. Granularity. The granularity refers to the number of relevant items that are evaluated at one time at in the process of feedback (Thompson and Craft, 1989). Browsing allows the user to manually examine one item at a time to evaluate its relevance. Query search provides a group of retrieved documents for feedback processing. Clarity of information need. Not everyone begins his/her search with a clearly defined information need. The vagueness of an information need may result from the lack of domain knowledge or an uncertain relationship between what users want and the related concepts and contexts. Browsing is distinguished from querying by the absence of a definite target in the mind of the user (Waterworth and Chignell, 1991). Browsing is especially appropriate for an illdefined problem or for exploring new task domains (Marchionini and Shneiderman, 1988). Although browsing has a poorly conceived and unplanned nature, it may be both goal-directed and nongoal-directed rather than simply aimless (Chang and Rice, 1993; Wiesman et al., 2004). Query searching usually
1.2 Information retrieval
7
requires a relatively well-conceived information need for which keywords can be chosen and a query is formulated. x Interactivity. The nature of browsing is its interactivity in exploration. Almost all steps of browsing exploration involve interaction between users and an information environment. This characteristic of heavy interaction makes browsing more complicated and challenging because of the dynamic human factor. Query searching has fewer steps to complete a search and therefore less interaction than browsing. x Retrieval results: Query searching primarily focuses on looking for individual items or documents stored in a database. Browsing can lead to a wide range of retrieval results from contextual information, to structural information, to relational information, to, of course, individual items or documents. Results of browsing are richer and more diverse than these of query searching. Conventional information retrieval systems like an OPAC system or search engines are primarily query searching paradigm based and they only have limited browsing ability. For instance, if a thesaurus is equipped, users can scan it to look for synonyms, antonyms, related terms, broader terms, or narrow terms for a query. Users can also traverse a returned results list to examine the relevant documents. It is evident that there is a retardation of browsing compared to query searching in conventional information retrieval systems.
1.2.2 Information at micro-level and macro-level A well-organized data collection or database should provide users with information at two different levels: micro-level and macro-level. Information at the micro-level refers to individual objects or documents such as their contents, subject surrogates, and even full texts. Information at the macro-level refers to the aggregate information of objects or documents in a data collection. Information at the micro-level is direct and obvious while information at the macro-level is indirect and sophisticated. The aggregate information is derived, or generated from individual objects in a data collection. It is an important asset of the data set and it is also vital and valuable for users because the aggregate information at the macro-level is unique, heuristic, holistic, rich, and useful. The two kinds of information at different levels are different in nature. The value-added aggregate information provides users with object connections, rhythms, trends, and patterns which transcend individual objects at micro-level. The information at the macrolevel also helps users to explain the information at the micro-level, and to locate related information of a particular item/object at the micro-level by illustrating holistic overview, heuristic contexts, and other rich information. It is the result of information integration, information organization, and information generalization for a data collection. The form and contents of aggregate information lean heavily on the way and approach of information organization and information presentation. In other words, the aggregate information at the macro-level can vary in information organization methods and information presentations for the same data set. It is not the result of simply putting all of the objects together. The
8
Chapter 1 Information Retrieval and Visualization
value-added information at the macro-level reflects the characteristics of the whole data collection, interconnection of its objects, and the interdependence of its objects. It is beyond the individual objects. The significance of information at the macro-level on information retrieval resides in that it enables users to discover new emerging topics which may be the future trend, explore related objects which can be used to adjust their search strategy and reformulate new queries, reveal the internal structural distribution patterns of objects which can be used to optimize internal data structures by minimizing space density of a data collection, expose the intrinsic semantic clues which can be used for clustering analysis and correlation analysis, and provide a fundamental base for data browsing and data mining. Certainly the ultimate aim of an information retrieval system is to provide users with accurate, relevant, and reliable information. Clearly the information should not include only the information at the micro-level. Toward this aim, information at the two different levels in an information retrieval system should be available and accessible to users by both browsing and query searching. It is crystal clear for users that objects/documents in a data collection are always apparent targets of information retrieval. However, information retrieval, in a broader sense, should not only be limited to retrieving individual objects or documents of a data collection. Information retrieval should go beyond searching for individual items or objects. After objects or documents are ordered by an information organization method in an information retrieval system, what the system can provide users with are not only the apparent individual objects or documents, but also the aggregate information such as the relationships of these individual objects, the contexts of these objects, and the semantic frameworks which hold the individual objects. It is apparent that query searching primarily targets information at the micro-level by word matching between query terms and terms from individual documents and then returns the individual documents. In other words, aggregate information at the macro-level is hardly utilized, if not totally ignored, in the query searching paradigm. In contrast, browsing can target information at both the micro-level and macro-level by examining individual documents and contextual information derived from individual documents (See Fig. 1.1.).
1.2.3 Spatial characteristics of information space An information space is multidimensional, abstract, and invisible. It possesses two basic characteristics: semantic and spatial characteristics. The semantic characteristic is apparent because it results from the information organization of a data set, reveals semantic relationships among data, and enables users to explore and discover information from the data collection. The spatial characteristic is not as obvious as the semantic characteristic. Abstract information per se has no shape
1.2 Information retrieval
9
Information Retrieval
Query searching
Browsing
Aggregate information of objects
Macro-level
Objects/items
Micro-level
Fig. 1.1. Information retrieval and information at the two levels (Koike, 1993). Information itself does not constitute a space. Instead, semantic relationships among the data/information constitute the structure of the information space. An information space can be constituted by intrinsic attributes such as shared keywords/subjects, citations, hyperlinks, and authors; or extrinsic structures like a subject directory, a thesaurus system, and an organized search result list; or the combination of both the intrinsic and extrinsic. Web pages can be connected by hyperlinks. Documents can be linked by their citations, categorized onto a hierarchical structure such as a subject directory, classification system, or a thesaurus, indexed by a group of keywords in a Boolean-based system, and described in a documentterm vector form. As an important property of a space, the distance between two objects in an information space can be defined as the shortest hyperlink path in a hyperlink-based system, the shortest citation path in a citation-based system, the shortest path on a hierarchical structure, the similarity in a Boolean system, and the Euclidean distance in the vector model, respectively. Direction, another property of a space, has a special meaning in a hyperlink-based system and citation-based system. If an object links/cites another object in a database, it means that one object is directed to another object. But it does not mean that the reverse also holds. In a hierarchical structure moving up (down) in a hierarchical context indicates a jump from a node at a lower (higher) level to another node at a higher (lower) level. Moving left (right) in such a system means shifting a current node to a left (right) sibling node. In a hyperlink system “Back” and “Forward” imply returning to the previous webpage and next webpage respectively in a browsing path. As we know, a vector-based information retrieval model defines a high dimensional space. We have the distance-based information retrieval model and the angle-based information retrieval model in a vector-based information retrieval system. Retrieval boundary, retrieval area, overlapping area, and size of
10
Chapter 1 Information Retrieval and Visualization
an area are basic concepts of information retrieval models used in the cosine model, ellipse model, conjunction evaluation model, disjunction evaluation model, and so on. In fact, the vector document model corresponds to a hyperspace where all special properties of a space are preserved, although they are invisible to people. The spatial characteristic of an information space can also be confirmed by impact of users’ spatial ability on their information retrieval performance. Individuals with high spatial ability tend to outperform individuals with low spatial ability when information retrieval requires the construction of spatial structures and spatial relations (Seagull and Walker, 1992; Vicente et al., 1987). Because of these spatial characteristics of an information space, it is no coincidence that people may become “disoriented in an information space” and “lost in cyberspace ”.
1.2.4 Spatial characteristics of browsing Browsing depends upon an information environment and is clearly associated with direction, distance, position, and other fundamental spatial properties. Browsing constitutes a series of spatial movements from one attention point to another attention point. An attention point can be a Web page in a hyperlink-based system, a node of a hierarchical subject directory, a document from a returned results list of an information retrieval system, a subject term from a thesaurus, or a citation from a citation system. When users stay at an attention point, they inspect the content of an object, examine the contexts of the attention point, make a relevance judgment about a document, select the appropriate search terms from a thesaurus, identify a potential trend, analyze meaningful clusters, compare useful patterns, interpret interesting information, find new search clues, evaluate research results, or reformulate their search strategy. The spatial patterns of browsing in an information space are formed directly from these attention points. Spatial browsing movements are directed. A series of spatial movements produces a visible or invisible browsing path for users. The term start point, end point, and attention point correspond to positions or nodes in a browsing path respectively. The distance between two nodes on the path is defined as the number of nodes between the two end nodes on the path minus one. The distance is an important and meaningful concept in the context. Browsing may be forward and backward. Backward browsing involves revisiting or reviewing a browsed attention point on a browsing path. Forward browsing may increase the path length by adding new attention points. Thus, browsing has a natural and undividable relationship within a space. In fact, browsing relies on an information space which can be one-dimensional, like a list of returned search results or subject terms, two-dimensional, and three-dimensional. It is the space that underlies the browsing paths. The browsing paths actually constitute a browsing space deriving from the space that users browse. For instance, when browsing in a thesaurus system, users enter a term as a start attention point. The next possible attention points are synonymous, antonymous, related, broad, or narrow terms. Selecting a new term will increase the length of the browsing path. After browsing is completed, the browsing paths form a sub browsing space that results from the thesaurus space.
1.3 Perceptual and cognitive perspectives of visualization
11
It is apparent that browsing in an information space needs guidance to avoid becoming tired and disoriented.
1.3 Perceptual and cognitive perspectives of visualization Without a doubt, interaction with visual information involves both human cognitive and perceptual activities. A picture, as a special vehicle for thought, inspires spatial and holistic thinking. Perceiving and thinking are intertwined and truly productive thinking takes place in the realm of imagery (Arnheim, 1972). Although perceiving, recognizing, understanding, and reasoning about objects in an environment seems simple, researchers are still far from achieving a complete understanding of how these processes function in the human brain. Information seeking itself involves heavy cognitive activities. Information retrieval visualization should be grounded in the fundamentals of cognition in order to maximize perceptual ability and minimize the cognitive load in information seeking.
1.3.1 Perceptual perspective People perceive information primarily through vision. Visualization capitalizes on our innate human perception system’s ability because human vision is the most highly developed human sense for receiving, recognizing, and understanding information in our environment (Colonna, 1994). A picture is worth thousand words! Pictures naturally appeal to humans because they instantly convey information to our minds for easier analysis and assimilation. The human visual system can rapidly identify and distinguish between an incredibly diverse variety of objects that may be chromatic or achromatic, dynamic or static, regular or irregular in a two-dimensional or three-dimensional space. The visual cortex consists of approximately thirty interconnected visual areas in the brain. It is responsible for processing visual stimuli and is excellent in pattern recognition. There is a very well-defined map of spatial information in vision. According to a study (Zeki, 1992), the four parallel systems within the human visual cortex work simultaneously to process received visual input from the retina. One system is responsible for motion, one is responsible for color, and two are responsible for form. It is the parallel processing mechanism that makes perceptual processing in the brain amazingly rapid and efficient. This may explain why people are naturally closer to a visual presentation than to language presentation since their perceptual system processes pictures in a parallel way and textual messages in a linear way. In addition graphic representations can show the spatial relationships among a large number of objects much more quickly and with less memory than natural language (Morse et al., 1995). However, this does not mean that the visual presentation can replace language expression nor is visual presentation more easily created than language expression.
12
Chapter 1 Information Retrieval and Visualization
The human perceptual system not only receives but also understands visual information. If conceptual information is presented spatially, this helps users understand, learn, and remember it (Paivio, 1990). Most of the concepts that a human establishes within an environment are carried out through visual perception because graphic entities like point, line, shape, color, size, location, and motion of objects may form a variety of patterns. These patterns reveal information, encapsulate knowledge, and elucidate properties of data.
1.3.2 Cognitive perspective It is widely recognized that a visual presentation extends the cognitive ability of humans to some extent. Visualizations are regarded as an external cognition where internal mental presentations are offloaded onto an external medium to relieve the cognitive burden and speed up processing (Scaife and Roger, 1996). These visual presentations amplify the cognitive ability by increasing resources, reducing search efforts, enhancing recognition of patterns, utilizing perceptual inference, and allowing for perceptual monitoring and manipulation of medium (Card et al., 1999). The theory of cognitive facilities was introduced by Jackendoff (1992). This theory revealed two fundamental cognitive mechanisms or two cognitive facilities which are responsible for a different knowledge process and knowledge representation. One facility processes spatial structures and objects, whereas the other facility processes symbols like languages. Either has different ways of acquiring, analyzing, transforming, classifying, organizing, integrating, and representing knowledge. Although the two cognitive facilities have distinctive processed objects and processing ways, they are definitely not exclusive in a cognitive processing. In fact they are complimentary to each other in the processing. They benefit from each other if both facilities are applied to the same cognitive process. As a special cognitive process, information retrieval is affected by the two cognitive facilities. The significance of the theory hinges on the fact that it describes a fundamental cognitive principle that the two facilitators should be fully utilized in information retrieval. That is, an information retrieval system should provide users with an environment where both of the cognitive facilities can be fully used to maximize cognitive ability in an information seeking process. Unfortunately, query search based information retrieval systems are primarily built in favor of one cognitive facility, and they consider little (if not totally ignore) the other cognitive facility. That is because the majority of the current approaches for information retrieval are linguistic in nature, requiring the use of vocabulary and syntax (Allen, 1998). In order to ensure effective and efficient communications and interactions in a visualization environment, any visual presentations should facilitate the human cognitive process. Any design which conflicts with the preference of the human cognitive process would definitely increase the cognitive load of users. The introduction of information visualization attempts to address the inherent problems of information retrieval systems by utilizing human perception ability and amplifying
1.4 Visualization for information retrieval
13
cognitive capacity. Despite the potential and promise of visualization, an information visualization environment may be not effective if it is poorly designed. That is because the information retrieval visualization environment itself, which provides users with new interactive means such as query searching, browsing, pattern detecting, navigating and so on, may create new cognitive loads for users. Users must understand the visual configuration and features offered in a visual space and interact with them. It reminds people that a visualization environment should achieve a balance between the new cognitive load and new visualization features when they pursue maximizing new features. The potential cognitive load in an information visualization environment can be minimized by an array of methods and principles that we will discuss in later chapters. The overall cognitive benefits of the implementation of information visualization should surpass the negative cognitive impact if it is well-designed.
1.4 Visualization for information retrieval
Information retrieval visualization refers to a process that transforms the invisible abstract data and their semantic relationships in a data collection into a visible display and visualizes the internal retrieval processes for users. Basically, information retrieval visualization is comprised of two components: visual information presentation and visual information retrieval. The visual information presentation provides a platform where visual information retrieval is performed or conducted.
1.4.1 Rationale The benefits of applying visualization to information retrieval can range from using human perceptual ability, to reducing cognitive workload, and to enhancing new retrieval effectiveness. Let us address this issue in detail. 1. Information retrieval visualization provides an ideal and natural platform for browsing. Both browsing and query searching can be effectively conducted and achieve mutual benefits in a visual space. The browsing can be fully supported and accommodated due to the spatial characteristic of an information retrieval visualization environment. It provides rich information for browsing. Browsing within an information retrieval visualization environment makes the relevance judgment of objects more intuitive and clarification of users’ information needs more convenient. Browsing in a visualization environment is far beyond simply scanning, causally looking around, and superficially glancing over. Browsing in a visualization environment is associated with an array of rich interactive activities used to fulfill information retrieval tasks. These interactive activities are supported by interactive visualization techniques such as brushing and linking, focus and content, panning and zooming, overview and details, and various lenses approaches (Hearst, 1999). These interactive activities play a
14
Chapter 1 Information Retrieval and Visualization
crucial role in a successful information retrieval task. They help users to define their information needs, narrow down to interest spots, examine details, compare related objects, and identify new territory. An interactive presentation transcends a static presentation because with the addition of interactive features, visual presentations may be customized and personalized. An interactive information environment can achieve multiple ways that users may interact with a system and support users in successfully achieving their goals and completing their tasks effectively. In conclusion, browsing within the information retrieval visualization environment becomes more efficient and effective than within a traditional information retrieval environment. 2. Information retrieval visualization realizes the spatialization of an information space. This is done by projecting an invisible and abstract information space onto a visible and visual space. An organized data collection or information space has its intrinsic spatial structures. It is these intrinsic structures that define the internal semantic relationships among the objects in the data collection. The abstract and invisible structures may be linear, hierarchical, network, or their combinations. Although there a wide spectrum of approaches to define the semantic relationships among objects, they may change the spatial forms but not the spatial nature of the collection. They only increase the diversity of the spatial structures. Browsing is fundamentally spatial also. It is not a coincidence that a browsing process which consists of a series of spatial attention points can generate a browsing space which is usually a sub-space of the information space. Therefore, the spatial characteristics of both an information space and browsing make a spatial visual presentation of a data collection not only necessary but also promising. Spatialization or visualization of an abstract information space opens a promising territory for the rich expressions of the information space. It can make full use of spatial properties such as a point, line, plane, distance, direction, etc. to describe and illustrate objects, object contents, object contexts, and the semantic relationships of objects. Color properties such as color hue, color saturation, and flash rate can add another powerful dimension to the spatial description and illustration of the information space. In addition, motion and sound can be integrated into the visual presentation as a unique means to present information. 3. Information retrieval visualization elucidates the aggregate information at the macro-level in a data collection and makes it available and accessible for people. The valuable aggregate information, which is rarely available in a traditional information retrieval system, demonstrates contextual information, relational information, heuristic information, structural information, and holistic overview information. The information is generated from individual items of a data collection but transcends those individual items. It definitely enriches and enhances the resources of a data collection. As a result, it allows users to discover meaningful trends, detect patterns, make references from the visual configurations, recognize important information clusters and themes from a data set, gives a better understanding of a data collection as a whole, and aids users to orient them to set a right search direction in the information space.
1.4 Visualization for information retrieval
15
4. Information retrieval visualization may provide an avenue to develop new information retrieval means. An information retrieval visualization environment can visualize not only traditional information retrieval models whose retrieval contours are symmetric in an information space such as the distance model, the conjunction model, the disjunction evaluation model, and the ellipse model; but also new non-traditional models whose retrieval contour can be asymmetric in the information space. Traditional retrieval models usually require 1 to 2 reference points. But the number of involved reference points in a retrieval process can be extended to demonstrate the impact of multiple interest points on information retrieval in a visualization environment. 5. Information retrieval visualization can supply a unique method for information analysis. Information visualization is a powerful tool for information analysis. For instance, a traditional information space density analysis is based on calculation and the final result is a simple number. It does not answer these questions: How are documents distributed in an information space? How many clusters are there? Which clusters are the largest? Which clusters are the smallest? What are related clusters of a specific cluster? Which cluster makes a contribution to density change? How does a selected term affect the space density? These questions are crucial for information analysis and they can be easily answered in an information retrieval visualization environment. Citation analysis is another example. Visual citation analysis overshadows traditional citation analysis by displaying both citation connections and the connection strengths as well. 6. Information retrieval visualization opens a broad territory to develop a variety of visualization presentation approaches. One of the salient characteristics of information retrieval visualization is its spatiality. It is this spatiality that gives people great flexibility to define visual spaces, choose coordinate systems, select presentation semantic framework methods, determine projection algorithms, and integrate information retrieval features. As a result, diverse and rich information retrieval visualization models burgeon. 7. Information retrieval visualization enriches information retrieval and empowers users. Information retrieval visualization uplifts information retrieval to an unprecedented level. Information retrieval visualization makes the process of finding information intuitive and simple. There is no complex technical background required to manipulate the system with a minimum cognitive effort of comprehension. Since a visual presentation as a special communication means between systems and users is a universal “language”, it overcomes the language barrier for users with different language backgrounds. Because of the spatialization of an information space, interactive browsing, and new features of visual exploration, information retrieval is no longer a simple process of finding information. It turns the process of information retrieval into one of knowledge discovery and knowledge acquisition. The visual data exploration process can be viewed as a hypothesis-generation process, whereby through visualization of the data users are allowed to gain insight into the data and come up with new hypotheses (Keim, 2001).
16
Chapter 1 Information Retrieval and Visualization
1.4.2 Three information retrieval visualization paradigms Query searching and browsing are two fronts of information retrieval. Although query searching and browsing are different ways to seek information, they can be synthesized in an information retrieval visualization environment to take advantages of both. There are three basic paradigms of the syntheses. One is the QB paradigm (Query searching and Browsing). An initial regular query is required to submit to an information retrieval system to narrow things down to a limited search results set, then the search results set is visualized in a visualization environment. Finally, users may follow up with browsing to concentrate the visual space for more specific information. The second paradigm is the BQ paradigm (Browsing and Query searching). For the BQ paradigm, a visual presentation of a data set is first established for browsing. Then users submit their search queries to the visualization environment and corresponding search results are highlighted or presented within the visual presentation contexts. The third one is the browsing only paradigm (BO). It is obvious that this paradigm does not integrate any query searching components. Query searching is not categorized as a paradigm because it is a traditional information retrieval paradigm which does not require a visual space. It is clear that the BQ paradigm only visualizes a sub-set of an entire data collection, and connections between the retrieved documents and un-retrieved documents in the visual space are missing. This problem is alleviated if the size of retrieved results is adjustable by a retrieval threshold if the retrieval threshold control mechanism is available. However, if the amount of the data in a database is huge like information on the Internet and it is impossible to visualize the entire database, the first paradigm may better fit this type of database. One of the advantages of the second paradigm is that it offers an overview of an entire database, and it maintains semantic clues to further explore un-retrieved documents. The third paradigm does not satisfy a specific need of users by a query search.
1.4.3 Procedures of establishing an information retrieval visualization model Building an information retrieval visualization environment is a complicated process affected by a multitude of variables because of the diversities of visualization frameworks, visualization objects, information organization methods, visual presentation approaches, and search controls. Establishing an information retrieval model consists of a series of steps. Let us discuss these steps in detail. 1. Determination of an information retrieval visualization paradigm An information retrieval visualization paradigm would affect the source and amount of raw input data for visualization. The entire data set is considered as the
1.4 Visualization for information retrieval
17
source for both the BQ and BO paradigms. For the QB paradigm a front information retrieval system must be provided and retrieved results from the system are used as the source of the raw input data. Raw input data of the BQ and BO paradigms are stable while those of the QB paradigm are dynamic. The QB paradigm may require constant changes and reconstructions of its visual configurations in the visual space due to the dynamic characteristics. The number of displayed objects in the QB paradigm may be relatively smaller than that in both the BQ and BO paradigms. 2. Identification of displayed objects The identification of displayed objects refers to the selection of items/objects from a data set which is visualized in a visual space. In a data collection there may be multiple items and any of these items can be defined as the displayed objects in a visual space. For instance, document, keyword, journal, or author in a bibliographic database; and Web page, user, or server in the Internet. The identified objects from a data set should be meaningful for the data set, users, and the later information retrieval. 3. Extraction of attributes An object can be described by a group of attributes. These attributes can not only define the properties of an object but also determine its position in a visual space. Therefore extraction of attributes from an object is an important and necessary step. Selected attributes should be representative and applied to all objects, and reveal fundamental and significant retrieval characteristics of the objects. Extracted attributes can be either homogeneous, such as a group of subject keywords; or heterogeneous, such as publishing time, author, and title. They should be coherent with the semantic framework of the information retrieval visualization environment and be measurable because in some models attributes are expressed in quantitative form for later calculation. The results of attribute extraction usually are described in an object-attribute matrix. The methods of applying the attributes to a visualization environment vary in different situations. Attributes can be applied to an information retrieval visualization environment directly. They can also be converted to a meaningful form and then applied to an environment indirectly. In most cases the latter happens for the homogeneous attributes. For instance, similarity or proximity between two objects is calculated based on the homogeneous attribute-object matrix to produce a new object-object proximity matrix. The new object-object proximity matrix is used as the raw input data format for many information visualization models. 4. Structural definition of a visual space The structural definition of a visual space refers to determining the dimensionality of a low visual space, and defining axes for its coordination system. The dimensionality of a visual space can be one-dimensional, or two-dimensional, or three-dimensional. In order to take spatial advantage, most information retrieval visualization models are two or three dimensional. A coordination system in the visual space can be orthogonal, polar, or parallel. Orthogonal coordination
18
Chapter 1 Information Retrieval and Visualization
systems are widely used. The parallel coordinates can transform multivariate relations into meaningful patterns (Inselberg, 1997). Based on the nature of extracted data attributes, the type of an axis in a coordination system can be nominal, or ordinal, or quantitative. It is worthy to point out that the types of axes in a coordination system don’t have to be the same in some cases. Selected attributes may be assigned to axes of a coordination system either directly or indirectly. 5. Definition of a visual semantic framework The definition of a visual semantic framework is vital and essential because it will define a structure where the objects are projected, aggregate information is formed, patterns are derived, internal structures are demonstrated, and interactions are conducted. A semantic framework will define a valid display area and all objects are supposed to be projected within the area. The semantic frameworks range from a simple shape to a complex one such as a grid, tree, circle, line, rectangle, triangle, and polygon. Some frameworks do not even have a fixed shape. A defined semantic framework, which may be too abstract for common people, can be represented and rendered in a special form to facilitate users understanding. For instance, a metaphorical presentation such as landscape, the solar system, river, room, etc. can be chosen to render a complicated semantic framework. Choosing an appropriate representation can provide the key to critical and comprehensive appreciation of the data, thus benefiting subsequent analysis, processing, or decision making (Robertson, 1991). 6. Projection of objects onto a defined semantic framework The projection of objects onto a defined framework is a core part of the entire procedure. It determines the final position of each individual object in a visual space and therefore the ultimate visual configuration of a data set in the visual space. It is clear that a projection algorithm or approach heavily leans upon its defined coordination system and the semantic framework of a visual space. Complexity of a projection algorithm really varies in different information retrieval visualization models. Projection depends on relationships between projected objects and projection criteria. An object can be projected onto a semantic framework against criteria such as a time line, or subject theme, or a reference system defined by users’ information need, or relevance to related objects which exist in the framework, or linkage to other objects, or a pseudo dynamic object in a visual space. As a result of projection, a visual configuration is generated in the visual space. A visual configuration can be a local view if objects are projected against a special reference system defined by users’ interests, or a global view if objects are projected against mutual relationships among the objects in the data set. A position of an object can be relative. This means that the projected objects in the framework are movable. It happens when the compared criteria are not directly assigned to the axes of the coordination system. In this case a visualization model can achieve a better flexibility and controllability over the objects by taking advantage of this characteristic. A projection procedure can be iterative or not iterative. An iterative projection procedure attempts to achieve optimal distributions for objects via repeating
1.4 Visualization for information retrieval
19
position adaptations. Consequently, a position of an object in the visual space produced by an iterative projection algorithm is not unique. As we know, an object may have multi-facets which define a multidimensional information space. When an object is projected onto a visual space, only the significant and salient facets are chosen for projection due to a limited dimensionality of the visual space. In this sense, the projection is a process of dimensionality reduction of an information space. Due to the dimensionality reduction, relationships of objects in the visual space may be “distorted” and they may reflect in part relationships of the data in a data set after they are mapped onto it. 7. Development of interactive means for information retrieval Without a doubt, a static visual configuration can reveal rich information for users. However, interactive information retrieval tools would make information exploration and knowledge discovery more effective and efficient. There are many mature interactive techniques that can be applied to support browsing in a visual space. With these interactive tools users may browse information from the detailed content of an individual object, to a local context of an area of interest, and to a global overview of a data set at will. Query searching should be integrated into the information retrieval visualization environment to meet the need of seeking specific information. An information retrieval model such as the conjunction model or disjunction model corresponds to a retrieval contour in a high dimensional space. Like an object a contour in a high dimensional space can also be projected onto a low dimensional visual space by a projection algorithm to observe and control an internal retrieval process. If the contour is projected onto a visual space, users can manipulate its size and position to control a retrieval process in the visual space at will. However contour projection is much more complex than point projection. A projection contour function must be found to generate the projected contour in the visual space. New information retrieval models and means may be developed based on the structure of the semantic framework and structural definition of a visual space, and then synthesized into the environment to enhance information retrieval. 8. Evaluation The last step is evaluation of a developed information retrieval visualization model. Evaluation will examine whether objects, extracted attributes, a defined coordination system, a designed semantic framework, and developed visual information retrieval means are coherently and seamlessly synthesized in the visualization environment; whether data is displayed adequately, clearly, accurately, and comprehensively expressing significant attributes and salient relationships of an original data set; whether visual presentations are meaningful, interpretable, and explainable; and whether interactive information retrieval means are well integrated into the visualization environment.
20
Chapter 1 Information Retrieval and Visualization
1.5 Summary Information retrieval has two basic paradigms: query searching and browsing. It is widely recognized that both paradigms as information seeking means have their strengths and weaknesses. They are not exclusive; in fact, they are complimentary to each other. Browsing can not be fully utilized in a traditional information retrieval system because of the inherent weaknesses in its structures of information organization, information storage, and information presentation. In nature an abstract data collection is spatial. The aggregate information at the macro level derives from relationships and connections of data. The aggregate information is valuable but hidden and not available in a traditional information retrieval system because the system internal data structures are not transparent to users and system focus is on searching individual items. Information visualization is a burgeoning field whose goal is to capitalize on the human perception system’s ability to understand abstract information. Visualization transcends the visual boundary and facilitates understanding complex information because a visual presentation is not a simply a picture but a mirrored image of mental thoughts. Visualization circumvents the inherent human limitations of vision and extends the vision capacity significantly. Information retrieval visualization spatializes an information space. As a result, information retrieval visualization renders and reflects the spatial characteristic of an information space, and provides a natural and ideal environment to feature browsing. In addition, information retrieval visualization underlies a semantic framework, elucidates relationships of concepts, illustrates holistic overview, demonstrates patterns, and facilitates interaction between systems and users. These make information retrieval a process of data mining, information exploration, and knowledge discovery. The gem of information retrieval visualization is the diversity of information retrieval visualization models. High dimensionality of object attributes and the sophisticated relationships among objects in database in conjunction with low dimensionality of a visual space suggests that the high dimensionality has to be reduced so that objects can be fit in the low dimensional visual space. Due to the reason, salient and meaningful attributes of an object are identified and preserved in the visual space and insignificant attributes are scarified. As a result, people can come up with various ways to identify the salient attributes and various methods to present them in a visual space. These, in part, account for the diversity of information retrieval visualization models and algorithms. Spatialization of an information space leaves a widely open room for people to come up with various information retrieval visualization models or systems. Although there are a variety of information retrieval visualization models or systems, a basic procedure for establishing an information retrieval visualization model is applicable to all these models. Defining a semantic framework and projecting objects onto the framework are extremely important and fundamental. Developing interactive information retrieval mechanisms are crucial for end-users to explore the visual information space.
Chapter 2 Information Retrieval Preliminaries
In this book we shall address the topic of information retrieval visualization. In an attempt to deal with a variety of state-of-the-art systems, concepts, theories, models, methodologies in information retrieval visualization, we can not ignore or circumvent basic concepts, models, and theories of information retrieval. Many advanced models and theories in information retrieval visualization cannot exist without the support of the underlying information retrieval theories, models, and concepts. In other words, these complicated information retrieval visualization models can not be explained explicitly until the principles of information retrieval are fully introduced. Information retrieval is a long-standing research area with a relatively mature theoretical system. In this chapter, the vector space model, term weighting algorithms, similarity measures, information retrieval (evaluation) models in a vectorbased space, distance metrics, and reference points are introduced. Information retrieval, which as a primary thread goes through almost every chapter of this book, is essential, fundamental, and indispensable for information retrieval visualization. For instance, the vector space model lays fundamental data organization structures for the self-organizing maps, Pathfinder associative networks, multidimensional scaling models, multiple reference point based visualization models, and Euclidean spatial characteristic based models. Term weighting algorithms are employed to assign weights for automatically extracted keywords from documents before similarity between the documents is calculated. Various similarity measures determine object semantic relationships which are used to project the objects onto visual spaces for almost all of the information retrieval visualization models. These algorithms are ultimately used to generate document-term matrices or object-object matrices which are employed as raw data input for information retrieval visualization environments. Information visualization has a very natural connection with traditional clustering algorithms. Information visualization can be regarded as a special visual clustering approach because a graphic display from any information visualization methods visually “clusters” projected objects in its presentation space in a particular way. Clustering algorithms can be used to alleviate the notorious information overload in a visual display space. When a huge number of objects/documents are projected onto a limited visual space, a graphic presentation no longer makes sense to users. Visualizing clusters, rather than individual objects, in an object-overwhelming visual space would effectively solve the problem. In nature, data clustering is a process of information organization. Since the process of manual classification is labor-intensive and time-consuming, it is not competitive and would be intolerable for processing a large dataset like the
22
Chapter 2 Information Retrieval Preliminaries
Internet. Automatic clustering algorithms can significantly reduce the time lag of information organization and processing which is essential and curial for an information system. One categorical structure generated by a clustering algorithm can serve as a subject guidance means to some extent. It provides a view of the data at different levels of abstraction. Clustering solutions at different levels of granularity make ideal interactive explorations (Zhao and Karypis, 2002). The hierarchy structure can also be used to discover possible association patterns among identified cluster groups.
2.1 Vector space model In the vector space model, which was first introduced by Salton (1989), a document is defined by n independent features or attributes. These features are used to describe subject characteristics of the document. In most cases, these features are keywords extracted from the title, abstract, or full-text from the document. di
(2.1)
(ai1 , ai 2 ,..., aij ,..., ain )
In Eq. (2.1) di is a document, aij is a feature describing the document, its value or weight reflects the importance of this feature aij to document di, valid value of aij ranges from 0 to infinity, and n is the number of features or the dimensionality of the vector space. denotes a vector with n dimensionality. A vector corresponds to a visible point in a low dimensional space (for instance, a two or three dimensional space), or an invisible point in a higher dimensional space. n
For a linear vector space, if d1, d2 and d3 , c is a constant, the following equations always hold true. n
( d1 d 2 ) u c d1 d 2 ( d1 d 2 ) d 3
d1 u c d 2 u c
(2.2)
d 2 d1
(2.3)
d1 ( d 2 d 3 )
(2.4)
Eqs. (2.2), (2.3), and (2.4) are called commutativity, distributivity, and associativity, respectively. A matrix is a two-dimensional rectangle with row and column values or a rectangular array of elements (or entries) set out by rows and columns. A documentterm matrix consists of a group of document vectors. The rows and columns are documents and features respectively (See Eq. (2.5)).
2.1 Vector space model § a11 ¨ ¨ a 21 ¨ ... ¨ ¨a © m1
D
a12 a 22 ... am2
... ... aij ...
a1n · ¸ a 2n ¸ ... ¸ ¸ a mn ¸¹
23
(2.5)
Where aij is the weight of document di for feature j, m is the number of the documents in a document collection. Similar to a document representation in the vector space model, a query representation can also be defined as a vector. See Eq. (2.6). Where qj is the weight of feature j and its value is dependent upon a user’s information need, n is the number of unique features which should be equal to n in Eq. (2.1). It is clear that query representation structure is the same as a document representation structure, which makes various similarity calculations and other calculations between a document and a query possible. q
(q1 , q 2 ,..., q j ,..., q n )
(2.6)
Notice in a vector space model the number of unique features (n) in a document-term matrix can be extremely large because the features are unique indexing terms used in a document collection. When the number of documents indexed in a collection increases, the number of the features (n) also increases. However, the relationship between the number of documents indexed in a collection and the number of features (n) is not simply linear. When the number of documents indexed in a collection reaches to a certain level, the number of features (n) would stay stable. Looking at each of documents in the matrix, we would find that the number of non-zero features which are used to index that a document is relatively small compared to the number of features (n). The number of non-zero features used to index that document is usually affected by the indexing policy and length of the document. As a result, the document-term matrix is a sparse matrix where most of its elements are 0. The strengths of the vector space model are summarized as follows: 1. The vector-based structure is suitable for representation of an object with multiple attributes. The vector space is a natural way to represent a document because a document has multiple attributes or keywords. 2. Weights can be assigned to indexing terms to distinguish their term significance to a document, allowing terms to become more or less important within a document or the entire document collection as a whole. 3. Similarly, weights can also be assigned to query terms, which makes users’ information expression more accurate and flexible. 4. Based on a vector space model, a variety of similarity calculation methods can be developed such as distance-based measure, angle-based measure, etc. The wide range of applicable similarity methods in a vector space enables people to make an appropriate choice based on a desired similarity measure. By applying different similarity measures to compare a query and a
24
Chapter 2 Information Retrieval Preliminaries document, or a document and a document, people can reveal various properties of compared objects. 5. Information retrieval (evaluation) models such as the distance model, angle model, ellipse model, conjunction model, and disjunction model are available for users to control a search in a vector space. 6. The partial match technique in a vector space based model allows to describe the degree of a match between a query and a document representation. The varying degree of match can be used to rank retrieved documents to the users with respect to how well each document responds to the query (Fowler and Dearholt, 1990). The size of retrieved documents can be controlled based upon users’ desired size by setting a retrieval threshold. 7. The iterative nature of information retrieval calls for more relevance feedback means to dynamically adjust a search strategy. The vector space model can easily make a dynamic query revision based upon feedback information. 8. A vector space model provides an ideal environment where sophisticated information processing techniques and methods like self-organizing maps, Pathfinder associative networks, multidimensional scaling models, distance and angle based visualization model, distance and distance based visualization model, angle and angle based visualization model, and so on, can be developed and implemented. The major weaknesses of the vector space model are reflected as: 1. One of the problems with the vector based model is the inherent high dimensionality problem, which makes it applicable only to a relatively smaller collection due to intensive computing. 2. Theoretically, multiple features/attributes (It means terms in this case) can be extracted from a document. These extracted features are used to describe domain subject of the document. However, as these terms are extracted from a document and used to construct a document-term matrix for future retrieval, the semantic contexts of these terms in the document are also lost. Since terms possess multiple meanings and the exact meaning can only be judged in the proper context, it is difficult to make a correct judgment of a term in the document-term matrix context. It can cause a potential term ambiguity problem when a query is matched with a document. 3. The vector space model is subject to an assumption that all terms describing documents are independent. It has been realized that this assumption may over-simplifying the interrelationship between the use of a term and its context. The assumed orthogonality between two terms in the matrix is at odds with reality.
2.2 Term weighting methods Term weighting or automatic indexing is fundamental, essential, and vital for information retrieval visualization. It is not surprising that a considerable number of
2.2 Term weighting methods
25
research papers potentially applicable to the research field have emerged. Any information visualization model, which employs a document-term matrix, needs to use automatic term weighting approaches to fill in the cells of the matrix. There are many factors affecting term weighting, for instance, frequency of a term in a document, length of the document, distribution of the term in a document collection, location of a term in full-text, etc. Several term weighting methods will be discussed.
2.2.1 Stop words The stop word method is a common strategy used to filter useless keywords and to reduce the size of indexed terms in a document. Certain words, which are deemed to be of insignificant importance within a full-text, are added to the stop list. When a word within the stop list matches with an extracted keyword from a document during text parsing, it is ignored. Otherwise is kept. The stop words are common, grammatical, and relational words such as “a”, “the”, “and”, etc. These stop words are rarely used in search queries in a not natural language based query. They are meaningless in terms of information retrieval. A subject stop list is different from the stop list. Words in a subject list are not grammatical, and relational words like these in a general stop list. Words in a subject stop list are very high frequent keywords in a certain subject domain. These words are too general to lose retrieval meaning for the subject domain. In other words, almost every document in the subject domain is related to these keywords, therefore these words should not be used as index terms. For example, in a medical database, the term “medical” should be included in the subject stop list because every document in that database addresses medical issue and none would use medical as a search term to search in a medical database. It is worthy to point out that words in a subject stop list are relative to a subject domain. When a subject domain changes, words in that subject stop list should change accordingly. For instance, the term “medical” which is in a subject stop list of a medical database should no longer be included in a subject stop list of a computer science database.
2.2.2 Inverse document frequency The Inverse Document Frequency (IDF) method was introduced by Spark Jones (1972). It has been widely used in information retrieval. It takes the database size and term distribution in the database into account. The approach is described in the following equation. In Eq. (2.7) fi is the frequency of the term i in a document, N is the number of documents in a database and di is the number of documents containing that word i in the entire database. The use of the log function for the ratio of the all document numbers N to the number of documents containing the term i is to soften the impact of di and N on the final term weight wi. It is clear that term frequency is the major factor affecting the term weight. Eq. (2.7) is also called TF×IDF (Term Frequency × Inverse Document Frequency). This approach has
26
Chapter 2 Information Retrieval Preliminaries
proved extraordinarily robust and difficult to beat, even by much more carefully worked out term weighting models or theories (Robertson, 2004). §N f i u Log ¨¨ © di
wi
· ¸ ¸ ¹
(2.7)
2.2.3 The Salton term weighting method The Salton term weighting method is a revised TF×IDF approach which normalizes the TF×IDF approach by considering document length (See Eq. (2.8)). The length normalization ensures that all documents with different lengths have an equal chance of being retrieved (Salton et al., 1996). The definitions of N, fi , and di are the same as these in Inverse Document Frequency method. The parameter m is the number of unique terms in a document vector space. wi
fi * Log ( N / di )
¦ f j * Log N / d j m
2
(2.8)
2
j 1
2.2.4 Another term weighting method The approach integrates term frequency retrieval characteristics, term frequency, document collection characteristics, and both the term depth and term width distribution characteristics as well (Zhang and Nguyen, 2005).
w i = c -(
f i- f
a
)2
* Log(
N * Di ) d i * Li
(2.9)
In Eq. (2.9), where fa is the middle value of the term frequency range in a document; fi is the raw frequency of term i in the document; Li is the number of term i in the document collection; Di is the number of all terms in documents containing term i; wi is term significance of term i in the document, its weight; and the constant c (>0) is used to adjust the impact of term frequencies on the weight. The definitions of N, di and fi are the same as the previous definitions.
2.2.5 Probability term weighting method The probability-based term weighting algorithm is a different way to calculate term significance. The binary independence model was introduced in 1976 by Robertson and Sparck Jones. The model is described as follows:
2.3 Similarity measures
wi
| rel u p occ
| rel
pocci | rel u 1 p occi | rel p1 occi
i
27
(2.10)
Where wi is the weight of term i, p occi | rel is the probability that the
term i occurs in relevant documents, p occi | rel is the probability that the term i occurs in non-relevant documents. It is apparent that the model is based on two
additional probabilities p occi | rel and p occi | rel . These two additional
probabilities can be estimated in the following equations. ri R
(2.11)
ni ri NR
(2.12)
pocci | rel |
p occi | rel |
Where N is the number of the total documents in a document collection, ni is the number of the documents containing the term i, R is the size of a relevant document set, and ri is the number of relevant documents that contain term i. However, an assumption of this term weighting model is statistical independence among these terms. It suggests that the terms are statistically independent among both relevant documents and non-relevant documents.
2.3 Similarity measures A similarity measure is used to indicate resemblance between two objects in a numeric value. The degree of similarity between two objects is reflected in their similarity value. A higher value usually illustrates greater similarity; and vice versa. We introduce several widely used similarity measures. All these similarity measures work in the vector space model. Assume x and y are two objects in a vector space. They can be documents or queries, and n is the dimensionality of the vector space. x
(a1 , a 2 ,..., a j ,..., a n )
(2.13)
y
(b1 , b2 ,..., b j ,..., bn )
(2.14)
S(x,y) denotes the similarity between x and y.
28
Chapter 2 Information Retrieval Preliminaries
2.3.1 Inner product similarity measure n
S ( x, y )
¦ ai u bi
(2.15)
i 1
A valid weight is always equal to or larger than 0. In this inner product similarity measure, the weights of features shared by both involved vectors are considered. Weights of features possessed by only one vector are excluded. In other words only overlapping features between two vectors are included. This method is simple. It is clear that it does not consider all features possessed by two vectors, which can result in biased calculations. For instance, for simplicity, assuming that the element values of the vector are binary (0 and 1), the following two cases have the same similarity value: a documents and a query are indexed and assigned by the same five keywords respectively, their inner product measure value is 5; a document and a query are indexed and assigned by 10 keywords respectively, but 5 of 10 keywords are the same, their inner product measure value is also equal to 5. However, the similarities for these two cases should be different because the former case is a 100% match in terms of index terms while the latter case is only a 50% match.
2.3.2 Dice co-efficient similarity measure n
2¦ ai u bi S ( x, y )
i 1 n
n
i 1
i 1
¦ ai ¦ bi
(2.16)
The Dice co-efficient measure considers both features shared by two involved vectors and features possessed by either of them. This measure looks similar to the Inner product measure except adding a denominator. The sum of weights of both vectors as a denominator serves to normalize the inner product measure. The normalization can avoid the problem of unfair calculations as described in the inner product measure.
2.3.3 The Jaccard co-efficient similarity measure n
¦ ai u bi S ( x, y )
n
¦ i 1
ai2
i 1 n
¦
i 1
bi2
n
¦ ai u bi i 1
(2.17)
2.3 Similarity measures
29
The Jaccard co-efficient measure (or Tanimoto measure) is similar to the Dice coefficient measure in terms of considering features possessed by either of two vectors. It looks similar to the Dice co-efficient measure in terms of normalization. But the methods of normalization are quite different. In the denominator, the impact of shared features is reduced by eliminating
¦
n
i 1
ai u bi . Also, the sums of
the term weight squared for both vectors are considered in the denominator.
2.3.4 Overlap co-efficient similarity measure n
¦ ai u bi S ( x, y )
i 1
MIN
n
n
i 1
i 1
¦ ai , ¦ bi
(2.18)
The difference between the overlap co-efficient measure and the Dice co-efficient measure is also reflected in their denominators. The former measure takes the minimum of the sum weights from the two vectors as its denominator. It is another way to normalize the inner product measure.
2.3.5 Cosine similarity measure n
¦ ai u bi S ( x, y )
i 1 n
n
i 1
i 1
¦ ai2 u ¦ bi2 1 / 2
(2.19)
The cosine similarity method measures the similarity between two objects based on an angle formed by the two objects in the vector space. As we know, any document corresponds to a point in a vector space. The two points of two documents against the origin of the vector space yield an angle. The cosine value of the angle described in the above equation is used as similarity value between the two documents. It is evident that the valid similarity value ranges from 0 to 1 in this case. The cosine similarity measure works best to identify the similarity between two objects which are proportionally similar in a vector space. Proportional similarity refers to relative magnitudes of these two objects in terms of their weight distributions. We can also interpret the cosine measure in a different way. It uses
¦
ure.
n i 1
ai2 u ¦i 1 bi2 n
1 / 2
as a denominator to normalize the inner product meas-
30
Chapter 2 Information Retrieval Preliminaries
2.3.6 Distance similarity measure A distance between two objects in a vector space should satisfy the following properties: the distance is always positive; the distance from point A to point B is equal to the distance from point B to point A; the distance from a point to itself is 0; if the distance of two points is equal to 0, then the two points are overlapped; and the distance between points A and B is always smaller or equal to the distance between points A and C plus the distance between points B and C. C is any point in the space. Before introducing the distance definition of two documents in a vector space, we must first address the metric of a distance. This is because two points in the vector space can generate a family of distances as a parameter in the metric changes. The Minkowski metric is defined as: 1
G ( x, y )
§n ·k ¨¨ ¦ ai bi k ¸¸ , k ©i 1 ¹
1,..., f
(2.20)
The relationship between the parameter k and distance G ( x, y ) is shown in Fig. 2.1. For simplicity, we choose a three dimensional vector space as an example and the differences of the X-axis, Y-axis, and Z-axis values of three points in the space are 3, 4, and 5 respectively. The equation (Eq. (2.21)) is described as follows. Notice that when the parameter k increases the corresponding distance value decreases dramatically.
G ( x, y )
3
k
4k 5k
(2.21)
1/ k
10
8
2 k 3 k 4 k
1 k
6
4
0
20
40 k
Fig. 2.1. Relationship between distance and the parameter k
60
2.3 Similarity measures
31
When k is equal to 1, it is called the Manhattan distance measure, or City block measure, or Hamming distance measure. Using a i bi
rather than
(ai bi ) assures that the final distance measure value is equal to or greater than 0.
G ( x, y )
§n ¨¨ ¦ ai bi ©i 1
· ¸¸ ¹
(2.22)
When k is equal to 2, it becomes the famous Euclidean distance measure which is used to describe the distance between two objects in a Euclidean space. 1
G ( x, y )
·2 §n ¨¨ ¦ ai bi 2 ¸¸ ¹ ©i 1
(2.23)
When k is equal to f, it becomes the Supremum distance measure or Dominance distance measure.
G ( x, y )
MAX ^ai bi `
(2.24)
For brevity, the relationships among the three similarity measures are illustrated in a two-dimensional space in Fig. 2.2. The figure shows the unit circles of the three different metrics. That is, when a point has a unit distance to the origin, its three different circles of three metrics are shown. If k=f, the metric becomes a square contour outside the circular contour. If k=2, it becomes a circular contour. If k=1, the metric becomes a smaller square contour inside the circular contour. Distance, in nature, normally is used to measure dissimilarity between two objects in a space. It is natural and intuitive. When two objects are far away from each other in the vector space, they are more dissimilar. When they are located close to each other in the vector space, they are more similar. Since any document can correspond to a point in the vector space, it is no surprise that the distance between two documents is employed to measure their similarity. In order to mathematically describe their similarity, people use the inversion transformation of the distance between two objects as their similarity (See the following equation). When distance between two objects is infinite, the similarity is 0. When distance between two objects is 0, the similarity is 1. Another benefit of the inversion transformation is to normalize all similarity measure values from 0 to 1.
S ( x, y )
1
c
G x, y
(2.25)
32
Chapter 2 Information Retrieval Preliminaries
k=f
k=1
k=2
Fig. 2.2. Display of three Minkowski metrics
In Eq. (2.25), x and y stand for two objects in the vector space respectively, and the constant c is always larger than 1. This assures that when the distance between two objects increases, their corresponding similarity decreases. The constant c would impact the degree of the distance between two objects on the similarity. The larger the constant c is, the stronger it impacts the similarity; and vice versa. Both the distance-based similarity measure and cosine similarity measure are widely used in information retrieval. The former uses spatial distance characteristics of investigated documents to measure their similarity while the latter employs spatial direction characteristics of investigated documents to measure their similarity. It is worthy to point out that two documents with a high similarity value in one similarity measure may have a very low similarity value in the other. In other words, two documents with a high similarity value in one similarity measure do not necessarily means they also have a high similarity value in the other. The similarity values of two documents for the two similarity measures depend on their spatial locations in a high dimensional space.
2.3.7 Angle-distance integrated similarity measure
(
S ( x, y )
c
n
n
i 1
i 1
n
¦ ai * bi
¦ ( ai )1 / 2 ) 2 (¦ (bi )1 / 2 ) 2 *
(2.26)
i 1 n
(¦ ai ) i 1
2 1/ 2
n
* ( ¦ bi ) i 1
2 1/ 2
2.3 Similarity measures
33
Angle-distance integrated similarity measure (Zhang and Rasmussen, 2001) takes the strengths of both angle based similarity measure and distance based similarity n
n
(
¦ ( ai )1 / 2 ) 2 ( ¦ ( bi )1 / 2 ) 2
i 1 to the measure into account. By adding a distance modifier c i 1 cosine measure, the method combines the distance strength of the two compared objects into the angle based measure. In the above equation the positive constant c (0
2.3.8 The Pearson r correlation measure The Pearson Product Moment Correlation Coefficient is the most widely used measure for correlation or association analysis. It was named after Karl Pearson, the designer of the co-relational method to perform agricultural research. The product-moment correlation, commonly expressed as r, indicates the strength of a relationship between two variables that are assumed to be measured on an interval or ratio scale. The two variables should have a linear relationship, and either of the variables is normally distributed. There are two equations which can be used to calculate the Pearson r of two vectors x and y. n
n
n
i 1
i 1
n u ( ¦ a i u bi ) ( ¦ a i ) u ( ¦ bi ) S ( x, y )
r
i 1
§§ n n ¨ ¨ n ¦ a 2 ·¸ §¨ ¦ a ·¸ ¨ ¨© i 1 i ¸¹ ¨© i 1 i ¸¹ ©
2
· ¸ ¸ ¹
1/ 2
n § ¨ ¦ ai ¦ ¨¨ a i i 1 n i 1 ¨ © n
S ( x, y ) r
§ § n ¨ ¨ ¦ ai ¨n¨ i 1 ¨ ¦ ¨ ai n 1 i ¨ ¨ ¨ © ©
· ¸ ¸ ¸ ¸ ¹
2
· ¸ ¸ ¸ ¸ ¸ ¹
1/ 2
2 §§ n · §n · · u ¨ ¨¨ n ¦ bi2 ¸¸ ¨¨ ¦ bi ¸¸ ¸ ¨© i 1 ¹ ©i 1 ¹ ¸ ¹ ©
1/ 2
n · § · ¸ ¨ ¦ bi ¸ ¸ u ¨b i 1 ¸ ¸ ¨ i n ¸ ¸ ¨ ¸ ¹ © ¹ 2 § § n · ·¸ ¨ ¨ b ¸ ¦ i ¨n ¸ u ¨ ¦ ¨ bi i 1 ¸ ¸ ¨ ¸ n ¨ i 1¨ ¸ ¸ ¨ © ¹ ¸¹ ©
1/ 2
(2.27)
(2.28)
Pearson r ranges from -1.0 to + 1.0; the signs indicating whether the relationship is positive (+) or negative (-). The absolute value of the coefficient indicates its strength. Eq. (2.28) is easier to be explained than Eq. (2.27). The numerator of Eq. (2.28) is the product of the joint deviations of the two variables from their
34
Chapter 2 Information Retrieval Preliminaries
respective means. The denominator indicates product of the variances of the two variables. It is used to adjust for the overall variation. In fact, Eq. (2.28) is a special form of the cosine similarity measure. The similarity measure first subtracts the average of the elements from each of the variables before computing cosine similarity. In this sense, it is regarded as a special type of the cosine similarity measure.
2.4 Information retrieval (evaluation) models Information retrieval (evaluation) models in a vector space model can be used to define a special retrieval contour based on a query, and all documents located within the contour are regarded as the retrieved documents to the query. The location of the contour is dependent on the query submitted by users, the shape of the contour is dependent on the information retrieval (evaluation) model chosen by users, and the size of the contour is determined by a threshold controlled by users. It is quite clear that the larger the contour, the more documents may be retrieved; and vice versa. The location of the retrieval contour is always situated in the first quadrant of a vector space because all weights of document vectors are positive or equal to zero. There are two basic types of information retrieval (evaluation) models: one-reference-point-based and two-reference-points-based. A reference point refers to users’ information needs in a broader sense. A reference point in a vector space can be described in a vector format like a document. A reference point may be defined as users’ previous queries, current query, users’ research interests, and any thing related to users search. A query can be viewed as a special reference point. The one-reference-point-based models include direction-based evaluation model and distance-based evaluation model. Two-reference-point-based models include the ellipse evaluation model, the conjunction evaluation model, the Cassini oval evaluation model, and the disjunction evaluation model.
2.4.1 Direction-based retrieval (evaluation) model The direction-based evaluation model, or the angle-based evaluation model, or cosine evaluation model, is a one-reference-point-based model. The retrieval contour is a cone. A reference point (R) or query can define a cone in vector space (See Fig. 2.3). The reference point and the origin of the vector space form the central line of the cone. In this case the angle (D) of the cone is the threshold controlled by users. Documents situated within the cone are retrieved documents. The angle of any document against the central line can be calculated by using the cosine similarity measure. Then it is compared with the threshold D. If it is smaller than or equal to the threshold D, it is retrieved. Otherwise, it is located outside the cone and excluded.
2.4 Information retrieval (evaluation) models
R
35
D D
O
Fig. 2.3. Display of the direction based evaluation model
2.4.2 Distance-based retrieval (evaluation) model The distance-based evaluation model is a one-reference-point-based model. The retrieval contour of the distance based retrieval model is a sphere (See Fig. 2.4). The center of the sphere is the reference point (R) or query. Retrieval threshold is determined by radius (r) of the sphere. Documents within the sphere are retrieved documents. Whether a document is within the sphere or not depends upon the
R r
O
Fig. 2.4. Display of the distance based evaluation model
36
Chapter 2 Information Retrieval Preliminaries
distance between the reference point/query and the document. The distance can be calculated by using the distance calculation equations discussed in the previous distance similarity measure section. The larger the radius (r) is, the more documents may be included in the sphere; and vice versa. However, when more documents are added to the sphere, the relevance between the query and these newly added documents decreases. That is because they are relatively farther away from the query in the vector space.
2.4.3 Ellipse retrieval (evaluation) model The ellipse evaluation model is a two-reference-point-based model. The shape of the ellipse evaluation model looks like an oval (See Fig. 2.5). The two reference points are R1 and R2. E is any point on the ellipse retrieval contour. The location of the oval is determined by the two reference points defined by users. The sum of distance from R1 to E ( | R1E | ) and distance from R2 to E ( | R 2 E | ) is always equal to a constant c. The oval is symmetric against the axis formed by R1 and R2. It is evident that the constant c is the retrieval threshold. The larger the threshold c is, the more documents may be retrieved. Given x is a document in the vector space, if document x meets the following equations, it is located within the oval, and therefore is retrieved. R1 (r11 , r12 ,......, r1n )
(2.29)
R2
(r21 , r22 ,......, r2 n )
(2.30)
(a1 , a 2 ,......, a n )
(2.31)
c | R1E | | R 2 E |
(2.32)
x
1/ 2
§n · c t ¨¨ ¦ ai r1i 2 ¸¸ ©i 1 ¹
1/ 2
§n · ¨¨ ¦ ai r2i 2 ¸¸ ©i 1 ¹
(2.33)
2.4.4 Conjunction retrieval (evaluation) model This model is a two-reference-point-based model. The two reference points R1 and R2 correspond to two spheres in the vector space respectively. The overlapping part of these two spheres is defined as the retrieval contour. See the shaded part in Fig. 2.6. In fact, the conjunction evaluation model functions similar to the logic AND operation in a Boolean retrieval system. The retrieved document x in the vector space should satisfy the following equation (Eq. (2.34)). The parameter k, the retrieval threshold, is the radius for the both spheres. Definitions of the
2.4 Information retrieval (evaluation) models
R1
37
R2
E
O
Fig. 2.5. Display of the ellipse evaluation model
variables are the same as these in the ellipse evaluation model. The retrieval sphere is affected by the parameter k and locations of the two reference points R1 and R2. The retrieval sphere may be null if the distance between the two reference points is too great or the parameter radius k is too small. In other words, if the two spheres don’t intersect with each other in the vector space, the retrieval set is empty. However, even if the two spheres intersect in the vector space, it is possible that no document is within it.
k
k R1
R2
O
Fig. 2.6. Display of the conjunction evaluation model
38
Chapter 2 Information Retrieval Preliminaries 1/ 2 1/ 2 §§ n · §n · · k t MIN ¨ ¨¨ ¦ ai r1i 2 ¸¸ , ¨¨ ¦ a i r2i 2 ¸¸ ¸ ¨©i 1 ¹ ©i 1 ¹ ¸¹ ©
(2.34)
2.4.5 Disjunction evaluation model The disjunction evaluation model is a two-reference-point-based model. The two reference points R1 and R2 correspond to two spheres respectively similar to the conjunction model. Both two spheres are defined as the retrieval contour (See Fig. 2.7). In fact, the disjunction evaluation model functions similar to the logic OR operation in a Boolean retrieval system. The retrieved document x in the vector space should satisfy the following equation (Eq. (2.35)). In other words, as long as it is within either of these two spheres, it is regarded as the retrieved document. The definitions of the variables and parameters are the same as these in the conjunction evaluation model. The retrieval contour can be one contour if the two spheres intersect. It can be two separate spheres if they do not intersect in the vector space. 1/ 2 1/ 2 §§ n · §n · · k t MAX ¨ ¨¨ ¦ ai r1i 2 ¸¸ , ¨¨ ¦ ai r2i 2 ¸¸ ¸ ¨©i 1 ¹ ©i 1 ¹ ¸¹ ©
k R1
k R2
Fig. 2.7. Display of the disjunction evaluation model
(2.35)
2.4 Information retrieval (evaluation) models
39
2.4.6 The Cassini oval retrieval (evaluation) model Like the ellipse model, the conjunction model, and the disjunction model, the Cassini oval evaluation model works in support of two reference points R1 and R2. Its definition is similar to that of the ellipse model. The two reference points are located within the contour. The product of the two distances from any point on the contour to the two reference points (or foci) is a constant. That is, if the product of the two distances from any document to the two reference points satisfies Eq. (2.36), the document is within the contour and therefore it is retrieved. In Eq. (2.36), the retrieval threshold c is a constant which is one of the two parameters determining the shape of the Cassini oval. It is always equal to or larger than 0. §n · c t ¨¨ ¦ ai r1i 2 ¸¸ ©i 1 ¹
1/ 2
§n · u ¨¨ ¦ ai r2i 2 ¸¸ ©i 1 ¹
1/ 2
(2.36)
Denote h the half distance between the two reference points in the vector space, it is also always equal to or larger than 0. Then we have: 2h
§n 2· ¨¨ ¦ r2i r1i ¸¸ ©i 1 ¹
1/ 2
(2.37)
c to h changes, the shape of the contour alters accordingly. If h is smaller than c , the contour looks like an oval or a It is interesting that when the ratio of
dumbbell where the two reference points are within it (See Figs. 2.8 (a) and (b)). If h is equal to c , it becomes two connected loop contours (See Fig. 2.8 (c)). If h is larger than c it becomes two separate loop contours (See Fig. 2.8 (d)) and the two reference points are still within the contours. It suggests that given a fixed
c , as the distance (2h) between the two reference points increases, the contour tends to be separate gradually from each other in the space. These characteristics have apparent information retrieval implication. When the two reference points are close to each other in the vector space and documents in their neighboring area are relevant to the two reference points, the entire area around the two reference points is included in the retrieved contour. As the two reference points move away from each other and the middle area between the two reference points becomes less relevant to either of the reference points, the surrounding areas of the two reference points are still included and the middle area in the retrieved contour shrinks. And the entire middle area is excluded from the retrieved contour when the two reference points are too far away from each other because in this case, the middle area between the two reference points is no longer relevant to them. The oval contour and separated sub-contours are always symmetric against the axis formed by the two reference points regardless of the distance between the two reference points in the space.
40
Chapter 2 Information Retrieval Preliminaries
R1
R2
R1
(a)
R1
R2
(b)
R2
(c)
R1
R2
(d)
Fig. 2.8. Display of the Cassini oval model
2.5 Clustering algorithms Clustering algorithm is a type of data analysis method that can organize a dataset into categorical groups based on certain data association criteria. The association measure in a clustering algorithm can range from the co-citation analysis method, the cosine similarity measure, to the distance-based similarity measure, to other similarity measures discussed previously. Different similarity measures can result in different clustering results. Items or objects within the same group/cluster are more similar than items between two distinct groups/clusters. An automatic clustering process is considered an unsupervised learning process because it can automatically reveal intrinsic categorical patterns from a dataset. Unlike a traditional classification method where hierarchy and categories are predefined, categories generated from a clustering algorithm are usually dynamic and not predetermined. The categories from a clustering algorithm rely on the nature of a dataset, association criteria of clustering, and distribution of data items in the dataset. Clustering algorithms can be basically classified into two types: nonhierarchical clustering algorithm (partitioning clustering algorithm) and hierarchical clustering algorithm (Rasmussen, 1992). The major difference between these two clustering algorithms is that the hierarchical clustering algorithm generates a multiple-level categorical structure for clustered items while non-hierarchical clustering algorithm partitions the items in a one-level categorical structure.
2.5 Clustering algorithms
41
A
B C
D
E
I
F
J
G H
Fig. 2.9. Display of hierarchical clustering results
Figs. (2.9) and (2.10) show results for both hierarchical clustering algorithm and non-hierarchical clustering algorithm, respectively. {A, B, C, D, E, F, G, H, I, J} are items in a dataset. In Fig. (2.9), each circle represents a cluster and the
A
B C
D
E
I
F
J
G H
Fig. 2.10. Display of non-hierarchical clustering results
42
Chapter 2 Information Retrieval Preliminaries
small circles are sub-clusters of a large circle which contains them. C, D, and E are three individual items. They form a new cluster which is a sibling cluster of clusters A and B. Cluster{A, B, {C, D, E}} yields another cluster which is a sibling cluster of clusters {I, J} and {F, G, H}. In Fig. (2.10), there is no hierarchical structure. Therefore, all items are located at the same level.
2.5.1 Non- hierarchical clustering algorithm Non-hierarchical clustering algorithms partition N items into K categorical groups. The number of categorical groups must be predefined. In other words, the final number of categorical groups does nothing with the size of clustered items, distribution of the items, and the association/similarity measure selected, and a cluster membership function. The produced categorical groups are mutually exclusive. It suggests that an item can only fall into one categorical group. The K-means clustering algorithm, one of the popular non-hierarchical clustering algorithms, is based on a simple iterative scheme for finding a local minimal solution (MacQueen, 1967). The algorithm starts with a guess about the solution, and then readjusts cluster centroids until reaching a local optimum. A centroid is a special artificially created item in a cluster which is used to represent that cluster for various purposes. It is defined as the average coordinates of all items in a cluster which it represents. A cluster centroid should possess traits of the representative cluster. A cluster membership function refers to a method to judge whether an item is assigned to a cluster or not in a clustering process. This function plays an extremely important role in a clustering process. M (c j , x i )
MIN ( xi c j ) ,
j
1,..., k
(2.38)
M(cj, xi) is a cluster membership function. Here i ( i =1,…, n) is the number of investigated items, k is the number of predetermined clusters. The function defines the shortest distance difference between the centroid (cj) and items (xi). The K-means clustering algorithm is described as follows: L1 L2 L3 L4
Begin
L5 L6
Do
L7 L8
Set the number of clusters k; Put all items in a process list; Randomly assign k items from the list to k initial clusters and they also serve as centroids of the clusters; Use a membership function to calculate difference between an item and each of the centroids; Assign the item to the cluster whose centroid is the most similar to the item; Recalculate the centroids whose clusters change because of item removal or item adding;
2.5 Clustering algorithms L9 L10
43
DoEnd While Convergence is achieved; End.
L2 to L4 prepare for the later process. L5 to L9 is an iterative process which optimizes k-clusters for a dataset. During the clustering process, if an item violates the membership criteria, it will be removed from a cluster and added to the closest cluster whose centroid is the most similar to the item. Changes of these clusters can lead to changes of the corresponding centroids. Thus, after cluster adjustments the centroids of the involved cluster must be recalculated. When convergence is achieved, it means that no item will be removed from or added to any cluster. The K-means clustering algorithm is simple and it is easy to implement. It minimizes local (intra-cluster) variance through an iterative process. How the number of clusters (k) is defined is important for the K-means clustering algorithm. The number of clusters (k) can be determined by the elbow criterion. A plot whose X-axis and Y-axis are defined as the number of the produced clusters and percentage variance among the clusters respectively is drawn and analyzed for this purpose. Basically, as the number of clusters increases, the corresponding percentage variance among the clusters increases. From the plot we can find an elbow point after which increasing number of clusters would not result in a decent increase of percentage variance. Then the corresponding number of the elbow point can serve as the expected cluster number.
2.5.2 Hierarchical clustering algorithm The hierarchical clustering algorithm yields a categorical tree structure, which is also called dendrogram. It implies that a child sub-cluster has to overlap with its parent cluster in such a structure. That is, sibling clusters located at the same level of the structure are mutually exclusive while one of the sibling clusters and its parent cluster are overlapping. For the hierarchical clustering algorithm, the clustering process is recursive. In other words, successive sub-clusters are generated from an existing cluster, and new sub-sub-clusters are produced from one of the sub-clusters. The Hierarchical cluster structure can be produced from two basic strategies: agglomerative (or from-bottom-to-top) algorithm, and divisive (or from-top-to-bottom) algorithm. The former algorithm first clusters input items to form a set of clusters. It then merges close clusters from the existing cluster set to form a parent cluster. The algorithm ends when all existing clusters merge as one large cluster, the root of the tree. Unlike agglomerative or from-bottom-to-top algorithm, divisive or from-top-to-bottom algorithm takes an opposite direction as its name suggests. It starts with the root of the tree. Then it breaks down one large cluster into several smaller clusters. The divisive or from-top-to-bottom algorithm keeps recursively partitioning clusters until certain criteria are met. In reality, agglomerative clustering algorithms are more popular than divisive clustering algorithms. The following discussion will focus on the agglomerative clustering algorithm.
44
Chapter 2 Information Retrieval Preliminaries
In order to effectively merge two clusters, we have to define a cluster merging function to select target clusters. The idea behind the merging function is to find the two most similar clusters from a cluster set to merge into a new large set. Therefore we have to find a way to calculate the similarity between two clusters. There are many approaches available for the calculation of two cluster similarity. Three popular equations are listed. Complete linkage clustering MAX ( IC ( x, y )), x C i ,
CS (C i , C j )
y Cj
iz j
(2.39)
In the above equation, CS(Ci, Cj) is similarity between two clusters Ci and Cj, IC(x, y) is similarity between two items (x and y) from two clusters respectively. Similarity between two clusters is defined as the least similar pair of items in the two clusters. Single linkage clustering: MIN ( IC ( x, y )), x C i ,
CS (Ci , C j )
y Cj
iz j
(2.40)
The definitions of CS(Ci, Cj), Ci, Cj, and IC(x, y) are the same as the previous equation. Similarity between two clusters is defined as the most similar pair of items in the two clusters. Average linkage clustering: Ci
§ Cj
·
©
¹
¦ ¨¨ ¦ IC ( x k , y l ¸¸
k 1 l 1
CS (C i , C j )
Ci u C j
, x k Ci ,
y l C j , Ci z C j
(2.41)
|C| means the number of elements in cluster C. Similarity between two clusters is defined as average similarity value of all pairs of items in the two clusters. The agglomerative clustering algorithm is described as follows: L1 L2 L3 L4 L5 L6 L7 L8 L9 L10
Begin
Input items into R list; Select an appropriate cluster merging function CS; Do
Find out the two most similar clusters (Ci and Cj) in R, FIND(Ci, Cj)= MIN(CS(Ci, Cj)), Ci and Cj R; Merge Ci and Cj as a new cluster and put it into R; Remove Ci and Cj from R; DoEnd While R has a cluster; End.
2.6 Evaluation of retrieval results
45
L2 to L3 prepare for the later process. R is a storage area used to store clusters. At the beginning each of the items is treated as a cluster in R. L4 to L9 try to find two most similar clusters in R, merge them into one cluster, and remove the old clusters. The algorithm ends when only one cluster is left.
2.6 Evaluation of retrieval results Evaluation of a search result is important both for information retrieval system designers and as well as system users. Search results can be affected by factors on the system side such as database coverage, database record quality, system interface design, and system features, and also factors on the users’ side such as user’s expertise in information retrieval, experiences, and familiarity in his/her search domains. There may be an array of search evaluation approaches, but recall and precision are basic and fundamental. When a user submits a query to an information retrieval system and the system responds to the query with a results list, the query actually partitions the entire database into four distinct sets: retrieved and relevant (A); retrieved and nonrelevant (B); not retrieved but relevant (C); and not retrieved and non-relevant (D). The four sets are shown in Table 2.1. It is clear that a good search result should maximize A, and minimize B and C. Recall is defined as: Re call
A AC
(2.42)
Precision is defined as: Pr ecision
A A B
(2.43)
Eqs. (2.42) and (2.43) show that valid values for both recall and precision fall between 0 to1. The recall of a search result indicates how well a search finds what a user needs against what a system should provide, while the precision of a search indicates how well a search finds what a user needs against what a search has found. It is not surprising that recall and precision maintains an inverse relationship. In other words, a high recall usually corresponds to a low precision, and vice versa. As users try to improve recall by maximizing set A, the chance of adding Table 2.1. Partition of a search result in a database
Relevant Non-relevant
Retrieved A B
Not retrieved C D
46
Chapter 2 Information Retrieval Preliminaries
“noises” (non-relevant documents) to a search result or set B also increases. It would result in an unexpected shrink of precision. However, the degree of adding “noises” (non-relevant documents) to a search result or set B varies among searchers. The major strength of recall and precision rests upon their simplicity and operability. There are also problems with recall and precision. Since relevance judgment of a retrieved document can be both objective and subjective, it brings uncertainty to calculations of recall and precision. When computing recall, we have to determine not retrieved but relevant (C). It has proven that determining the not retrieved but relevant (C) is a tough job, especially in a large database like the Internet. Recall and precision are supposed to measure the quality of a search result. They do nothing with the search process. Evaluating a dynamic and interactive search in a more advanced information retrieval system like an information retrieval visualization system is far beyond their reach.
2.7 Summary In this chapter, the vector space model and its characteristics have been discussed. Term weighting methods such as stop words, inverse document frequency, the Salton’s term weighting method, probability term weighting method, and so on, were introduced. Similarity measures like the inner product measure, the dice coefficient measure, the Jaccard’s co-efficient measure, the overlap co-efficient measure, the cosine measure, the distance measure, the angle-distance integrated similarity measure, and the Pearson r correlation measure were addressed. Many similarity measure methods were derived from the basic inner product measure by using different normalization techniques. Similarity measures can be employed to determine relationships among projected documents in a visual space. Information retrieval (evaluation) models such as the direction-based model, distance-based model, ellipse model, conjunction model, and disjunction model were presented. Users can apply information retrieval (evaluation) models to narrow down their search scopes in a vector space. The information retrieval (evaluation) models can be visualized in several Euclidean spatial characteristics based information retrieval visualization models. Both hierarchical clustering algorithm and non- hierarchical clustering algorithm were discussed. Definitions and potential problems of recall and precision are given. These concepts, algorithms, and theories of information retrieval are prerequisites for information retrieval visualization. A wide spectrum of solutions to an issue is offered such as similarity measures. The choice of solutions is at the discretion of the researchers.
Chapter 3 Visualization Models for Multiple Reference Points
Visualization models for multiple reference points have many unique characteristics. They can effectively handle complex information need of a user by using multiple reference points and achieve excellent operation flexibility for the multiple reference points. A traditional information retrieval system responds to a user search query with a linear results list. Retrieved documents may be ranked based upon their similarities to the query if the system offers a relevance ranking mechanism. However, it does not provide users with a flexible and powerful browsing environment with which users can make a relevance judgment about the retrieved documents. That is because the linear list does not provide users with relevance information among the retrieved documents and it only offers relevance information between the query and the retrieved documents. Originally, the visualization algorithms for multiple reference points were designed to visualize the results of a search query to overcome the weakness of the linear structure of a search results list by projecting the retrieved documents onto a low dimensional visual space. This visual space is defined by the multiple reference points. Basically, a reference point represents a user’s information need. We will discuss this concept and its implications in depth in the following section. In this sense, the visualization models for multiple reference points were directly driven or motivated by information retrieval. The visualization algorithms can be classified into two groups based upon what kind of an information retrieval system it works on: the algorithms that work on a Boolean information retrieval system and the algorithms that work on a vector information retrieval system. The algorithms for multiple reference points can also be categorized into three categories based upon the status of reference points in the visual spaces: the algorithms for fixed reference points in the visual spaces; the algorithms for manual movable reference points; and the algorithm for automatic rotating reference points. The status changes of reference points can not only facilitate users’ manipulation of the reference points, but also assist users in clarifying the notorious ambiguity of overlapped documents in the visual space and reveal new visual semantic relationships among displayed documents. Among the three categories, a large body of research and applications has gravitated to the second category because the algorithms for manual movable reference points are flexible, simple, and also powerful. The representative and pioneering paradigm in this category was VIBE (Visual Information Browsing
48
Chapter 3 Visualization Models for Multiple Reference Points
Environment) (Olsen and Korfhage, 1994). Other related algorithms and applications in this category were derived directly from the VIBE algorithm. The underlying principle of the VIBE algorithm is that the position of a displayed document between two reference points indicates its relevance to them in its visual space. The position of a reference point can be any location in the visual space. Due to the algorithm simplicity and flexibility, it has been widely used and adapted to many application domains. For instance, Web pages as objects were visualized in the visual environment WebVIBE (Morse and Lewis, 2002). The model was integrated in a multilingual and multimodal system to visualize video objects (Lyu et al., 2002). A visualization environment and a geographic environment were combined to generate a new environment so that users could search and browse geographic information in both Geo-VIBE (Cai, 2002) and Visual Digest (Christel, 1999; Christel and Huang, 2001). Radial (Carey et al., 2003) provided a two dimensional circular space where terms or reference points were limited to positions on its boundary and documents were situated within the circle. Virtual reviewer applied the VIBE algorithm for visualization of movies reviews where reference points were virtual reviewers of movies while displayed objects were movies (Tatemura, 2000). Experimental studies on the VIBE systems were also conducted (Koshman, 2004; Morse and Lewis, 2002).
3.1 Multiple reference points A reference point (RP), or a point of interest (POI), is a search criterion against which database documents or document surrogates are matched and search results are generated and presented to users. In a broad sense, a reference point represents users’ information needs and any information related to users’ needs. For instance, it can range from general information such as a user’s past/current research interests, a user’s previous search histories, a user’s reading preferences, a user’s research projects involved, user’s affiliation and educational background; to specific information like search terms from a complicated query, browsed documents, and a group of user’s queries as well. A reference point may correspond to either a term or a group of terms. Terms in a reference point can be weighted or not weighted. Since a reference point can cover all aspects of users, it is also called a user profile. Now let us discuss the implication of multiple reference points on information retrieval. The primary implication of multiple reference points is that they can form a low dimensional visual space and documents can be mapped onto the space based upon their attractions to the reference points. Contribution of each individual keyword of a query to a final retrieved result is hardly observed in a given linear results list which reflects combinative impacts of all terms involved in a query. A search process is further compounded when a user formulates a multi-faceted query. The user may want to see not only the result set of the original query but also result sets based upon component terms of
3.2 Model for fixed multiple reference points
49
the query. This would give the user some idea of the contribution of the query parts to the full query (Havre et al. 2001). This is important for a searcher because searching is an iterative process and each process needs to optimize a search query by adjusting search terms based upon search feedbacks. If users are not satisfied with returned search results, a query reformulation follows. A query reformulation may include removing useless search terms, adding new related terms, revising weights assigned to the terms, and so on. All of these actions need a better understanding of the degree to which an individual term affects on the returned search results. That is, if users can understand how and which search terms affect the search results, users can revise and modify a search strategy more accurately to perform a better search. For instance, if the size of a search result set is too small, users may want to know exactly which search terms led to the results. After identifying the terms, users can either change weights assigned to the terms, or replace them with other terms to increase the result size. If search terms are assigned to reference points respectively and presented in the visual space, the impact of each individual term on search results can be easily and intuitively perceived. Therefore, users can make a judgmental decision on query reformulation based upon the provided visual display. Multiple search terms instead of a single search term can also be grouped and assigned to a reference point. Previous search queries and current search query can serve as reference points respectively and they are present in the same visualization environment. So, both the degree to which the revised search strategies affect on the final results and comparisons of the impact of these strategies are visually displayed. Similarly, a user’s preferences, user research interests or other related information can also serve as reference points if users want to know their impacts upon search results.
3.2 Model for fixed multiple reference points The representative two-dimensional visualization model for fixed multiple reference points was InfoCrystal (Spoerri, 1993 a and b). The model was originally designed as a visual query language to visualize a query result from a Boolean based information retrieval system. In the Boolean contexts each reference point may be equivalent to a term or a sub-Boolean logic expression from a Boolean query. The visual space is a polygon where reference points constitute vertices of the polygon and visual results are displayed. The side lengths of the polygon are equal so that the reference points are evenly configured in the visual space. For instance, if there are three, four, and five reference points, the corresponding polygons are an equilateral triangle, square, and pentagon respectively. There are two basic types of icons within the polygon: the criterion icons that are reference points and interior result icons that show the retrieval results. The polygon is partitioned by N exclusive tiers. N is the number of reference points. The tiers are represented by concentric rings and have different radii. Each of these rings defines a special interior display area where interior result icons between certain reference points/ criterion icons are
50
Chapter 3 Visualization Models for Multiple Reference Points
placed. In other words, result icons in the same tier share some commonality with respect to related reference points. The first tier covers result icons that represent documents related to only one inclusive reference point, the second tier covers result icons that represent only documents related to two inclusive reference points,…, and the last tier(the circle) covers one result icon that represents documents related to all N reference points. Shapes, directions, and sizes of these result icons vary in different tiers so that users may easily distinguish and identify them. Each side of a result icon is designed and placed to face one of the related criterion icons it meets. Positions of result icons within a tier are fixed and the number of documents satisfying the criteria is displayed on a corresponding result icon. The extent of related documents to the related reference points can also be visualized in result icons by partitioning the icons proportionally and coloring them differently. For example, Fig. 3.1 shows an example of the configuration for four reference points where r1, r2, r3, and r4 are reference points located at four corners of the large square as criterion icons. In this figure, there are 4 tiers due to the 4 reference points. Four circles as interior results icons are situated in the first tier, 8 rectangles in the second tier, 4 triangles in the third tier, and 1 square in the fourth tier. The circle close to r1 in the first tier shows the results of the related document meeting the criterion r1 AND NOT (r2 AND r3 AND r4) because it is located at the first tier, and the related documents are relevant to r1 but r2, r3, and r4. The rectangle icon between reference points r1 and r2 in the second tier indicates the results of the related documents satisfying the criterion r1 AND r2 AND NOT (r3 AND r4). The triangle icon between the center and reference point r1 in the third tier illustrates the results of the related documents meeting the criterion r1 AND r2 AND r4 AND NOT r3. The square icon in the fourth tier demonstrates the results of the related documents meeting the criterion r1 AND r2 AND r3 AND r4. In this model result icons between two not adjacent reference points may appear more than one time. For instance, both the rectangle icon between the center icon and reference point r1 and the rectangle icon between the center icon and reference point r3 in the second tier show the same results satisfying the criterion r1 AND r3 AND NOT(r2 AND r4). If a criterion icon is a sub-Boolean expression (For instance, in a Boolean expression A AND (B OR C), B OR C is a sub-Boolean expression), it can be connected to another independent visual display that represents the sub-Boolean expression and has a similar display structure to its parent. In this way, the visualization model can be extended to display a hierarchy for a very complicated nested Boolean query. This model can easily be applied to visualize documents in a vector space. If it is adapted to a vector space, the display framework or structure of the visual space has to be changed accordingly. Reference points are still situated at the corners of
3.2 Model for fixed multiple reference points
r1
51
r2
First tier Third tier
Fourth tier Second tier
r4
r3
Fig. 3.1. Display of 4 reference points in a fixed reference point environment. (Spoerri, 1993a). © 2003 IEEE. Reprinted with permission
the polygon. The center of the polygon is defined as the origin of the new display coordinate system. In this case, objects in the visual space are individual document icons instead of interior result icons that present a set of related documents. Documents are mapped within the polygon. The position of a document in the visual space is determined by two basic parameters: direction and distance. For instance, if a document is relevant to a reference point, then it is located on the segment defined by the origin of the visual space and the reference point (criterion icon). A document is placed on the segment based upon such a principle that the document with a high similarity to the reference point is close to the origin, and vice versa. So, the position of a projected document is affected not only by its attractions to the criterion icons/ reference points but also by the degree to which the document is relevant to the reference points. As we know, in a vector space such a relevance degree between a document and a reference point can be measured and calculated by using any of the similarity measures discussed in Chap. 2. It is apparent that the visualization models for both Boolean and vector information retrieval systems support multiple reference points. Reference point icons are fixed and valid display areas are the same in both models. However, visualized objects in these two models are quite different. One visualizes the interior result icons which reflect a result set of retrieved documents while the other displays individual documents. Interior result icon positions are fixed and the number of the result icons is constant in the Boolean based model while the positions of projected documents are not fixed and the number of document icons is a variable in the vector-based model. Notice that in the Boolean-based model, an interior result icon within a tier represents results of all reference points which are either inclusive (AND) or exclusive (NOT) in a criterion. When the number of exclusive (NOT) reference
52
Chapter 3 Visualization Models for Multiple Reference Points
points in a criterion increases, it could easily lead to an empty set of retrieved documents. It may happen when inclusive (AND) reference points are highly associated with exclusive (NOT) reference points. To avoid this phenomenon, the system should allow users to enable and disable exclusive (NOT) reference points in a criterion. The uniqueness of the visualization models for fixed multiple reference points is that they may be applied to both a Boolean-based retrieval model and a vectorspace-based retrieval model. Due to the fact that all reference points are evenly placed and fixed in the visual space, it results in a symmetrical and well-balanced visual area layout and therefore its interface achieves an aesthetically appealing effect.
3.3 Models for movable multiple reference points The VIBE model (Olsen et al., 1993 a), one model for movable multiple reference points, is distinguished from other visualization models by the fact that the ratiobased similarity scales make displayed objects movable while semantic connections of these objects are still maintained in the visual space. The similarity between a displayed object and a reference point is not directly assigned to any Cartesian coordinates of the display space like other visualization models. The Cartesian system is usually employed to present objects for a visualization model. Cartesian coordinates can be two dimensional or three dimensional. Each dimension corresponds to an axis that is linear. These axes are mutual orthogonal or perpendicular to each other. Any of axes can range from - f to f. The algorithm uniqueness suggests that the logic relationships among the displayed objects and reference points are independent of their physical locations in the visual space. Position changes of reference points may result in the reconfiguration of projected objects/documents in the visual space. The primary benefit of this uniqueness is that users may arbitrarily place a reference point in the visual space to any interest area (for instance, another reference point, an interest document, a cluster of documents), and observe the impact of that reference point on the area.
3.3.1 Description of the original VIBE algorithm Since the VIBE model works in a vector space, a document Ds can be described by a group indexing keywords Ds(k1, k2, …,ki,…, kg) where ki is a keyword and g is the number of the index keywords for the document. Rj(k1, k2, …,ki ,…, km) is a reference point (j=1, …, q) where ki is a keyword, q is the number of the reference points predefined by users, and m is the number of keywords for the reference point Rj. According to the algorithm the position of a document is strongly related to similarities between the document and a group of predefined reference points. The impact of these reference points on the document are defined by a document
3.3 Models for movable multiple reference points
53
reference point vector DRPVs(r1, r2, …, ri,…, rq) where ri is the relevance value between document Ds and a reference point Rj (k1, k2, …, ki ,…, km) (j=1, …, q). Ds Ri
ri
q
¦
(3.1)
Ds R j
j 1
Eq. (3.1) shows that the relevance between document Ds and a reference point Rj is determined by ratio of the number of keywords shared by both the document Ds and the reference point Rj to sum of shared keywords by the document Ds and all reference points. Function |X| indicates the number of elements in a set X. Notice that there may be many other ways to calculate the relevance between a document and a reference point. Because documents and reference points are described in a vector space, the Euclidean distance similarity model, the cosine similarity model, and other models discussed in Chap. 2 are all applicable. If document index keywords and reference point keywords are associated with weights, the relevance between the document Ds and a reference point Rj can also be defined as: Ds R j
¦ MIN (WDs (k a ),WRi (k a )
ri
a 1
§ Ds R j ¦ ¨¨ ¦ MIN (WDs (k a ),WR j (k a ) j 1 © a1 q
· ¸ ¸ ¹
(3.2)
In Eq. (3.2), i is always between 1 and q, WDs(ka) and WRi (ka) are weights of a shared keyword ka in document Ds and reference point Rj. Keyword ka is an element from a joint set between keywords of document Ds and keywords of reference point Rj ( Ds R j ). In this equation, the sum of the minimum weights of shared keywords in the document Ds and reference point Rj replaces the number of shared keywords in the document Ds and reference point Rj. The equation is also normalized by the total of the minimum weights of shared keywords in document Ds and all reference points Rj(j=1, …, q) to avoid an unnecessary scale effect. A reference point can be arbitrarily located in any meaningful point in the two dimensional visual space. P(Rj) denotes the location of the reference point Rj in the visual space. P( R j )
V j (x j , y j ) ,
j
1,..., q
(3.3)
In fact, V(x, y) defines a point which is the reference point icon position in the two dimensional visual space. The positions of all related reference points in the visual space play a very important role in positioning a projected document. In addition, the relevance between a document and related reference points can also dominate the position of
54
Chapter 3 Visualization Models for Multiple Reference Points
the document in the visual space. When these two factors are considered, the ultimate position of a document is determined. Given a document Ds is related to two reference points R1 and R2. The positions of these two reference points in the two dimensional visual space are P(R1)= V1(x1, y1) and P(R2)= V2(x2, y2), respectively. Position of Ds is P(Ds)= Vs(xs, ys ). The similarities between document Ds and the two reference points R1 and R2 are expressed as DRPVs (r1, r2) and we assume that both r1 and r2 are available by using either Eq. (3.1) or (3.2). Then, the location of document Ds is supposed to be located on a segment between R1 and R2 in the visual space. Distances between R1 and R2, R1 and Ds, R2 and Ds are d1, d2 and d3, respectively. The exact location of Ds depends on its similarities to the two reference points DRPVs(r1, r2). The more similar to a reference point, the closer it is located to the reference point, and vice versa (See Fig. 3.2). Let us discuss the calculation of the document projection position Vs(xs, ys ) in the visual space. It is clear that d1, d2 and d3 maintain relationships in Eqs. (3.4) and (3.5). d1
x
2
x1 2 y 2 y1 2 d1
1/ 2
d 2 d3
(3.4) (3.5)
In Fig. 3.2, both V3 (x1, ys) and V4 (x1, y2) are two temporary points, the right triangle ∆ V1VsV3 is similar to the right triangle ∆ V1V2V4 because they share the same angle V2V1V4, and both angles V1V3Vs and V1V4V2 are right triangles. Therefore, we have the following equations. d2 d1
x1 x s x1 x 2
(3.6)
d2 d1
y1 y s y1 y 2
(3.7)
For DRPVs(r1, r2), as we know that both r1 and r2 are similarities between document Ds and R1, and document Ds and R2 respectively. In the visual space, the larger a similarity value between a document and a reference point, the smaller the distance between them should be. This suggests that the relationship between similarity r and its corresponding distance d in the visual space should be reverse. The reverse relationships are described in on Eqs. (3.6) and (3.8). d2 d1
d2 d2 d3
(r1 r2 ) r1 r1 r2
r2 r1 r2
(3.8)
3.3 Models for movable multiple reference points
V1(x1,y1)
V3(x1,ys)
d2
55
d1
d3
Vs(xs,ys)
V2(x2,y2) V4(x1,y2)
Fig. 3.2. Display of a projected document and two related reference points d3 d1
d3 d 2 d3
(r1 r2 ) r2 r1 r2
r1 r1 r2
(3.9)
Based upon Eqs. (3.6) and (3.8), we have: xs
§ § r2 ¨¨ ¨¨ r r ©© 1 2 xs
· · ¸¸ u x 2 x1 x1 ¸ ¸ ¹ ¹
r1 x1 r2 x 2 r1 r2
(3.10)
(3.11)
Similarly, based on Eqs. (3.7) and (3.8), we have: ys
r1 y1 r2 y 2 r1 r2
(3.12)
When r1 (r2) is equal to 0, it implies that the document Ds is irrelevant to the reference point R1 (R2). Both Eqs. (3.11) and (3.12) tell us that position of the document Ds would be the same as that of R2 (R1.). In other words, it is overlapped with reference point R2 (R1) in the visual space. And the smaller r1 (r2) is, the farther away the document Ds is from R1 (R2). In the above equations, (r2/(r1+r2)) or (r1/(r1+r2)) is called the partition coefficient of document Ds with respect to the reference points R1 and R2. It is a ratio of similarity between a document and a reference point to the sum of similarities between the document and the two involved reference points. It underlies where
56
Chapter 3 Visualization Models for Multiple Reference Points
the document is located on the segment defined by the two reference points in the visual space. Now let us discuss the scenario where multiple reference points are involved. Basically, the VIBE algorithm requires three or more reference points to support document projection (Olsen et al., 1993 b). According the original algorithm description (Olsen et al., 1993 a), if a document is related or similar to multiple reference points and similarities between this document and related reference points are available, then first two related reference points from the multiple reference point set are selected and its position on the segment defined by the two reference points is calculated by Eqs. (3.11) and (3.12). The new position of the document on the segment serves as an intermediate reference point for further consideration. The relevance of the newly generated intermediate reference point and the document is the sum of similarities between the document and the two newly merged reference points. The next round position of the document (or next intermediate reference point) is determined by the intermediate reference point and another unprocessed reference point from the multiple reference point set. This process stops when all related reference points from the multiple reference point set are considered and processed. The final intermediate reference point is the ultimate position of the projected document with respect to the multiple reference point set in the visual space. It can be proved that the sequence of taking reference points from the multiple reference point set into consideration does not affect the final result, which is the ultimate location of the document in the visual space. For simplicity, use three reference points R1, R2, and R3. The positions of these three reference points are P(R1)= V1(x1, y1), P(R2)= V2(x2, y2) and P(R3)= V3(x3, y3), respectively. Similarities of these three reference points to the document Ds are in DRPVs (r1, r2, r3). If the reference points R1 and R2 are considered first, then position of newly generated intermediate reference point (R4(x4, y4)) sees Eqs. (3.13) and (3.14). The similarity value of the intermediate reference point is equal to ( r1 r2 ). x4
r1 x1 r2 x 2 r1 r2
(3.13)
y4
r1 y1 r2 y 2 r1 r2
(3.14)
Take the intermediate reference points R4 and R3 to form another intermediate reference point R5(x5, y5).
x5
§r x r x · r3 x3 ¨¨ 1 1 2 2 ¸¸ u r1 r2 © r1 r2 ¹ (r1 r2 ) r3
(3.15)
3.3 Models for movable multiple reference points r3 x3 r1 x1 r2 x 2 r1 r2 r3
x5
57 (3.16)
Similarly the Y-axis value of the reference point R5 can be calculated and the result is displayed in Eq. (3.17). r3 y 3 r1 y1 r2 y 2 r1 r2 r3
y5
(3.17)
Using the same approach, merge R3 and R2 first, then merge R1; we would get the same results as Eqs. (3.16) and (3.17). In general, a position Vs(xs, ys) of a document Ds in a two-dimensional visual space can be computed from Eqs. (3.18) and (3.19) in terms of multiple reference points. Position of a reference point Rj (j=1, …, q) denotes Vj(xj, yj), and similarities between the document Ds and the reference point Rj (j=1, …, q) are DRPVs (r1,…,rq). In fact, both Eqs. (3.18) and (3.19) show that the sequence of reference point considerations has nothing to do with the ultimate position of a projected document in the visual space. q
¦ ri xi xs
i 1 q
(3.18)
¦ ri
i 1 q
¦ ri yi ys
i 1 q
(3.19)
¦ ri
i 1
Assume that there are three reference points R1, R2, and R3. Their initial positions in the visual space are V1(5, 5), V2(30, 5), and V3(15, 25), respectively. The position of a document Ds is Vs(xs, ys). Assume the similarities between this document and the three reference points are shown in Eq. (3.20). DRPVs (r1 , r2 , r3 )
DRPVs (0.2 ,0.4,0.6)
(3.20)
And therefore the coordinates of the document Ds are:
xs
ys
r1 x1 r2 x 2 r3 x 3 r1 r2 r3
r1 y1 r2 y 2 r3 y 3 r1 r2 r3
0.2 u 5 0.4 u 30 0.6 u 15 18.3 0.2 0.4 0.6
(3.21)
0.2 u 5 0.4 u 5 0.6 u 25 15 0.2 0.4 0.6
(3.22)
58
Chapter 3 Visualization Models for Multiple Reference Points
As the reference point R3 moves from its original position to a new position V’3(30, 25), the document Ds is pulled to a new position D’s by R’3. And the coordinates of the document D’s are: x s1
r1 x1 r2 x 2 r3 x' 3 r1 r2 r3
y s1
r1 y1 r2 y 2 r3 y ' 3 r1 r2 r3
0.2 u 5 0.4 u 30 0.6 u 30 0.2 0.4 0.6
25.8
0.2 u 5 0.4 u 5 0.6 u 25 15 0.2 0.4 0.6
(3.23)
(3.24)
Fig. 3.3 gives both the position of the document before reference point R3 moves and the position of the document after R3 moves to R’3. The following algorithm is adapted from the original VIBE algorithm (Olsen et al., 1993 b). L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L23 L24 L25 L26 L27
Begin
Define multiple reference points Rj (j=1, …, p); Get RP positions in the visual space (Vj(xj, yj), j=1, …, p); Compute the vector (DRPV(r1,…,rp)) for each document based on a selected similarity measure; Discard documents that are irrelevant to the reference points; While documents with unprocessed DRPVs are still available Get a document D and its DRPV (r1,…,rp); If only one ri in DRPV (r1,…,rp)(1 d i d p) z 0 Then Assign the corresponding reference point as the final position of D in the visual space; Else Select Ri whose ri is not equal to 0 from DRPV (r1,…,rp) (1 d i d p); While an unprocessed Rj whose rj is not equal to 0 in DRPV (r1,…,rp) is available Merge Ri and Rj to form a new intermediate RP, and assign it to Ri; EndWhile; Assign the position of the last intermediate RP Ri as the final position of the document D; ElseEnd; EndIf; EndWhile; End.
3.3 Models for movable multiple reference points
R3
59
R’3
Ds
R1
D’s
R2
Fig. 3.3. Impact of a moving reference point on a document
L2 to L7 initialize variables before processing. According to this algorithm, documents which are not relevant to any of the defined reference points are discarded and not considered in the visual space (See L6 to L7). If a document is only relevant to a reference point, its position will overlap with that of the reference point in the visual space (See L10 to L13). If a document is relevant to multiple reference points, a series of intermediate reference points must be generated. A related reference point in conjunction with an intermediate reference point forms a new intermediate reference point. The reference point merging process continues until all relevant reference points are processed. The position of the final intermediate reference point is the position of the projected document in the visual space (See L14 to L24). Notice that if users change the position of any reference point in the visual space, or add a new reference point to the visual space, or remove a reference point from it, or revise the weights of involved reference points, then it would result in a reconfiguration of projected documents in the visual space. Each of these actions would trigger the algorithm. Because the algorithm requires calculating a series of intermediate reference points, it is a multiple-step generation algorithm. However, Eqs. (3.18) and (3.19) suggest that calculation of a final document position with respect to multiple reference points in the visual space can be done in a single step. The one-step algorithm would make the algorithm more efficient.
3.3.2 Discussions about the model Suppose that there are N reference points (Nt3) in a visual space. These N reference points can form a convex polygon in which all reference points are covered.
60
Chapter 3 Visualization Models for Multiple Reference Points
The number of polygon vertices is M, where M is always smaller than or equal to N. If some of the reference points are located inside the polygon, then M is smaller than N. Otherwise, M is equal to N. The polygon should be constructed in such a way that an angle formed by any vertex of the polygon and its two adjacent vertices should cover the entire polygon. This polygon will define a valid display area for projected documents if the condition is met for all vertices of the polygon. In other words, all documents should be projected within this special convex polygon generated by the multiple reference points. For a group of reference points, they can be classified into two categories: one includes these reference points which are inside the polygon and the other includes these which form the polygon as its vertices. The above analysis shows that the sequence of taking related reference points into consideration to project a document does not affect the ultimate location of the document in the visual space. We first take consideration of these reference points which are vertices of the polygon. This starts with a vertex point and one of its adjacent reference points in the polygon to calculate the first intermediate reference point with respect to a document. It is apparent that the first intermediate reference point should be located on the side connected by the two selected reference points according to previous analysis. This suggests that the document is not projected outside the polygon. Use the first intermediate reference point and another adjacent reference point (vertex) to generate the second intermediate reference point. Since the second intermediate reference point is situated on the segment between the first intermediate reference point and new selected adjacent vertex, it should be within a triangle area yielded by three reference points already selected. Thus, the document is definitely within the polygon. Then consider the second intermediate reference point and the third adjacent vertex to produce the third intermediate reference point. Because the second intermediate reference point is within the polygon, the segment formed by the second intermediate reference point and the third adjacent vertex should be inside the polygon. This process continues until all vertices of the polygon are considered. This analysis shows that after the reference points in the second category are processed, the last intermediate reference point is still within the polygon. Notice that all reference points in the first category are within the polygon. This implies that the segments formed by any of these reference points and an internal intermediate reference point should also be situated within the polygon. This concludes that all new corresponding intermediate references are within the polygon. According to previous discussion, we know that the location of the final intermediate reference point is the final position of the projected document in the visual space which, of course, is within the polygon. It is important that all reference points should be included in the polygon and the polygon should be a convex polygon which meets the condition we discussed before. These two conditions will ensure that all projected documents are mapped within the polygon.
3.3 Models for movable multiple reference points
61
R6 R1 R8 R12
R2
Ds R0 R9
R11
R5
R10
R3
R4
Fig. 3.4. Display of a polygon of a valid display area
For example, there are seven reference points and they are put in the visual space (Rj (j=0, …, 6)). Ds is a document. {R0} is within the first category, and {R1, R2, R3, R4, R5, R6} fall in the second category which forms a polygon (See Fig. 3.4). Segments with broken lines are produced by intermediate reference points and adjacent vertices or reference points in the second category. Solid lines are the sides of the polygon. R8 is the first intermediate reference point produced from both reference points R1 and R2. R9 is the second intermediate reference point produced from the first intermediate reference point R8 and an adjacent vertex R3,…. R12 is the intermediate reference point produced from the intermediate reference point R11 and an adjacent vertex R6. The final intermediate reference point or projected document Ds is determined by an internal reference point R0 and the intermediate reference point R12 (See the small circle in Fig. 3.4). After all reference points are positioned in the visual space, some or all of them can actually define a polygon onto which all documents are mapped. Documents within the polygon may be affected by one reference point or many reference points. If a document is only relevant to a certain reference point, it is easy to identify it in the visual space because it is overlapped with the related reference point. However, when a document is relevant to multiple reference points, it is difficult to specify which reference points are related to the document because its position in the visual space is a combination of the impacts of all relevant reference points. To solve this problem, we introduce a concept, reference point monopoly triangle (Korfhage, 1991). A reference point monopoly triangle is a special display area in the visual space that is defined by a vertex of a polygon (a reference
62
Chapter 3 Visualization Models for Multiple Reference Points
R6 R8
D
R1
R2 D’ R5
R4 R3
R9
Fig. 3.5. Explanation of reference point monopoly triangle
point) and its two adjacent reference points on the polygon. In the special display triangle area, all projected documents are related to the reference point. As we know, that a group of reference points can define a convex display polygon and all documents are mapped within the polygon. When a new reference point is added to outside the polygon in the visual space, this new reference point will form a new vertex to the polygon. This newly added reference point and its two neighboring reference points constitute a triangle area. After the reference point is added outside the polygon, related documents are reconfigured due to its attraction. A related document to the newly added reference point within the old polygon would move towards the new reference point along the line formed by the newly added reference point and the document. The extent to which the document moves towards the reference point depends upon its similarity to the reference point. It suggests that the related documents may end up within the triangle area if they are strongly attracted by the reference point. In other words, all documents within the triangle are related to the reference point. But documents which are not within the triangle area may also be related to the reference point. If documents are not related to the reference point or they are not very similar to the reference point, they would stay in their original positions or not be “pulled” to the triangle area by it. Users can take advantage of the triangle characteristics to manipulate a reference point, to create a reference point monopoly triangle, and to identify related documents to a special reference point. When creating a reference point monopoly triangle area, make sure that the angle formed by the special reference point and two neighboring vertices of the polygon must be large enough to cover the entire polygon. If this condition is not satisfied, related documents may be “pulled” out of the monopoly triangle area. When a document is located outside the uncovered area and is related to the reference point, it may be pulled out of the monopoly triangle area.
3.3 Models for movable multiple reference points
63
For instance, there are six reference points {R1, R2, R3, R4, R5, R6} and they define a polygon (See Fig. 3.5). R8 and R9 are two newly added reference points outside the original polygon. They generate two reference point monopoly triangle areas (See the two shaded parts in Fig. 3.5). The triangle formed by reference point R8 is correctly defined because the angle covers entire polygon area. However, the triangle formed by reference point R9 is not properly defined because some areas of the polygon are not within the angle area (See the angle formed by two broken lines in Fig. 3.5). Assume that a document D is situated within the polygon but outside the angle area and is related to the reference point R9. Pulled by R9, D moves toward R9 after R9 is added to the visual space. Obviously, it may end up outside the reference point monopoly triangle area (See the icon D’ in Fig. 3.5). It is worthy to point out that documents situated within a reference point monopoly triangle are definitely related to the reference point and the documents located outside the monopoly triangle area may also be related to the reference point. They are not pulled into the triangle area because the impact of the reference point on these documents is not strong enough compared to other related reference points in the visual space. This occurs often in the visual space. A new question is raised: how can we identify all of the related documents to a specific reference point in the visual space? We must figure out a way to squeeze the polygon and force these documents out of the polygon. A simple way to do this is to put all of the reference points except a selected reference point along a line and position the selected reference point above the line; then all documents related to the selected reference point would be singled out (See Fig. 3.6). In Fig. 3.6, the polygon defined by reference points (R1, …, Ri, …, Rn) becomes the reference point monopoly triangle. Therefore, all documents related to a special reference point Rselected would be displayed within the triangle. Generally speaking, it is true that the more similar to a reference point a document is, the closer it is to the reference point in the visual space. However, when a document is compared to another document with respect to the same reference point, we cannot simply use the distances from the two documents to the reference point to judge which one is more relevant to the reference point. That is, if a document (D1) is closer to a reference point than another document (D2) in the visual space, it does not conclude that document (D1) is more relevant to the reference point than document (D2). This may confuse people a little bit. The reason for this phenomenon is that the similarity between a document and a reference point is not the only factor which affects its position in the visual space. Position of a document in the visual space is a collective effort of multiple factors. These factors range from similarity between the document and the special reference point, and the position of the reference point, to similarities between the document and other related reference points, and positions of other related reference points. Each of these factors plays a role in determining the document’s position.
64
Chapter 3 Visualization Models for Multiple Reference Points Rselected
R1
. ..
Ri
…
Rn
Fig 3.6. Display of all related documents to a special reference point
Therefore, the position of a document is dynamic and relative against the positions of related reference points. It is difficult to judge the similarities of two documents to a special reference point depending upon their distances to the special reference point in the visual space. On the other hand, comparing two reference points with respect to the same document is relatively easy. Users can neutralize all reference points except the two compared reference points by putting them on a line similar to the one in Fig. 3.6. The user may then put one of the compared reference points above the line and the other below the line. Distances between the two reference points to the line are the roughly same. All related documents would then be spread out around the line to their related reference points. In this way, the impacts of the two reference points on the related documents are shown in the contexts of the other reference points. If a document has an identical similarity to the two reference points, the attractions from the two opposite directions cancel each other because the reference points are located in opposite positions against the line in the visual space. This would cause the document to stay on the line like the other unrelated documents. This problem can be solved by moving any of the two reference points to the line. As a result, the documents would leave the line and move towards the other reference point. The VIBE algorithm was used for a full-text environment (Korfhage and Olsen, 1994). In this case, the visual environment was limited to a single full-text based document instead of a group of documents. Displayed objects were replaced by meaningful semantic logic units within a document. The semantic logic units may be defined as sentences, paragraphs, sections, or chapters within that document. Segmentation of a full text should depend upon the nature and length of the full text. A semantic unit is indexed by keywords extracted from that semantic
3.3 Models for movable multiple reference points
65
unit. Keyword weight can be assigned by simply using its raw occurrence in that semantic unit. The weight of a term can also be calculated by a more complicated approach. For instance, we can apply TF×IDF (Term Frequency × Inverse Document Frequency) to the measuring weight of a term. wi
§N f i u Log ¨¨ © di
· ¸ ¸ ¹
(3.25)
In this equation, wi is the weight of the term i, fi is the frequency of the term i in a semantic unit such as a paragraph, N is the number of all semantic units segmented from a full-text, and di is the number of semantic units that contain the term i. Each of the semantic units is assigned a sequential number based upon its appearance sequence in a full-text. When all semantic units are visualized in the visual space, two adjacent semantic units are connected by a sequence link. Therefore, another dimension is added to the visual space to facilitate analyzing the relationships of semantic units in a full-text analysis in addition to the existing visualization features provided in VIBE. The original two-dimension-based VIBE model was also adapted to support three-dimension-based visual environments for greater interaction in two distinctive approaches. The first approach added the third dimension which was the overall significance of a document to the two dimensional VIBE space (Benford et al., 1995; Benford et al., 1997). This new dimension was used to demonstrate the relevance between a document and all reference points. The overall significance between a document Ds and all involved reference points Rj (j=1, …, q) was defined as the sum of its similarities to all reference points and used as the Y-axis value of the document in the visual space: q
zs
¦ ri
(3.26)
i 1
In this equation, the similarities between document Ds and reference point Rj (j=1, …, q) are described in DRPVs (r1,…,rq). Definitions of the X-axis and Y-axis for the document Ds are still the same as those in both Eqs. (3.18) and (3.19). Similarly, the Z-axis of a reference point in the three-dimensional space is defined as the sum of similarities between this reference point Rj to all related documents: n
zRj
¦ riR j
(3.27)
i 1
In this equation, riRj is similarity between the document i and reference point Rj, and n is the number of the involved documents in the visual space. The X-axis value and Y-axis value for a reference point Rj are dynamic and determined by users while the Z-axis is obviously not.
66
Chapter 3 Visualization Models for Multiple Reference Points
It is evident that the newly added Z-axis value for a certain document or reference point is an absolute value and is no longer dynamic and relative like the Xaxis and Y-axis values. As long as the involved reference points and documents are determined, the Z-axis values for these documents and reference points are unchanged. The significance of adding a fixed Z-axis value for a document or reference point rests upon that it can alleviate the ambiguity phenomenon of projected documents to some degree in the visual space, and therefore it more accurately reveals relationships among the projected documents. For instance, given two reference points R1 and R2, and two documents D1 and D2, similarity between D1 and R1, and similarity between D1 and R2 are 0.4 and 0.8, respectively. Similarity between D2 and R1, and similarity between D2 and R2 are 0.1 and 0.2, respectively. Without the third overall significance dimension, these two documents are projected onto the same location between the two reference points because the position of a document is determined by relative attraction that is the ratio of the document to all reference points multiplying their location factors to the overall similarities of the document to all reference points. Users simply cannot tell to what extent each document is related to the reference points if the ambiguity phenomenon occurs. With the third overall significance dimension, the two overlapped documents are separated in the third axis because they have different overall significance values, which can distinguish them in the visual space. The second three-dimensional model was LyberWorld (Hemmje et al., 1994). Unlike the first model, the Z-axis of a document Ds in this model (See Eq. (3.28)) is similar to both the X-axis and the Y-axis (See Eqs. (3.18) and (3.19)). q
¦ ri z i zs
i 1 q
¦ ri
(3.28)
i 1
All documents are located on the so-called relevance sphere boundary. The indexing terms of a document are always situated within the sphere. In this case, the indexing terms replace the referenced points in the visual space. Users may manipulate the indexing terms like reference points, rotate the relevance sphere, and zoom in/out of the space at will.
3.4 Model for automatic reference point rotation In this section, a new model for multiple reference points is introduced (Zhang and Nguyen, 2005). The model for automatic reference point rotation is a similarity ratio based model to some extent. The uniqueness of this model is that it adds a new feature reference point automatic rotation to the two-dimensional visual space. It enables users to observe the relevance between a rotating reference point and its related documents in the visual space.
3.4 Model for automatic reference point rotation
67
3.4.1 Definition of the visual space The visual space is two dimensional and is built on a polar coordinate system which is a plane with a point O (the pole) and a ray from O (the polar axis). Each point (P) in the plane is assigned polar coordinates: the directed distance from O to P and the directed angle whose initial side is on the polar axis and whose terminal side is on the line PO. Notice that the polar coordinate system can be converted to the Cartesian coordinate system easily. The valid display area is a sphere in the space. All reference points are positioned on the sphere. Reference points Rj (j=1, …, q) are evenly distributed on the sphere as their default positions. Once a reference point is activated by users, it automatically rotates counterclockwise around the sphere. In other words, the boundary of the sphere is the orbit of a moving reference point. The radius of the sphere is MR. The center of the sphere (O) is the focus point. The focus point can be defined as different meanings in different contexts. This issue will be discussed later. All of the documents are basically scattered within the visual space. The position of a document Di denotes DPi (li, D i). Here li and Di are the projection distance/ directed distance and projection angle of a document/ directed angle, respectively. These two parameters determine its position in the visual space. MR u (1 S oi )
li n
(
Soi
c
¦(at )
) (
1/ 2 2
t 1
n
¦at u xti
n
¦( xit )
1/ 2 2
t 1
(3.29)
)
u
(3.30)
t 1 n
n
t 1
t 1
(¦at 2 )1 / 2 u (¦ xti 2 )1 / 2
In this equation, O(a1, a2,…, an) is the focus point vector, Di(xi1, xi2,…, xin) is the document vector, both aj and xij are keyword weights(0djdn), and n is dimensionality of the vector space. It is clear that li basically reflects the relationship between the document Di and the focus point O. Eq. (3.30) calculates the similarity between Di and the focus point O and c is a control constant. The characteristics of this similarity measure were discussed in Chap. 2. In Eq. (3.29) factor(1-Soi) is used to convert the similarity to reflect such a relation that the more relevant the document Di is to the focus point O, the closer they are in the visual space, and vice versa. Notice that the valid value of Soi is between 0 and 1. MR defines the radius of the sphere. By changing MR, users may zoom in/out of the visual sphere at will. The projection angle of the document Di is defined in Eq. (3.31). E j is the angle which is formed by the reference point Rj and the polar axis against the origin of the visual space or the focus point. Sjk is the similarity between reference point Rj and document Dk. If a document Di is irrelevant to any of the reference points (
¦
q k 1
S ki
0 ), then the angle D i is defined as zero to avoid a meaningless D i.
68
Chapter 3 Visualization Models for Multiple Reference Points
The equation shows whether a reference point is irrelevant to any reference points, it would be situated on the polar axis and the distance to the origin is determined by its similarity to the focus point. The equation demonstrates that the projection angle of a document is dominated by a similarity ratio rather than by an absolute sum of the angles multiplied by similarities. This characteristic underlies the reference point rotation feature. The display sphere, reference point rotation direction, the projection distance, and the projection angle are shown in Fig. 3.7. In the figure R1, R2, R3, R4, and R5 are five reference points.
Di
q ° ¦ ( E j u S ji ) °j 1 , q ° ¦ S ji ° j 1 ° ° ® ° q °0, ¦ S ji 0 ° j 1 ° ° °¯
q
¦ S ki z 0 k 1
(3.31)
R2
R1
D1 D2
RP rotation direction
Di(li , D i)
D3 li
Di
D4 R3
C
D5
X
O
MR
R5
D8
D7 D6
Display sphere
R4
Fig. 3.7. Display of the WebStar visual space Source: Zhang and Nguyen (2005)
3.4 Model for automatic reference point rotation
69
3.4.2 Rotation of a reference point Eq. (3.31) describes the projection angle of a document against static reference points. However, as a selected reference point is activated to rotate around the sphere, related documents no longer stay static. Attracted by the moving reference point, the related documents would also rotate around the orbit whose radius is defined by Eq. (3.29). The speed of a related document primarily depends upon its relevance to the moving reference point and it can be calculated in Eq. (3.32). In this equation, given Rr is the activated reference point, E r is the initial angle in the visual space, T is the speed of the activated reference point Rr, t is a time variable, Sri is the similarity between Rr and Di, and D’i is the speed of the related document Di. q
¦ ( E j u S ji ) (T u t E r ) u S ri
Di '
j 1, j z r
(3.32)
q
¦ S ji j 1
From Eq. (3.32) we have the speed of any related document. dD ' i dt
T u S ri
(3.33)
q
¦ S ji j 1
Eq. (3.33) suggests that if a document is not relevant to the rotating reference point it would stay unchanged in the visual space. The more relevant the document is to the rotating reference point, the faster the document rotates around the center, and vice versa. The maximum speed of the document is equal to that of the rotating reference point. The similarities between the related document and other reference points also play a role in the document speed due to the divisor
¦
q k 1
S ki in
Eq. (3.33). Fig. 3.8 gives a series of the WebStar snapshots of a rotating reference point. The four reference points are sport, research, international, and library. Starting from the upper left corner figure clockwise, the first figure is the initial status, the second figure is the status when the reference point international rotates to about 85 degrees, the third figure is the status when the reference point international rotates to 0 degrees, and the fourth figure is the status when the reference point international is activated and rotates to 270 degrees. As the reference point international orbits, all of the related documents rotate accordingly along the same direction but at different speeds. Differences between this model and other similarity ratio based visualization models are that this model requires a focus point located at the origin of the visual space, it supports single reference point projection, only one of the location parameters (projection angle) is similarity-ratio-based while the other (projection distance) is not, and the reference point rotation feature utilizes the movement of
70
Chapter 3 Visualization Models for Multiple Reference Points
Fig. 3.8. A series of the WebStar snapshots of a rotating reference point
objects as an indicator of relevance between a rotating reference point and related documents.
3.5 Implication of information retrieval The visualization models for multiple reference points have a very natural and direct relationship to information retrieval. The model for fixed multiple reference points was designed to visualize results from a Boolean query. The original models for movable multiple reference points were aimed at visualizing the results of a query from a vector-based space. In fact, a query can be regarded as a special reference point in a broader sense. Documents related to these reference points are displayed in the visual space. Even within the VIBE visual space, it can integrate a search mechanism where users submit their queries, and matched document icons in the visual space are highlighted so that they can easily be distinguished from other un-retrieved document icons. In the WebStar visual space, users may employ the retrieval contour to retrieve documents in the visual space. The retrieval contour is a circle which shares the same center with the display sphere. The radius of
3.5 Implication of information retrieval R2
71
R1
A retrieved document Retrieval Radius
X R3
O
Boundary of the retrieval contour
R5
Boundary of the display sphere R4
Fig. 3.9. Display of the retrieval contour Source: Zhang and Nguyen (2005)
the retrieval contour can be manipulated by users to control the size of a retrieved document set (See Fig. 3.9). As the radius increases, more documents are retrieved, and vice versa. Notice that when the number of retrieved documents increases, the similarities of the newly added documents and the focus point decrease because they are getting farther away from it. The focus point in the model for automatic reference point rotation is extremely important. The visual contexts change if the focus point changes. If it is assigned as a website, then the projected documents are outgoing Web pages. If it is assigned as a search query submitted to an information retrieval system, then the mapped documents are returned results. If it is assigned as a regular paper, then the displayed documents are its citations. If it is assigned as a root of a subject hierarchy, then the projected objects are its children nodes. The focus point enables users to narrow down their interests. As long as the focus point is clearly defined, its relationships to the projected objects are illustrated by observing the distances from the origin to the displayed objects in the visual space. Of course, like other multiple reference point based visualization models it demonstrates the relationships between reference points and documents in the visual space. That is, in the static status a document is located near to reference points which are relevant to it. In addition, the model reveals which documents are related to a rotating reference point and the extent to which these documents are similar to the moving reference point. A similarity-ratio-based visualization model may be used for term discriminative analysis (Dubin, 1995). In the VIBE space, each of the tested keywords is assigned to a reference point, and all of the reference points are scattered onto the
72
Chapter 3 Visualization Models for Multiple Reference Points
visual space to form a circle. Terms with poor discriminative capacity tend to stay around the center of the circle because they are not affected by any reference points. Terms with good discriminative capacity tend to be spread out in the visual space because a good discriminative term will attract related documents. Based upon a distribution of documents, people can make a correct judgment about whether a term is a discriminative term or not.
3.6 Summary Reference point is an important concept in information retrieval. Multiple reference points were introduced not only to visualize documents but also to solve the inherent problems of a traditional information retrieval system. In this chapter, visualization models based on multiple reference points were classified into three categories: the models for fixed multiple reference points, the models for movable multiple reference points, and the model for automatic reference point rotation. Each of them has its own unique qualities. The models for fixed multiple reference points can be used for both a vector-based information retrieval system and Boolean based information retrieval system. They can handle a complex Boolean query. But the algorithm is not based upon the similarity ratio method, even though it can be applied to a vector-based information retrieval system. Most models for movable multiple reference points are similarity ratio based, except for the VR-VIBE model, whose Z-axis of the visual space is sum of similarities of all related reference points for a document unlike the X-axis and Y-axis. The visualization model for automatic reference point rotation is based upon the similarity ratio in part because only the projection angle of a document, one of the two projection parameters, is calculated based upon similarity ratio of related reference points while projection distance is not. These two parameters of a document dominate its position in the visual space. The power of the similarity ratio based algorithms relies on the manipulative flexibility for reference points in the visual space. The position of any reference points can be controlled and manipulated by users at will. It is the manipulative flexibility that enables users to compare and analyze the impact of two reference points on documents, and identify good/poor discriminative terms. Both the models for fixed multiple reference points and the models for movable multiple reference points require at least three reference points to project documents in their visual spaces, while the model for automatic reference point rotation requires at least one reference point in conjunction with the focus point to construct its visual space. Visualization models for multiple reference points can be two-dimensional or three-dimensional. They can be applied to either Boolean based information systems or vector based information systems. They can be used to visualize Internet hyperlinks, search results from an information retrieval system, a full-text, and term discriminative analysis.
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models
A vector based information retrieval model has many unique advantages over a Boolean based information retrieval model. The similarity between two objects in a vector based model can be calculated and measured in multiple approaches and information retrieval evaluation models are applicable to the database as an effective retrieval mechanism. Furthermore, a vector has a natural and isomorphic relationship to the Euclidean space. The basic Euclidean spatial elements such as point, distance, and angle may have a special connection to information retrieval in the contexts of the vector-based space. For instance, a document or reference point in a vector based space corresponds to a spatial point in the Euclidean space. Euclidean distance between two documents/ reference points can be used as an indicator of their similarity. It is not coincidence that the angle of two objects in the Euclidean space underlies the cosine evolution model. It is the natural connection between Euclidean spatial characteristics and information retrieval that can be utilized to construct visualization environments for users to browse and search information. Motivated by this connection, researchers employed the two distinctive spatial characteristics of distance and angle, and their combination to establish several information retrieval visualization models such as DARE(Zhang and Korfhage, 1999; Zhang, 2000), TOFIR (Zhang, 2001), and GUIDO (Korfhage, 1991; Nuchprayoon and Korfhage, 1994; Nuchprayoon and Korfhage, 1997). These researchers constructed their visual spaces by preserving the Euclidean spatial characteristics of objects in a high dimensional space and mapping them onto low two dimensional spaces. Euclidean distance and angle, angle and angle, and Euclidean distance and distance were employed in these visualization models respectively for the common purpose of revealing document’s relationships and retrieving the documents.
4.1 Euclidean space and its characteristics Euclid of Alexandria, the father of geometry, was an ancient Greek mathematician. The book The Elements, describing all of the rules of geometry, made him one of the most famous and prestigious mathematicians in the world. Without his significant contribution to geometry the world might not be as far in the field of mathematics. The Euclidean space was named after Euclid because it
74
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models
is the generalization of a low dimensional space (two or three) studied by Euclid. Usually an n-dimensional mathematical space is denoted . The Euclidean space is also called the Cartesian space. Most geometric properties can be perceived in a low dimensional space (two or three). There are three important geometry concepts underlying the Euclidean space: point, distance, and angle. An ndimensional vector space is isomorphic to the n-dimensional mathematical n
space . The reason for working with a vector space instead of is that it is easy to accurately describe a point, distance, and angle in a vector space. A point in an n-dimensional Euclidean space is defined as X(x1, x2,…, xn), where n is a positive integer or dimensionality of the space. In other words, a vector X(x1, x2,…, xn) corresponds to a point in the Euclidean space. Assume Y(y1, y2,…, yn) is another point in the Euclidean space. The distance between the two points X and Y is defined as: n
n
XY
· §n ¨¨ ¦ ( xi y i ) 2 ¸¸ ¹ ©i 1
d ( X ,Y )
1/ 2
(4.1)
Eq. (4.1) is also called the Euclidean distance of the two points. By using this Euclidean distance, Euclidean space becomes a metric space. In the distance metric, the following axioms are always satisfied: x The distance between two points in the Euclidean space is always equal to or larger than 0 x If and only if two points are overlapped in the Euclidean space, their distance is equal to 0 x The distance from a point X to another point Y is always equal to the distance from the point Y to another point X x If Z is the third point, then the distance between two points X and Y is always smaller than or equal to sum of the distance between two points X and Z, and the distance between two points Z and Y in the Euclidean space. That is: XY t 0
XY
0, XY
iff
X
YX
XY d XZ ZY
(4.2) Y
(4.3) (4.4) (4.5)
The angle of the two vectors X and Y or the angle formed by the two points X and Y against the origin is defined as:
4.2 Introduction to the information retrieval evaluation models
D
§ n ¨ xi u y i ¦ ¨ 1 i 1 COS ¨ 1/ 2 1/ 2 ¨§ n 2· § n 2· u x y ¸ ¨ ¸ ¨ ¦ ¦ ¨¨ i ¸ i ¸ ¨ ¹ ©i 1 ¹ © ©i 1
· ¸ ¸ ¸ ¸ ¸ ¹
75
(4.6)
4.2 Introduction to the information retrieval evaluation models In this chapter, the three visualization models introduced are based upon the Euclidean spatial characteristics. That is, they are built upon the vector-based space. These models share features with respect to the dimensionality of the visual spaces, the number of reference points employed, visualization for retrieval mechanisms, and even the characteristics of their visual spaces. The visual spaces for DARE, TOFIR, and GUIDO are two dimensional based upon either distances, or angles, or their combinations. Two reference points are used to construct their visual spaces respectively. The selections and roles of these two reference points may vary in different visual environments. These two reference points can serve two view points for the object visualization projection. One is called the key view point KVP (or major view point) and the other is called the auxiliary view point AVP (or minor view point). These two view points form a reference axis RA. DRP stands for the distance from one reference point to the other. KVP, AVP, DRP, and RA play an important role in terms of projecting documents onto a corresponding visual space. Notice that KVP, AVP, DRP, and RA are defined by users and based upon the two reference points. In other words, a change in the employed reference points can lead to changes in these important parameters. The origin of the vector space can also be defined as a view point in some cases. The most distinguished feature of these three visualization models is their capacity to visualize information retrieval evaluation models. These models are widely used in traditional information retrieval and they usually define invisible and various retrieval contours in the high-dimensional vector document space. The corresponding contours in the high dimensional space can be projected onto the visual spaces for users to manipulate, interact with, and observe. It is not surprising that the shapes of these contours are distorted after they are projected onto the low dimensional visual spaces because of the significant dimensionality reduction. The key point is to find effective mathematical projection conversion approaches (equations) to carry out the projections for these retrieval contours. It is apparent that the complexities of the mathematical conversion approaches vary in different visual environments. Even the possibility of projecting an information retrieval evaluation model varies in different visual environments. For instance, the Cassini oval evaluation model can be visualized in the distance-distance-based GUIDO environment and cannot be visualized in the angle-angle-based TOFIR environment
76
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models
or the distance-angle-based DARE environment. It is exciting that within the visualization environments, new non-traditional information retrieval evaluation models can be developed. Suppose R1 and R2 are two reference points. They can be assigned as KVP or AVP, and Di is a document in the n-dimensional vector space. R1 (r11 , r12 ,......, r1n )
(4.7)
R2
(r21 , r22 ,......, r2 n )
(4.8)
Di
( xi1 , xi 2 ,......, xin )
(4.9)
For the cosine evaluation model, the corresponding equation and display figure see Eq. (4.10) and Fig. 4.1. Here D is a retrieval threshold and it may be controlled by users. In the figure h is DRP, d1 and d2 are the distance between document D2 and the reference point R1 and distance between D2 and the reference point R2 respectively. It defines an angle retrieval area where documents are retrieved. DARE, TOFIR, and GUIDO can visualize this model. § n ¨ ¦ xij u r2 j ¨ j 1 D t COS 1 ¨ 1/ 2 1/ 2 ¨§ n § n · · ¨ ¨ ¦ xij2 ¸ u ¨ ¦ r22j ¸ ¨j 1 ¸ ¸ ¨¨ j 1 © ¹ ¹ ©©
IEC2
R2
IEC1
D1 d2
h
D
D
O (R1)
Fig. 4.1. Display of the cosine model Source: Zhang and Korfhage (1999)
D2 d1
· ¸ ¸ ¸ ¸ ¸ ¸ ¹
(4.10)
4.2 Introduction to the information retrieval evaluation models
77
For the distance evaluation model, the corresponding equation and display figure see Eq. (4.11) and Fig. 4.2. Here r is a retrieval threshold, d is the distance between Di and the origin of the vector space, h is the distance between the reference point R1 and the origin, D 1 is equal to the angle R1ODi, and D 2 is equal to the angle R1Di O, and D 3 is equal to the angle Di R1 O, the lines OD1 and OD2 are tangent to the sphere whose center is R1 and radius is r at D1 and D2 respectively. DARE, and TOFIR can visualize this model. § n r t ¨¨ ¦ xij r1 j ©j 1
2 ·¸¸
1/ 2
(4.11)
¹
For the ellipse evaluation model, the corresponding equation and display figure see Eq. (4.12) and Fig. 4.3. Here c is a retrieval threshold, a is DRP the distance between the two reference points, D 1 is equal to the angle R2R1Di, D 2 is equal to the angle R1R2Di, d1 is the distance between R1 and Di, and d2 is the distance between R2 and Di. DARE, TOFIR, and GUIDO can visualize this model. § n c t ¨¨ ¦ xij r1 j ©j 1
2 ·¸¸
1/ 2
¹
§n ¨¨ ¦ xij r2 j ©i 1
2 ·¸¸
1/ 2
D3
D2 R1
r
Di
D3 D2
h D1
(4.12)
¹
d
D1
O
Fig. 4.2. Display of the distance based evaluation model Source: Zhang (2001)
78
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models
R1
D1
a
D2
R2
d1
d2
Di
O
Fig. 4.3. Display of the ellipse evaluation model Source: Zhang and Korfhage (1999)
For the equations for both conjunction and disjunction models see Eqs. (4.13) and (4.14) respectively. Here k is a predefined retrieval threshold, T is equal to the angle R2OR1, D1 is equal to the angle R2R1Di, and D 2 is equal to the angle R1R2Di, a is DRP, h1 is the distance between the origin of the vector space and the reference point R1, and h2 is the distance between the origin of the vector space and the reference point R2. For the corresponding display figure, see Fig. 4.4. In this figure, the overlapping area of the two spheres is the result of the conjunction model, while the union of two spheres is the result of the disjunction model. DARE, TOFIR, and GUIDO can visualize this model. §§ n k t MIN ¨¨ ¨¨ ¦ xij r1 j ¨© j 1 ©
2 ·¸¸
§§ n k t MAX ¨¨ ¨¨ ¦ xij r1 j ¨© j 1 ©
1/ 2
¹
2
· ¸ ¸ ¹
§ n , ¨¨ ¦ xij r2 j ©j 1
1/ 2
§ n , ¨¨ ¦ xij r2 j ©j 1
2 ·¸¸
1/ 2
¹
1/ 2
2
· ¸ ¸ ¹
· ¸ ¸¸ ¹
(4.13)
· ¸ ¸¸ ¹
(4.14)
For the Cassini oval model, the equation sees Eq. (4.15). Here c is the retrieval threshold. This model can only be visualized by GUIDO. § n c t ¨¨ ¦ xij r1 j ©j 1
2 ·¸¸ ¹
1/ 2
§ n u ¨¨ ¦ xij r2 j ©j 1
2 ·¸¸ ¹
1/ 2
(4.15)
4.3 The distance-angle-based visualization model
k
79
Di
R1 D1
a
h1
D2
k R2
T
h2
O Fig. 4.4. Display of the conjunction and disjunction models Source: Zhang and Korfhage (1999)
The cosine model does need the origin and another reference point. In this case the origin is KVP and the other reference point is AVP. The distance model only needs one reference point, which is assigned as KVP. The other models need two reference points which are assigned KVP and AVP respectively. The three different metrics (City block, Euclidean distance, and Dominance distance) discussed in Chap. 2 are applicable to the DARE, TOFIR, and GUIDO visualization environments. Application of the three different metrics can produce three similar document distributions. Although application of any metric cannot change the nature of a document distribution, it does change the display density for a document distribution. Using this feature, users can observe a subtle discrepancy among the document clusters for a dataset.
4.3 The distance-angle-based visualization model 4.3.1 The visual space definition DARE is a distance-angle-based visualization model. In the vector space, if two reference points KVP (R1) and AVP (R2) are clearly defined, the visual projection distance and visual projection angle, two important parameters for any document Di in the vector space, can be defined in Eqs. (4.16) and (4.17), respectively.
80
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models
d
D
§ n ¨ ¦ xij r1 j ¨j 1 ©
2 ·¸¸
1/ 2
(4.16)
¹
§ n ¨ ¦ ^ ( xij r1 j ) u (r2 j r1 j ) ` ¨ j 1 COS 1 ¨ 1/ 2 1/ 2 ¨§ n · § n · 2 2 ¨ ¨ ¦ ( xij r1 j ) ¸ u ¨ ¦ (r2 j r1 j ) ¸ ¸ ¨j 1 ¸ ¨¨ j 1 ¹ © ¹ ©©
· ¸ ¸ ¸ ¸ ¸ ¸ ¹
(4.17)
That is, the visual projection distance is the distance from the document Di to the key view point KVP(R1), and the visual projection angle is the angle formed by the two lines R1Di and R2R1 against KVP in the high dimensional vector space. Notice that these two parameters of a document are always available in the vector space as long as both KVP and AVP are defined (KVP does not overlap with AVP). The visual projection angle and distance can be employed to define the X-axis and the Y-axis for a low visualization space respectively. In such a visual space, the document can be uniquely located based upon the two important parameters. If all documents in the vector space follow the same projection procedure, then they are successfully projected onto the visual space. This procedure converts documents in an invisible high dimensional space to a perceivable two dimensional space. Therefore, the relationships among these documents can be preserved and observed in the visual space. Since the Y-axis is defined as the distance between a document and KVP, its legitimate value should range from zero to infinite. The X-axis is defined as the angle DiR1R2. The minimum value and maximum value for the angle DiR1R2 are 0 and 2S respectively. Here angles are measured in radians. The angle DiR1R2 is symmetric against the reference axis RA formed by R1 and R2. The reference axis and the document can generate a plane. On the plane, the angle DiR1R2 can be divided into two areas: the angle located on one side of the reference axis is positive and the angle located on the other side of the reference axis is negative. On the each side, the maximum angle is S. Therefore, the valid value of the angle DiR1R2 ranges from – S to S. For display simplicity, the valid value of the angle DiR1R2 can be defined from 0 to S due to the symmetry of a document against the reference axis in the vector space. So the final valid visual display area, the semantic framework, consists of three boundary lines: Y X
0,
or X
(4.18)
0 X
S
S
(4.19) (4.20)
4.3 The distance-angle-based visualization model
81
The visual display area defines a zone within which all documents are projected. It is apparent that the zone is a half-infinite plane. Its one side overlaps with the X-axis and the width is equal to 2 S (or S). It is parallel to the Y-axis; one side overlaps with the Y-axis. KVP is always mapped onto the origin of the visual space because its visual projection distance is 0 and the angle is defined as 0. The position of AVP is mapped onto the Y-axis of the visual space because its visual projection distance is the length between the two reference points R1 and R2 in the visual space and the visual projection angle is defined as 0.
4.3.2 Visualization for information retrieval evaluation models As we know, documents in the vector space can be scattered onto the visual space as previously described. Similarly, an information retrieval evaluation model can also be visualized in the visual space. The equation and corresponding hypercontour of an information retrieval evaluation model are available and clear. The point is to find out its mathematical conversion equation when it is projected onto a low dimensional visual space. The complexity of such a mathematical conversion equation varies in different visualization models. Some are straightforward and some are very complex. It depends upon the definition of a visual space, the nature of an evaluation model, and the adaptability of an information retrieval evaluation model in the context of the visual environment. The significance of the visualization for these information retrieval evaluation models relies upon that fact that users may manipulate the size and position of a projected information retrieval evaluation contour in the visual space to change the information retrieval area size and focus, in addition to observing how documents are retrieved by the model. This would definitely make the interior information retrieval process transparent to users. The cosine evaluation model The equation and spatial display of the cosine model are illustrated in Eq. (4.10) and Fig. 4.1 respectively. The angle D is the retrieval threshold, the origin of the vector space is KVP while R2 is AVP, D1 is a document situated within the retrieval area defined the angle D , and D2 is any document located on one boundary of the angle D in Fig. 4.1. The document D2 always has a constant visual projection angle regardless of its visual projection distance because it is located on the boundary. It suggests that the boundary line of the retrieval area D2O should be projected onto a vertical line in the visual space.
82
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models Visual projection distance
Retrieval area
D1
D2
D
R2
S
S
O
Visual projection angle
Fig. 4.5. Display of the projected cosine model in DARE Source: Zhang and Korfhage (1999) X
D, D d S ,
(4.21)
If D1 is a document within the defined retrieval area, its visual projection angle is always smaller than D . This implies that it will be mapped within the zone defined by the Y-axis and Eq. (4.21). Users may drag the vertical retrieval line (Eq. (4.21)) to any place within the valid display area. It would decrease the retrieval area by moving it toward the Y-axis or increase the retrieval area by moving it away from the Y-axis at will (See Fig. 4.5). R2 is projected onto the Y-axis because its projection angle is always equal to 0. The distance evaluation model The equation and spatial display of the distance model are illustrated in Eq. (4.11) and Fig. 4.2 respectively. In this case, the key view point is the origin of the vector space and the auxiliary view point is R1. In the figure, Di is any point on the circle whose center is R1 and whose radius is r. It is evident that the origin, the document Di, and the auxiliary view point R1, consist of a triangle where the following equations always hold: h u sin(D1) d
r u sin(D 2)
h u cos(D1) r u cos(D 2)
(4.22) (4.23)
In fact, Eq. (4.22) defines the height of the triangle R1ODi, and Eq. (4.23) defines the height of its one side. From the two equations, we may calculate the relationship between d and D1 of a document Di on the circle because they are the visual projection distance and
4.3 The distance-angle-based visualization model
83
visual projection angle of the document Di. If their relationship can be calculated, this suggests that the mathematical conversion equation has been found. This is ultimately used to draw the projected contour of the distance model in the visual space. We also have: cos 2 (D 2) sin 2 (D 2) 1
(4.24)
From Eqs.(4.23) and (4.22): d 2 2d u h u con(D 1) h 2 u con 2 (D 1)
h 2 u sin 2 (D1)
r 2 u cos 2 (D 2)
r 2 u sin 2 (D 2)
(4.25) (4.26)
According to Eqs. (4.24), (4.25), and (4.26): r2
d 2 2hd cos(D1) h 2
(4.27)
Finally we have: d
h u cos(D1) r (r 2 h 2 u sin 2 (D1))1 / 2
(4.28)
In Eq. (4.28), which describes the final projected distance model in the visual space, D1 is a variable, and h and r are two constants because after the reference point is defined and the retrieval threshold is set up, they remain the same. When D1 is equal to 0, d has two solutions (h+r and h-r, see the points V1 and V2 in Fig.
4.6). Because r h sin (D 1) in Eq. (4.28) should be always equal to or larger than zero, it indicates that the range of the variable D1 should be from 0 to sin1 (r/h). The point when D1 reaches the maximum value is at the point H1 which overlaps with D1 and D2 in Fig. 4.6. For the projected contour, see Fig. 4.6; the resulting figure resembles a bullet. The origin of the vector space is projected onto the origin of the visual space because it is KVP in projection. The projected position of the reference point R1 is on the Y-axis and its height is equal to h. The above analysis show that when the retrieval contour size (r) increases/decreases and its center is fixed (the location of the reference point R1 is fixed), the corresponding projected contour size increases/decreases also. But its relative position in the visual space remains the same due to the stable position of R1. This suggests that the larger the contour in the vector space, the bigger the projected contour in the visual space, and vice versa. When the contour size in the vector space is fixed and its center moves away from the origin of the vector space, the corresponding contour in the visual space moves up along the Y-axis 2
2
2
84
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models Visual projection distance D3
V1 Retrieval area
R1
H1, D1, D2
V2
S
O
Visual projection angle
Fig. 4.6. Display of the projected distance model in the DARE visual space(I) Source: Zhang and Korfhage (1999)
and shrinks in size. When the contour size in the vector space is fixed and its center moves towards the origin of the vector space, the corresponding contour in the visual space moves down along the Y-axis and its contour swells. This implies that we cannot judge the contour size in the vector space based only upon its projected contour size in the visual space because its size in the vector space also depends upon not only the corresponding projected contour size but also its position on the Y-axis. It is evident that the closer the projected contour is to the origin in the visual space, the closer the retrieval contour is to the origin in the vector space, and vice versa. If the reference point R1 is selected as KVP and the origin of the vector space is selected as AVP, the display scenario for the distance model can be totally different from the previous one. Observe that in this case r is the projection distance which is always a constant as D 3 (the projection angle) changes (See Fig. 4.2). It is easy to find out the projection conversion equation due to the constant r (Eq. 4.29). And display of the projected distance model in the visual space sees Fig. 4.7. It is clear that this approach is much simple compared to the previous one. Y
r
(4.29)
The disjunction and conjunction evaluation models The equations of the conjunction and disjunction models are illustrated in Eq. (4.13) and Eq. (4.14), respectively. For their spatial display, see Fig. 4.4. Since these two models share many commonalities, we will discuss them together. Each
4.3 The distance-angle-based visualization model
85
Visual projection distance
D3
D1, D2
r
Retrieval area
S
R1
Visual projection angle
Fig. 4.7. Display of the projected distance model in the DARE visual space (II)
of these two models involves two reference points and each reference point defines a retrieval contour similar to the distance model. If one reference point of a retrieval contour is assigned as KVP and the other is assigned as AVP, the first visualized contour will be a horizontal line and the second projected contour will look like a bullet in the visual space. Then the two visualized contours can form
Visual projection distance
R2 k
R1
S
Visual projection angle
Fig. 4.8. Display of the conjunction and disjunction models in the DARE visual space
86
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models
results for both the conjunction and disjunction models (See Fig. 4.8). The overlapping part of the two retrieval areas is the result of the conjunction model while the union of the two retrieval areas is the result of the disjunction model. It is clear that the selection of KVP and AVP would make a difference for the projection. If the projection of both contours can use the origin of the vector space as their KVP and the two reference points as AVP respectively, then the projection conversion equations would become more complicated. The ellipse evaluation model The equation and spatial display of the distance model are illustrated in Eq. (4.12) and Fig. 4.3 respectively. It requires two reference points. R1 and R2 are assigned as KVP and AVP respectively. Di is any point on the ellipse. In the triangle DiR1R2, the following equations are always satisfied. c
d1 d 2
d1 u sin(D1) a
d 2 u sin(D 2)
(4.30) (4.31)
d1 u cos(D1) d 2 u cos(D 2)
(4.32)
cos 2 (D 2) sin 2 (D 2) 1
(4.33)
Now we need to use these four equations to figure out the functional relationship between the two variables d1 and D 1 which are the projection distance and the angle of Di respectively. From Eq. (4.31): d12 u sin 2 (D1)
d 2 2 u sin 2 (D 2)
(4.34)
From Eq. (4.32): d 2 2 u cos 2 (D 2) a 2 2 u a u d1 u cos(D1) d12 u cos 2 (D1)
(4.35)
From Eqs. (4.33), (4.34) and (4.35): d 22
a 2 2 u a u d1 u cos(D1) d12
(4.36)
From Eqs. (4.36) and (4.30): cos(D1)
a2 c2 c 2a u d1 a
(4.37)
4.3 The distance-angle-based visualization model
87
Visual projection distance
R2 Retrieval area
R1
S
Visual projection angle
Fig. 4.9. Display of the ellipse models in the DARE visual space Source: Zhang and Korfhage (1999)
Eq. (4.37) defines the projected ellipse contour or conversion equation in the visual space. For the display of the projected ellipse model, see Fig. 4.9. It is a curve which intersects with the Y-axis at (0, (c+a)/2), and with another boundary line (X=S) at (S, (c-a)/2). The reference point R1 is mapped onto the origin of the visual space and R2 are mapped onto the Y-axis at (0, a). As the retrieval threshold c increases, the curve goes up and vice versa in the visual space. DRP (a) determines the curve’s shape. The larger DRP(a), the steeper the curve in the visual space, and vice versa. Discussion The iso-extent contour is a concept introduced in the paper for a new similarity measure to distinguish the documents with different distance characteristics within a retrieved area of an angle model (Zhang and Rasmussen, 2002). For instance, in Fig. 4.1, IEC1 and IEC2 are two iso-content contours. This concept can be easily visualized in the visual space and can be used to develop a new information evaluation model. In the vector space R1 is assigned as KVP and R2 as AVP. R2 is located in a particular area defined by the two sides of the angle D and the two arcs IEC1 and IEC2 in the vector space. A projected iso-extent contour is a horizontal line in the visual space because the distance between any point on the contour and the origin is a constant. This area has a special information retrieval implication because it considers both angle and distance factors with respect to the reference point R2. In other words, documents within this particular area are relevant to the reference point R2 in terms of both distance and angle. This area is
88
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models Visual projection distance Retrieval area IEC1 R2
d IEC2
R1
D
S
Visual projection angle
Fig. 4.10. Display of a new evaluation model in DARE Source: Zhang and Korfhage (1999)
visualized as a rectangle whose width is equal to the angle D and whose length is equal to the distance (d) between the two arcs IEC1 and IEC2 in the vector space. R2 is located in the middle of one side of the rectangle in the visual space. See Fig. 4.9. As the parameter d deceases, the length of the retrieved area reduces and the retrieved documents within the area become more relevant in terms of the distance measure, and vice versa. As the parameter D deceases, the width of the retrieved area reduces and the retrieved documents become more relevant in terms of the angle measure, and vice versa.
4.4 The angle-angle-based visualization model 4.4.1 The visual space definition TOFIR is an angle-angle-based visualization model. In the high-dimensional vector space, if two reference points (R1) and (R2) are clearly defined and assigned to KVP and AVP respectively, two visual projection angles(two important parameters for any document Di in the vector space) can be defined in Eqs. (4.17) and (4.38) respectively.
4.4 The angle-angle-based visualization model
E
§ n ¨ ¦ ^ ( xij r2 j ) u (r1 j r2 j ) ` ¨ j 1 COS 1 ¨ 1/ 2 1/ 2 ¨§ n · · § n ¨ ¨ ¦ ( xij r2 j ) 2 ¸ u ¨ ¦ ( r1 j r2 j ) 2 ¸ ¸ ¨j 1 ¸ ¨¨ j 1 ¹ © ¹ ©©
· ¸ ¸ ¸ ¸ ¸ ¸ ¹
89
(4.38)
In the vector space the three points R1, R2, and the document Di generates a plane where the triangle DiR1R2 is D (See Eq. (4.17)), the triangle DiR2R1 is E (See Eq. (4.38)), and the triangle R1DiR2 is J . We have:
S
D E J
D, E ,J t 0
(4.39)
The two angles D and E , the two visual projection angles of the document Di, are assigned to the X-axis and Y-axis of the visual space, respectively. They satisfy the following equation due to Eq. (4.39).
S tD E
(4.40)
The minimum value and maximum value for the two angles D and E are 0 and
S respectively.
So the valid visual display area, which a triangle area, the semantic framework, consists of three lines: Y
0
(4.41)
X
0
(4.42)
X Y
S
(4.43)
The two reference points R1 and R2 are projected at (S/2, 0) and (0, S/2) respectively. That is, they are located on the X-axis and Y-axis of the visual space respectively.
4.4.2 Visualization for information retrieval evaluation models The cosine evaluation model The equation and spatial display of the cosine model are illustrated in Eq. (4.10) and Fig. 4.1 respectively. The angle D is the retrieval threshold, the origin of the vector space is KVP, while R2 is AVP, D1 is a document situated within the retrieval area defined by the angle D , and D2 is any document located on one side of the angle D in Fig. 4.1. In the figure, O’ is the projected origin of the vector space and R2 is another reference point (See Fig. 4.10). Since the document D2 always has a constant visual projection angle regardless of the other visual projection angle, this suggests that the boundary line of the
90
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models Visual projection angle S
R2(0,S/2)
D Retrieval area
O’( S /2,0)
O
S
Visual projection angle
Fig. 4.11. Display of the projected cosine model in TOFIR Source: Zhang (2001)
retrieval area D2O will be projected onto a horizontal line in the visual space (See Fig. 4.11). This line can be manipulated by users at will. Y
D, D dS
(4.44)
The distance evaluation model The equation and spatial display of the distance model are illustrated in Eq. (4.11) and Fig. 4.2 respectively. In this case, KVP is the origin of the vector space and AVP is R1. The point D3 is a point on the line defined by DiO. Now we need to find out the relationship between D1 and D 3 which are two visual projection angles of document Di. In the figure Eq. (4.45) always holds because in a triangle the sum of the two interior angles should be equal to the exterior angle of the third interior angle. R1Di D3
D1 D 3
S R1Di D3
D2
(4.45) (4.46)
From Eqs. (4.45), (4.46), and (4.22): r u sin(D1 D 3)
h u sin(D1)
(4.47)
Finally we may calculate the relationship between D 1 and D3 in the following equations. This suggests that a value of D 1 corresponds to two values of D 3. h D 3 sin 1 ( sin(D1)) D1, r
D1 D 3 d S / 2
(4.48)
4.4 The angle-angle-based visualization model h r
D 3 S sin 1 ( sin(D1)) D1,
D1 D 3 ! S / 2
91
(4.49)
For the visual display of the distance model in TOFIR, see Fig. 4.12, where O’ is the projected origin of the vector space. It looks like a half falling water drop. In Fig. 4.2, observe that the document Di intersects with the circle at two points and each point should have its own projection angles. These two points share the same value of the angle D1 but they have different values of the angle D 3. That is why D3 results in two solutions (See Eqs. (4.48) and (4.49)). Eq. (4.48) describes the arc D1D2 near to the origin of the vector space while Eq. (4.49) describes the arc D1D2 far away from the origin (See Fig. 4.2). It is interesting that when Di moves to the position of the document D1, the line D1O is tangent to the circle. This infers that D1 only intersects with the circle at one point and at this point D1 reaches its maximum value. That is, when D 2 is equal to S/2, D1 reaches its maximum value. According to Eq. (4.48), when D 2 is equal to S/2, D1 is equal to sin-1(r/h) and D3 is equal to S/2-sin-1(r/h) which is the coordinate of the document D1 in the visual space. Discussion about the position of the document D1 can lead to a better understanding of the projected distance model in the visual space. Suppose h is fixed and is the distance from the reference point to the origin of the vector space, as the radius r increases, the X-axis value of the document D1 increases and the Y-axis value decreases at the same time. This implies that D1 would move toward the point (S, 0) in the visual space. As the radius r decreases, the X-axis value of the document D1 decreases and the Y-axis value increases at the same time. In this case, D1 would move toward the point (0, S) in the visual space. We can also isolate the radius r and perform a similar analysis for h. As h increases/decreases, the document D1 just moves toward the opposite direction compared to what the radius r does. In fact, the position of the document D1 is affected by both parameters r and h. Notice that both Eqs. (4.48) and (4.49) suggest that DRP, the distance (h) between KVP and AVP, may totally change the shape of the distance model in the visual space. In a broader sense, KVP and AVP can be any two reference points R1 and R2. R1 is the center of the distance model. KVP does not have to be the origin of the vector space. Let us address this issue in detail. That is, the shape of the distance model varies in different DRP (h). There are three different possible scenarios: DRP (h) is smaller than the radius (r) of the distance model; h is equal to the radius (r) of the distance model; and h is larger than the radius (r) of the distance model. When h is smaller than r, for the relationships among the involved variables, see Fig. 4.13. Di is any point on the circle of the distance model. D1 and D2 are two intersections between the reference axis and the circle. Here D1 and D 3 are equal to the angle DiR1R2 and angle DiR2R1 respectively. In the triangle DiR1R2, we have the following equation.
92
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models Visual projection angle S
Retrieval area
R1(0,S/2)
D1
O’(S/2,0)
O
S
Visual projection angle
Fig. 4.12. Visual display of the distance model in TOFIR (I) Source: Zhang (2001)
D1 D 3 S R1Di R 2
(4.50)
In the triangle DiR1R2, the following equation holds because D1, D2, and D3 are on the circle.
Di
r D1
R1
D1 D3 R2
O
Fig. 4.13. Display of the distance model with h
D2
4.4 The angle-angle-based visualization model
93
Visual projection angle S
R2(0,S/2)
Retrieval area
O
R1(S/2,0)
S
Visual projection angle
Fig. 4.14. Visual display of the distance model in TOFIR (II)
S /2
D1 Di D2
(4.51)
Because of h
(4.52)
Based on Eqs. (4.50), (4.51), and (4.52),
D1 D 3 ! S / 2
(4.53)
The above equation shows that when h is smaller than r, sum of the two visual projection angles of the document Di is always larger than S/2. This conclusion in conjunction with Eq. (4.49) implies that the shape of the distance model would differ from the previous one. For the corresponding display of the distance model in the visual space, see Fig. 4.14. It is clear that the corresponding curve starts at the point (0, S) and ends at the point (S, 0) in the visual space. The second scenario happens when h is equal to r (for the relationships among the involved variables, see Fig. 4.15). In this case, R1 is the center of one circle and R2 is located on the circle of R1. Here D1 is equal to the angle D1R1R2 and D3 is equal to the angle D1R2R1. In the triangle DiD1R1, since both points Di and D1 are on the circle, we have: Di D1 R1
D1 Di R1
(4.54)
Because the sum of two interior angles is always equal to the exterior angle of the third interior angle in a triangle, we have:
94
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models Di D1 R1 D1 Di R1 D1
(4.55)
Di D1 R1 D 3 S / 2
(4.56)
Based on Eqs. (4.54) to (4.56):
D1 2
D3 S / 2
(4.57)
It is clear that we may now project the distance model in the visual space according to Eq. (4.57) because D1 and D3 are projection angles of the document Di. The projected distance model is a straight line. For the display of the distance model, see Fig. 4.16. The last scenario (when h is larger than r) is addressed before. For the display of the distance model in that scenario, see Fig. 4.12. The disjunction and conjunction evaluation models The equations of the conjunction and disjunction models are illustrated in Eqs. (4.13) and (4.14), respectively (For their spatial display, see Fig. 4.4). Since the distance model in TOFIR is discussed thoroughly, it makes the discussion on the disjunction and conjunction models relatively easy. When the two reference points R1 and R2 correspond to two circles in the vector space respectively and DRP (a) which is distance from R1 to R2 is smaller than the retrieval threshold (k), the display of the disjunction and conjunction models sees Fig. 4.17(a); when DRP (a) is equal to twice retrieval threshold (2k) (the two circles are tangent to each other in the vector space), the display of the disjunction and conjunction models sees Fig. 4.17 (b); when DRP (a) is smaller than twice retrieval threshold
Di
D1
r D1
D3
R1
O
Fig. 4.15. Display of the distance model with h=r
R2
4.4 The angle-angle-based visualization model
95
Visual projection angle S
R2(0,S/2)
Retrieval area
S
R1(S/2,0)
O
Visual projection angle
Fig. 4.16. Visual display of the distance model in TOFIR (III)
(2k) and larger than one retrieval threshold (k) (the two circles still overlap each other in the vector space), the display of the disjunction and conjunction models sees Fig. 4.17 (c); and when DRP (a) is larger than twice retrieval threshold (2k) (the two circles no longer overlap each other in the vector space), the display of the disjunction and conjunction models sees Fig. 4.17 (d). In these figures, the overlapping part of the retrieval areas is a result of the conjunction model while the union of the retrieval areas is a result of the disjunction model. The ellipse evaluation model The equation and spatial display of the distance model are illustrated in Eq. (4.12) and Fig. 4.3 respectively. It requires two reference points. The two reference points R1 and R2 are assigned as KVP and AVP respectively. Di is any point on the ellipse. Eq. (4.37) has described the relationship between d1 and D1. Similarly the relationship between d2 and D 2 is described in Eq. (4.58). cos(D 2)
a2 c2 c 2a u d 2 a
(4.58)
Based on Eqs. (4.37), (4.58), and (4.12), we have: c
d1 d 2
a2 c2 1 1 ) ( a u cos(D 2) c a u cos(D1) c 2
(4.59)
96
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models S
S
R2(0,S/2)
R2(0,S/2)
R1(S/2,0)
O
S O
(a)
S
R1(S/2,0)
S
(b) S
S
R2(0,S/2)
O
(c)
R1(S/2,0)
R2(0,S/2)
R1(S/2,0)
S
O
(d)
Fig. 4.17. Displays of disjunction and conjunction models in TOFIR Source: Zhang (2001)
Eq. (4.59) is the projection conversion equation for the ellipse model. In other words, it gives the relationship between two visual projection angles of the document Di on the ellipse contour in the visual space. Observe that in the equation, both D1 and D 2 are exchangeable, this suggests that the projected contour in the visual space is symmetric against the line X=Y. (For the display of the ellipse model, see Fig. 4.18.) The point C (cos-1(a/c), cos-1(a/c)) is a special point on the contour. From the position change of the special point we can understand the projected contour better. Since c is always larger than or equal to a, it means that the point C would move along the line X=Y between the origin and the boundary line Y=-X+S in the visual space. Position of the point C can be used to make a judgment on the ellipse retrieval contour. For a fixed a (the distance between the two reference points), the closer the special point C to the origin, the smaller the retrieval contour, and vice versa. For a fixed c (the retrieval threshold), the closer the special point C to the origin, the farther away the two reference points within the retrieval contour, and vice versa. It is apparent that the curve starts from the
4.5 The distance-distance-based visualization model
97
Visual projection angle S
C Retrieval area
O
S
Cos-1(a/c)
Visual projection angle
Fig. 4.18. Visual display of the ellipse model in TOFIR Source: Zhange (2001)
point (0, S), then it passes the point C(cos-1(a/c), cos-1(a/c)), and finally it ends at the point(S, 0). The projected contour of the ellipse model is similar to the distance model (See Fig. 4.14) in the visual space. They share the same starting and ending points in the visual space, but they are different. The projected contour of the ellipse model is always symmetric against the line Y=X while the projected contour of the distance model is basically asymmetric against the line Y=X.
4.5 The distance-distance-based visualization model
4.5.1 The visual space definition GUIDO is a distance-distance-based visualization model. In the vector space, if two reference points R1 and R2 are clearly defined and assigned to KVP and AVP respectively, the two visual projection distances against the two reference points, two important parameters for any document Di, can be defined in the following equations. § n d1 ¨¨ ¦ xij r1 j ©j 1
2
· ¸ ¸ ¹
1/ 2
(4.60)
98
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models
§ n ¨ ¦ xij r2 j ¨j 1 ©
d2
2
· ¸ ¸ ¹
1/ 2
(4.61)
If the two parameters d1 and d2 are assigned to the X-axis and Y-axis of the visual space respectively, the document Di can have a position (d1, d2) in the visual space. Given h is the DRP (the distance between the two reference points R1 and R2), then h
§ n ¨ ¦ r1 j r2 j ¨j 1 ©
2 ·¸¸
1/ 2
(4.62)
¹
Here parameters d1, d2, and h are always larger than or equal to 0. Because the sum of the lengths of any two sides in a triangle is always larger than or equal to that of the third side and the three points R1, R2, and Di can form a triangle in the vector space, its three sides should satisfy the following equations.
X Y
d1 d 2 t h
(4.63)
X h
d1 h t d 2
Y
(4.64)
Y h
d 2 h t d1
X
(4.65)
In fact, this group of the equations defines a valid display zone, the semantic framework, for all projected documents in the visual space. Each of the three equations determines a boundary line of the display zone in the visual space. In Fig. 4.19 the reference points R1 and R2 are projected onto the Y-axis and the Visual projection distance
Display area
w R1
O
R2
Visual projection distance
Fig. 4.19. Visual display zone of GUIDO. (Nuchprayoon and Korfhage, 1994). © 1994 IEEE. Reprinted with permission
4.5 The distance-distance-based visualization model
99
X-axis at the positions (0, h) and (h, 0), respectively. The display zone is a half infinite plank which forms S/4 angle against the X-axis or Y-axis. Its width is determined by the distance between the two reference points in the following equation. w
h
(4.66)
sin(S / 4)
4.5.2 Visualization for information retrieval evaluation models The distance evaluation model The equation and spatial display of the distance model are illustrated in Eq. (4.11) and Fig. 4.2 respectively. If KVP is the origin of the vector space and AVP is R1 which is the center of the distance model contour, the projection conversion equation is shown in Eq. (4.67). It is a horizontal line because the distance between a point on the circle and the reference point R1 is a constant (r), the retrieval threshold regardless of the other projection distance (See Fig. 4.20). Y
r
(4.67)
The disjunction and conjunction evaluation models The equations of the conjunction and disjunction models are illustrated in Eqs. (4.13) and (4.14), respectively; their spatial display is shown in Fig. 4.4. Reference
Visual projection distance
R2 Retrieval area
O
R1
Visual projection distance
Fig. 4.20. Visual display of the distance model in GUIDO
100
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models
Visual projection distance
R1
O
R2
Visual projection distance
Fig. 4.21. Visual display of the conjunction and disjunction models in GUIDO. (Nuchprayoon and Korfhage, 1994). © 1994 IEEE. Reprinted with permission
points R1 and R2 are assigned to KVP and AVP respectively. The conversion equations are listed as follows and these lines in conjunction with the boundary lines of the valid display area form the retrieved areas. The visual display is shown in Fig. 4.21. The overlapping part of the two retrieval areas is for the conjunction model and the union of the two retrieval areas is for the disjunction model. Y
k
(4.68)
X
k
(4.69)
The ellipse evaluation model The equation and spatial display of the distance model are illustrated in Eq. (4.12) and Fig. 4.3 respectively. The reference points R1 and R2 are assigned to KVP and AVP respectively. The conversion equation is very simple and it is listed in Eq. (4.70) and visual display of the model is shown in Fig. 4.22. X Y
c
(4.70)
The Cassini evaluation model The equation is illustrated in Eq. (4.15). Like the ellipse model it requires two reference points. R1 and R2 are assigned to KVP and AVP respectively. The conversion equation is very simple (See Eq. (4.71)). The visual display of the model is shown in Fig. 4.23. X uY
c
(4.71)
4.5 The distance-distance-based visualization model
101
Visual projection distance
R1
Retrieval area
O
R2
Visual projection distance
Fig. 4.22. Visual display of the ellipse model in GUIDO. (Nuchprayoon and Korfhage, 1994). © 1994 IEEE. Reprinted with permission
The cosine evaluation model The equation and spatial display of the cosine model are illustrated in Eq. (4.10) and Fig. 4.1 respectively. The angle D is the retrieval threshold, the origin of the vector space (or R1) is assigned to KVP while R2 is assigned to AVP. The document D2 is a point on the one side of the angle. Clearly, the three points, the origin of the vector space, the reference point R2 and the document D2 can form a Visual projection distance
R1 Retrieval area
O
R2
Visual projection distance
Fig. 4.23. Visual display of the Cassini model in GUIDO. (Nuchprayoon and Korfhage, 1994). © 1994 IEEE. Reprinted with permission
102
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models
Visual projection distance
R1 Retrieval area
R2
O
Visual projection distance
Fig. 4.24. Visual display of the cosine model in the distance-distance–based visual space
triangle in the vector space. Within this triangle, we define the distance (d1) between D2 and R1 as one visual projection distance of the document D2 for the Xaxis and the distance (d2) between D2 and R2 as another visual projection distance of D2 for the Y-axis. The following equations are always true (See Fig. 4.1). d 2 u cos(R 2 D2 R1) h u cos(D2 R1R 2) d 2 u sin(R1D2 R 2)
d1
h u sin(D2 R1R 2)
(4.72) (4.73)
In the equations, h is the DRP, the distance between the two reference points R1 and R2. It is a constant. Based on the two equations, we can get the following equation: d2
(d12 2 u h u d1 u cos(D2 R1R 2) h 2 )1 / 2
(4.74)
(d12 2 u h u d1 u cos(D ) h 2 )1 / 2
(4.75)
Or: d2
Eq. (4.75) shows the projection conversion equation for the cosine model in the distance-distance-based visual environment. The visual display of the cosine model is shown in Fig. 4.24. The retrieval area is defined by three lines described in Eqs. (4.75), (4.65) and (4.63) respectively. It is clear that retrieval area is an open area. R2 is always within the retrieval area due to its special location in the retrieval model. When d1 is equal to 0, the curve intersects with the Y-axis at (0, h). When d1 increases, the corresponding d2 value also increases for a fixed D .
4.5 The distance-distance-based visualization model
103
When the angle D increases, the value of cos(D ) decreases. Consequently, it results in an increase of its d2 value and the curve would move up, which increases the retrieval area in the visual space. Eq. (4.75) indicates that when the threshold D is equal to 0, it becomes Eq. (4.76). d1 h
d2
(4.76)
In fact, Eq. (4.76) is one boundary of the valid display area. It implies that the retrieval area is empty. When the threshold D is equal to S, it becomes the following equation. The equation is another boundary of the valid display area. It means that the entire area is included in the retrieval area. d1 h
d2
(4.77)
Discussion In the visual space, we can also generate a circle whose center is the origin of the visual space and whose radius is h/2, the half distance between the two reference points. This circle intersects with the two corners of the display zone (See Fig. 4.25). This arc within the visual display zone has an obvious implication on information retrieval. In fact, this arc corresponds to a sphere in the vector space. The center of the sphere is the centroid (Rc) of the two reference points R1 and R2 and its radius is equal to h/2. In other words, the two reference points are located on the circle. Rc
((r11 r21 ) / 2, (r12 r22 ) / 2,......, (r1n r2 n ) / 2)
(4.78)
If the document Di is any point on the retrieval circle contour, the retrieval contour is described as: 2 r1 j r2 j · ·¸ h §¨ n § t ¨ ¦ ¨¨ xij ( ) ¸¸ ¸ 2 ¨ j 1© 2 ¹ ¸¹ ©
1/ 2
(4.79)
And its corresponding projection conversion equation in the visual space is shown in Eq. (4.80). Here d1 and d2 are the two projection distances between Di and the two reference points R1 and R2 respectively. Observe that in this case the locations of the two reference points R1 and R2 in the visual space are R1(0, h/2) and R2(h/2, 0) respectively because Rc, instead of R1 or R2, is KVP. h 2
d1
2
d 22
1/ 2
(4.80)
Unfortunately, the contour in the visual space is not changeable because the two reference points are fixed and located on the contour in the vector space.
104
Chapter 4 Euclidean Spatial Characteristic Based Visualization Models Visual projection distance
Retrieval area
R2
O
R1
Visual projection distance
Fig. 4.25. Discussion of the arc in the distance-distance-based visual space
4.6 Summary In this chapter three visualization models were introduced. These visualization models are constructed by using distances, angles, or their combination. One of the distinguishing characteristics of these visualization models is their capacities to visualize traditional information retrieval evaluation models in addition to visualizing relationships among documents. Basically their visual spaces are two dimensional and built on two reference points defined by users. Notice that document distributions in these visual spaces change accordingly when the reference points change. This implies that the displayed document configurations in the visual spaces can be customized based upon user’s dynamic information needs. It is this characteristic that differentiates these visualization models from other visualization models. In the distance-distance-based visual space the valid display area (the semantic framework) is a half-infinite plank, both the X-axis and Y-axis are assigned as the visual projection distances. It forms a S/4 angle against the X-axis or the Yaxis, its two corners are connected to the X-axis and Y-axis respectively, and its width is dynamic and determined by the distance between the two reference points. In the distance-angle-based visual space the valid display area (the semantic framework) is a half-infinite plank, the X-axis and Y-axis are defined as the visual projection angle and distance respectively. The plank forms a S/2 angle against the X-axis, one side overlaps with the Y-axis of the visual space. Its width
4.6 Summary
105
is a constant and it is always equal to S. It is not affected by the distance between the two reference points. In the angle-angle-based visual space the valid display area (the semantic framework) is a right triangle area. Both the X-axis and Y-axis are defined as visual projection angles. The two sides of the triangle overlap with the X-axis and Yaxis respectively. The lengths of the two sides are equal to S. It is not affected by the distance between the two reference points. Calculating a projection conversion equation for an information retrieval evaluation model is crucial for visually displaying it in the visual space. The complexity of a conversion equation depends upon multiple factors such as the definition of the visual space and nature of the retrieval evaluation model. Some equations are simple and straightforward while others may be complicated. The significance of visualizing an information retrieval evaluation model is not only to make the invisible internal retrieval process transparent to users but also to allow them to manipulate the model in the visual space at will. It is worth pointing out that within the visual environments new information evaluation models can be developed.
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network
The human brain has amazing and powerful capacities for analyzing and generalizing complex information. It is natural and understandable that human beings attempt to simulate the structures and functionality of the brain to develop theoretical models for complex problems they confront. Neural network techniques were inspired by the way that the human brain responds to and handles human sensory signals. An artificial neural network is also a visualization method that can reveal semantic relationships among data from sophisticated connections in the network. As a result, the network can be employed as a topology- preserving map to explore information.
5.1 Introduction to neural networks Although human beings have developed state-of-the-art technologies for understanding nature, how the human brain exactly stores and processes information is still a mystery. The brain is a very complex and profound biological “machine” in terms of its structures and functionality. Understanding of the brain is quite limited and superficial. Biological evidence shows that there are more than a billion neurons in the brain. The neurons are connected in order to communicate with each other. Each neuron consists of the following basic components: dendrites which are responsible for receiving outside signals, an axon, a channel used to pass output signals, synapses, located at the end of an axon to convert outgoing signals to the meaningful format that other neurons can recognize and receive, and a cell body where a neuron processes received signals and decides whether send out a signal based upon certain thresholds. All semantically related signals are neurologically mapped onto the same areas and relations among the input signals still preserve. The brain receives outside signals from human sensory organs. All received signals are organized and stored in the brain. Different areas in the brain are responsible for different functions and tasks. For instance, both the left and right sides of the human brain have different functions. The left side of the human brain is responsible for analytical calculation, handling abstract information such as numbers. This side, scientists believe, also has the capacity to analyze detailed information. On the other hand, the right side of the human brain is responsible for spatial, intuitive, and holistic thinking. It handles graphic information such as graphic representations. Received signals are
108
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network
automatically mapped onto the most relevant areas based upon their task natures. Concepts stored in the brain are not isolated from each other. In fact they are semantically associated in some ways. Although the biological brain structure is predetermined genetically and understanding of the brain working mechanism is still inadequate, we know that the brain has memory and stores previous experiences, and a concept can trigger other associated concepts during information processing. These understandings of the brain would definitely give people a hint and inspire the bold imagination to develop artificial neural networks. The first artificial neuron network was introduced in 1943 by the neurophysiologist Warren McCulloch and logician Walter Pits. A learning rule for neural networks, an essential part, was proposed by Hebb in 1949. In the 1980’s, more and more practical artificial neural networks were brought out such as the presentation of Hopfield networks in 1982.
5.1.1 Definition of neural network It is no surprise that there are many definitions for neural networks because of the concept complexity. Each of the definitions may emphasize different perspective of the concept. A neural network is a system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes (DARPA, 1988). A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: x Knowledge is acquired by the network through a learning process. x Interneuron connection strengths known as synaptic weights are used to store the knowledge (Haykin, 1994). An artificial neural network is an information processing model which simulates the human biological nervous system to solve complex problems. An artificial neural network attempts to mimic the brain structure in architecture, function, and behavior to complete a task. Neural network techniques are intuitive and intriguing because of their learning and generalizing ability, efficient and robust performance, and fault and noise tolerance. Artificial neural network techniques can be applied to many fields such as investment analysis, process control, monitoring, marketing, automotive, banking, defense, electronics, entertainment, industrial, insurance, manufacturing, medical, robotics, speech, securities, telecommunications, and of course, information retrieval visualization. The research field has attracted attention from computer scientists, mathematicians, cognitive scientists, industrial engineers, biologists, information scientists, and even philosophers. They attempt to apply the neural network techniques to their research or practical problems.
5.1 Introduction to neural networks
109
5.1.2 Characteristics and structures of neuron network Neural networks possess many characteristics, but the adaptive learning, selforganization, real time operation, and tolerance of imprecise input data are most prominent and unique factors that differentiate them from other conventional information processing models/systems. Adaptive learning means the ability to solve a problem by training processing and previous experiences based on representative input data/signals. It can detect and reveal automatically hidden intrinsic structures and patterns of the representative input data. It learns by example. In other words, neural networks have a capacity to learn and generalize automatically from training data. It is one of the most prominent features of neural networks. There are various learning algorithms for neural networks. Self-organization refers to the ability to produce automatically an output as a result of training processing. The experiences and knowledge learned from training processing need to be stored and represented in the system for a variety of reasons. Neural networks can not only automatically learn from input signals but also automatically organize learned experiences and knowledge. It is the selforganization ability that makes visualization application of neural networks in information retrieval possible. Real time operation refers to the ability to carry out the complex and sophisticated computation in a parallel fashion which means that a model can process a group of data simultaneously. This parallel computational method consists of multiple processing units connected together in an integrative and collaborative way in order to perform multiple tasks in the same time. It is different from the traditional von Neumann structure method which executes one instruction at a time in a sequential order. Neural network processing is nonlinear. It is clear that neural network processing may not be efficient if it runs in a linear way. Tolerance of imprecise input data refers to the ability to learn from complex and imprecise data to generate meaningful output patterns. In reality, input data collected by people may be vague just because the environment where the data is collected is too complicated and knotty. But people still attempt to understand and find out possible hidden patterns from the “mess” data set. This noise tolerant characteristic enables neural networks to handle the situations which are beyond what traditional systems can reach. Neural networks are widely used in various fields partly because of this unique characteristic. Although there are many neural network models available, the basic structure of a neural network is similar. Generally speaking, a neural network has three layers: input layer, hidden layer, and output layer. The input layer is responsible for gathering raw representative information from the outside and the initial processing. It is the first step for neural network processing. The hidden layer further handles received signals from the input layer and generates analysis processing results for the output layer. It is called the hidden layer because it is located between the input layer and output layer. For this reason it is invisible and not transparent for system users. It is worth pointing out that a neural network may have multiple hidden layers with the output from one hidden layer serving as input to the next
110
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network
Out layer
Hidden layer
Input layer
Fig. 5.1. Neural network layer display
hidden layer, or it may have none of the hidden layers. The number of hidden layers relies on the neural network algorithms. The output layer is a final representation of a neural network result for users. The output layer should be the interface between neural networks and their users. The output layer is the foundation of the visualization application. A neural network must have both an input layer and output layer but it does not have to have hidden layer(s). The relations of the input layer, hidden layer, and output layer are illustrated in Fig. 5.1. Learning is an indispensable and essential component of a neural network. Based on whether a learning processing is interfered with by an outside “teacher”, learning can be classified into two categories: supervised learning and unsupervised learning. The supervised learning involves outside intervention. Learning output is taught by an outside “teacher” on how to respond to input signals and to reveal or generalize properties of input data. The unsupervised learning does not incorporate any outside intervention, and the output result is automatically generated based upon input data. Unsupervised learning is also called self-organization due to the unsupervised nature. The way that a neural network handles input signals varies in neutral network models. An input signal may need a single process to generate results or may need iterative multiple processes to yield final results. According to the directions of input data flowing within the neural networks, it can be categorized in two groups: feedback and feed-forward structures. For the feed-forward structure, input signals pass through the input layer, then possible hidden layers, and finally reach the output layer directly. It is a loop-free process. However, for the feedback structure, the input signals can go back and forth between the input layer, the hidden layers, and the output layer before the final result is yielded. The relationships of these neural network concepts are shown in Fig. 5.2.
5.2 Kohonen self-organizing maps
111
Neural Network
Unsupervised learning
Feedback
Feed forward
Supervised learning
Feedback
Feed forward
Fig. 5.2. Neural network classification
5.2 Kohonen self-organizing maps Willshaw and von der Malsburgh (1976) came up with the original idea of selforganization. Then Takeuchi and Amari (1976) further extended the convergence properties and dynamic stability of the Willshaw and von der Malsburgh’s theory. Finally Kohonen (1982) simplified and optimized the neural network theories of Willshaw and von der Malsburgh, and Takeuchi and Amari. He proposed a more practical and robust self-organizing map algorithm that bears his name. Basically, there are three primary Kohonen networks: vector quantization, self-organizing maps, and learning vector quantization. The vector quantization method can convert high dimensional data into low dimensional data. It uses unsupervised density estimators or autoassociators (Kohonen, 2001). The learning vector quantization method is used for supervised classification purpose; but a self-organizing map (SOM) network is the most popular and widely used one among the three. The Kohonen self-organizing map is a kind of unsupervised feedback neural networks. The self-organizing maps are designed not for pattern recognition, but for data clustering, information visualization, data mining, and data abstraction (Kohonen, 2001). SOM is regarded as a special topologypreserving map because intrinsic topological structures and important features of input data are revealed and kept in the resulting output grid (the feature map). SOM inherits all fundamental characteristics of an artificial neural network. It is unsupervised, competitive, and self-organizing. The term competitive in this context means that during the learning processing, the algorithm finds the best match between the input signal and all nodes in the output grid. In other words, all neurons/nodes in SOM are competing for the input signal. That is, the input signal finds a winning node after competition and then associates to it. SOM can be
112
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network
applied to information visualization, data dimensionality reduction, data property generation, data clustering analysis, data mining, and other related fields. SOM was first introduced to visualize document relations from document titles, then full texts or documents were categorized into a two dimensional grid based upon their contents (Lin et al., 1991; Lin, 1997). The self-organizing map technique has been applied to visualize more dynamic and diverse Internet information such as WEBSOM (Kohonen et al., 2000; Lagus et al., 1999). An alignedSOM attempted to visualize the influence of different parameters by training multiple related self-organizing maps representing the same data set (Pampalk et al., 2003). In order to avoid the problem of misclassification resulting from an imposed output grid size, an approach of automatically controlling the grid size or topological map size was proposed (Nurnberger and Detyniecki, 2002). To tackle the massive information, a multi-layered graphic SOM approach to Internet information categorization was presented. In that approach a recursive process of analyzing Web pages and creating submaps was undertaken in the environment (Chen et al., 1996).
5.2.1 Kohonen self-organizing map structures SOM has only an input layer and an output layer without a hidden layer. Input data represents the raw information that is fed into the neural network system. The characteristics of output results displayed in the final output grid are determined by the input data. All input data is organized and represented in an input vector space. The dimensionality of the input vector space is n. In Eq. (5.1), Di denotes document i (or input signal i) and each dij defines one of multiple attributes for document Di, p is the number of the documents/signals in a vector-based database. If a corresponding attribute dij relates to the document Di, a non-zero value is assigned to dij. Otherwise, a zero is assigned to dij. In reality, an object has multifacets describing its characteristics. A vector structure is a perfect organization data structure to represent it. signifies a vector set whose dimensionality is n, the number of different unique attributes in a database. Although n ends up a large number, the size of non-zero elements in a vector (representing a document) is relatively small. It is apparent that the input document vector space is a highdimensional space. n
Di
[d i1 , d i 2 , d i 3 ,......, d in ]T n , i
1,..., p
(5.1)
The output layer of SOM is also called the topological map, self-organizing map, property map, output grid, result grid, or feature map. It can be onedimensional, two-dimensional, or three-dimensional. In most of cases it is two dimensional. It is usually defined as a grid structure for computational simplicity. A grid, the semantic framework, is comprised by a group of arranged neurons, cells, or nodes. Final intrinsic features of the input data set are displayed in the grid. A neuron/node in the feature map is associated with a weight vector defined as:
5.2 Kohonen self-organizing maps Mi
[mi1 , mi 2 , mi 3 ,......, min ]T n , i
1,..., k
113 (5.2)
In Eq. (5.2), k is the number of all neurons/ nodes in the grid/feature map. The weight vector records gained experiences and defined the characteristics of its corresponding node in the feature map. Here mij is an element of Mi. Each node/neuron corresponds to a coordinate which determines its position in the feature map. The weight vector of a node plays an extremely important role in the training processing. It is used to store all learning experiences and knowledge. Contents of the weight vector associated to a node in the feature map are dynamic during learning process. For this reason, it is regarded as the memory of SOM. The weight vector is invisible and not transparent to system users. The structure of a weight vector is the same as that of an input data vector. That is, they share the same vector dimensionality, attribute definitions, and attribute sequence in the vectors. These two structures have to be the same because later learning computation in the algorithm requires that they are computationally compatible. The structures and relationships between the input data vectors, output node weight vectors, and output feature map are illustrated in Fig. 5.3. The Kohonen self-organizing algorithm produces a low dimensional feature map from an input space, which is a high dimensional document vector space and is spatially continuous. In other words, a point between any two points in a spatially continuous space is definable and legitimate. However, a resulting feature map does not have this feature. In fact, these nodes in the visual space of a selforganizing map are spatially discrete because each neuron/node in SOM is made up of a high-dimensional weight vector. This associated weight vector has nothing to do with the coordinates of a node in the feature map. That is, the associated high dimensional weight vectors can not determine the positions of their corresponding nodes in the visual space. It is one of the most important differences between input space and output space in SOM. The two-dimensional space grid can be arranged or designed in either a rectangular structure or hexagonal structure (See Fig. 5.4). Each structure defines different neighborhoods; these will be discussed later. The feature maps are “elastic” networks because the values of weight vectors in the feature maps are dynamic and always try to adapt to the input signals during the training and learning processing.
5.2.2 Learning processing of SOM algorithm All weight vectors of nodes in the feature maps must be initialized before training processing starts. Very small values near zero are randomly assigned to all
114
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network Output grid
weight vector Input vector Fig. 5.3. The structures and relations between input vector space and output feature map
elements of a node weight vector in the feature map. This initialization is necessary for producing legitimate and reasonable output results. During the iterative training and learning processing, each input vector is randomly picked up and is required to be selected multiple times before the training and learning processing is completed. The way that the self-organizing maps process input data is classified as the feedback type because of iterative raw input data processing. After an input data vector is randomly selected, all neurons in the output feature map compete for the winning node of this input vector. The winning node is defined as the closest neuron whose weight vector is the most relevant to the input vector, or the best matching neuron, or the best matching unit, or most relevant node in the feature map. There are many similarity measures available for similarity calculation such as the cosine measure, the Euclidean distance measure, and so on. The most used and intuitive measure is the Euclidean distance measure. Suppose Di is a randomly selected input document/signal, then the winning neuron can be defined from Eq. (5.3). n
C ( Di )
min{( ¦ (d ir mlr ) 2 )1 / 2 },
l
1,2,..., k
(5.3)
r 1
The definitions of the parameters k, n, dir, and mlr are the same as the previous. This step calculates the winning node for a randomly selected input vector or document. As we discussed earlier, neurons in the feature maps should not be independent of each other. They should impact each other and be associated with each other. The degree of association and impact of a node on its neighborhood should be calculated and recorded in weight vectors of the feature maps. It is crucial to
5.2 Kohonen self-organizing maps
N(t1)
N(t2)
N(t3)
115
Hexagonal grid structure.
N(t1)
N(t2)
N(t3)
Rectangular grid structure
Fig. 5.4. The relationships between a winning node and its surrounding neighboring nodes. (Kohonen, 1990). © 1990 IEEE. Reprinted with permission
clearly define an impact neighborhood and to properly calculate the impact extent for each of nodes in the defined impact neighborhood. The degree of the impact on the neighborhood areas should vary. Toward this aim, the surrounding impact neighboring areas of the winning node should change accordingly and gradually to adapt the impact of the input data vector because the impact of the winning node on nodes with different distances to the winning nodes in the output grid varies. In other words, the degree of the impact of the winning node on the surround neighboring nodes should decrease as the distance between the winning node and neighboring nodes increases. The adjustment strategy makes sense because the farther a neighboring node to the winning node, the less impact the winning node has on it. That is, the impact of the winning node on it gets softer as the distance between the two gets larger. The relationships between a winning node and its surrounding neighboring nodes are shown in Fig. 5.4. Fig. 5.4 shows two scenarios of a feature map: one for hexagonal structure and the other for rectangular structure. The dark nodes in both structures are the winning nodes and corresponding neighboring areas with the same impact extent are linked by lines. For instance,
116
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network
the input vector should have a stronger impact on the nodes linked by N(t3) than the nodes linked by N(t2), and similarly the input vector should have a stronger impact on the nodes linked by N(t2) than the nodes linked by N(t1). According to the Kohonen SOM algorithm, the size of a defined impact neighboring area for a winning node is dynamic and changeable. As training and learning processed, the impact neighborhood size of a winning node shrinks. In other words, the neighborhood size of a winning node is a time-sensitive variable. The reasons for this strategy are that at the beginning of training and learning, a winning node with a relatively larger neighborhood area may achieve a global impact order in the feature map where local feature orders have not been formed because few training input data have been processed; at the end of the training and learning, the impact neighborhood of a winning node should reduce to the winning node in the feature map where local feature orders have been generated. The neighborhood size change during the generation of the feature map would definitely affect the smoothness of the feature map. Initially, the size of neighborhood should be large enough. For example, the radius of the neighborhood is set as half diameter of the feature map, to make an entire global adjustment. In addition, this neighborhood shrinking strategy would ensure that the feature map converges at the end of training and learning processing. In Fig. 5.4 three different neighborhoods at three different times are marked separately. Since t3>t2>t1, we have neighborhood relationships: N(t1)>N(t2)>N(t3). Parameter ti is a time variable and N(ti) is the defined neighborhood at time ti. The Gaussian neighborhood function is introduced to describe dynamic neighborhood size change:
h ci ( t )
D ( t ) exp(
|| Mc Mi ||
2
(5.4)
)
2
2V (t ) 0 D (t ) 1
D i (t 1) D i (t ) /(1 hci (t ) u D (t )) Mc Mi
(5.5)
or
max( x1 x 2 , y1 y 2 )
D (t )
A tB
(5.6)
(5.7)
In Eq. (5.4), t is a time variable, Mc and Mi denote a winning node of a current input data vector and one of its potential neighboring nodes in the feature map respectively, hci(t) is defined as the neighborhood value for the winning node Mc and a neighboring neuron/node Mi. In Eq. (5.7), Mc(x1, y1) and Mi(x2, y2) define positions of Mc and Mi, respectively in a rectangular-grid-structure-based output feature map (See Fig. 5.4). Here x1, y1, x2, and y2 are integers. M c M i stands for
5.2 Kohonen self-organizing maps
117
the distance between the winning node Mc and neighboring neuron/node Mi in the output feature map. The larger the distance between Mc and Mi, the smaller their corresponding neighborhood value hci(t) in a given time, and vice versa. Notice that M c M i defines a feature map distance between two nodes in the feature map rather than the Euclidean distance in the weight vector space. These two nodes also correspond to a Euclidean distance in the weight vector space, which is determined by their two weight vectors. But these two distances are totally different concepts. For instance, in Fig. 5.4, the feature map distance ( M c M i ) between the marked node (the winning node) and any node on the ring specified by N(t1) is 3, and the distance between the marked node(the winning node) and any node on the ring specified by N(t3) can be a different value determined by the weight vectors of the winning node and the corresponding node on the ring. Fig. 5.5 clearly demonstrates the impact degree change of a winning node on its surrounding neighboring nodes. Both the X-axis and the Y-axis construct the output feature map. The Z-axis is the winning node impact degree of a winning node. The peak of the bell-like shape is the winning node. Eq. (5.4) contains two new functions D (t ) and V (t) and both functions relate to the time variable t. The first function D (t ) is the learning rate function. Its legitimate value always falls between zero and one. It is a monotonically decreasing 0. Eq. (5.6) requires function of time variable t. When t f, we have D (t ) that parameters A and B are positive constants. In fact, Eq. (5.6) gives two learning rate functions. The second one is simple and straightforward while the first one involves hci(t). Therefore, the first one is recursive. Researchers prefer the second one because of its simplicity. The second function in Eq. (5.4) is V (t), referring to the width of the neighborhood function. It also decreases monotonically as the regression progresses. According to Eqs. (5.4) and (5.6) it is evident that as t f, we have hci(t) 0. In other words, as training and learning time increases, the neighborhood of a winning node reduces to itself. A scalar kernel function is used to update neighboring nodes to adjust weight vectors of the neighborhood nodes in the feature map after an input vector is fed, its winning node is identified, and its surrounding neighborhood is determined.
M i (t 1) M i (t ) hci (t ) u (D j (t ) M i (t )) ,
j 1,..., p
(5.8)
In Eq. (5.8), Dj(t) is an input data vector, Mi(t) is a node in the neighborhood area, and Mi(t+1) is a updated node of Mi(t) after learning and training processing. Here p is the number of the input signals. The equation modifies the nodes
118
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network
Fig. 5.5. Bubble neighborhood learning kernel
within the defined neighborhood in the way that makes their weight vectors to move toward the input data vector Dj(t). The degree to which a neighboring node weight vector moves toward the input data vector in the weight vector space is dependent on the distance from it to the winning node in the feature map (See Eq. (5.7)). The impact degree is controlled by the parameter hci(t). In Figs. 5.6 (a) and (b) shows the relationships among Dj(t) (an input data vector), Mi(t) (a node in the neighborhood area), and its updated node Mi(t+1) in the weight vector space. Fig. 5.6 (a) illustrates the impact of a winning node on a neighboring node without considering impact of the Gaussian neighborhood function. In this case, weight vector Mi(t) would move to Dj(t) vector directly after updating processing. Fig. 5.6 (b) shows the impact of a winning node on a neighboring node with considering the Gaussian neighborhood function impact. Two possible updated results of Mi(t) are marked as Mi(t+1) and M’i(t+1) respectively in the figure to show the impact discrepancy between two neighboring nodes whose distances to the input vector Dj(t) are different in the output feature map. Vector M’i(t+1) is closer to Dj(t) than Mi(t+1) to Dj(t) in the feature map, so the final updated position of M’i(t+1) in the weight vector space moves closer to Dj(t) due to the stronger effect of the Gaussian neighborhood function. However, notice that both weight vectors of nodes Mi(t+1) and M’i(t+1) move toward the same direction (Dj(t)).
5.2 Kohonen self-organizing maps
119
M’i(t+1) Di(t)
Mi(t)
Mi(t+1)
Di(t)
Mi(t)
Mi(t)- Mi(t)
Di(t)- Mi(t)
(a)
(b)
Fig. 5.6. Impact of a winning node on a neighboring node
The state of the output feature map is changing continuously until it reaches an equilibrium point, that is, the map convergence status. Map convergence indicates the end of the iterative training and learning processing. In fact, there are two control mechanisms to ensure the feature map convergence. One is the neighborhood shrinking strategy and the other is the decreasing learning rate. In other words, when the neighborhood area shrinks to the winning node itself and the learning rate reduces to zero, the map may converge and the training and learning processing ends. Training and learning processing may need more than one hundred input data iterations. After the training and learning processing is finished, each of the documents in the collection needs to be projected onto the final feature map by using the same projection algorithm. Each of the documents finds its ultimate winning node in the feature and then it is assigned to its winning node. Finally, all documents are scattered onto the feature map. Documents similar to a node are associated with each other under the umbrella of the same node. After this process, a node/ neuron has a weight vector and may also have a group of relevant documents.
5.2.3 Feature map labeling After training and learning processing, nodes with similar weight vectors can be merged as an area. Each of these areas represents subject topic(s). Feature map labeling is the processing of assigning proper terms to an area in the feature map. The labeled term(s) are supposed to reflect subject topic(s) of the area. After feature map training and learning is finished, one of the major tasks is to interpret the feature map. As we know, the feature map contains rich information about the
120
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network
database. Appropriately labeling the partitioned map would definitely help users to understand the topic distributions in the feature map, to guide them to the right locations in the map during navigation, and facilitate information retrieval in the map. In this sense the labeled terms can serve as landmarks of the map. However, labeling terms for a node/area has proven not an easy task. First, an area/node may involve multiple subject topics based upon its weight vector information or documents associated to it. Second, the spatial limitation in the feature map prevents assigning too many terms to a local area. Cramming a map display with excessive terms would not only confuse users but also lead to a bad aesthetic visual effort. Therefore, the principle of labeling terms is to find the most appropriate term(s) and make the best use of the limited space in the feature map. There are a variety of labeling algorithms available. Each of them has its advantages and disadvantages. In the term labeling method (Lagus and Kaski, 1999), the two factors of a term in the map were identified to describe and define a good term for a cluster/area. The one factor was its prominent status in the cluster and the other was its prominent status in entire data collection. These two factors were combined together to measure importance or significance of a term in a given cluster. Merkl and Rauber (1997) presented a method for term labeling in SOM. It utilized the similarity between output weight vectors of two neighboring nodes to define their degree of connectivity. In order to enhance the visual display, a set of thresholds was set to differentiate similarity degrees of neighboring nodes. LabelSOM took a different term labeling strategy (Rauber, 1999). The quantization error for all individual features served as relevance references for a cluster label. Quantization error was defined as the accumulated distances between the weight vector elements of a node and all associated documents to the node. The uniqueness of this method is that it considered both a weight vector of a node and associated documents. However, the most popular term labeling approach is very simple and straightforward as follows: selecting a node from SOM, locating the largest weight value in its weight vector, finding its corresponding term in the weight vector, assigning the term to the node as the winning term, and merging nodes sharing the same term as a region.
5.2.4 The SOM algorithm description The input data structures and visual space structure, characteristics, and processing of SOM are analyzed and discussed. In order to put all components of the selforganizing map together and to give a complete picture of the SOM processing, a detailed algorithm is presented as follows. The algorithm input is a group of raw data vectors and the algorithm output is SOM. Lines L2 to L3 involve initiating variables. Lines L4 to L13 generate the feature map based on input signals. It is an iterative learning and adopting process. After the final feature map is yielded and node weight vectors are stable, all
5.3 Implication of SOM in information retrieval
121
documents are projected onto the map (L14 to L17), Lines L18 to L21 describe feature map labeling process. L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22
Begin
Initiate parameters, neighborhood radius, and weight vectors in the feature map; While the converge condition is not satisfied Do Randomly pick up a raw data vector as an input vector; Calculate the winning node whose Euclidean distance is the smallest between the input data vector and the weight vector associated with the node; For all neighboring nodes of the winning node Do Update their weight vectors; Endfor; Adjust the Gaussian neighborhood function; Endwhile; For each of raw data vectors in the collection Do Find its winning node in the feature map; Assign it to the winning node; Endfor; For each of nodes in the feature map Do Label the selected node; Merge adjacent nodes sharing the same term(s); Endfor; End
An example of the SOM feature map sees Fig. 5.7. Kohonen (1990 and 2001) came up with another simple and intuitive representation of the self-organizing maps. It is called the minimal spanning tree. It looks like a tree structure. Each node or leaf in the minimal spanning tree represents an input vector. It does not need iterations of input data vectors. When an input vector is submitted, it is compared with all existing nodes in the minimal spanning tree and it is linked to a node on the minimal spanning tree that is the most relevant node to the input vector. Repeat the previous process until all input vectors are processed and linked in the minimal spanning tree.
5.3 Implication of SOM in information retrieval Basically users may employ the SOM to perform document cluster analysis, browse and explore information, and search information. SOM may be utilized to analyze document distributions in a collection and to give users an overview of what the entire database looks like and a valuable insight into intrinsic structures of a database. Each partitioned area in the map clearly represents a concept(s) and documents associated with the concepts. It is apparent that the size of each area in
122
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network
Fig. 5.7. A SOM feature map. Reprinted with permission of Xia Lin
the feature map indicates term occurrence frequencies or the possible size of the projected documents representing the area. The larger an area in the feature map, the more documents within that area; and vice versa. After term labeling processing, semantically related areas are also connected. The neighboring relations of areas show intrinsic semantic associations among the neighboring areas because according to the algorithm, only relevant concepts are adjacent in the feature map. The degree of the relevance between two neighboring areas can be judged by the shape and length of the border separating the two areas. The longer the sharing border, the more relevant the two neighboring areas, and the converse also holds. During feature map navigation, users can select any interesting concept term labeled on an area by clicking it in the map to activate the system to list all document titles, even full texts, which are associated with the selected area. Then users can directly read the titles or full texts. The map can be used to find documents similar to a special document. Users can browse the feature map to find a document of interest. Then the system can show users all semantically relevant documents by pulling out all documents associated with the area that the interest document belongs to. The feature map can be used to detect whether a special input document fits to a certain class/category in the self-organizing map. If it does not fit, it suggests that the input data document is new to the dataset. In other words, if the selforganizing map fails to recognize a new input in an existing output pattern, then it indicates the novelty of the input data. In a retrieval algorithm based on the self-organizing map (Lagus, 2002), after SOM map was created, each node in the feature map was assigned a centroid vector. The centroid vector was generated based on average weights of all associated document vectors to the subject area. This centroid vector was used as a surrogate of all associated documents to the area. After a query was submitted to the system,
5.3 Implication of SOM in information retrieval
123
the query was compared with all centroid vectors. The best matching centroid vectors were selected and corresponding associated documents were pulled out as search results. If necessary, the retrieved document set could be redefined to narrow down the size of the retrieved set. Users can also submit a query to the feature map. The query terms are compared with the weight vectors directly. The nodes with the best matched weight vectors are highlighted in the contexts of the feature map. Users can identify retrieved nodes, the associated documents, and their distributions as well. Using SOM, people can explore and discover a complex hidden term semantic network. SOM can provide related terms from the following three levels. x A group of terms determined by a specified node. As we know, a node in a selforganizing map corresponds to a weight vector formed by a group of predefined terms. After training and learning processing is done, non-zero elements of the weight vector can define a group of related terms for a node. Each non-zero element corresponds to a term in the weight vector. These identified terms are the most related terms in the database. The relevance degree among these terms can be determined by their corresponding weight values. x A group of terms determined by a set of nodes which are located in the same cluster area in SOM. If nodes are located in the same area, it means that they are within the same subject cluster. In other words, they address a similar topic(s). Therefore, terms extracted from these nodes should be relevant to each other to some degree. However, the relevance degree of this group of terms should be lesser than that of the group of terms extracted from a single node. It is interesting to note that the relevant degree of one of these terms extracted from same area can be measured by similarity between two node weight vectors where they are extracted. x A group of terms determined by neighboring node areas of an area. Geographically, an area in SOM can define a group of neighboring areas. Each of these neighboring areas has a group of related terms as we discussed in [2]. This group of terms is considered as the least relevant terms compared to the other two groups because they are neither within the same node nor within the same area in the feature map. Notice that the term semantic network built based upon the above method is formed and generated purely by document semantic associations. So it is databasecontent-based. Because this term network roots in original document semantic associations, it is a more user-oriented term network and it reflects a user’s preferences. The term semantic network can be employed to assist users to formulate a search query by suggesting potential query terms, and to aid in constructing a thesaurus by recommending a group of related terms for a special term. Notice that since the feature map is generated based on the weight vectors and it displays the partitioned subject areas rather than the projected objects/documents, it means that the feature map can effectively present the results of a large amount of input data in the visual space.
124
Chapter 5 Kohonen Self-Organizing Map–An Artificial Neural Network
5.4 Summary The SOM technique is a nonlinear topology-preserving projection method to convert a high-dimensional space into a low dimensional grid. Learning and training in the SOM algorithm, the core part of the SOM algorithm, are done by finding the winning node and adjusting the weight vectors of the affected neighboring nodes of the winning node to resemble to the input vector. Two parameters play an important role in the learning and training processing. The learning rate parameter is used to control the magnitude of adjacent neighborhood node adjustments, and the neighborhood parameter is used to control scope and coverage of the adjustment. SOM can offer an agreeable visual environment for information retrieval, document, cluster analysis, and term association analysis. There are three spaces which involve in SOM: the high dimensional document vector space, the high dimensional weight vector space, and the low dimensional visual space (the display grid). The three spaces are different and play different roles. The two high dimensional spaces are compatible in terms of their structures. The two spaces are not transparent to users. But one is associated with documents/objects and is used to describe characteristics of documents while the other is associated with the nodes of the display grid and is used to preserve experiences and knowledge learned from the training process. The low dimensional visual space is also associated with the nodes of the grid. It is used for users to observe and interact with visual information. Despite its appeal the SOM techniques have some restrictions and weaknesses. Computational complexity is one of the disadvantages of SOM, especially for a large data set. Training and learning processing of the self-organizing maps requires iterations of input signals to reach the convergence according to the algorithm. The number of training iterations depends on the setup of the parameters and size of the raw input dataset. If a database is very large, the training and learning processing is time-consuming. Because the SOM algorithm generates feature patterns by its iterative training and learning processing, its operations and results can be unpredictable. SOM cannot visualize a regular pattern properly in a high dimensional space in its low dimensional space. For instance, an ellipse pattern in a high dimensional space cannot be projected onto the low dimensional feature map with its meaningfully geometric characteristics. It implies that traditional information retrieval models such as the cosine model, the ellipse model, the Euclidean distance model, the conjunction model, the disjunction model, etc. which all correspond to hyper regular geometric patterns in a high dimensional space, cannot be effectively visualized and displayed in the SOM environment like other information retrieval visualization models such as DARE, TOFIR, and GUIDO. Notice that after training and learning processing, the SOM structures stay stable. Without doubt, the static feature maps provide users with rich information about the intrinsic structures of the database they represent. However, this static characteristic of SOM prevents it from customizing users’ needs into the feature maps. Although a zoom in/out feature implemented in some SOM systems allows
5.4 Summary
125
users to observe the feature maps at various detail levels, it does not change the basic contextual structure of the maps. It is widely recognized that users’ information needs are dynamic and diverse during information retrieval. If SOM could be customized based upon each individual users’ needs, it would definitely enhance its flexibility.
Chapter 6 Pathfinder Associative Network
The Pathfinder associative network (PFNET) was originally designed to assist researchers with psychological analysis based on a proximity data set (Schvaneveldt et al., 1989). It is a structural and procedural modeling technique that extracts underlying connection patterns in proximity data and represents them spatially in a class of networks (Cooke et al., 1996). The power of the Pathfinder associative network rests on its ability to discard insignificant links in the original network while it reserves the salient semantic structure of the network. The simplified network still maintains the proximity connections and fundamental characteristics of the original network. PFNET can be used to visualize semantic relations of related nodes in a more effective and meaningful way. The Pathfinder associative network can handle data with both an ordinal and ratio nature. The triangle inequality principle which is centered in the Pathfinder associative network algorithm is applied to simplifying an original network. The triangle inequality is used to identify paths with the lowest weights in the network, eliminate redundant ones, and make the network more economical. In the Euclidean space, the triangle inequality can be easily interpreted and illustrated. Given three points (A, B, and C) in the Euclidean two-dimensional plane, the distance between AB is always smaller than or equal to distances of AC and CB(See Fig. 6.1). When C is situated on the line determined by AB, the distance between AB is equal to distances of AC and CB. In other words, AB is always the shortest path in the network. If there is a network consisting of multiple connected points and the network is pruned in a way that all shortest paths are preserved and redundant paths are discarded, the final pruned network would be a Pathfinder network. The main idea of the Pathfinder associative network is to discard the redundant paths and keep the significant ones in a network. The principle of the triangle inequality can be extended to an abstract space. In that case, connection proximity between two points may be measured in other forms such as invisible semantic similarity between two objects rather than distance. The Pathfinder associative networks can be applied to many different fields of study, such as cognitive science, artificial intelligence, psychological analysis, information retrieval, knowledge organization, and information visualization as well.
128
Chapter 6 Pathfinder Associative Network A
C
B
Fig. 6.1. Display of three points in the Euclidean space
6.1 Pathfinder associative network properties and descriptions
6.1.1 Definitions of concepts and explanations A graph can be defined as G(V, E). V is a set of vertices (or nodes) {N1, N2, …, Nn }and E is a set of edges in which an edge is connected by a pair of vertices (nodes) in V. |V|=n is defined as the number of nodes in V. In a Euclidean plane, a graph can be depicted with vertices as points and edges as segments linking these vertices. A graph G is also called a network. Connections and relationships of all edges in E can be described in an adjacent n×n matrix EG (See Eq. (6.1)). Headings of both the column and row are nodes and orders of these nodes in both column and row are exactly the same. The matrix EG is represented as:
EG
§ e11 ¨ ¨ ... ¨ ... ¨ ¨e © n1
...
...
... eij ...
...
...
...
e1n · ¸ ... ¸ 1 d i, j d n ... ¸ ¸ enn ¸¹ nun
(6.1)
Where eij is defined as an edge from node Ni to node Nj. If there is an edge between Ni to Nj, then the corresponding eij is equal to 1, otherwise eij is equal to 0. We define eii=0, assuming that a node is not linked to itself. It suggests that the
6.1 Pathfinder associative network properties and descriptions
129
diagonal elements of the matrix are always equal to zero. Constant n is the number of nodes in a graph. If a graph is undirected, then we have eij=eji. Therefore, the matrix EG is symmetric against its diagonal. If a graph is directed, the equation (eij=eji) may not hold. Thus the corresponding matrix EG is asymmetric against its diagonal. Parallel to the matrix EG, the weight matrix W (See Eq. (6.2)) defines a weight wij that is associated with an edge eij in a graph. In other words, wij is the weight assigned to eij.
W
§ w11 ¨ ¨ ... ¨ ... ¨ ¨w © n1
...
...
... wij ...
...
...
...
w1n · ¸ ... ¸ , 1 d i, j d n ... ¸ ¸ wnn ¸¹ nun
(6.2)
Similar to eii, wii is always equal to 0. W and EG have the same matrix structure but different contents and meanings. It is clear that if eij=0, then wij=0. That is, if there is no link between two nodes, the weight is zero. As we know, the Pathfinder associative network is a simplified network. It always has the same nodes as the original network but possesses fewer edges than the original network. Therefore, the Pathfinder associative network can also be defined as a matrix PF where pij is a weight assigned to the edge eij.
PF
§ p11 ¨ ¨ ... ¨ ... ¨ ¨p © n1
...
...
...
pij
...
...
...
...
p1n · ¸ ... ¸ , 1 d i, j d n ... ¸ ¸ p nn ¸¹ nun
PF W
(6.3)
(6.4)
A path in the graph/network is comprised of several connected edges. For instance, path P={eab, ebc, ecd }is a path consisting of three edges eab, ebc, and ecd. The weight of a path is calculated by the Minkowski r-metric (See Eq. (6.5)): W ( Path)
§ k r· ¨¨ ¦ wi ¸¸ ©i 1 ¹
1/ r
, r
1,..., f.
(6.5)
In the above equation, wi is the weight of edge i and Path(e1, e2, …, ek) is a path and (w1, w2, …, wk) are weights of the edges on the path. The legitimate value of the parameter r in Eq. (6.5) can range from 1 to f. The parameter r affects the path weight significantly. When r is equal to 1, the path weight is the sum of all edge weighs along the path; when r is equal to 2, the path weight is the Euclidean distance calculation of the path weight; and when r is equal to f, the path weight is equal to the maximum edge weight among all involved edge weights.
130
Chapter 6 Pathfinder Associative Network
Path length is defined as the number of edges along a path. For instance, the length of Path(e1, e2, …, ek) is k. L( Path)
(6.6)
k
Notice that the concept of the path length is quite different from that of the path weight even though they have a very close relationship. The path length is not dependent on the edge weights along the path whereas the path weight is calculated based on these edge weights. A graph is q-triangular with the Minkowski r-metric if and only if all possible weights of these paths in a network, whose path lengths are smaller than and equal to the parameter q, meet the triangle inequality (See Eq. (6.7)): wag d
m
r r wab wbc ... w rfg
m
1 / r , m
(6.7)
1,2,3,..., q
(6.8)
L ((eab ,..., e fg ))
In G(V, E), the valid value of q can range from 1 to n-1. The associated weights of eab, ebc, …, efg are wab, wbc, …, wfg respectively. Parameter m is the path length. The two parameters q and r can determine a family of similar Pathfinder associative networks respectively. The Pathfinder associative network family is also called isomorphic Pathfinder associative networks. EGi is a path-length-i matrix. In the matrix, if there is a path from node l to i
node k with path length i, then the element e1k is equal to 1; otherwise 0. Now let us define another very important concept: path-length-i minimum weight matrix which contains the most economical weights for a certain path length in a network. For the definition, see Eqs. (6.9) and (6.10): W i 1
w ijk1
MIN ® w j1 w1ik ¯ for
wmk ,
r
m ! k ,
r 1/ r
for
,...,
W 1
W i
w
w r
jm
(6.9) r 1/ r i mk
w jm , m ! j , 1 d m d n,
,...,
w
w r
jn
1 d i d n 1
1/ r i r nk
½ ¾ ¿
(6.10)
W1 is the original weight matrix W. Parameter n is the number of all nodes in a network. The above two equations are used to calculate the weight of a path when the path length increases by 1. Observe that if path growth in a network happens, it should consider all possibilities of path growths and select the most economical one from all possible paths. For instance, an existing path with path length i will increases by 1, that is, convert Wi to Wi+1. It first should consult W1 to
6.1 Pathfinder associative network properties and descriptions
131
i
determine all possibilities for path growth. For weight w jk , the possible paths with 1
i
1
i
1
i
path length i+1 are e j1 and e1k , e j 2 and e2 k , …, e jn and enk are considered if 1
the corresponding e jm exists for the path increase. The next step is to use the Minkowski r-metric to calculate new path weights for all newly generated paths with path length i+1. And the final step is to select the best (the lowest weight) path from the all newly calculated path weights. The reason that for the weight wmk, m can not be equal to k, and for the weight wjm, m cannot be equal to j, is that adding either wkk or wjj can not result in an increase in the path length. In other words, the path length from a node to itself is defined as 0. i
In the path-length-i minimum weight matrix, the meaning of an element w jk is defined as the lowest weight of a path whose path length is exactly equal to i that starts from node j and ends in node k in a network. The path-length-i minimum weight matrix Wi (1<=i<=n) is introduced to calculate the path-length-i complete minimum weight matrix Di (See Eqs. (6.11) and (6.12)).
Di
i § d11 ¨ ¨ ... ¨ ¨ ... ¨d i © 1n
...
...
... d lki ...
...
...
...
d1in ·¸ ... ¸ ¸ ,1 d l , k d n, ... ¸ i ¸ d nn ¹ nun
1 d i d n 1
1 MIN ( wlk , wlk2 , ...., wlki ) , l ! k
d lki
(6.11)
(6.12)
Di is also a square matrix like Wi. The path-length-i complete minimum weight matrix Di is different from Wi. The former is generated based upon the lati
ter. The element d lk means the weight of a path that meets two conditions: it comes from one of a group of paths whose path are lengths equal to 1, 2, 3,…, i, respectively, and its weight is the lowest among weights of these paths. Notice that when value of i increases, values of the elements in Di may decrease. That is because the number of the paths from any node A to another B increases due to increase of i. As a result, the possibility of finding a lower path weight increases. If it happens, according the algorithm, the path with the lower path weight would replace the old path in Di, which leads to lower values of elements in Di. When i is equal to n-1, it reaches its maximum because linking a node to itself in a network does not construct a valid edge.
6.1.2 The algorithm description The Pathfinder associative network (PFNET(r, q)) generation algorithm is described as follows. PFNET(r, q) means the produced Pathfinder associative network
132
Chapter 6 Pathfinder Associative Network
is q-triangular with the Minkowski r-metric. W is an input original weight matrix, PF matrix is an output matrix of the generation algorithm, and pij is an element of PF (1di, jdn). The algorithm is adapted from the original Pathfinder algorithm (Dearholt and Schvaneveldt, 1990). The algorithm can handle both symmetric and asymmetric matrices. L1 L2 L3 L4 L5 L6 L7
Begin Initialize PF matrix; Input parameters r, q, and the proximity matrix W; For m=1 To q-1 Step 1 For k=1 To n Step 1 For l=1 To n Step 1
wlkm 1 L8 L9 L10 L11 L12
1/ r m r 1k
r
, ...,
w
w
1 r ln
1/ r m r nk
r
· ; ¸ ¹
Next l; Next m; For k=1 To n Step 1 For l=1 To n Step 1
d lkq =MIN( wlk1 , wlk2 , ...., wlkq ); Next l; Next k; For k=1 To n Step 1 For l=1 To n Step 1 q
If wlk= d lk Then let plk =wlk;
L18 L19 L20 L21 L22
w
Next k;
L13 L14 L15 L16 L17
MIN §¨ wl11 ©
EndIf; Next l; Next k; End.
In the algorithm, from lines L2 to L3 all variables are initialized, and parameters and proximity matrix are received. Lines L4 to L10 calculate a group of the path-length-i minimum weight matrices Wi, which will be used as inputs for calculation of the path-length-q complete minimum weight matrices Dq. Lines L11 to L15 compute the path-length-q complete minimum weight matrices Dq. Lines L16 to L21 examine whether each of the edges in the matrix Dq meets the condition. If so, these edges are moved from the matrix Dq to the final PF matrix. The condition is that the weight of an edge from the matrix Dq is equal to that of the corresponding weight from W1. If the condition is satisfied, it suggests that the edge from Dq has the same weight as the weight of the corresponding edge in W1 but different path lengths. Notice that as the value of parameter q increases, fewer elements in the matrix Dq may be qualified for the condition according to the
6.1 Pathfinder associative network properties and descriptions
133
algorithm because more paths whose path lengths are larger than 1 and path weights are lower than those in W may be found. If so, they replace the old ones. Consequently fewer edges in W have chances to be added to PF. The inputs for this generation algorithm are the two parameters r, q, and matrix W, which describes the proximity among objects. Here n is the number of all nodes in the network. The output of the algorithm is the Pathfinder matrix PF which may be employed to draw a Pathfinder associative network graph in the visual space. Parameter l, k, and m are control variables. We have 1=< q =< n-1, and 1=< r =
W
§0 ¨ ¨1 ¨7 ¨ ¨3 ¨ ¨7 ¨5 ©
5· ¸ 8¸ 7¸ ¸ 7¸ ¸ 8 7 2 0 1¸ 8 7 7 1 0 ¸¹ 1 0 1 7
7 1 0 2
3 7 2 0
7 8 7 2
(6.13)
For simplicity, the two parameters r and q for the Pathfinder associative network are set to f and 2 or PFNET (f, 2). First, the algorithm needs to calculate 2
path-length-2 minimum weight matrix W2. For instance, we can calculate w12 as follows.
w122
MIN {MAX ( w13, w32 ), MAX ( w14, w42 ), MAX ( w15, w52 ), MAX ( w16, w62 )} 2 w12
MIN (7,
7,
8, 8)
134
Chapter 6 Pathfinder Associative Network
N1
N6
N2
N5
N3
N4
Fig. 6.2. Original network display of an example 2 w12
(6.14)
7
Following a similar calculation procedure, we can calculate the rests of elements in W2. The final result sees Eq. (6.15).
W2
§0 ¨ ¨7 ¨1 ¨ ¨7 ¨ ¨3 ¨7 ©
7 0 7 2 7 5
1 7 0 7 2 7
7 2 7 0 7 2
3 7 2 7 0 7
7· ¸ 5¸ 7¸ ¸ 2¸ ¸ 7¸ 0 ¸¹
(6.15)
The next step is to calculate the path-length-2 complete minimum weight matrix D2 based on both W and W2: Compare wij and wij2 in both W and W2, find a minimum weight, and put the minimum weight into dij2 . For results of the calculations, see Eq. (6.16). D1 is equal to W, so we don’t need to calculate it.
6.1 Pathfinder associative network properties and descriptions
D2
§0 ¨ ¨1 ¨1 ¨ ¨3 ¨ ¨3 ¨5 ©
1 1 3 3 5· ¸ 0 1 2 7 5¸ 1 0 2 2 7¸ ¸ 2 2 0 2 2¸ ¸ 7 2 2 0 1¸ 5 7 2 1 0 ¸¹
135
(6.16)
The final step is to compare D2 and W, identify the edges which satisfy the 2
conditions d ij =wij, then put the satisfied edges into the PF matrix. It is clear that edges e12, e14, e16, e23, e34, e36, e45, and e56 meet the condition and should be added to the PF matrix (See Eq. 6.17). As a final result, the final Pathfinder associative network is shown in Fig. 6.3. This network demonstrates two characteristics of the triangle inequality: No link violates the triangle inequality within path length 2 in terms of Minkowski f-metric, and there may be some links violating the triangle inequality when their path lengths are longer than 2.
PF
§0 ¨ ¨1 ¨0 ¨ ¨3 ¨ ¨0 ¨5 ©
1 0 3 0 5· ¸ 0 1 0 0 0¸ 1 0 2 0 7¸ ¸ 0 2 0 2 0¸ ¸ 0 0 2 0 1¸ 0 7 0 1 0 ¸¹
N1
(6.17)
N6
N2
N5
N3
Fig. 6.3. Final display of PFNET(f, 2)
N4
136
Chapter 6 Pathfinder Associative Network
6.1.3 Graph layout method Unlike other information visualization approaches such as the self-organizing maps, DARE, TOFIR and so on, the Pathfinder associative network has a unique problem of graph drawing in a visual space. The problem is raised because logic relationships of nodes in PFNET are separate from physical relationships of the PFNET nodes in the visual space. Logic relationships of nodes in PFNET are described in the matrix (PF). But PF does not illustrate how these nodes are projected onto the visual space. The physical relationships of these nodes refer to nodes’ positions and locations, and edges linking these nodes in a visual space. For instance, nodes A and B are linked in PF, A and B can be positioned anywhere in a visual space as long as they are connected in the visual space. Graph drawing, an independent research field, addresses how to effectively arrange connected nodes in a low visual space while preserving logic connections and relationships of the nodes. Issues regarding aesthetics for drawing an undirected graph include graph symmetry, minimal edge crossing, bending of edges, uniform edge length, reflection of inherent symmetry, conformation to the frame, and uniform vertex distribution (Battista et al., 1994; Fruchterman and Reingold, 1991). A spring model for graph drawing was introduced by Kamada and Kawaii (1989). The model simulates a dynamic spring system where an edge in a graph stands for a spring, and a ring for a node in the graph. Two springs are linked by a ring in the system. When new springs are added to the system (or existing springs are deleted from the system), or an external force is imposed upon the system, the previous balance of the spring system is no longer maintained. The system reaches new equilibrium when the energies of all springs are released to the minimum status. This optimistic status is used to draw the graph in the visual space. The energy of a spring is given in the following equation: E
1 KuX2 2
(6.18)
Where X is the spring length from the position of its free status to the position of its stretching status, and K is the force constant of the spring which is primarily determined by its material quality. Eq. (6.19) can be extended to a multiple spring system. Given a dynamic spring system in which n nodes are mutually linked by springs. Denote pi (i=1, 2, 3,…, n) a node in a graph. The energy of the whole system is defined as:
¦ ¦ k ij u pi p j lij
n 1 n
E
i 1j
1 i 1 2
(6.19)
Where lij is the original length of the free spring determined by pi and pj, kij is the force constant of the corresponding spring, and | pi - pj | is the distance between pi and pj in the graph. In order to achieve the equilibrium status in which lengths of all springs are the shortest, E must reach its minimum value.
6.2 Implications on information retrieval
137
Finally, use the Newton-Raphson method (Rowe et al., 1987) to find out solutions to all variables in the equation which determine positions of all involved nodes in the graph.
6.2 Implications on information retrieval Application of a PFNET to a domain problem requires identifying two basic, necessary, and indispensable elements from application domain: the first is the objects which are used as nodes in the network; and the second is the proximity relationship between the two objects which is used to form a link between the two objects. For certain types of objects, it may correspond to multiple methods to define their proximity relationship between the two objects. Different types of objects may have different proximities. Proximity can be procured by either a humaninterference method or an automatic computation method. It is not surprising that different objects and proximity methods can lead to different Pathfinder associative networks. Clearly defining objects and the proximity method are essential to construction of a Pathfinder associative network.
6.2.1 Author co-citation analysis Author co-citation refers to the phenomena occurring when the authors of two different papers both co-cite the same paper(s) in their works. The concept is also called bibliographic coupling. Usually papers are cited to demonstrate previous related research works, or support the author’s arguments. It is a very common and natural phenomenon that two authors cite the same paper(s) if they address the same topic or a related topic. As a supplement to subject analysis, author cocitation analysis is unique and important because the cited documents have a close semantic relationship with a citing document. Views, themes, ideas, concepts, theories, issues, problems, trends, approaches, and people from the cited documents are naturally embedded in the contexts of the citing paper. It is believed that the concepts and conceptual relations based on cited documents have an advantage over concepts and conceptual relations created from conventional co-term analysis (Rees-Potter, 1989). Author co-citation analysis uses co-citation data to structure and summarize a scientific field which can be depicted in a co-citation network or a collaboration graph. It is apparent that in this case the objects, one of the two basic elements for the Pathfinder associative network construction, are documents which cite or are cited by each other in author co-citation analysis. There are various approaches to define proximity relationship between documents in co-citation analysis. Proximity relationship between two documents is used to produce a documentdocument proximity matrix providing the input for the Pathfinder associative network generation algorithm.
138
Chapter 6 Pathfinder Associative Network
The first approach, the cosine similarity measure, was described in Eq. (6.20) (Chen and Morris, 2003).
S (d i , d j )
cocit (d i , d j )
cit (d i ) u cit (d j ) 1/ 2
(6.20)
In Eq. (6.20), cit(x) denotes the number of all citations of a document x and cocit(x, y) stands for the number of co-citations that both document x and document y cite. Here both di and dj are two citing documents in a document collection. The equation suggests that the proximity or similarity between two documents increases if the number of citations that the two documents co-cite grows, and vice versa. And the proximity or similarity between two documents increases if the number of all citations for either of the documents decreases and the co-cited documents stay the same, and vice versa. The second proximity approach is called the Jaccard or Tanimoto similarity measure. See Eq. (6.21). It was used in co-citation analysis study (Schneider and Borland, 2004; Schneider, 2005). Definitions of cit(x) and cocit(x, y) in Eq. (6.21) are the same as these in Eq. (6.20). The difference between the two equations reflects in their denominators, or the way that they normalize co-citations. S (d i , d j )
cocit (d i , d j )
cit (d i ) cit (d j ) cocit(d i , d j )
(6.21)
The author co-citation Pathfinder network can be used to visualize progress in knowledge domain (Chen, 2004). In order to illustrate the progress, a time interval was divided into a number of meaningful time slices (saying one year, or five years), and an individual co-citation Pathfinder network was derived from each time slice. The final time series of co-citation networks was generated when all time slices were connected according to their time sequences. That is, time slices consisted of a continuous time series. In the time series of co-citation Pathfinder associative networks, salient changes between neighboring time slices were identified and scientific research evolution was visualized and analyzed. In the cocitation Pathfinder associative networks, a node size, and width and length of a link were proportional to the number of the citations of a document, co-citation similarity value respectively. And nodes in the network were classified as landmark nodes which had significant attribute values; Hub nodes that were the widely co-cited documents; and pivot nodes that were joints between different subnetworks. The third proximity approach is the Pearson r correlation co-efficient. It is widely recognized and used in author co-citation analysis. It is an easily understood concept. Many commercial statistical packages support the Pearson r correlation co-efficient. Author co-citation matrices can serve as input to principal component analysis as well as multidimensional scaling and hierarchical clustering routing. The Pearson r co-efficient method can produce highly intelligible results (White, 2003).
6.2 Implications on information retrieval
139
Correlation analysis addresses measuring the association degree between two variables. In the Pearson r correlation co-efficient (or Pearson product moment correlation co-efficient), the two variables should have a linear relationship, and either of the variables is normally distributed. They should be interval or ratio. The Pearson r can be computed from Eq. (6.22). r
n¦ X
n¦ XY ¦ X u ¦ Y 2
¦ X 2 u n¦ Y 2 ¦ Y 2
(6.22)
Where n is the number of observations, X and Y are two variables. Result r ranges from -1 to 1. If the result r is larger than 0, it suggests that there is a positive relationship between the two variables. If the result r is smaller than 0, it suggests that there is a negative relationship between the two variables. If the result r is equal to 0, it suggests that there is no relationship between them. For instance, result r (from 0.9 to 1) indicates very high correlation, from 0.7 to 0.9 high correlation, from 0.5 to 0.7 moderate correlation, from 0.3 to 0.5 low correlation, and from 0 to 0.3 little correlation. However, there are debates over application of the Pearson correlation coefficient approach to co-citation analysis (White, 2003). The issues include: Pearson r becomes unstable when smaller co-citation count matrices are combined; The treatment of diagonal in matrices from which measures like r are produced remains a problem; Pearson’s r is supposed to handle data with normal distribution while author co-citation data is highly skewed; The standard significance test for r assumes random sampling of independent observations from population; And so on.
6.2.2 Term associative network Term co-occurrence analysis addresses term co-occurrence behavior in a full-text document. Keywords appearing together in a predefined length of text in the same document are regarded as the co-occurrence terms. Term co-occurrence information can be utilized to produce the Pathfinder associative term network which may be utilized to explore and discover related terms in a domain that users are not familiar with. For instance, the idea, the socalled “term seeding“ method (Buzdlowski et al., 2001), is that a user starts with a seed term as a starting point, then it can trigger other associative terms that most frequently co-occur with the seed term. Documents including the seed term are systematically examined to return co-occurred terms. Term Pathfinder associative networks are also expected to help users to better formulate their queries. It is clear that objects of term Pathfinder associative networks are terms and the proximity is the relationship among terms in the full-text contexts. There are several options to define such proximity in the full-text contexts. The first proximity method is based on term adjacency information. In analyzing provided texts, all stop words are filtered by a predefined stop word list, and
140
Chapter 6 Pathfinder Associative Network
the remaining words are stemmed. Term pair proximity or similarity is calculated as the sum of values added when they are adjacent, or occur in the same sentence, paragraph, or documents. For each of the term pairs, similarity is increased by 5 if they are adjacent in the same sentence, 4 for a nonadjacent term pair in the same sentence, 3 for a nonadjacent term pair in the same paragraph, 2 for a nonadjacent term pair in the same section/chapter, and 1 for a term pair in the same document. The results of this processing lead to a final term-term proximity matrix that is used for construction of a term PFNET (Fowler and Dearholt, 1990). The term cooccurrence matrix is organized as follows. Both the columns and rows are defined as terms respectively. The order of terms in both the column and row are the same. The interaction of a column and a row in the matrix is the term proximity value between two terms. The second proximity method is based on term probability in a full-text. Association between two terms in a full text can be calculated by the equivalent index (Turner et al., 1988; Schneider and Borland, 2004), see Eq. (6.23). S (t i , t j )
f ij2 fi u f j
(6.23)
Where fij is the number of co-occurrences of term ti and term tj in citation contexts, both fj and fi are occurrence of term ti and term tj respectively. S(ti, tj) indicates the probability of term ti (tj) appearing simultaneously in a set of the citation context with term tj (ti). S(ti, tj) is also called a coefficient of mutual inclusion because of this reason. Eq. (6.23) can be used to produce the term-term proximity matrix.
6.2.3 Hyperlink The PFNET technique can be applied to Internet information representation. In this case, the objects of the Pathfinder associative network are Web pages and proximity is strength of a hyperlink which connects Web pages. In fact, there are two approaches to construct a PFNET based on hyperlink strength. The first one is similar to the author co-citation analysis method. The number of co-cited hyperlinks can be used to measure the similarity between two Web pages. The cell value in the webpage co-citation matrix is defined as the number of the same Web pages that two Web pages co-cite. The webpage cocitation matrix should be symmetric. This is because if webpage A cites webpage C and webpage B also cites webpage C, the direction of a citation does not play any role. Therefore, the final PFNET is an un-directed graph. The second approach is based on hyperlink connections between two pages (Chen, 1997). In this case, a webpage connection matrix is defined as follows. The cell value of the webpage connection matrix is defined as the number of hyperlinks that a webpage cites another webpage. It is apparent that the webpage connection matrix is asymmetric. That is because if webpage A cites webpage B, it
6.2 Implications on information retrieval
141
does not necessarily mean that webpage B also cites webpage A, therefore, the final PFNET is a directed graph.
6.2.4 Search in pathfinder associative networks Users are allowed to search an established Pathfinder associative network (Chen, 1999). After a query is submitted to the network, the relevance between a query and a document is calculated by the Pearson correlation coefficient. Then, search results can be demonstrated or highlighted on the network. The relevance magnitude of a search query and a document is indicated by the height of a raising spike from the document sphere. The longer a spike is, the more relevant it is to the document; and vice versa. Users can also browse the Pathfinder associative network at will. Clicking a document sphere, users can view its contents in detail. Documents on the center ring in a Pathfinder associative network appear to be more generic than leaf-documents on a branch. Query search in a Pathfinder associative network can be a different scenario (Fowler et al., 1991; Fowler and Dearholt, 1990) where both a query and a document are converted into two Pathfinder associative networks respectively. The similarity between a query and a document hinges on the similarity between the two Pathfinder networks. The query process can begin with the user’s entry of a natural language request for information. Query revision can be accomplished by deleting nodes, entering more texts, or dragging any terms that the system display to the query Pathfinder associative network. Keyword adjacency information in both a natural language based query and a full-text based document can be employed to generate a query Pathfinder network and a document Pathfinder network respectively. Since both a query and a document are presented in the PFNET form, the match technique between a query and a document is a little different from traditional ones. The proximity algorithm for a query network structure and a document network structure consists of two parts. The first part is defined as the ratio of the number of common terms in both a query and a document to the number of all terms in the query. It is clear that the first part only measures the term relevance between the query and the document. The second part is supposed to measure the network structure similarity between the query network and a document network. The value of this part increases when nodes (terms) connected in the query network also appear closely connected in the document network. For instance, the similarity value of two network structures increases by 2 when two terms appear in both the query and the document, and they are directly linked in both networks; the similarity value increases by 1 when two terms appear in both the query and the document, but are indirectly linked in both networks. All network similarity values are summed up and the total is divided by 2 times the number of links in the query to normalize the structure similarity between 0 and 1. Finally, the two parts are weighted and integrated into a final similarity value which is used to make a decision on whether the document is relevant to the query or not.
142
Chapter 6 Pathfinder Associative Network
6.3 Summary In a Pathfinder associative network PFNET(r, q), the triangle inequality is always satisfied in terms of a path weight calculated by the Minkowski r-metric within path length q. Characteristics of a Pathfinder associative network are determined by the two important parameters, r and q. The weight of a path is affected by the Minkowski r-metric while path length is affected the parameter q. PFNET can have systematic variations when the two parameters q and r are varied. The changes of these two parameters can impact the Pathfinder associative network complexity. The complexity of the network decreases as either or both of these two parameters increases. In other words, when the parameters r and q are equal to their maximum values f and n-1 respectively (n is the number of all nodes in a network), the PFNET is the simplest and most economical network. However, increase of q would result in an increase of computational complexity. The strength of PFNET lies in reveling accurate, detailed, and specific connections of nodes in a network. The weaknesses of the Pathfinder associative network include its computational complexity, which may prevent PFNET from not only visualizing a large dataset, but also dynamically modifying a PFNET caused by interactions between users and the network. The PFNET generation algorithm requires many large intermediate matrices to yield the final result. This may lead to occupying a large amount of memory to support the generation of these matrices. Another disadvantage of PFNETs in the present state of development is that people have no way of knowing the features upon which similarity judgments are made, which results in that the semantic content of links is not easily discernible (Dearholt and Schvaneveldt, 1990). It is clear that PFNET cannot generate a local visual configuration based users’ individual information needs and it only produces a global overview for a data collection. Since the logic relationships of nodes in a Pathfinder associative network are separate from its physical relationships of the nodes, the logical relations are not directly assigned to a coordination system of the visual space. It leads to a graph drawing problem when the Pathfinder associative networks are projected onto a 2D or 3D visual space. Fortunately, people have found an effective solution to the problem. The Pathfinder network technique is very effective and efficient for display of complex relationships among objects such as sophisticated semantic networks. As an information visualization means, it can be applied to a wide spectrum of information retrieval environments, ranging from information searches, author cocitation analysis, term co-occurrence analysis, thesaurus construction, to the Internet information representation.
Chapter 7 Multidimensional Scaling
Multidimensional scaling (MDS) technique consists of a group of methods used to discover empirical relationships among investigated objects by visualizing them and presenting their geographic representation in a low dimensional display space. It can be used to reveal and illustrate hidden patterns for a set of proximity measures among objects for multivariate, exploratory, and visual data analysis. It is much more vivid and intuitive than a traditional data analysis approach. People can observe proximity relationships among investigated objects intuitively in a low dimensional MDS display space, leading to a better understanding of individual or group differences of the investigated objects. Input data for MDS analysis is usually a measure of proximity (similarity or dissimilarity) of investigated objects in a high dimensional space, while its output is a spatial object configuration in a low dimension space where users may perceive and analyze the relationships among the displayed objects. It is apparent that in such a MDS display space the more similar two objects, the closer to each other they are, and vice versa. The data used in MDS analysis is relatively free of distributional assumptions. This characteristic makes the technique applicable to more study fields. MDS originated in psychology. The psychophysical approaches (Young and Householder, 1941; Torgerson, 1952; Guttman, 1968) led to successful algorithmic developments which soon came to be known as MDS technique. This technique has been widely applied in other fields such as econometrics, social sciences, sociology, physics, political science, biology, information science, archaeology, and chemistry. MDS techniques include a series of algorithms. Each of them handles different situations. In fact, one of MDS technique’s advantages is the diversity of its algorithms. These MDS algorithms can be classified into metric and non-metric MDS algorithms primarily based upon types of input proximity data. Another category of MDS technique is classical MDS algorithm. The non-metric MDS algorithm is applied to qualitative proximity data while metric MDS is applied to quantitative proximity data. Qualitative proximity data refers to ordinal data. In this case, investigated proximities in one category are ordered relative to those in another. Quantitative proximity data refers to ratio-scaled data type. Both metric and non-metric MDS algorithms attempt to achieve an optimal Euclidean distance configuration of projected objects in a low dimensional space by minimizing the so-called stress value. Classical MDS is used for only one proximity matrix with un-weighted and quantitative proximity data. Locations of projected objects in a low dimensional display space are computed through linear
144
Chapter 7 Multidimensional Scaling
algebra approaches in classical MDS algorithm. Proximity data in classical MDS is either interval-scale or ratio-scale data. Interval-scale data and ratio-scale data are applicable to both metric MDS algorithm and classical MDS algorithm while ordinal data is applicable in only nonmetric MDS.
7.1 MDS analysis method descriptions
7.1.1 Classical MDS Classical MDS was first introduced by Torgerson (1952). Unlike both metric and non-metric MDS, classical MDS does not involve any iterative procedure. Instead it offers a direct analytical solution for locations of projected objects in a low dimensional space. The non-iterative nature of the algorithm reflects its uniqueness. The proximity matrix of a classical MDS is a square matrix. An analytical solution for locations of projected objects in a low dimensional space is central and vital to classical MDS. It involves a series of linear algebra reasoning. Let us discuss these in detail. The process of converting a square matrix into eigenvalues and eigenvectors in linear algebra is called an eigen-decomposition. There is a famous eigendecomposition theorem which states that if a matrix is a square matrix, finding its eigen-decomposition is always possible. This theorem guarantees that both eigenvalues and eigenvectors can be derived from a square matrix. This is extremely important and necessary. Anxn
Q/Q'
(7.1)
In Eq. (7.1), matrix Anxn is a square matrix, matrix Q’ is the transpose of the matrix Q. According to the definition if Q’ is the transpose of the matrix Q, the rows of matrix Q become the columns of matrix Q’. Matrix I in Eq. (7.2) is an identity matrix where all diagonal elements are equal to 1 and all off-diagonal elements are equal to zero. Matrix Q is an orthonormal matrix if Eq. (7.2) is satisfied according to the definition.
Q'uQ
I
§1 ¨ ¨0 ¨ .. ¨ ¨0 ©
.. 1 .. 0
0 .. .. ..
0· ¸ 0¸ .. ¸ ¸ 1 ¸¹
(7.2)
Matrix / is a diagonal matrix whose off-diagonal elements are always equal to 0.
7.1 MDS analysis method descriptions
145
First we modify Eq. (7.1) by multiplying matrix Q to both sides of the equation, then use Eq. (7.2) to simplify it. Because / I is always equal to / , we finally have: AQ
Q/Q' Q
Q/I
Q/
(7.3)
Eq. (7.3) can also be presented as:
O V nu1
Anun V nx1
(7.4)
In Eq. (7.4), An×n is the same as the definition in Eq. (7.1), Vn×1 is a column vector (Column vector is a special matrix which contains only one column) and each of its elements is an eigenvector. Here O consists of a group of elements which are the so-called eigenvalues of the matrix Anxn, or characteristic values of the matrix Anxn. Notice that O is scalar which means it can be either real or complex. According to the definition, I·Vnx1is always equal to Vnx1, so we replace Vnx1 with I·Vnx1 in Eq. (7.4), move the right side of the equation to the left side, then we have the following equations: Anun V nu1 O I V nu1
0
(7.5)
According to matrix property: ( Anun O I ) V nu1
0
(7.6)
If the column vector Vnx1 is non-zero, then the corresponding eigenvalues must have a zero determinant which is defined as: Det ( Anun O I )
0
(7.7)
Eq. (7.7) is also called the characteristic equation of Anxn. The solutions of the equation are the eigenvalues of Anxn. Each of the eigenvalues corresponds to an eigenvector for which the eigen equation is always true. According to the Laplace equation, a determinant can be decomposed along a row (or column) of the matrix. This process is supposed to reduce the dimensionality of the matrix. By repeating the Laplace process, if the dimensionality of a matrix can effectively decrease to 1, then eigenvalues of the matrix can be easily calculated. n
Det ( B )
¦ bij u C ij
(7.8)
j 1
In the above equation, bij is an element of matrix B, and Cij denotes matrix cofactors. Cij
(1) i j u Det ( M ) ij
(7.9)
Mij is a minor of matrix B. Mij is derived by removing the ith row and the jth column in matrix B. It turns out a new dimension-reduced (n-1)-by-(n-1) square matrix.
146
Chapter 7 Multidimensional Scaling
For instance, we use the Laplace equation to calculate the determinant of the given matrix B (See Eq. (7.10)). In this case, we first remove the second column of matrix B.
B
Det ( B)
§ 1 2 3· ¨ ¸ ¨ 2 1 4 ¸ ¨2 0 5 ¸¹ ©
(7.10)
§ 2 4· § 1 3· ¸¸ (1) 2 2 u (1) u Det ¨¨ ¸ (1)1 2 u 2 u Det ¨¨ 2 5 5 ¸¹ © ¹ © 2
(7.11)
Notice that the dimensionality of the original matrix is reduced from 3 to 2 now. In Eq. (7.11), continue to apply the Laplace equation to calculate determinants of the two smaller matrices by removing the second columns for both matrices. Det ( B)
5
(7.12)
Now let us go back to Eq. (7.7) and use the Laplace equation to get a solution of eigenvalues for a given matrix. For simplicity, we use a 2-by-2 matrix as an example. A2 x 2
§1 2 · ¨¨ ¸¸ ©3 4¹
§1 O Det A2 x 2 O I ¨¨ © 3
(7.13)
2 · ¸ 4 O ¸¹
0
(7.14)
Apply the Laplace equation to Eq. (7.14):
O 2 3O 10 0
(7.15)
Solutions of Eq. (7.15) are the two eigenvalues of Eq. (7.13):
O1
2 , O2
5
(7.16)
Next, we need to get the corresponding eigenvectors of the two eigenvalues. First, get the eigenvector of O1 from Eq. (7.6). 2 · § v11 · §1 2 ¨¨ ¸u¨ ¸ 2 ¸¹ ¨© v12 ¸¹ 3 4 ©
0
(7.17)
From Eq. (7.17), we have the following equation. Here k1 is a constant. v11
2 u k1 u v12
(7.18)
7.1 MDS analysis method descriptions
V1
§ 2· k1 u ¨¨ ¸¸ ©1¹
147
(7.19)
Similarly, we have the eigenvector of O2 from Eq. (7.6). Here k2 is another constant. V2
§ 1 · k 2 u ¨¨ ¸¸ © 3¹
(7.20)
One of the major tasks of the classical MDS algorithm is to use the eigenvalue decomposition approach to generate a low dimensional coordinate matrix Xnxm based on which a final object configuration in a low dimensional space is depicted. Here n and m are the number of investigated objects and dimensionality of the low display space, respectively. In order to map objects described in a proximity matrix Mn×n onto a low dimensional coordinate matrix Xn×m, we have to introduce the scalar product matrix Bn×n, which serves as an intermediate matrix, bridging an input proximity matrix Mn×n to a final low dimensional coordinate matrix Xn×m. B nu n
X num u X m' un
(7.21)
Observe that a proximity matrix Mn×n only records the proximity relationship among objects, and it does not include any coordinate information of these objects in the low dimensional display space. The relationship between these two matrices must be investigated and determined. Fortunately, the double centering (Borg and Groenen, 1997) method can successfully solve the problem. If these investigated objects are projected onto the low dimensional display space and their proximity relationships described in Mn×n are also preserved after projection, a scalar product matrix of the low dimensional coordinate matrix and squared proximity matrix satisfies Eq. (7.22). The significance of the double centering is that it associates a high dimensional proximity matrix to a scalar product matrix of a low dimensional coordinate matrix. B nu n
1 J nun M n( u2)n J nun 2
(7.22)
Matrix Jnxn is defined in Eq. (7.23) where 1 is an n-dimensional column vector in which all element values are equal to one, and 1’ is its transpose or an ndimensional row vector with all element values equal to one. J nu n
I nun 1u 1'u
1 n
(7.23)
148
Chapter 7 Multidimensional Scaling ª1 «1 « 1u 1' «.. « «1 «1 ¬
1 1 .. 1 1
1 1 .. 1 1
1 1 .. 1 1
1º 1»» ..» » 1» 1»¼ nun
(7.24)
According to matrix property, /1/2m /1/2m should be equal to / 1/2m, therefore Eq. (7.1) can also be presented as: B nu n
Q m / m Q m ' Q m /1m/ 2 /1m/ 2 Q m '
(7.25)
Because of the matrix property that (AB)’ is equal to B’A’, Eq. (7.25) can be converted to: Bnun
Qm / mQm ' Qm /1m/ 2 /1m/ 2Qm ' (Qm /1m/ 2 )(Qm /1m/ 2 ' )'
(7.26)
Qm is the matrix of m eigenvectors and /1/2m is a diagonal matrix of m eigenvalues of matrix Bnxn. Comparing Eqs. (7.21) and (7.26), finally we have: X num
Q m /1m/ 2
(7.27)
That is an important equation. The analysis shows that from the following procedure we can project objects in a high dimensional space onto a low dimensional space: first square a given proximity matrix, second use the double centering equation to get its scalar product matrix of a low dimensional coordinate matrix, then calculate eigenvales and eigenvectors of the scalar product matrix, and finally use the eigenvales and eigenvectors to compute the coordinate matrix Xnxm. This procedure locates the positions of all objects in a low dimensional MDS display space. The following example is a distance-based proximity matrix for four Wisconsin cities. The first column (row), the second column (row), the third column(row), and fourth column(row) are four cities: Milwaukee, Green Bay, La Crosse, and Wausau, respectively. Distance relationships among these four cities are described in Eq. (7.28).
M city
§ 0 112 201 184 · ¨ ¸ ¨112 0 196 93 ¸ ¨ 201 196 0 141 ¸ ¨ ¸ ¨184 93 141 0 ¸ © ¹
(7.28)
7.1 MDS analysis method descriptions
149
Square the Mcity proximity matrix:
M ( 2) city
12544 40401 33856 · § 0 ¸ ¨ 0 38416 8649 ¸ ¨ 12544 ¨ 40401 38416 0 19881 ¸ ¨ ¸ ¨ 33856 8649 19881 0 ¸¹ ©
(7.29)
Because there are four cities, the number of the objects in this case (n) is equal to 4, matrix J4x4 is computed in the following equation. J 4u 4
§1 ¨ ¨0 ¨0 ¨ ¨0 ©
0 0 0 · §1 ¸ ¨ 1 0 0 ¸ ¨1 0 1 0 ¸ ¨1 ¸ ¨ 0 0 1 ¸¹ ¨©1
1 1 1· ¸ 1 1 1¸ 1 u 1 1 1¸ 4 ¸ 1 1 1¸¹
§ 0.75 0.25 0.25 0.25 · ¨ ¸ ¨ 0.25 0.75 0.25 0.25 ¸ ¨ 0.25 0.25 0.75 0.25 ¸ ¨ ¸ ¨ 0.25 0.25 0.25 0.75 ¸ © ¹
(7.30)
The corresponding scalar product matrix B4x4 of M4x4 is shown in the following equation.
B4u 4
1 ( 2) u J 4u 4 u M city u J 4u 4 2
2420 6622 7889 · § 12090 ¨ ¸ 5293 9029 1316 ¸ ¨ 2420 ¨ 6622 9029 15070 585.8 ¸ ¨ ¸ ¨ 7889 1316 585.8 5987 ¸¹ ©
(7.31)
Since the MDS display space is a two-dimensional plane, the parameter m is equal to 2. By using the eigen-decomposition approach, when eigenvalues and eigenvectors of the scalar product matrix B4x4 are calculated, they are listed in Eqs. (7.32) and (7.33) respectively. § 1.11u 10 12 · ¸ ¨ ¨ 602.66 ¸ ¨ 4 ¸ ¨ 1.379 u 10 ¸ ¨ 2.525 u 10 4 ¸ ¹ ©
(7.32)
0.559 · § 0.5 0.356 0.558 ¸ ¨ ¨ 0.5 0.679 0.387 0.372 ¸ ¨ 0.5 0.263 0.427 0.706 ¸ ¸ ¨ ¨ 0.5 0.585 0.597 0.225 ¸ ¹ ©
(7.33)
E _ values
E _ vector
From Eq. (7.32), the two largest positive eigenvalues 25250 and 13790 are identified respectively because the number of selected eigenvalues depends on the dimensionality of the visual space. In this case, it is equal to 2. Their corresponding eigenvectors are located in the fourth column and third column of the matrix in Eq. (7.33), respectively. Based on Eq. (7.27) the final coordinate matrix of the four cities (X) in the low dimensional presentation space is calculated in Eq. (7.34). The final locations of the four cities are Milwaukee (88.8, 65.5), Green Bay(59.1, -45.4), La Crosse(-112.2, 50.1), and Wausau(-35.8, -70.1) (See Fig.
150
Chapter 7 Multidimensional Scaling
7.1). It is worth pointing out that because the final visual display of the four cities is based on the distances among the four cities, it may not reflect the real city directions on a geographic map. It simply reflects relative relationships among the four cities in terms of distance.
X
E2 u
/12/ 2
0.558 · § 0.559 ¸ ¨ 0 · ¨ 0.372 0.387 ¸ §158.9 ¸ ¨ 0.706 0.427 ¸ u ¨¨ 0 117.4 ¸¹ ¸ © ¨ ¨ 0.225 0.597 ¸ ¹ ©
65.5 · § 88.8 ¸ ¨ 45.4 ¸ ¨ 59.1 ¨ 112.2 50.1 ¸ ¸ ¨ ¨ 35.8 70.1 ¸ ¹ ©
(7.34)
The procedure of classical MDS algorithm is described as follows. L1 Begin L2 Generate an object similarity matrix Mnxn as input of the algorithm; L3 Compute the matrix of squared similarities Mnxn(2) from Mnxn; L4 Generate a scalar product matrix Bnxn by applying the double centering L5 method to Mnxn(2); If the dimensionality of MDS presentation space is 2 L6 Then m=2 L7 Else m=3 L8 EndIf; L9 L10 Calculate both eignevalues and eigenvectors of matrix Bnxn, rank the L11 eigevalues, and select the m top largest positive eigenvalues and their L12 corresponding eigenvectors; L13 Generate the final output coordinate matrix Xnxm from the following L14 equation Qm /1/2m; L15 Project objects in the low dimensional multidimensional space based on L16 coordinate matrix Xnxm; L17 End. Lines L2 to L5 calculate a scalar product matrix. Lines L6 to L9 determine dimensionality of the low display space. Lines L10 to L12 calculate eigenvalues and their corresponding eigenvectors. Lines L13 to L16 compute coordinate matrix and then project objects based on it. It is clear that the algorithm is not iterative. The input of the algorithm is a proximity matrix M of investigated objects while its output is a coordinate matrix X of objects in the low dimensional presentation space. The key part of the algorithm is to bridge the proximity matrix to a coordinate matrix through its scalar product matrix, and its eigenvalues and corresponding eigenvectors. Since the final visual MDS display is presented in either a two-dimensional space or a three-dimensional space, the possible values for parameter m are 2 or 3.
7.1 MDS analysis method descriptions
151
80 60 40 20 0 -150
-100
-50
-20
0
50
100
-40 -60 -80
Fig. 7.1. MDS display of the four Wisconsin cities
7.1.2 Non-metric MDS The non-metric MDS analysis method is applied to ordinal proximity (dissimilarity/ similarity) data. In order to project ordinal proximity data onto a low dimensional configuration for investigated objects like classical MDS, ordinal proximity data must be converted into non-ordinal scaled data by a so-called monotonic transformation process. Such a transformation process may add “noise” to the data. After the investigated objects in a high dimensional space are projected onto a low dimensional space, relationships among the objects are expected to be preserved in the low dimensional space (See Eq. (7.35)).
G ij o d ij ,
or
f (G ij )
d ij
(7.35)
In the context of non-metric MDS, Gij is the proximity between two investigated objects (saying Di and Dj) in a high dimensional space, dij.is the Euclidean distance between projected objects Di and Dj in a low dimensional MDS space. Here f(x) is a projection function that converts objects in a high dimensional space to a low dimensional space. It is apparent that a location of projected object Di in a low dimensional MDS space is affected by proximities between it and all its relevant objects. Suppose Di is an object (or document) in a low dimensional MDS space which is m-dimensional and xik (k=1, 2,…, m) is the coordinate(X) of an investigated object (Di or Dj) in a low dimensional space (See Eq. (7.36)). And m is usually equal to 2 or 3 so that projected configuration is within a visible display space. The
152
Chapter 7 Multidimensional Scaling
Euclidean distance dij between two objects (Di and Dj) in the low dimensional MDS space is defined in Eq. (7.37). Di
x i1 , x i 2 ,......, x im , d ij
Dj
x j1 , x j 2 ,......, x jm
§m ¨¨ ¦ x ik x jk ©k 1
2 ·¸¸
1/ 2
(7.36) (7.37)
¹
The quality of the projection is described by a loss function which is used to measure optimal projection arrangement at its minimum “distortion”. A loss function, or Kruskal stress, or formula 1, or goodness-of-fit value, is defined in Eq. (7.38). It states that a least-squares method defines a loss function that is a normalized sum of projection errors over all pairs of objects. Here n is equal to the number of objects processed, and ¦in 1 ¦ nj 1 d ij2 in Eq. (7.38) is used to normalize the stress value so that stress values fall between zero and one. As a result, it can avoid unnecessary scale impact on the result.
S
§ n n f (G ) d 2 · ¨ ¦¦ ij ij ¸ ¸ ¨ i 1 j n1 n ¨¨ ¦¦ d ij 2 ¸¸ ¹ © i1j1
1/ 2
(7.38)
There are other loss functions. For instance, Eq. (7.39) is another formula for stress (S-stress) calculation (Takane et al., 1977). This equation is quite similar to the previous one in terms of error calculation and normalization but slightly different in the way to calculate them.
SS
§ n n ( f (G )) 2 ( d ) 2 2 · ij ij ¸ ¨ ¦¦ ¸ ¨i1j1 n n 4 ¸¸ ¨¨ d ¦¦ ij i 1 j 1 ¹ ©
1/ 2
(7.39)
The idea behind these equations is that in reality, for a variety of reasons, dij is not always ideally equal to f(G ij) in Eq. (7.35). There are differences between dij and f(G ij) for a variety of reasons. Basically, the stress value is affected by many factors such as the number of investigated objects, the dimensionality of a low MDS space, quality of collected proximities, quality of a selected regression function, and so on. It is clear that the smaller the differences between them, the more accurate positions of the projected objects in the low dimensional space. In other words, the greater the stress value is, the greater the distortion is. The rule of thumb for stress judging criteria for goodness-of-fit was provided (Kruskal, 1964) (See Table 7.1). For instance, the selected goodness-of-fit values for Kruskal’s stress 1 were 0.18 (early period), 0.17 (middle period), and 0.14(late period) respectively for 100 cases in a study (White, 1998).
7.1 MDS analysis method descriptions
153
Table 7.1. Criteria of goodness of fit
Stress value Goodness fit
0 Perfect
0-0.025 Excellent
0.025-0.05 Good
0.05-0.1 Fair
0.1-0.2 Poor
In order to achieve the best faithful configuration of investigated objects in the low dimensional MDS space, we must find a way to minimize the stress value. Minimization of the stress value implies that it requires an optimal placement of projected objects by adjusting the locations of the projected objects in the low dimensional MDS space to reduce the stress value. This problem can be solved by a so-called regression method. For non-metric MDS, a monotonic regression method is used to find a minimized the stress value for an object configuration. i, j , m, n : If
d ij d mn ,
Then G ij d G mn
(7.40)
The function described in Eq. (7.35) is a monotonic transformation function if condition in Eq. (7.40) is satisfied. That is, for any objects i, j, m, and n, if the distance between objects i and j is smaller than the distance between objects m and n in the low dimensional multidimensional display space, then in the high dimensional space the proximity between objects i and j is smaller than or equal to the proximity between objects m and n. Because of the equal mark (=) in Eq. (7.40), it is a weak monotonic transformation function. In nonmetric MDS, the key point is to find a monotonic transformation function for an optimal object configuration. Kruskal (1964) came up a successful solution to the problem. His method is quite simple but effective. The actual algorithm is based upon the following rule: rank proximities of all objects in a list, randomly generate a configuration of the projected objects in a low dimensional MDS space, and calculate the Euclidean distances among the projected objects in the visual space and record them. Check the list from the beginning; if consecutive distance values between two objects are not in a good order, then the corresponding distances will be replaced by their average distance value. In other words, if they are out of order, add them up and divide by the number of the disordered objects, and update them with their average distance in the visual space. This replacement process caused by disordered distances is also called unification. The replacement would lead to rearrangements of the affected objects in the visual space. The unification continues until all distance values in the list are kept in a perfect order. If any unification happens, another round order-checking process will follow. The monotonic transformation process is also called monotonic regression. Note that there are many other monotonic transformation algorithms available such as the Guttman algorithm (1968). The following gives an example to show the Kruskal regression algorithm. There are four objects. Their proximity relationships are shown in Eq. (7.41). The proximities of the four objects and process of the Kruskal regression are listed in Table 7.2. In Table 7.2, G ij is the initial proximity between objects i and j, dij is the distance between objects i and j after they are randomly projected onto the low dimensional MDS space, Iij, IIij, IIIij, and IVij are adjusted distances between objects i and j in different replacement phases. Examining randomly generated object distances in the given list, we find that d12 and d13 are out of order. They are
154
Chapter 7 Multidimensional Scaling Table 7.2. An example of the Kruskal regression process
Link(i, j) (1,2) (1,3) (1,4) (2,3) (2,4) (3,4)
Gij 1 2 3 4 5 6
dij 7 2 4 1 6 5
Iij 4.5 4.5 4 1 6 5
IIij 4.3 4.3 4.3 1 6 5
IIIij 3.5 3.5 3.5 3.5 6 5
IVij 3.5 3.5 3.5 3.5 5.5 5.5
replaced by their average distance 4.5. That is, makes the both the Euclidean distances d12 and d13 equal to 4.5 and rearrange the corresponding objects in the visual space. In the second round adjustment, d12, d13, and d14 are out of order. Therefore they are replaced by their average distance 4.3 and adjusted in the visual space. In the third round, d12, d13, d14, and d23 are replaced by their average distance 3.5 and adapted in the visual space. In the fourth round, d24 and d34 are replaced by their average distance 5.5 and rearranged in the visual space. After four round adjustments, all distance data is in good order and the monotonic data transformation from proximities to distances is completed (See columns G ij and IVij in Table 7.2).
M 4u4
§0 ¨ ¨1 ¨2 ¨ ¨3 ©
1 2 3· ¸ 0 4 5¸ 4 0 6¸ ¸ 5 6 0 ¸¹
(7.41)
The Scree plot is designed to analyze and display relationships between stress values and the dimensionality of a low dimensional space in MDS analysis. Its Xaxis and Y-axis are dimensionality and stress value, respectively. Studies show that as the dimensionality of a low dimensional MDS space increases, the corresponding stress value decreases. This means that a relatively large dimensionality of a low display space tends to generate a better result for monotonic transformation. However, if the dimensionality of a low MDS space is larger than 3, the object configuration is no longer visible for people. Therefore, it is not acceptable for visualization analysis. The Shepard plot illustrates relationships between the object proximity in a high dimensional space and its transformed distance in a low dimensional configuration space. In this case its X-axis and Y-axis are proximity and distance respectively. Less spread distribution (or small deviations) along a loose increasing/decreasing line in the plot indicates a good monotonic transformation fit. It is clear that this plot can be used to evaluate the quality of a monotonic transformation of proximity data. Object location adjustment in the low dimensional space has to be done after disordered distances are replaced by their average distance in the monotonic transformation process. In the monotonic transformation, when average distance replacement or unification is completed, it does not mean that corresponding
7.1 MDS analysis method descriptions
155
locations of involved objects in the low dimensional MDS space change automatically. Their locations must change accordingly to reflect the distance changes caused by the replacement or unification. This is achieved by rearranging or adjusting involved objects in the low dimensional MDS space. As a result, a new object configuration in the low dimensional space will be generated. For instance, in Table 7.2, d12 (=7) and d13 (=2) are out of order in column dij. According to the Kruskal regression algorithm both d12 and d13 are replaced by their average distance 4.5. It implies that both locations of involved object 1, object 2, and object 3, which determine both d12 and d13, have to be adjusted or relocated in the low dimensional MDS space so that newly replaced distances d12 and d13 are equal to 4.5. In fact the non-metric MDS algorithm basically includes two optimizations. The first is optimization of monotonic transformation of the input ordinal proximities. The second is optimization of object configuration (or object location adjustment) in the low dimensional MDS space. Optimization of a monotonic transformation happens first, and then it triggers optimization of object configuration in the low dimensional space. After two optimization processes are completed, these optimizations are finally evaluated by calculating a stress value. That is, a stress value is used to measure the goodness-of-fit. The two iterative optimization processes do not stop until the generated stress value reaches an acceptable level. The non-metric MDS algorithm is described as follows. L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16
Begin
Input an ordinal proximity matrix; Set an acceptable stress threshold (Sa) as comparison criteria; Define the dimensionality of a low dimensional MDS display space; Randomly project objects in the low dimensional MDS display space, or randomly assign addresses of objects in the coordinate matrix; Repeat
Calculate the Euclidean distances between objects in the low dimensional space based on the coordinate matrix; Monotonically transform the disordered data; Adjust addresses of disordered objects in the coordinate matrix; Calculate current stress value S; Unitil S >Sa End.
Lines L2 to L8 initialize the visual space and other variables. Lines L9 to L15 are a loop which has two optimizations: optimization of monotonic transformation of the input ordinal proximities and consequent optimization of object configuration in the visual space. The loop does not stop until the dynamic stress value reaches an acceptable level.
156
Chapter 7 Multidimensional Scaling 1.5 som2 1.0
som1 mds3 mds1
.5
0.0
mds2
pfn3
-.5
Dimension 2
som3 pfn1
-1.0
pfn2
-1.5 -3
-2
-1
0
1
2
Dimension 1
Fig. 7.2. MDS visual display of nine papers
Both input and output of this algorithm are the same as the previous classical MDS algorithm. The difference between the two algorithms is the way that the coordinate matrix of investigated objects is found for a graphic representation. The non-metric MDS algorithm attempts to minimize the square differences between optimally scaled proximities of objects in a high dimensional space and distances of the objects in a low visual space by iterative processes. It is worth pointing out that after a unification process the newly generated stress value would be smaller than the stress value yielded before the unification. This is very important because it will ensure that as the number of unifications increases, the corresponding stress values will be smaller than a given threshold. The following is a simple example of co-citation analysis. Nine papers in information visualization are analyzed. The first three papers (pfn1, pfn2, and pfn3) are about Pathfinder associative networks, the next three papers (som1, som2, and som3) are about self-organizing maps, and last three papers (mds1, mds2, and mds3) are about MDS analysis. The proximity of two papers is defined as the number of their co-citations. The proximity matrix of these 9 papers is shown in Eq. (7.42). In this MDS analysis, the S-stress formula (see Eq. (7.39)) is used, the stress threshold is set to 0.001, and the final round stress value is 0.00048. The MDS visual display of the 9 papers sees Fig. 7.2. It is apparent that (som1 and som2), (mds1 and mds3), and ( pfn1 and pfn2) are clustered together in the MDS
7.1 MDS analysis method descriptions
157
4.5 4.0
3.5 3.0
2.5
2.0
Disparities
1.5
1.0 .5 2.0
2.5
3.0
3.5
4.0
4.5
5.0
Observations
Fig. 7.3. Shepard plot of the monotonic transformation
display space, respectively. The corresponding Shepard plot is shown in Fig. 7.3. The dots in the figure form a less loosely spread increasing line.
M 6u 6
§0 ¨ ¨5 ¨2 ¨ ¨0 ¨ ¨0 ¨1 ¨ ¨0 ¨0 ¨¨ ©0
5 2 0 0 1 0 0 0· ¸ 0 3 0 0 0 0 0 0¸ 3 0 0 0 1 0 0 0¸ ¸ 0 0 0 3 6 0 0 0¸ ¸ 0 0 3 0 7 0 0 0¸ 0 1 6 7 0 0 0 0¸ ¸ 0 0 0 0 0 0 2 6¸ 0 0 0 0 0 2 0 2¸ ¸ 0 0 0 0 0 6 2 0 ¸¹
(7.42)
7.1.3 Metric MDS Like nonmetric MDS, proximities of objects in metric MDS analysis can also be transformed into the Euclidean distances in a low dimensional MDS display space by iterative processing. Fortunately, there is a plethora of metric transformation
158
Chapter 7 Multidimensional Scaling
algorithms available for metric MDS analysis. They include absolute transformation (the proximity between two objects is simply equal to their corresponding Euclidean distance in the low display space), ratio transformation (the proximity between two objects is proportionally equal to the corresponding Euclidean distance), interval transformation (See Eq. (7.43)), logarithmic transformation (See Eq. (7.44)), and other types of transformation functions. a b u p ij
(7.43)
a b u log( p ij )
(7.44)
f ( p ij ) f ( p ij )
In the above equations, both a and b are two parameters for metric proximity data transformation. For metric MDS, a stress value needs to be calculated. For a given metric transformation function, it is not difficult to figure out the minimized stress value compared to the minimization of the non-metric MDS stress value. In other words, optimization of the input proximity transformation in metric MDS is simpler than that of non-metric MDS. Optimization of object configuration (or object location adjustment) in the low dimensional MDS space is still necessary for metric MDS. The basic procedures of metric MDS are similar to that of non-metric MDS.
7.2 Implications of MDS techniques for information retrieval
7.2.1 Definitions of displayed objects and proximity between objects Basically, the application of MDS techniques to a domain are affected by many factors such as identification of a domain, definition of objects in the domain, definition of proximity among objects, the dimensionality of a low dimensional MDS space, selection of a suitable MDS algorithm, selection of monotonic regression function if selected proximity data type is ordinal, selection of an acceptable stress threshold if a nonometric/metric MDS algorithm is chosen, and so on. Among them the most important and fundamental factors are the definitions of objects in the domain and the definition of proximity among objects. That is, explicitly define meaningful objects which are investigated and ultimately displayed in a low dimensional MDS display space, and proximity between a pair of investigated objects and proximity data type. Implications and applications of MDS in information retrieval can be roughly categorized into two groups in terms of proximity definition: one is to use a co-citation method to define the proximity metric, and the other is to use a non-co-citation method such as a traditional distance-based similarity or an angle-based similarity measure. Of course, the applications can
7.2 Implications of MDS techniques for information retrieval
159
also be classified by object types that are investigated and displayed in MDS visual analysis. Citing papers and cited papers, which are published prior to citing papers, can literally form a citation network if citing papers and cited papers are connected properly. The citation networks embody the communication patterns of millions of scholars both living and dead. These patterns show how researchers go about embedding their work both cooperatively and competitively in the works of prior authors (Small, 1999). Citation information can be used not only to connect a citing paper to its cited papers to form a citation network, but also to analyze the semantic connection between two citing papers. It is natural and intuitive to employ cocitation information of two papers to define their proximity. The proximity between papers can be further utilized for construction of an input proximity matrix of MDS analysis. Similarly, the way of using co-citation to define proximity between papers can be expanded to proximity between authors who co-cite the same authors, or journals which include the same journals appearing in references, or research areas/topics that share the same papers. Input proximity matrices are author-author proximity matrices, journal-journal proximity matrices, and topictopic proximity matrices, respectively. Therefore, investigated and displayed objects in a MDS analysis can vary in different forms such as regular journal papers (York et al., 1995; Chalmers and Chitson, 1992), authors(White, 1998), journals (Nelson, 2005), or research areas(Small and Garfield, 1985; Small, 1973). As investigated object types change, MDS analysis presents and reveals different semantic relationship pictures at different levels. Notice that the number of investigated objects at a high level such as journals or research areas is usually smaller than that of investigated objects at a low level such as papers or authors. It implies that a high level MDS configuration may display less objects in the visual space than a low level MDS configuration does. Using the traditional co-citation method and strategy, MDS analysis can be expanded to visualization of hyperlink-based Internet information because hyperlink and citation share a similar structure and function in nature. In this case, Web pages, the investigated objects, are equivalent to papers or documents; hyperlinking and hyperlinked relationships are equivalent to citing and cited relationships; and co-hyperlinks, which are employed to define the proximity between two Web pages, are equivalent to co-citation. Since hyperlink co-citation analysis can also be applied to Web page affiliation, or portal, or field, therefore the displayed objects in MDS display space can also be affiliations or portals or fields. The proximity between two objects can be defined by using a non-co-citation method. An object, say a document, can be characterized by a group of attributes or keywords. The relationships between objects and attributes can be described in an object-attribute matrix like a document-term matrix. The proximity between two documents can be measured or calculated by a distance-based similarity measure, or an angle-based similarity measure, or other applicable similarity measures in the vector space. Therefore, the final object-object (documentdocument) proximity matrix as input of MDS analysis can be yielded based upon the object-attribute matrix and a selected similarity method. It is apparent that the
160
Chapter 7 Multidimensional Scaling
input object-object proximity matrix is generated indirectly through an objectattribute matrix. Another scenario for using a non-co-citation method in a MDS visual analysis is visualization of information system users. Users generally submit a query to an information retrieval system to search for relevant information. Each query consists of several query terms representing users’ information needs. Users with similar information needs tend to use similar search terms for their queries. Query terms submitted by users can be used to define and categorize the users. Relationships between users and query search terms can be utilized for a MDS analysis. In this case, displayed objects in a MDS display space are users who search an information retrieval system, and the proximity between two users can be calculated based upon shared query terms submitted by the two users. In other words, a useruser proximity matrix is generated based on the number of the shared query terms. Finally, users are projected onto the low dimensional MDS space and users with similar search behavior are clustered in the display space.
7.2.2 Exploration in a MDS display space After all investigated objects are projected onto a low dimensional space, an object configuration is produced and a MDS display space is completed. Then users can explore the MDS display space and begin an information discovery journey. According to MDS algorithms the MDS display should provide a holistic overview of investigated objects, where relevant or related objects are supposed to be projected into a cluster in the space. Users can zoom in on an interesting object in the visual space and observe how it is related to its neighboring objects. Clicking an object triggers the interface to present more detailed information about the object. Selecting a neighboring object allows users to observe related objects or discover new objects of interest. Users can use an adjustable radius to narrow down a particular area. The objects outside the defined sphere have their color grey so as to highlight objects in the selected area (Chalmers and Chitson, 1992). In a cocitation based MDS visual display, if a large cluster is identified, it implies that active intellectual activities in the field are taken and progress is made. If research areas are visualized, interdisciplinary research area(s) can be illustrated and easily identified; if journals are visualized, citation patterns among the involved journals can be demonstrated; and if authors are visualized, changes in leading authors and their impacts can be shown. It is interesting that if a time dimension is added to the MDS analysis, a series of MDS visual displays produced in different time periods can be compared and analyzed. It tells the evolution of investigated objects during a certain period of time. For example, if objects of MDS visual displays are authors in a research field, such a series would show dynamic changes of leading and influential authors in that field over a period of time. In order to search in a MDS display, users can submit a query consisting of search terms. Query terms will be matched with terms in projected objects (documents). Search results are demonstrated by highlighting related projected objects which contain the search terms in the MDS visual space. To facilitate understanding
7.2 Implications of MDS techniques for information retrieval
161
of the search results, some systems color relevant objects and non-relevant objects differently in the MDS display space. The darkness degree of a highlighted document reflects its relevance to the query. The darker the object icon in the MDS display space is, the more relevant it is to the query, and vice versa (Chalmers and Chitson, 1992). The complexity of responding to a query term would be as difficult as a sequential search. Adding a keyword index would definitely accelerate search processing. Users may explore a MDS space in another quite different way. They can submit a seed document which best fit users’ current interests, then this seed document initiates the MDS system to automatically create a local cluster related to the seed document, users can observe the cluster on the screen, and then select another seed document from the cluster and start another round exploration (Small, 1994).
7.2.3 Discussion Without doubt the MDS technique can be applied to the information retrieval field to provide users with a visual browsing and search environment. Unlike the application of MDS in psychology, in which the number of investigated objects is relative small, the application of MDS technique in information retrieval is a quite different scenario. For information retrieval, the number of involved objects (say documents) could reach thousands and even more. This huge number of displayed objects in a low dimensional space raises concerns in terms of efficient system implementation and information representation in the MDS display space. As a result, it may lead to in-tolerable implementation and system response times if the MDS system has an interactive mechanism between users and the interface, and a possible overload of objects in the visual display space. To solve the problem, people use the supernode method (Small and Garfield, 1985) that visualizes object clusters and objects at different levels respectively. In the MDS visual display space, based on a clustering algorithm, documents are clustered first so that highly related documents in terms of the co-citation are formed as new supernodes. These supernodes instead of individual documents are presented as displayed objects at a high level. Documents within a supernode might also be visualized but at a lower level if users zoom on a selected cluster. The multiple level presentations in the MDS display space not only simplify object presentation in the visual space, but also provide users with a meaningful categorical hierarchy for information browsing. The categorical hierarchy reveals hidden, complex, and structural inter-citation relationships among objects. Users can drill down to an area of interest in a multiple level display environment, and observe documents in that area in detail. When users keep progressively drilling down, they eventually reach the lowest level of the display to browse details of an individual document. Finding out an efficient computation of the stress value in each of the iterations can result in a dramatic decrease in implementation time for a large number of
162
Chapter 7 Multidimensional Scaling
investigated objects. As we know, the MDS technique minimizes the stress or difference between the high dimensional proximities and low dimensional distances among objects in order to achieve the optimistic configuration status. As the number of involved documents increases, the computational complexity increases in an exponential manner. A study (York et al., 1995) discovered a so-called anchored least stress method to tackle the problem. All documents are roughly clustered at the beginning and centroids (or geographic centers) of these clusters are computed. During the MDS document projection and iterative position adjustment processing, the distances between two objects in the low dimensional space are determined by calculating the distance from the position of an object to a centroid of a cluster which contains another compared object. In other words, the linear regression minimizes the squared differences between the observed proximity and fitted distances to the centroids. It will significantly reduce the number of object pair comparisons and thus decrease the algorithm computational complexity because the number of clusters is much smaller than the number of documents in a collection. However, efficiency improvement of the method is at the cost of sacrificing accuracy. Another potential problem of application of MDS to information retrieval is the intuitive representation of projected objects in a low dimensional MDS space. It is extremely important for users to easily understand and meaningfully interpret the graphic presentation. Toward that aim, the MDS approach in combination with the so called “ecological approach” was introduced to take advantage of the visual appearances of natural formats that humans have learned to interpret visually as part of the biological heritage from their species’ history on the Earth (Wise, 1999). In the visual “ecological landscape”, a MDS display space consisted of a group of ecologically connected local landscapes. Each landscape represented an object cluster. The size of each local landscape was related to the number of documents containing a thematic term which defined the local landscape. The shape of each local landscape was impacted by related clusters. A standard Gaussian function was employed to smooth the local landscape surface, making the landscape look more natural and smooth. A document was positioned in the “ecological landscape” based upon its indexing terms, the thematic term, and the category assigned to the document. This method mapped the content of documents directly onto the visual presentation and was scalable for larger document collections. There are a variety of MDS analysis software packages available. Both commercial statistics software packages like SPSS and customized MDS packages like XGvis (Buja et al., 2001) provide basic or advanced MDS analysis features. The major difference between commercial statistics software and customized MDS packages is that the former usually does not offer an interactive means that allows users to navigate and explore the MDS spaces, while the latter enables users not only to interact with the MDS display, but also to search the MDS display.
7.3 Summary
163
7.3 Summary The aim of MDS visual analysis is to generate a low dimensional space from a high dimensional space to illustrate the hidden relationships of investigated objects. However, the way of generating a low dimensional configuration of objects varies in classical MDS and non-metric (metric) MDS. Classical MDS uses a linear algebra solution for the problem. It requires calculations of eigenvalues and eigenvectors of a scalar product matrix, and a double centering squared proximity matrix. Although the classical MDS algorithm does not involve an iterative process, each step is very computationally complex and makes heavy demands on storage. As the number of analyzed objects increases, the computational complexity and demands for storage resources increase dramatically. The non-metric (metric) MDS method looks for the best match between the original proximity of two objects and their Euclidean distance in a low dimensional space in terms of a least sum-of-squares error. In the low dimensional MDS space, the Euclidean distance between any two investigated objects should reflect the degree of their proximity in a high dimensional space. Initial configurations of investigated objects are based on a random mapping in the low dimensional space. A stress function that compares object proximity with their distance in the low dimensional space is employed to evaluate the quality of the projection. An iterative improvement procedure is applied until an acceptable minimum of a stress function value is reached. The Kruskal algorithm, which is used for the minimization, is iterative and simple. The algorithm computational complexity is about O(N2) and in practice it is almost O(N) (Sidiropoulos, 1999). It suggests that applying traditional MDS to a very large data set may be prohibitively slow. The non-metric (metric) MDS algorithm does not guarantee the uniqueness of its output configurations because of possible errors in minimization of stress values (Quist and Yona, 2004). The rule of thumb for the relationship between the dimensionality of a MDS display space and the number of the projected objects is that a k-dimensional representation requires at least 4k objects (Borg and Groenen, 1997). It is worth pointing out that a classical MDS approach requires that a MDS input proximity matrix should be a square matrix. In other words, the number of columns should be equal to the number of rows in the matrix, and both the column head and row head of the proximity matrix are investigated objects. In reality, many proximity matrices may not be square object-object matrices. In this case, a classical MDS method is not directly applicable. There are two possible solutions. One is to construct a square object-object matrix by using an object-attribute matrix; and the other is to employ a nonmetric(metric) MDS approach, which can handle a proximity matrix that is not a square matrix (Bartell et al., 1992). Notice that there are many other complex MDS algorithms such as replicated MDS, weighted MDS, and so on, which are not discussed in this chapter.
Chapter 8 Internet Information Visualization
The Internet has been become the primary information resources for people. It not only poses unprecedented theoretical and practical challenges for information retrieval visualization but also provides an enormous opportunity for its application. Because of the richness and diversity of the Internet information, almost all information retrieval visualization approaches and techniques can find their niches in such a dynamic environment. The Internet makes information retrieval visualization transcend the traditional methods, objects, and application domains. Information visualization techniques can be used to alleviate the notorious “lost in cyberspace“ syndrome or disorientation during Internet navigation, making navigation smoother and more comfortable. The Internet has been the driving force behind major changes in information retrieval and other related fields. It has opened a new chapter for the field of information retrieval.
8.1 Introduction
8.1.1 Internet characteristics The emergence of the Internet, first as a communication infrastructure, later as a distributed computing environment and information repository, has transformed the way that information is exchanged and shared. Standardized formats like HTML and HTTP have been successfully defined with the development of browsers, these formats make accessing information and publishing information on the Internet very easy and convenient. Data published on the Internet can be accessed globally from a wide range of platforms. With the Internet, global and convenient information access is taken for granted. No other information system besides the Internet has had such significant impact on education, culture, economy, science, technology, and society. x Internet scale. There are a vast number of Web sites on the Internet. No one can give an exact number for the size of the Web. Many surveys and studies have been conducted regarding the dramatic growth of the Internet. One study indicated that the number of hosts has been roughly doubling every year (Kobayashi and Takeda, 2000). It appears that existing estimates significantly underestimate the size of the Web (Lawrence and Giles, 1998).
166
Chapter 8 Internet Information Visualization
x Complexity of contexts. The contents of a Web site can be very complex.
x
x
x
x
Although English is still the dominant language, a Web page can be in any language. The size of a website varies significantly, ranging from a single page to hundreds, if not thousands of pages. The Internet can cover any topic such as history, science, culture, sport, entertainment, news, business, etc. The ways to describe contexts of Web page are quite different. Some use metadata while others do not. Even for those with metadata descriptions, the use behaviors may vary in different user groups. Diversity of information types. Without a doubt, textual information is the primary information format. But information in other formats such as images, animation, music, audio-video, games, computer applications, and so on, is growing dramatically. Dynamics of Web pages. Unlike the traditional print-counterpart, whose contents no longer change after it is formally published, Web site contents change constantly. The contents of a Web site can be revised, added, appended, and even deleted. This characteristic can create inconsistencies between database indexes and Web pages, and possible dead links if a database is not updated regularly. Users’ diversity. Heterogeneous user populations pose a new challenge. The Internet users can be from elementary school students to experts, from inexperienced users to computer wizards. Every user group may have its own information seeking behavior. In addition, the number of Web users is large and growing. Millions and millions of users all over the world access to the Internet for a variety of purposes on a daily basis. Variability of Web page quality. An open Internet is a double-edged sword. It makes online publishing easy, convenient, and affordable for common people but it raises a new question regarding its quality. Unlike a regular journal paper that is peer-reviewed, Web pages may be posted on the Internet without any quality control process. The authenticity and quality of a Web page are major concerns for users.
8.1.2 Internet information organization and presentation methods Hyperlinks Hyperlink techniques allow users to present, organize, access, and browse information within a hyper-space. The hyper-space consists of two basic elements: page node and anchor (hyperlink) embedded within the page node. Page nodes are literally connected by embedded hyperlinks. A hyperlink creates a meaningful association between two nodes. The hyper-space is a huge network connected by hyperlinks. The Internet browsers are capable of displaying Web pages with embedded hyperlinks and allowing users to jump from a Web page to another hyperlinked Web page.
8.1 Introduction
167
A hyperlink is generated by Web authors to link an embedded object item such as a keyword, sentence, title, topic, or image in a textual context to other related Web pages. A hyperlink in a textual context is supposed to make a topic connection and association in a more natural, convenient, and smooth way. In contrast to the traditional linear information presentation method, the hyperlink technique is non-linear. When readers simply click a hyperlink, they are automatically directed to a Web page which contains the concept or topic that the hyperlink represents. Readers may then easily get more detailed and explanative information of the concept or topic in a different page and continue on their navigation journey. It is clear that the hyperlink technique can accommodate the various needs of users due to its incredible flexibility. However, since hyperlinks within a Web page are arranged and embedded by Web authors rather than Web surfers, the hyperlink framework of a Web page may or may not fit a mental model of some users. That is because, in part, there is no mandatory hyperlink implementation standard, and hyperlink implementation is totally dependent on Web authors’ interests, preferences, and understanding to the Web page content. When users visit a Web page, they have to keep making decisions on whether they should click an embedded hyperlink and juggle to another Web page. If a hyperlink misleads them to an unwanted Web page or they want to finish the unfinished parts of previous Web pages or they are no longer interested in the new Web pages, then users have to return to previously traversed Web pages. Current Web browsers are not equipped with an effective means to illustrate both physical and information contexts for users. These factors would force users to remember the contexts of previous Web pages where they jump to other Web pages so that they can return to the appropriate location. Without a doubt, it would increase the short memory burden of users. This kind of information seeking behavior would directly result in cognitive overload. As a result, users would easily fall into a state of disorientation, frustration, discomfort, and confusion. This is also called “lost in cyberspace syndrome”. If “lost in cyberspace syndrome” occurs, it jeopardizes users’ confidence, increases anxiety, and causes the users to lose patience in the course of exploring the cyberspace. They may spend more time and effort and get nothing out of searching the Internet. Users may be more interested in content-based relationships rather than Web author created hyperlink-based relationships. In most cases, content-based relationships between two Web pages are only partially reflected in hyperlink-based relationship structures (Fowler et al., 1996). In addition, Web authors may not be aware of all related Web pages that should be linked and it is not possible or necessary to hyperlink all of these related pages to their Web pages. In other words, if two Web pages are not connected by a hyperlink, it does not mean that the two Web pages are not semantically related. Hyperlink techniques are powerful and flexible in organization and presentation of the Internet information. But there are some inherent weaknesses with hyperlinks that also create problems for Internet navigation. A network based upon embedded hyperlinks can reveal some semantic relationships among hyperlinked Web pages, but certainly not all.
168
Chapter 8 Internet Information Visualization
Subject directory In order to improve and facilitate users’ information exploration in the cyberspace, a subject hierarchical framework similar to a traditional classification system has been developed to categorize and organize cluttered Web pages on the Internet. The subject hierarchical framework is also called a subject directory. It is expected that with the help of such a subject directory, Internet surfers would become more oriented in the cyberspace. A subject directory is regarded as subject guidance that provides users with predefined meaningful categories, making users’ navigation on the Internet more efficient and effective. For this reason more and more portals and search engines have integrated a subject directory component. Most subject directories support keyword searching features, but entries are also listed under one or more hierarchical subject terms. Because Web pages are selected and categorized by human experts in most subject directories, it raises a lot of problems. Categorization processing is laborintensive; it needs hundreds of people, if not thousands, to index and classify Web pages. Only a very limited portion of Web pages (compared to the existing and available Web pages on the Internet) can be handled. The time lag of categorized Web pages in subject directories also becomes a concern. Since indexing and classifying a Web page is time-consuming, it barely keeps pace with the dramatically growing number of Web pages on the Internet. Unsatisfactory category granularity (specificity) in a subject directory prevents users from effectively accessing more specific information. For example, Yahoo, the most famous subject directory on the Internet, is substantially large. But the average depth level of the Yahoo directory is around 4-5 levels; it is still regarded as not specific enough for the billions and billions of Web sites on the Internet. It’s important to understand that a subject directory will not cover every piece of information on the Internet. If a subject directory is built by humans (rather than by computer programs), its coverage is much narrower than a search engine database.
8.1.3 Internet information utilization Searching The most used and powerful way of finding information on the Internet is to use a search engine. It is not surprising that Internet users are increasingly utilizing search engines to meet their information needs. About 85% of the Web users surveyed claimed to be using search engines to find specific information of interest (Kobayashi and Takeda, 2000). As a primary information search means, a search engine accepts a query which usually consists of several keywords in conjunction with Boolean operators, parses the submitted query to determine a logical relationship among the keywords, finds out the best-matching Web pages in its huge and regularly updated databases, ranks them based upon relevance between the query and retrieved results, and finally presents them to searchers. It is important to understand that search engines do not directly search each of the Web pages. Instead, they search indexed Web pages in databases. Crawlers are responsible for traversing Web pages on the Internet, extracting keywords, and establishing databases.
8.1 Introduction
169
Each search engine looks through different databases and it is the databases that determine the scope of the retrieved results to some extent. Retrieval result differences also depend on the level of the algorithm sophistication employed by the search engine when it looks through its databases. Users’ responses to the returned linear results list from a search engine are not satisfactory. Mounting evidence shows that most users only browse the first screen of a search results list and they rarely venture beyond the first page. This suggests that the linear ranking method looks simple and intuitive, but is not very productive or effective. Voorbij (1999) found in a survey that 67% of Internet users agreed or strongly agreed that it is difficult to conduct a search on the Internet; especially for inexperienced users. Browsing The earliest way to look for information on the Internet is browsing by hyperlinks. It is simple and natural. It can be done without the assistance of any search mechanisms. Users just land at any Web page, browse contents of the page, and click an interesting hyperlink for another page; then another similar browsing cycle begins. The browsing cycle stops when users find satisfactory information. An Internet browser is the dominant means to support hyperlink browsing. Current Web browsers can effectively handle texts, images, animations, and other types of multimedia information. Unfortunately, it basically only displays one Web page in a window; multiple Web pages cannot be displayed in one browser window. Notice that the Internet navigation involves multiple Web page traversals. Although current browsers support flexible hyperlink techniques which can embed multiple hyperlinks into one Web page, it does not display semantic relationships of multiple Web pages in the one-page context. A hyperlink just directs users to another page. Surfers need this kind of in-depth neighborhood information to facilitate decision making about where they should explore next and regain control if they get lost. Web browsers provide limited navigation assistance features for information navigation. These features, such as back, forward, home, and bookmark are ineffective in resolving the complex navigation problems. Another way to facilitate browsing is subject directory browsing. A hierarchical subject directory can help users to establish a spatial and mental information model which can alleviate disorientation in the cyberspace. Subject directories organize Web sites by subject categories, allowing users to choose a subject of interest and to browse the list of resources within that category. Users conduct their searches by selecting a series of progressively narrow search topics from a number of lists of descriptors provided in the directory. In this fashion, users are directed to a destination through each more specific layer of the hierarchy. At each of the layers, they may choose an appropriate category from a pool of related sibling categories. Some users prefer subject directories because they can control and change traversal paths at will. If a category looks promising, they can pick it up and explore it. In a subject directory users can easily return to upper (lower) level layers for broader (narrower) information. In summary, the searching/querying paradigm on the Internet is suitable for using search engines to look for specific information which can be used as a
170
Chapter 8 Internet Information Visualization
navigational hub for further exploration while the subject directory browsing paradigm is suitable for selecting categories of a su bject directory to look for more general and vague information. The number of resources that users can find in a subject directory is generally far less than that through a search engine. Search engine query searching and subject directory browsing are complimentary. Both accommodate different kinds of users’ information needs. Web traffic analysis As we know, everyday countless people use the Internet. The number of Web users is phenomenal. Online activities and information about the Internet users, such as visitors’ IP addresses, entry ports, time accessed, paths traversed, Web pages browsed, files downloaded, search terms used, and so on, are faithfully recorded and kept in the Web logs of visited servers. These logs may be converted into valuable data for Internet traffic analysis. Internet traffic analysis can be used to (1) identify and understand hidden and invisible user visit patterns, evaluate features and services provided by an investigated Web site; (2) interpret and solve potential problems of Web site use; (3) detect and prevent malicious attacks from hackers; (4) optimize a Web site in terms of information organization structures and presentations; and (5) maximize utilization of Web site information resources. Web traffic analysis has attracted more and more attention from both researchers and practitioners. Web masters, networking administrators, system managers, and system developers, as a growing Internet user group, directly benefit from Internet traffic analysis.
8.1.4 Challenges of the internet Nielsen (2000) came up with three fundamental questions that the Internet surfers face when they navigate the cyberspace: “where am I now?”, “where have I been?”, and “where can I go next?”. It seems that though present Internet information organization and presentation approaches, Internet search mechanisms, and Web browsers are indispensable and valuable for a Web search, they cannot help surfers answer these questions satisfactorily. Users still suffer from “lost in cyberspace“ and disorientation syndrome. Inherent weaknesses and limitations of hyperlink techniques, search engine tools, and subject directory methods primarily account for the blame. The situation calls for new methods and techniques to smoothly navigate the Internet and accurately locate relevant information from extremely large, dynamic, and real-time Web. Users’ online activity analysis adds a new and unique dimension to Internet information use. More and more organizations are interested in users’ online activities for a variety of reasons; ranging from business decision making, network security, users visit behavior studies, to system maintenance. These organizations are seeking a convenient, intuitive, and powerful solution to mining pertinent and rich information.
8.2 Internet information visualization
171
8.2 Internet information visualization Information visualization techniques open a new avenue for solutions to the existing problems. Spatial visualization ability is one of the most important predictors of task performance for retrieving information in an information space. Thus, when users retrieve information on a Web site with a conventional design or information structure like subject directories and flexible hyperlink networks, they have to visualize the information structures in their minds which are held in working memory when they are searching information on the Web. This would increase the cognitive burden on the Internet surfers. Visualizing information structures, which reduces memory overload, can benefit Internet users (Zhang and Salvendy, 2001). Information visualization can convert a large amount of information into meaningful and interpretable visual representations where relationships among a Web site and its “neighboring” Web pages are demonstrated. It provides users with intuitive and visual contexts for information browsing. Since the “neighborhood” of a Web page is presented, both what has been browsed and what will be browsed become much clearer for Web surfers. The two (or three) dimensional spaces can hold much information in a more meaningful way. Moreover, information visualization techniques are usually equipped with powerful interactive mechanisms. These mechanisms give Web surfers more control over the cyberspace navigation than traditional searching and browsing means and methods. Finally, information visualization allows people to use both perceptual and cognitive reason to carry out complex tasks like Internet information retrieval. These unique characteristics of information visualization would mitigate, if not totally eradicate, the notorious “lost in cyberspace” and disorientation syndrome. Information visualization techniques hold more promise for Internet users. Pioneering work on Web-based visualization was done by Ang et al. (1994). He exploited the MIME-typing idea to allow visualization data to be sent over the Web and processed by a Web browser. Since then, more Internet information visualization techniques have emerged. Three areas were identified in which information visualization techniques can make significant contributions (Eick, 2001). They are: visualizing Web site structure as a visitor’s navigation aid; illustrating visit paths and flows through a Web site to help Web site designers build a more effective site; and monitoring the sites’ real time activities to aid Web site administrators and managers to run their businesses more efficiently and effectively. Due to the particular characteristics of the Internet information, Web-based information retrieval visualization should possess both good portability and flexibility. Web-based visualization systems should be portable and able to be applied to all sorts of computing platforms to accommodate global access because the computing environments such as operating systems and Web browsers are very diverse. For instance, operating systems may be different (UNIX, LINUX, WINDOWS NT, WINDOWS XP, WINDOWS VISTA) and versions of the same operating system may also be different. The Web-based visualization techniques
172
Chapter 8 Internet Information Visualization
should allow users to visualize data in a customized way with a minimum amount of work to configure settings and to set parameters for operations. In other words, they should require minimum effort to install the systems and visualize their data.
8.2.1 Visualization of internet information structure Subject directory visualization An intuitive graphic subject directory structure can serve as subject guidance for Internet navigation. In a visualized subject directory structure, categories and sub-categories serve as nodes, parent-child-relations in a subject directory as edges, Web pages as leaves, and the main entry of the subject directory as the hierarchical graphic root. This structure is supposed to facilitate users’ navigation, achieve insight into Web site subject organization, and therefore gain confidence in information browsing. A hyperbolic technique is widely used for this purpose. Euclidean geometry was invented two thousand years ago. Many visualization methods are built in Euclidean spaces. It is possible that multiple lines pass through a spatial point and that they are also parallel to a given line in a nonEuclidean space. But in a Euclidean space, these lines through a spatial point and parallel to a given line overlap. In other words, in a hyperbolic space, a “straight line” looks like a curve. An object increases exponentially in a non-Euclidean space when it gets closer to an observer, whereas it grows linearly in a Euclidean space when it gets closer. These unique features of the hyperbolic space can be utilized to maximize the use of screen real estate and exaggerate local details without losing their contexts. It can be used to organize and present a Web subject directory in a flexible hierarchical layout. The flexible and movable tree structure allows users to drag a part of the tree of interest to the center of the screen and all branches are automatically rearranged accordingly. The technique distorts the tree layout in its space so that the interest part of the tree can be emphasized visually in full detail, and distant branches and associated leaves on the periphery are sidelined. In this way the hyperbolic method provides users with a huge amount of dynamic display room, accommodating a large subject directory with an exponential number of nodes and branches. It possesses good manipulability because users may drag any object to any place in the hyperbolic space. Both Inxight (http://www.inxight.com) and Webbrain(http://www.thebrain.com/) are good examples of this type of visualization tools. Another method for subject directory visualization uses a familiar book context metaphor. The book content structure stretches out from left to right. All categories and sub-categories are arranged as rows like a table of content. An indent structure often indicates parent-child relationships of items presented. All sibling items are at the same level. Indented items/rows are the children. WebToc (Nation, 1998) used this approach to offer a hierarchical structure indicating the number of elements on branches and elements’ individual and accumulative sizes. This hierarchical structure is also expandable. A SOM approach can also be used to organize Internet information. The power of the self-organizing map approach rests on automatic generation of a
8.2 Internet information visualization
173
subject hierarchical structure based on Web page distribution without human intervention. WEBSOM (Kaski et al., 1998), ET-Map (Chen et al., 1996) and SOM (http://www.csis.hku.hk/~yang/visualization/frac.htm) are examples of this type. They were designed to visualize the Web information by automatically creating a category hierarchical system. The category systems could serve to classify the vast Internet resources into subject-specific categories and databases upon which search and browsing operations may be performed. WebMap used a “topographic” information map to visualize categories of the Open Directory Project (Dürsteler, 2001). The visual space consisted of three integrative layers. Each layer had a special meaning. The first layer was composed of the visual representation of the many information directories mapped in a two dimensional space. Each item was depicted with a particular pixel. The distance between items reflected their similarity. As a result, the Web pages were distributed in the visual space according to their similarities to categories. The second layer was the elevation depicted as a topographic map. It was proportional to the relevance of the items. Finally, the third layer was composed of icons that represent customized items similar to the favorites or bookmarks. An aggregation algorithm can simplify the connections by using an appropriate thickness of an edge to indicate the number of Web pages within that category in the edge-node graph method. Color and size can also be used to indicate density and number of Web pages in a visualized area. Hyperlink hierarchy visualization The hyperlink technique is a primary means for Web publishers to organize and present their Web information. A Web page can embed as many hyperlinks as necessary and a Web page can also be embedded in multiple Web pages. It is natural that people directly employ existing and semantically meaningful hyperlink connections among Web pages to construct a visual hierarchical structure to facilitate the cyberspace navigation. Although connected Web pages can generate a hyperlink-based network, it may be imposed as a hierarchical structure for the purpose of simplicity. Selecting a Web site as a hierarchy root, each of the outgoing hyperlinks is treated as its first generation child and all outgoing hyperlinks are first generation sibling nodes. The outgoing hyperlinks of these first generation Web pages are the second generation children, and so on. In this way, a Web site can collapse into a tree structure. However, to establish a multiple level hierarchical structure, we have to deal with two problems. One of them is the parent node ambiguity. Parent node ambiguity occurs when a Web page has multiple potential citing Web pages and each of the citing pages can be used as the parent node in the hierarchical structure. This is because many Web pages can cite exactly the same Web page. In order to maintain a hierarchical structure, a node in a visual tree graph must have only one parent node. Therefore, the algorithm must offer a mechanism to include one citing Web page as the parent and to exclude others. There are different ways to handle this ambiguity problem. In fact the hyperlink-based hierarchy methods are usually classified by the way that the parent node ambiguity is addressed.
174
Chapter 8 Internet Information Visualization
The other problem that must be addressed is the selection of a tree search method which will determine the sequence of creating a tree structure. Given a tree structure, there are two basic traversal algorithms to access to all nodes on the tree. They are the breadth first search method and depth first search method. Each method can traverse all nodes on the tree in quite different ways. The breadth first search method prioritizes sibling nodes at the same level in terms of traversal sequence. It starts with the root of a tree, and then it traverses all of its children. After that, the children’s children are traversed, and so on. If there are no special requirements and preferences, all sibling nodes at the same level are traversed sequentially, that is from left to right. The breadth first search algorithm gives a general overview of the data set before introducing any detail. In other words, it searches nodes from general to specific. To illustrate this method, we use Fig. 8.1, a tree display. For the breadth first search method, the traversal sequences are {A1} which is the root, then {A11, A12, A13, A14} located at the second level of the tree, next (A111, A112, A113, A121, A122, A123, A131, A132, A133, A141, A142, A143) located at the third level of the tree, and finally {A1111} located at the fourth level. The breadth first search algorithm description follows: L1 Begin L2 Put the root of a tree into Queue; L3 While Queue is not empty L4 Remove a node from Queue; For each valid child of the node L5 L6 Put the child into Queue; EndFor; L7 L8 The node is traversed; L9 EndWhile; L10 End. L2 put the root to the queue to prepare the late process. L3 examines the loop end condition. L4 gets a node from the queue as the current node. Lines L5 to L7 put all children of the current node to the queue. L8 processes the current node. A queue uses a FIFO (First In First Out) approach. A queue is a linear storage structure used for storing data elements. The element which is put first into the linear structure will be processed and removed from it first. If the breadth first based method is employed to create a tree structure, the priority-based traversal method (Chi, 2002) can be used to determine the branch sequence of a network structure. For instance, when a Web page is selected as a root of a hierarchical structure, its potential children were put into the queue with the most important Web pages inserted into the queue first and the least important Web pages inserted into it later. Children of a child were processed using the same strategy. The priority scheme ensures the visualization of a tree structure emphasizes and represents route popularity.
8.2 Internet information visualization
175
A1
A11
A111
A112
A13
A12
A113
A121
A122
A123
A131
A132
A14
A133
A141
A142
A143
A1111
Fig. 8.1. A tree display
On the other hand, the depth first search method emphasizes searching nodes at lower levels in a tree. It starts with the root, and then searches for one of its children. Next, instead of searching its other sibling nodes, the algorithm picks up one of the child’s children. The algorithm keeps searching for lower nodes until it reaches the lowest leaves of the tree. After the lowest node is traversed, the algorithm needs to return to the nearest sibling node to continue its un-completed traversal. The depth first search method uses the stack technique to preserve unprocessed sibling node information so that it can return to a right node for further traversal after the lowest leave is traversed. The depth first search algorithm gives users all details of a subject and then shifts to the next subject. It searches nodes from one subject to another. Using the same tree sample shown in Fig. 8.1, we can determine the new traversal sequence for the depth first search method: {A1, A11, A111, A1111, A112, A113} which is the first branch of the tree root, then (A12, A121, A122, A123) which is the second branch of the tree, next (A13, A131, A132, A133) which is the third branch of the tree, finally (A14, A141, A142, A143) which is the last branch of the tree. The depth first search algorithm description follows: L1 Begin L2 Push the tree root to Stack; L3 While Stack is not empty L4 Remove a node from Stack; For each valid child of the node L5 L6 Push the child into Stack; EndFor; L7 L8 The node is traversed; L9 EndWhile; L10 End.
176
Chapter 8 Internet Information Visualization
The procedure of the depth first search algorithm is similar to that of the width first search algorithm excepting the ways of storing and getting nodes. A stack uses LIFO (Last In First Out) approach. A stack is also a linear storage structure used for storing data elements. The element which is put last into the linear structure will be processed and removed from the structure first. After these two problems are solved, hyperlink-based hierarchy generation algorithms can be addressed. Inverted hyperlink tree method This is the most intuitive and simple method for construction of a tree structure, using a top-down approach. The root of the tree is a specially selected Web page, for instance, a Web portal. All of its direct outgoing Web pages are treated as lower branches of the tree. Each branch can be processed in the same way to set up sub-branches of the tree. This process can continue until the entire tree is established. The breadth first method is employed to construct the hierarchical structure. In other words, the hierarchical structure is built in the way of level by level. If a Web page is cited by multiple Web pages which are potential parents, the citing Web page, which has the shortest distance between the root and it, is selected as the parent node. If there are many Web pages meeting this condition, one of them is randomly selected as its parent. The distance between two Web pages is defined as the number of hyperlinks connecting them. It is clear that this kind of hyperlink hierarchical structure only reflects Web authors’ preferences, knowledge, and understanding. It reflects neither the relationship of the Web page contents nor the users’ intended navigation habits. User-interference method The uniqueness of this method is that the construction of a hyperlink hierarchical structure requires external users’ inputs. One of the benefits is that it can include and link non-adjacent Web pages in the tree. In other words, a lower level Web page can be connected to a higher branch in the tree structure even they are not linked by a hyperlink. MAPA (Durand and Kahn, 1998) is an example of this type of method. A heuristically-determined and user-assigned weight was associated with a connected hyperlink to measure semantic similarity between two Web pages. The larger the weight on a hyperlink, the less relevant the two Web pages, and vice versa. Each Web page was projected onto the hierarchical structure based upon the minimum-weight path from the root to the Web page. Other possible paths with larger weights between the root and the Web pages which were potential parent nodes were excluded and no longer considered. Weights were assigned to all involved Web pages by users and then Web pages were put onto the tree in the same fashion. In Eq. (8.1), a user-interference Euclidean method was introduced (Wishart, 2001). Given two Web pages Wi(ai1, ai2,…,ain) and Wj(aj1, aj2,…,ajn), each Web page is described by n attributes. Wi(ai1, ai2,…,ain) and Wj(aj1, aj2,…,ajn) may be two rows extracted from a term-document matrix. Parameter hijl is a weight that is assigned by people to indicate the significance of between Wi and Wj on the attrib-
8.2 Internet information visualization
177
ute l. Then the path weight between Wi(ai1, ai2,…,ain) and Wj(aj1, aj2,…,ajn) is defined as:
PWij
n 2½ °° ¦ hijl u (a il a jl ) °° l 1 ® ¾ n ° ° h ¦ ijl °¯ °¿ l 1
1/ 2
(8.1)
The final hierarchical structure is more subjective than objective due to human interference/input. It is flexible but also labor-intensive. Subject-directory-assistance method Notice that more and more Web sites maintain a subject directory to make them more accessible from subject perspectives. Online search tools such as Yahoo, Google and others also have a comprehensive and well-maintained subject directory. Categorized Web pages are organized and associated with subjects or sub-subjects in subject directories. The relationships between two Web pages can be judged if such a subject directory system is available. Taking advantage of existing subject hierarchy relationships of Web pages, the method (Munzner, 2000) created its hyperlink hierarchical structure based on not only hyperlinks between Web pages but also their categorical relationships in an existing subject directory. This strategy is helpful when a processed Web page is cited by multiple Web pages and its final parent node on the subject hierarchy needs to be determined based on the relationship. With the help of an existing subject directory, the Web page which is the nearest to the processed node on the existing subject hierarchy can be defined as its parent node. The method seems very natural. But the potential risk is that if a Web page is not available in an existing subject directory or there is no such existing subject directory, the method must look for other alternatives to handle the parent node ambiguity phenomena. User-usage-based method Another prominent method for construction of a hyperlink hierarchical tree employed users’ Web log data (Zhu et al., 2004). The rationale for this method was that users’ traversal patterns based on the Internet log data analysis more appropriately reflected users information seeking behavior. And, therefore, the hierarchical structures based Web log analysis was more user-oriented and relevant. Users’ visit activities can be faithfully recorded and kept in the Web log of a server. These activities may be used to identify important path patterns by analyzing their visit behaviors. After Web log analysis, each of the hyperlinks was assigned a visit-frequency-based weight. The assigned weight for a hyperlink had a simple linear relationship with its recorded visit frequency. In other words, the smaller the weight value of a hyperlink was, the less important and relevant the hyperlink was, and vice versa. As we know, a Web page can correspond to a pool of potential parent candidates because multiple Web pages can hyperlink to it. It is
178
Chapter 8 Internet Information Visualization
the weight that ultimately determines the parent of the Web page in the hierarchical structure. Web log data was presented and converted to a matrix, see Eq. (8.2).
Vtf
§ tf 11 ...... tf 1n · ¸ ¨ ... ¸ ¨ ... tf ij ¸ ¨ tf © n1 ... tf aa ¹ n n
(8.2)
Parameter n was the number of the involved Web pages in the database and tfij was the traversal frequency from Web page Wi to Web page Wj. The column and row were Web pages. Using the Euclidean distance approach, we can also calculate similarity between any two Web pages in terms of traversal behavior based on the Vtf matrix. Another probabilistic similarity algorithm based on Web log data (Chen, 1996) addressed the same issue but in a different way. Suppose Pij is the estimated traversal probability from Web page Wi to Web page Wj, then Pij can be defined as: Pij
tf ij N
(8.3)
¦ tf ir r 1
Parameter tfij is the same as in Eq. (8.2). It refers to the Web traversal frequency from Web page i to Web page j. Parameter N is the total number of all involved Web pages in the database. Now the similarity or path weight between Web page Wi and Web page Wj based on user visiting behavior is defined as: S ij
Pij
(8.4)
N
¦ Pik k 1
It states that the similarity is the ratio of the estimated probability from Web page Wi to Web page Wj to the total estimated probabilities from Web page Wi to all related Web pages. It is apparent that these methods require a well-recorded and comprehensive Web log to support the construction of a hyperlink hierarchical structure. The structure and layout of the hyperlink hierarchical structure may change if users’ visitation patterns change. Webpage-content-based method If two Web pages are connected by a hyperlink, it indicates that the author of the Web pages believes that they are somehow related. However, whether the two Web pages are semantically related and the degree to which they are related should rest upon the contents of the two Web pages. Thus, creating a hyperlink based hierarchy should take both the hyperlink and the contents of hyperlinked
8.2 Internet information visualization
179
Web pages into consideration. The method (Zhang and Nguyen, 2005) used a similarity algorithm to calculate the similarity between two hyperlinked Web pages. The Web page similarity, the hyperlink weight, in conjunction with easily identified hyperlink relationships was utilized to construct a one-level hyperlink hierarchy. The similarity algorithm can also be calculated using a direction-based, Euclidean-distance-based, or another approach discussed in Chap. 2. This method is more objective because it is built upon both Web page contents and hyperlink connections. Moreover, it can also demonstrate the extent to which two Web pages are related. But the method has to set up a Web page term matrix in order to calculate Web page content similarity. Linkage-similarity-based method The linkage-similarity based method (Zhu et al., 2004) displayed hyperlink connections based upon hyperlink similarity. However, the way it demonstrated the relatedness degree was quite different from the Webpage-content-based hierarchy method. It computed the similarity of hyperlinks by Web page citation analysis. The more the same hyperlink paths two Web pages share, the more relevant they are, and vice versa. In order to calculate hyperlink similarity, a webpagewebpage linkage matrix was defined as follows:
V path
§ a11 ...... a1n · ¸ ¨ ... ¸ ¨ ... a ij ¸ ¨a © n1 ... a aa ¹ n n
(8.5)
Vpath was a matrix (n×n) where n is the number of all involved Web pages which were projected on a hierarchical structure. The row and column of the matrix were defined as Web pages. The position of a Web page in the both column and row was consistent in the matrix. Element aij was equal to the shortest distance from Web page i to Web page j. That was the number of hyperlinks on the path between the two pages. This matrix can effectively describe both citing and cited data of a Web page. Notice that the matrix describes a network where a path from node A to node B is not equivalent to a path from node B to node A. That is because Web page A, containing a hyperlink path to Web page B, does not mean that Web page B also includes a hyperlink path to Web page A. Linkage similarity between two Web pages can be measured by using the Euclidean distance approach based on the matrix. The similarity method depends on Web page linkage pattern. However, if two pages are not directly or indirectly linked, it does not mean they are not relevant. In this case it is hard to make a judgment about their similarity and other alternative methods must be used. An integrative method Several methods are introduced for creating a hyperlink-based hierarchy. Each of them has its strengths and weaknesses. Each method addresses a different perspective of hyperlinks and each perspective is important and vital. It suggests that an integrative method, which includes multiple perspectives of a hyperlink
180
Chapter 8 Internet Information Visualization
and hyperlinked Web pages, may be more robust and sound. For example, similarity between two pages can be calculated by the linkage-similarity-based method, webpage-content-based method, user-usage-based method, and subject-directoryassistance method. If each of these similarities can be treated as a meta-attribute and S1, S2, S3, and S4 stand for similarity values of the linkage-similarity-based method, webpage-content-based method, user-usage-based method, or subjectdirectory-assistance method respectively, then the final integrated similarity between two pages is defined as: S
w s1 u S1 w s 2 u S 2 w s 3 u S 3 w s 4 u S 4 S1 S 2 S 3 S 4
(8.6)
In Eq. (8.6), ws1, ws2, ws3, and ws4 are weights assigned to the four similarity measures S1, S2, S3, and S4 respectively. The weights are used to control the impact of each similarity measure on the integrated similarity measure. Valid values of the weight are from 0 to 1. The sum of ws1, ws2, ws3, and ws4 should be equal to 1. The denominator in Eq. (8.6), which is the sum of S1, S2, S3, and S4, is employed to normalize the integrated similarity measure to make its valid value between 0 and 1. Theoretically speaking, the number of hierarchical structure levels may be as large as needed. However, as the number of levels increases, the number of displayed Web pages increases dramatically. This raises many problems. One of these problems is the effective display of the growing hierarchy on limited screen real estate. An overloaded number of Web pages would definitely make the screen display appear cluttered. To tackle this problem, people usually take a Web page clustering approach to alleviate the problem. Web pages are categorized and clustered based upon some criteria, and then clusters, rather than Web pages, are used as basic display objects to be presented in the graph. If users are interested in one or several clusters, they can drill it down to find more detailed display. The previously discussed content-based similarity algorithm, user-usage-based similarity algorithm, subject-directory-assistance algorithm, and linkage similarity algorithm can be employed to categorize and cluster Web pages.
8.2.2 Internet information seeking visualization Visualization of Internet information seeking focuses upon individual information seeking behaviors on the Internet. It usually includes Web browsing and querying. Visualization of Internet information seeking is more dynamic because individual information needs vary in both people and contexts. Browsing history visualization Trial-and-error is the most used strategy when users browse in an unfamiliar domain on the Internet. Users try to click a hyperlink that looks relevant and interesting. The selected Web page may not satisfy their needs and they have to return
8.2 Internet information visualization
181
to previous locations to try new pages. How far they need to go backward to a previous location depends on a number of factors. It is obvious that the farther they need to return, the harder it is to remember the visited contexts. Sometimes simply clicking a browser’s back button does not solve the problem because of their complex traversal history. Studies show that displaying a search history could reduce the demand for working memory and potentially improve users’ performance in a complex search (Fang, 2000). Illustration of search history requires the presentation of a series of browsed Web sites in visiting sequence order. The idea is similar to an online map system like Yahoo map or MapQuest which can highlight routes from a start location to a destination. Users can easily understand which Web sites have been visited, which subject directory paths have been traversed, and what decisions have been made in the navigation if a visual navigation map is provided. It helps users to orient themselves in the cyberspace, trace back previous activities, and redefine their navigation directions. A visual map that can illustrate detailed traversal history would definitely ameliorate browsing conditions and provide for smoother navigation. Most of these types of visualization graphs consist of nodes representing visited Web sites and directed edges representing a path between two visited Web sites. The start site is marked up as a start point in the graphs. Visited Web sites are connected by directed edges, since visit direction really means something. A visit history graph is usually a network formed by visited paths. Traversal sequence numbers can be labeled in some graphs for users to trace browsing sequences (Ellis and Dix, 2004; Herder and Weinreich, 2005). In order to put visit history graphs into more meaningful contexts, graphs can integrate related unvisited paths into the visit history graphs. The related unvisited paths refer to those that are connected by visited Web pages but not visited by users. One of benefits of this strategy is to facilitate users in exploring unvisited Web pages. The edge type of these unvisited paths should be designed differently (possibly a broken edge line) from these of visited paths so that they may be distinguished from the visited paths, which may be represented by a solid edge line. Repeatedly visited paths which indicated more time and efforts were made by users were represented by a thicker edge so that users could easily identify them (Czerwinski et al., 1999). Hy+ (Hasan et al., 1996) offered more context information by listing all hyperlinks in a visited Web page in the graph and only connecting clicked/visited hyperlinks. Browsed time in a Web site is an important factor. The longer users stay on a Web site, the more important the Web site may be, and vice versa. This kind of information can also be visualized. The size of a node can be used to indicate the visit time length. The larger a node icon in a graph is, the longer users stay there, and vice versa. Visit history graphs should enable users to drill down to an individual Web page to browse its contents at any time. Search engine result visualization Search engines respond to users’ queries by returning a linear results list. Retrieved Web pages in the list usually are ranked according to their similarities to the query. The ranked list helps users to identify the most relevant Web pages. However, a search engine can return an overwhelming number of Web pages to
182
Chapter 8 Internet Information Visualization
users. If users want to browse all of these pages, they must keep scrolling down the screen or turning pages. As the screen is scrolled down, the relevance between the query and returned Web pages in the current screen becomes lower. As a result, most users only read and browse the first page of the results list and ignore the rest. Visualization techniques can project returned Web pages onto one nonlinear visual space which can hold more returned data from a search engine. Moreover, the visual space offers more information about relationships among the retrieved pages which is beyond what a traditional search engine can reach within its linear framework. Kartoo(www.kartoo.com) represented information using a cartographic interface where returned Web pages from search engines were cities and the semantic link between two cities was a road. It allowed users to submit a query to multiple search engines simultaneously. Returned results were compiled and presented in a two-dimensional space. All returned Web pages were sorted and classified into clusters in the space according to their similarities to the submitted query. Related Web pages were connected by semantic links in the space. The semantic links were generated by semantic strength rather than the hyperlink between two Web pages. Web pages connected by semantic links were treated as a cluster. The semantic links among clustered Web pages were invisible in the map until a cursor moves over the Web page cluster. Web pages in a cluster shared the same background color so that they might be distinguished from other Web pages in different clusters. Users could interact with the map by clicking any interest area for detailed information. Returned Web pages from a search engine can be presented by groups of circles. It is a from-top-to-bottom generation method. Each circle represents a category or sub-category. All returned items are categorized. All sibling circles are presented in the way that sub-category circles are embedded into their parent category circle. That is, the sibling circles are positioned one after another within larger circles placed early. The size of a circle is related to the number of Web sites it represents. The larger a circle, the more Web sites it holds. All categories and subcategories are arranged graphically by the same method until the leaves of the subject hierarchy are reached. Finally, a root circle encircles all its sub-circles. Users can click any of the positioned circles to collapse an interest category. Grokker (www.grokker.com) is an example of this kind visualization tool. Lighthouse (Leuski and Allan, 2000) was a novel interface concept for browsing the retrieval results from search engines. It integrated a traditional ranked results list with clustering visualization of pages in its visual space. It visually clustered returned search results from a search engine. All results were positioned between two columns of titles which hold the titles of returned Web pages. In this way users could easily associate an object in the visual space with its title in the columns. The visualization space was located in the central part of the screen between the two columns. Each little sphere represented a returned Web page. Sphere size was related to relevance to a query. All projected spheres appeared to be floating in the visual space and semantically related spheres were located
8.2 Internet information visualization
183
together. It was claimed that such an approach would make full use of precious screen space and highlight the integration of returned results and visualization. In People Map (Konchady et al., 1998), users’ queries instead of returned Web pages were visualized. It attempted to identify similar user groups by analyzing and clustering their submitted queries. The idea was based on the assumption that people with similar interests usually shared the similar queries. That is, if queries were clustered, then the people whose queries were within the same cluster were also grouped. Toward this aim, all queries submitted to search engines by users were collected. Each query represented a different user in the visual graph. All queries were projected onto the visual space based upon query keyword relatedness. In the visual space, related queries were linked and grouped. In other words, a people relationship network was established.
8.2.3 Visualization of web traffic information The Internet information consists of not just obvious Web pages and embedded hyperlinks. Both Web pages and hyperlinks are associated with very rich and complicated statistical traffic data which is usually preserved in usage logs. A usage log saves all traces left on the Web servers from visitors. For instance, it includes visited page type (such as html, doc, pdf, etc.), page error, average visitation time, the number of pages viewed per visitor, actions taken (such as browsing, downloading, searching, and other transactions), the number of visitors per page, and so on. From a Web site management perspective, people desperately need to identify the most visited Web pages, where users come from, how long they stay and patronize their Web sites, where they exit, what are the most used search terms, what are the most favored subject directories, and when are the most visited times in a day. This information can be used to optimize their Web site design in terms of its contents, organizational structures, and presentations, to identify and fix Web site problems, and to promote their business. There are three basic Web log types (Hong and Landay, 2001): (1) Server-side log: the logging is done on a Web server. In this case the log data is available only to the owner of the server. (2) Client-side log: the logging is done on a client computer and requires special software to be installed in a local computer. It only collects local traffic data. (3) Proxy-based log: the logging is done on an intermediate computer. The first log type, the server-side log, is the most popular. Traditional Web traffic analysis methods employ simple graphs and charts based upon Web traffic log data. These graphs and charts lack interactive facilities, they must be examined sequentially. They generally focus on aggregations with minimum (if any) support for direct examination of records relating to an individual request. As a result, users are unable to dynamically explore the data at will. Furthermore, the traditional methods often fail to integrate analysis output with available information regarding site topology (Hochheiser and Shneiderman, 2001). Visualization techniques offer a visual means for the qualitative analysis of Internet traffic characteristics and the impact of existing parameters on the traffic
184
Chapter 8 Internet Information Visualization
due to their powerful interactivity and good intuition. Visualization techniques are used to generalize and visually display massive available Internet traffic data to identify potential patterns, trends, and anomalies. It would shed light on Internet traffic predictability, optimize information resources of a Web portal, identify valuable online services for business, enhance Web site usability, assure network security, and improve visitor retention. Visualization for Web usage Web log data can be used to yield a usage-based Web graph network. The method connects two Web pages in the network based upon user traversal information rather than a hyperlink between the two pages which is designed and implemented by the Web page authors. In other words, if two Web pages are linked in the Web graph network, a hyperlink does not necessarily exist between the two pages. The two pages are linked in the graph network if they are traversed one right after another by users. Jumping from one Web page to another Web page can be caused by a variety of reasons such as use of back buttons, click of a favorite Web page from a bookmark folder, and selection of a search result item returned from a search engine, guidance by a subject directory, selection of a hyperlink within a Web page, and so on. The usage-based graph network is different from the user-usage-based hierarchy methods discussed earlier. The usage-based graph network is used for Web traffic analysis and it should faithfully reflect users’ usage activities. User-usage-based hierarchy is designed for navigation guidance. The usage-based graph network should avoid any data loss and revision similar to the process for the parent node ambiguity in the user-usage-based hierarchy construction. Flow analysis (Eick, 2001) could be conducted in the graph network. Flow analysis focused on users’ visiting flows in and out within a Web portal. Flow analysis told users which Web sites directed visitors to this Web site and where visitors left the site. The former was called the entry point and the latter was called the exit point. The percentage of visitors entering and exiting a Web site should be illustrated. The flow analysis enabled users to pick up any of the meaningful Web sites as the analyzed object. By progressively selecting a Web site, users could see the flow of the users’ navigation and find out the useful pattern of visitors. Web sites could be ranked in an ascending (descending) order against flow traffic amount. Users’ activities may be included in the usage-based graph network. For instance, the activities included users’ responses to promotion items and plans, particular services, recommended product lists and so on. Through visualized users’ activities, decision makers would have a clear picture about what services and products that users were interested in, how long users lingered on these services and products, and what actions they took after visiting them. The longest repeating subsequence (LRS) method (Pitkow and Pirolli, 1999) identified and preserved significant traversal paths based on Web log data analysis. A repeating subsequence path was defined as a path traversed by visitors multiple times. It can achieve efficient identifications of more informative and prevalent paths in the usage graph. The method can be used to effectively optimize the graph network and still retain the predictive power of the full data set.
8.2 Internet information visualization
185
DiskTree (Chi et al., 1998) used a circular layout to display a tree structure for Web traffic data. Using the breadth first search algorithm, it yielded a disk-like graph. The center or root of the graph was the home page of a given start Web site, nodes on the same perimeter were siblings from the same level with regard to the root. At each node, the algorithm first counted the number of children, then allocated angular disc space for each of children by the following Eq. (8.7). N is the number of children, D is the allocated angle for each of children, and E the angular disc space for children. Its valid value is from 0 to 2S. As the level of the hierarchy increased, the corresponding E decreased. At the first level it was initially equal to 2S. It is clear that the disc space allocation process was recursive.
D
E
(8.7)
N
A hierarchical structure was visualized compactly in the DiskTree display. It was easily understood by people due to the familiar disc metaphor. Web traffic information was then added to this hierarchy. The size and brightness of a line between the center of the DiskTree and a node on the perimeter indicated Web page access frequency. The color of the line indicated the page lifecycle status. In addition, a time parameter was integrated as another dimension for comparison analysis. TimeTube was a series of visualized DiskTree structures in consecutive time periods. People could observe and compare changes of Web traffic and Web site contents on a Web site during a certain period of time. The visual usage pattern subtraction method (Chi, 2002) effectively depicted traffic changes in a more economical way. In a visual traffic graph determined by the visual usage pattern subtraction method, a series of traffic differences within a certain timeframe, one week or one day, was illustrated. Since only the traffic amount differences instead of the traffic amounts were displayed and overlaid parts were omitted or excluded, the display room reduced significantly. Therefore the method made a better use of screen real estate, alleviating the notorious data overload problem to some extent. Notice that the traffic difference between two consecutive time periods may be either positive or negative. If positive and negative differences were colored differently in the graph, users could easily distinguish not only the change degree, but also the change directions. The DiskTree method was extended to illustrate multiple layers (Chen et al., 2004). The idea was to use the DiskTree algorithm to generate multiple disc-like graphs from the various perspectives. These perspectives were (a) hyperlink structure of a Web site; (b) Web usage such as visit statistics per page, usage statistics per hyperlink, average access time per page, and access probability of the links; and (c) Web page clusters and classifications. Instead of visualizing Web paths visited by users, WebCANVAS (Cadez et al., 2000) visualized category series which traversed Web pages were located within. Seventeen small and informative categories were predefined. The visual graph was a simple list and each of the visits corresponded to a row in the list. Each row consisted of a series of squares. Each square represented a predefined category and different category squares were colored differently. All users’ visits were put
186
Chapter 8 Internet Information Visualization
in the list, their traversal behavior patterns were easily identified by different colored square patterns from the arrows. The visual space of Starfield (Hochheiser and Shneiderman, 2001) consisted of an X-axis and Y-axis. The visual space was a two-dimensional grid. Two meaningful attributes from a group of attributes can be selected and assigned to the Xaxis and Y-axis respectively. All Web pages were projected onto the grid defined by the X-axis and Y-axis. Then patterns were displayed in the grid. The identified attributes were client host, top level Internet host name, second level Internet name, time stamp, category, HTTP status, delivered volume, HTTP-referrer, and user agent. Security visualization Network security has become a major concern. Any computer network connected to the Internet is likely to be attacked by hackers. Unprotected and not well-protected ports can be accessed by uninvited attackers. Viruses also look for vulnerable ports to invade. Network systems can crash because of these malicious attacks and viruses. Fortunately, all of these activities like those of regular visitors are recorded in security log files. Visually analyzing security log data can help network administrators to detect possible attacks, prevent attacks, maintain network stability, and provide insight into the attacker’ methods and techniques. Some stealthy attacks are resistant to detection by traditional intrusion detection systems, but are easily identified by the appropriate visualization approaches (Conti and Abdullah, 2004). Research has suggested that three items should be addressed in networking security visualization (Ma, 2004). They are a time-ordered overview of the entire data set; a detailed view that represents a relatively small subset of time units; and feature view representing events occurring in a particular part of some address space during a particular time. The feature is defined as a port scan, an attack, or other spyware activities. An IP address is used to uniquely identify a network for the Internet information exchange. It is similar to a VIN (Vehicle Identification Number) of a car. As a fingerprint of a visitor, it is recorded in a Web security log for every visit. Visualizing and analyzing IP addresses can quickly detect network anomalies. A quadtree was the visual space for visualizing IP addresses (Teoh et al., 2002). The quad-tree was recursively defined as four equal sub-quad-trees. Quad-Tree collapsed into a tree structure where each node had four equally-divided children. For each of recorded 32-bit IP addresses, the first two prefix digits were extracted and projected onto the first level of the quad-tree. For instance, the first quadrant of the quad-tree was reserved for an IP address starting with “11”, the second quadrant for “01”, the third for “00”, and the fourth quadrant for “10”. Then next adjacent two digits of the IP address were pulled out and mapped onto a sub-quadrant of its parent quad-tree in the same way. The process was repeated until all digits of the IP address were exhausted. After all IP addresses were mapped onto the quad-tree, hidden patterns of visits were revealed. A quad-tree is displayed in Fig. 8.2.
8.2 Internet information visualization
1101
1111
1100
1110
187
01
00
10
Fig. 8.2. Quad-tree display. (Teoh, Ma, Wu, and Zhao, 2002). © 1994 IEEE. Reprinted with permission
The parallel coordinate plot method’s primary advantage is its ability to demonstrate multiple attributes of an object in one display, breaking the traditional bounds of two- or three-dimensional representations of scatter plots in one display. The parallel coordinate plot method consists of multiple vertical lines that are parallel to each other in the visual space. Each vertical line represents an attribute of an object. The order of these vertical lines can be randomly arranged. The number of attributes displayed is unlimited theoretically. The scale of a vertical line varies in different attributes. Each attribute of an observed object matches a point in its corresponding vertical line in the visual space. Two points of an observed object on the adjacent two vertical lines are connected to form a segment. After all attributes of an observed object are processed in the same way, the object is represented as a series of straight line segments which intersect with defined vertical lines. The attribute value for each observation is plotted along each axis relative to the minimum and maximum attribute values for all observed data (See Fig. 8.3). In Fig. 8.3, A1, A2, A3, and A4 are 4 attributes of an object. The object is represented by three connected segments. The result is a “signature” line across n dimensions (attributes) for each of the observed data. After all data is projected onto the visual space, objects with similar characteristics would share similar signature lines. Clustered objects can thus be visually discerned. Conti and Abdullah (2004) used the parallel coordinate plot method to visualize network visitors to prevent networking attacks. In their case, the defined attributes for parallel coordinate plots were source IP, destination IP source port, destination port, and protocol type such as TCP.
188
Chapter 8 Internet Information Visualization
Object
Fig. 8.3. Parallel coordinate plot method display
8.2.4 Discussion history visualization Online discussion, which allows users to participate and make a contribution freely to share personal experiences, ideas, and information, has become popular thanks to the ubiquitous Internet. Since discussion contents on a particular topic keep changing due to contributions of discussion group members, it needs a mechanism to show the evolution of the discussion contents to indicate who makes what contribution. Visualized interaction history information makes the discussion forum a more social space for users. History Flow (Viégas et al., 2004) was designed to show relationships between multiple online document versions. Exploratory analysis with visualization revealed complex patterns of both cooperation and conflict. In History flow, each version was represented by a vertical version line. All vertical lines were parallel to each other, starting from an old version to a new version. Each vertical line was partitioned proportionately by different sections. Each section with a different color represented a contribution by an author. The same sections of two neighboring vertical lines were linked so that newly added parts could be easily identified in the graph. Moreover, the distance between two neighboring versions (vertical lines) was proportional to the time between the two consecutive versions. Therefore, time spent on each version and content changes could be effectively visualized.
8.3 Summary
189
8.3 Summary The Internet has posed unprecedented challenges for information retrieval due to its dynamics, diversity, and complexity of information sources. The situation calls for new methods and means to address these challenges. Visualization techniques can be applied to visual presentations of subject directories and hyperlinks, which is expected to help users better understand the Internet’s information organization structures. Visualization techniques alleviate the “lost in cyberspace” syndrome or disorientation during navigation. These techniques can also be used to assist searching and browsing which make the Internet information seeking more intuitive and effective. Visualization techniques can facilitate Web traffic analysis to discover information use patterns for both optimization of information resources and prevention of malicious attacks from hackers. The major difference between Web traffic visualization and browsing history visualization is that the former attempts to discover information seeking patterns based upon a group of users while the latter tries to visually display the search history of an individual user. Visualization of hyperlink-based hierarchies and visualization of usage-based Web graph networks share commonalities on many fronts. Their purposes, however, are quite different. The former is designed for information navigation guidance while the latter is for Internet information usage analysis. Web traffic visualization for security may be regarded as part of visualization for Web usage to some degree. Both attempt to look for patterns within Web log files, but their purposes are quite different. Visualization for Web usage attempts to facilitate benign uses while Web traffic visualization for security tries to prevent malicious attacks.
Chapter 9 Ambiguity in Information Visualization
Ambiguity may occur when a message is conveyed through a medium from one source to destination in reality. In this broad sense, ambiguity is a property of communication. The medium can be language which we use in everyday life and are quite familiar with. From a linguistic point of view, both syntax and lexicon can cause language ambiguity in a certain context. Multiple meanings of a term in conjunction with a flexible syntax structure make ambiguity an important property of language. It is no surprise that language ambiguity is one of the long-standing research topics in the linguistics field. The medium can also be a form of art such as a painting or sculpture. Ubiquitous ambiguity in art becomes an indispensable and essential convention for many artists, which makes artistic expression more aesthetic, powerful, and sophisticated. It isn’t a coincidence that ambiguity also occurs in a computer interface, a special communicative medium between a human and a computer. Ambiguity, as the nemesis of usefulness and usability, is considered an anathema in the human computer interaction field (Graver et al., 2003). As such ambiguity in interface design has attracted the attention of several researchers (Futrelle, 1999; Mankoff et al., 2000; Gaver et al., 2003). A good interface should communicate information clearly with users, and its structure and layout should be explicit. The design of an interactive interface requires effective communication with end users. And a “well-designed” interface is one that easily facilitates information exchange between a user and a system. Ambiguity arises when the interpretation of a received message from the medium in a communication system is indistinct. When people handle an ambiguous message they momentarily assess it and then try to rule out their irrelevant senses based on the contexts, then explain and understand it. The brain may respond emotionally and often illogically when it is forced to make a decision based on inadequate and uncertain information. The uncertainty comes from multiple interpretations of the same phenomena without further explanative information. Psychologists would say that ambiguity can cause the discomfort and anxiety which result from knowing there is something people should know but they don’t know. Ellsberg’s (1961) pioneering work has produced a significant body of empirical and theoretical research aimed at increasing the understanding of the impact of ambiguity on individual decision-making. Generally speaking, people try to avoid ambiguity in communication in order to convey information precisely and accurately.
192
Chapter 9 Ambiguity in Information Visualization
9.1 Ambiguity and its implication in information visualization
9.1.1 Reason of ambiguity in information visualization Information visualization has a very close relationship with computer interface design but it is not equivalent to computer interface design. A visual space for information visualization, where objects are visually displayed and their relationships are illustrated, makes not only its interface design distinct from other information system interface designs; but also its communication method with humans sophisticated and challenging. Information retrieval visualization methods can be used to reveal hidden or invisible relationships among objects in a high dimensional space such as a document vector space. In order to perceive, observe, and interact with these objects in the high dimensional space, the high dimensionality has to be reduced to 2 or 3 dimensions so that the objects are visible in the low dimensional visual space. In other words, an information visualization method provides a projection approach to map the objects in a high dimensional space onto a visual low space where their relationships are preserved, revealed, and perceived. It is not surprising that the relationships among the objects may be distorted to some extent and insignificant relationships among the objects may be compromised and ignored due to the dimensionality reduction. When a dimensionality reduction happens, some object attributes and connections between objects, which are associated with the reduced dimensions, would be lost or changed unintentionally and inevitably. Which relationships are kept and which relationships are lost after projection really depends upon an involved projection algorithm. For both the Euclidean spatial characteristic based information visualization models and multiple reference point based information visualization models, the relationships between defined reference points and documents/objects are identified and persevered while the relationships among documents are compromised and undermined during projection. Even though relationships among documents can be indirectly reflected via the involved reference points in the low dimensional visual space, the indirect relationships may be “distorted” to some degree. For the self-organizing map information visualization model, the relationship between a winning node in the visual space and an input object is primarily considered while relationships among objects are not considered in projection. For the Pathfinder networking model, the most economical (salient) paths are specified and kept while other “insignificant” paths are totally ignored and discarded in the ultimate pathfinder network. For the multidimensional scaling model during an iterative optimization process, the relationships among a group of disordered neighboring objects in terms of similarity are included while other object relationships are disregarded. For this reason in the visual space a visual configuration of objects may faithfully reveal only part of the relationships among the objects in a high dimensional
9.1 Ambiguity and its implication in information visualization
193
space. It suggests that the relationship between a high dimensional space and a low dimensional space in the visualization method is not bilateral in terms of projection. That is, we can employ a visualization projection method to project objects in a high dimensional space onto a low dimensional visual space. Unfortunately, the reverse projection process does not hold. People simply cannot use a visual configuration in the low dimensional visual space to restore exactly the previous relationships among the projected objects in the high dimensional space. That is because an information visualization configuration in the visual space may correspond to possible multiple interpretations in its corresponding high dimensional space. The exact meaning of an ambiguous configuration cannot be determined from its visual context. That is the deep-seated reason for ambiguity in information visualization. An ambiguity phenomenon in a visual space is, in this sense, an inherent property of dimensionality reduction in information visualization. It is clear that the ambiguity can lead to uncertainty, inexplicitness, and lack of clarity of the visual configuration interpretation in the visual space, which can cause discomfort for users. The discomfort may be elevated to confusion, frustration, and even distress if the ambiguity situation deteriorates and no means is provided for users to disambiguate the phenomenon.
9.1.2 Implication of ambiguity for information visualization Ambiguity can be a two-edged sword. On the one hand, the ambiguity resulting from poor interface design such as inconsistent and inaccurate terminology, or improper application of controls, or lack of basic information, or dimensionality reduction, can generate discomfort and confusion. Ambiguity can mislead users in terms of understanding relationships among objects; prevent users from explaining and exploring information space properly; and undermine the confidence of users. A feeling of incompetence to interpret and handle a visual configuration may lead to ambiguity aversion. On the other hand, people may also make full use of the virtue of ambiguity. For instance, it is the ambiguity in an art piece that not only enables creators of the art piece to have a creative space and but also gives the viewers of the art piece a rich imagination space which is crucial and fundamental. Poetry can not exist without language ambiguity, and impressionism cannot survive without ambiguity. It is also true in visualization for information retrieval. Ambiguity can be a driving force behind the design and development of new information retrieval visualization models. As we know, a series of information retrieval visualization models such as LyberWorld (Hemmje et al., 1994), VR-VIBE(Benford et al., 1995), and WebStar (Zhang and Nguyen, 2005) were derived from the original VIBE algorithm to address the ambiguity problem to some degree. LyberWorld and VR-VIBE added a different dimension to the original two-dimensional visual space respectively while WebStar integrated object movement dimension into the
194
Chapter 9 Ambiguity in Information Visualization
visual space. These derived algorithms were developed to overcome the inherent ambiguity problem in the original visual space which we will discuss later. An ambiguous configuration in a visual space may convey unexpected useful information other than just unambiguous information and may be employed by users to reveal hidden information about displayed objects in the visual space. For instance, in DARE, TOFIR, and GUIDO, if objects overlap in their visual spaces, then this is a typical symptom of ambiguity in these visual spaces. However, these overlapped objects share some important Euclidean spatial characteristics. It is these common characteristics that categorize the overlapped objects together in the visual spaces. The clustering feature can definitely be used to perform object clustering analysis and to identify relevant objects. Ambiguity can also promote users’ creative exploration and more engagement in the visual space. Ambiguity is not just an inconvenience for users who interact with an information visualization system. By compelling users to interpret an uncertain visual configuration by themselves, it encourages them to grapple conceptually with the system and contexts, and thus establishes a more personal relationship with the configuration (Gaver et al., 2003). A study showed that users were comfortable with making a best guess from incomplete information for appropriate interpretation in a visualization social activity environment (Erickson, 2003) if proper interactive means is provided. As a positive result, ambiguity may challenge users’ flexibility and willingness to considering new ways to tap information in the visual space and try new unknown interactive behavior based on personal interest and preference to clarify the ambiguity and gain solid control. If users learn to live with ambiguity in the visual space, to be tolerant of and negotiate ambiguity, it can become a positive factor in human-machine interaction.
9.2 Ambiguity analysis in information retrieval visualization models In this section, ambiguity in each individual information visualization model will be discussed. Reason, type, implication of ambiguity, and solutions to ambiguity in the visual space will be addressed.
9.2.1. Ambiguity in the Euclidean spatial characteristic based information models The Euclidean spatial characteristic based information models need a reference axis, which consists of two reference points, to measure two important visual projection parameters: visual projection distance and/or visual projection angle. Documents are mapped onto the visual space based on these projection parameters. According to the projection algorithms, the high dimensionality of a document space has to be reduced to construct a distance (angle) and angle (distance)
9.2 Ambiguity analysis in information retrieval visualization models
195
visual space. Toward this aim, relationships between a document and the defined reference points, reflected in the visual projection parameters, are going to be preserved and presented in the visual space. Unfortunately, ambiguity inevitably occurs when the dimensionality is cut short. In Fig. 9.1, two reference points R1 and R2 are the key view point (KVP) are the auxiliary view point (AVP), respectively, R1R2 is the reference axis for projection, r is the length from R1 and R2, Di and Dj are two documents in a high document dimensional space, both d1 and d2 are the visual projection distances of the document Di against the two reference points, and both D1 and D 2 are two visual projection angles of the document Di against the two reference points. It is apparent that the parameter d1 and D1 are used for DARE, the parameter d1 and d2 are for GUIDO, and the parameter D1 and D 2 are for TOFIR if the document Di is mapped onto their visual spaces. Here we first introduce a new concept Distance to Reference Axis (DTRA) of a document (Zhang, 2001b). In a document space, as long as the reference axis is defined, for any document its Euclidean distance to the reference axis can be measured and computed. In Fig. 9.1, DTRA for the document Di is DiR’. DTRA is an important concept to explain ambiguity phenomenon in the Euclidean spatial characteristic based information models. Expression of DTRA varies in different visualization models because their coordinate systems in the visual spaces and variables are different. In DARE, DTRA of the document Di is defined by the visual distance and visual angle in Eq. (9.1). It is simple if the parameter d1 and D1 are used to express it. Di R ' d1u SIN (D1)
(9.1)
R1 d1 D1
R’
Di
d2
Dj
D2
R2 O
Fig. 9.1. A group of documents with the same DTRA value
r
196
Chapter 9 Ambiguity in Information Visualization
In TOFIR, DTRA of the document Di is defined by the visual angles in Eq. (9.3) from Eq. (9.2) according to Fig. 9.1. Di R 'uCOT (D1) Di R 'uCOT (D 2)
r
(9.2)
Then we have: Di R '
(9.3)
r COT (D1) COT (D 2)
In GUIDO, the calculation of DTRA is not as simple as DARE and TOFIR because it is described by the two visual distances of the document Di. d12
Di R ' 2 R1R ' 2
(9.4)
d 22
Di R ' 2 R 2 R ' 2
(9.5)
R1R ' R 2 R ' r
(9.6)
From Eqs. (9.4) to (9.6): R1R ' R 2 R '
d12 d 2 2 r
(9.7)
From Eqs. (9.6) and (9.7): R1R '
1 §¨ d12 d 2 2 r 2 u 2 ¨© r
· ¸ ¸ ¹
(9.8)
From Eqs. (9.4) and (9.8): 1 § d12 d 2 2 r 2 ° D i R ' r ®d 1 2 ¨ r 4 ¨© °¯
· ¸ ¸ ¹
2½
° ¾ °¿
1/ 2
(9.9)
Since DiR’ is the distance between Di and R’, the negative solution should be ignored in Eq. (9.9). It is interesting that in Fig. 9.1, documents with the same DTRA value form a circle which is symmetric against the reference axis in the high dimensional space. All documents on the circle whose radius is DiR’ share the same two visual distances (d1 and d2) and two visual angles (D1 and D 2). It implies that all of these documents on the circle are projected onto the same spot in the three visual spaces because they have the identical visual projection distances and visual projection angles. For this reason the circle is called the object overlapping circle. These documents on the circle can be close to each other in the high dimensional document space. They can also be far away from each other like Di and D j in Fig. 9.1. The maximum distance between two overlapping documents is 2DiR’. In other
9.2 Ambiguity analysis in information retrieval visualization models
197
words, a cluster of projected documents in the visual spaces may not correspond to a real cluster of documents in the high dimensional space. Whether a cluster of projected documents in the visual space corresponds to a real cluster of documents in the in the high dimensional space depends not only on their locations on the special circle but also on their size of DiR’. If these overlapping documents are associated with a small size of DiR’, the overlapping documents correspond to a real cluster whose size is defined by DiR’. If these overlapping documents are associated with a large size of DiR’, then the locations of these overlapping documents on the special circle determine whether they are clustered in the high dimensional space or not. Since DTRA of the overlapping documents plays a significant role in judging the degree to which the overlapping documents are ambiguous in the high dimensional space, it is necessary to discuss DTRA distribution in the visual spaces. As we know, each projected document in the visual space possesses a DTRA value against the reference axis and different projected documents may vary in their DTRA values. If we can identify DTRA distribution in the visual spaces, it definitely would facilitate understanding of the ambiguity in these visual spaces. In Fig. 9.2, there is a line which is parallel to the reference axis and its distance to the reference axis is a in the high dimensional space. D is a point on the parallel line and D can define a group of the object overlapping circles like the one in Fig. 9.1. The radius of these circles is always equal to a because the line is always parallel to the reference axis. Based upon the earlier analysis and discussion, if the projection equation of the parallel line with the distance (a) to the reference axis in the visual space can be calculated, a projection area with DTRA which is smaller than a can be identified. A cluster of documents within this area would definitely correspond to a real cluster in the high dimensional space while a cluster outside the area may not correspond to a real cluster.
D1
D
R1
d1 a
r D2
d2
R2
D’
O
Fig. 9.2. Display of a parallel line to the reference axis
198
Chapter 9 Ambiguity in Information Visualization
6 10
4
4 10
4
50 x · sin§¨ ¸ © 1000 ¹ 2 104
0
0
1000
2000
3000
4000
x Fig. 9.3. Display of a low DTRA area in the DARE visual space
For DARE, any point D meets the following relationship (Eq. 9.10) in terms of the visual distance d1 and visual angle D 1. Variables in Fig. 9.2 are the same as those in Fig. 9.1. Here a is a constant. d1
a SIN (D1)
(9.10)
Fig. 9.3 shows the curve of the projected parallel line in the DARE visual space. Here distance d1 and angle D1 are assigned to the Y-axis and X-axis, respectively in the visual space. In this case, the constant a is equal to 50. Notice that the curve fits in the valid display area described in the previous chapter which is defined by the X-axis, Y-axis, and X= S. The area between the curve and the boundary of the valid display area is the low DTRA area (a=50). Document cluster within this area corresponds to a real cluster in the high dimensional document space for a small a. The curve is symmetric against the line X= S/2. The larger the value of a is, the closer to X= S/2 the curve is, and vice versa. For TOFIR, the projected parallel line in the visual space can be described by the two visual angles (D 1 and D2) shown in Eq. (9.12). a a TAN (D1) TAN (D 2) r a
r
D1 COT 1 ( COT (D 2))
(9.11)
(9.12)
Fig. 9.4 shows the curve of the projected parallel line in the TOFIR visual space. Here D1and D2 are assigned to the Y-axis and X-axis respectively in the
9.2 Ambiguity analysis in information retrieval visualization models
199
visual space. In this case, the constant a is equal to 50 and the length of the reference axis or the distance between the two reference points is equal to 100. Observe that the curve just fits in the valid TOFIR display area which is defined by the X-axis, the Y-axis, and Y=-X+S. The curve in conjunction with the X-axis and Y-axis defines the low DTRA area in the TOFIR visual space. It is clear that the curve is symmetric against the line Y=X. When value of the parameter a decreases, the intersection between the curve and the line Y=X moves to the origin of the visual space. When value of the parameter a increases, the intersection moves away from the origin of the visual space. For GUIDO, three different scenarios of the projected parallel line in the visual space are discussed separately. The first scenario is that a point on the parallel line can be directly mapped within the segment formed by the two reference points R1 and R2 like the point D in Fig. 9.2. In this case, from the triangle R1DR2, we have: ( d12 a 2 ) 1 / 2 ( d 2 2 a 2 ) 1 / 2
(9.13)
r
Therefore, the projected parallel line equation is shown in Eq. (9.14).
^
d 1 r a2 ( r ( d 2 2 a 2 ) 1 / 2 ) 2
`
(9.14)
1/ 2
Since visual distance is always positive, the negative solution from Eq. (9.14) is no longer considered. The second scenario is that a point on the parallel line is mapped outside the segment formed by the two reference points R1 and R2 like the point D’ (on the R2 side) in Fig. 9.2. The projected parallel line equation in the visual space is shown in Eq. (9.16). For the same reason, the positive solution is kept and negative one is ignored in Eq. (9.16).
4
§100 cot §¨ x ·¸¸· © 50 © 100¹¹
acot ¨
2
0
0
100
200
300
x Fig. 9.4. Display of a low DTRA area in the TOFIR visual space
400
200
Chapter 9 Ambiguity in Information Visualization ( d12 a 2 ) 1 / 2 ( d 2 2 a 2 ) 1 / 2
^
d 1 r a2 ( r ( d 2 2 a 2 ) 1 / 2 ) 2
r
(9.15)
1/ 2
(9.16)
`
Similarly, when a point on the parallel line is mapped outside the reference point R1 of the reference axis, the projected parallel line in the visual space is illustrated in Eqs. (9.17) and (9.18). The negative solution is not considered. ( d 2 2 a 2 ) 1 / 2 ( d12 a 2 ) 1 / 2
^
d 1 r a2 ( r ( d 2 2 a 2 ) 1 / 2 ) 2
r
(9.17)
`
(9.18)
1/ 2
In fact Eq. (9.18) is the same as Eq. (9.14). Thus, both Eqs. (9.14) and (9.16) are used to describe the projected parallel line in the visual space. The curve in the GUIDO visual space is shown in Fig. 9.5. In the figure, the constant a is equal to 50, d1 is assigned to the Y-axis, and both x and t are two different variables of the two equations assigned to the X-axis of the visual space. They correspond to Eqs. (9.14) and (9.16), respectively. The two equations generate two curves which merge near the origin of the visual space. The two curves fall within the valid display area that is a half infinite plank. The plank forms an angle (S/4) against both the X-axis and Y-axis. The boundary of this valid display area and the two curves defines the low DTRA area in the GUIDO visual space. The curve is symmetric against the line Y=X in the visual space. The smaller the constant a is, the farther the projected parallel curve is away from the line Y=X in the visual space and that creates a smaller DTRA area, and vice versa. In summary, the projected parallel line always falls in the valid display area in the visual spaces. It is evident that all low DTRA areas are located near the boundaries of valid display areas in the visual spaces. If a low DTRA area in the visual spaces is identified, then a cluster in the DTRA areas corresponds to a real cluster in the high dimensional space. However, a cluster in other areas may or may not be associated with a real cluster in the high dimensional space. To solve this problem, users can select a document from an overlapping document set in a high DTRA area of the visual space, replace the auxiliary view point (AVP) with the selected document, and then re-project all documents based upon the new reference system in the visual space. After AVP is replaced by a selected overlapping document, it would yield a new reference axis. The previous overlapping documents on the prior overlapping circle will no longer stay on the same overlapping circle of the new reference axis due to the auxiliary view point replacement. Consequently, these documents would be spread out in the new visual configuration. In Fig. 9.6, suppose that the document Dj is selected to replace the auxiliary view point R2 as the auxiliary view point R2’, the newly created reference axis now is R1R2’. The documents located on the previous overlapping circle would not stay on the new overlapping circle. Therefore they would
9.2 Ambiguity analysis in information retrieval visualization models
201
1500
2 ª 2 ª 2 2 0.5 º º ¬ 50 ¬ 100 x 50 ¼ ¼
0.5 1000
2 ª 2 ª 2 2 0.5 º º ¬ 50 ¬ 100 t 50 ¼ ¼
0.5
500
0
0
500
1000
x t
Fig. 9.5. Display of a low DTRA area in the GUIDO visual space
have different DTRA values against the newly created reference axis, and they would no longer be projected onto the same spot in the visual spaces. In the new configuration, there may be new ambiguous phenomena. Users can keep substituting the auxiliary view point with one of overlapping documents to solve the problem. The interaction between users and the system continues until users are satisfied with the visual configuration. Notice that since the key view point (KVP) R1 d1 D1
R’
Di
d2
Dj(R2’)
D2
R2 O
Fig. 9.6. Impact of a new reference axis on the overlapping objects
202
Chapter 9 Ambiguity in Information Visualization
stays the same as the previous one during replacement, it means that the newly created configuration still primarily reflects the previous user’s information need even though the new visual configuration is different from the previous one. The Euclidean spatial characteristics based information visualization models can provide multiple views for a special user’s interest by changing the auxiliary view point. It is this feature that enables users to disambiguate overlapping documents in the visual spaces. This unique feature distinguishes the Euclidean spatial characteristics based information visualization models from other information visualization models. It is worth pointing out that to some extent the ambiguity in the visual space serves as a sort of categorization function. If objects overlap in the visual spaces, it indicates that they share four of the same important parameters: two visual angles and two visual distances in the document space. That is, they are similar vis-à-vis the reference axis. It has obvious meanings of information retrieval. These overlapping documents are exactly similar vis-à-vis both the key view point and the auxiliary view point in terms of a distance measure. In addition, they are exactly similar vis-à-vis both the key view point and the auxiliary view point in terms of the cosine measure if the origin of the coordinate system shifts to the key view point and the auxiliary view point respectively. In this sense, overlapping objects in the visual space are a group of highly categorized objects. It is, of course, the positive side of ambiguity. The implication of overlapping objects on information retrieval is that if users identify an interesting object during browsing in the visual space, a group of similar overlapping objects can be presented to users as related documents. Users can use them to reformulate their search strategies.
9.2.2 Ambiguity in the multiple reference point based information visualization models The multiple reference point based information visualization models are famous for algorithm simplicity and flexibility of reference point manipulation. The essence of these models hinges on the fact that they support multiple reference points in the visual spaces, take impact from all related reference points on every object into consideration, and normalize the projection position of an object by factoring in impacts of all involved reference points. However, a multiple reference point visual environment can easily create ambiguity. There are two kinds of ambiguities in the visual space: the ambiguity created by the relativity of the object distance to the defined reference points, and the ambiguity created by the arbitrariness of reference point placement in the visual space. These two kinds of ambiguities are caused by different reasons and, therefore have different characteristics and solutions as well. Thus, we discuss them separately.
9.2 Ambiguity analysis in information retrieval visualization models
203
Ambiguity created by the relativity of object distance to reference points The uniqueness of the multiple reference point based algorithm relies on the distance relativity of a projected object to the related reference point(s) in the visual space, which is caused by the position normalization for the similarities to all reference points when the position of an object is calculated (See Chap. 3 for details). It implies that the distance of an object to its related reference points in the visual space is relative and changeable. On the one hand, this object distance relativity characteristic gives a powerful flexibility for both objection presentation and object manipulation in the visual space. That is one of the primary reasons that the algorithm is widely applied to many application domains. On the other hand, it can lead to a type of ambiguity: the relationship between objects and reference points may be misinterpreted when the similarities between the objects and reference points meet certain conditions in the visual space. In the visual space if a document (D1) is more relevant to a reference R1 than another document (D2), the document D1 may be farther away from the reference point R1 than the document D2 in the visual space. This happens when the second reference point R2 is involved and is related to both documents D1 and D2. Suppose that the similarity between R1 and D1 is r1, the similarity between R1 and D2 is r2, the similarity between R2 and D1 is r3, the similarity between R2 and D2 is r4, and r1 is larger than r2. The positions of the two reference points R1 and R2 in the two dimensional visual space are (xR1, yR1) and (xR2, yR2), respectively. The positions of the two documents D1 and D2 in the two dimensional visual space are (xD1, yD1) and (xD2, yD2), respectively. According to Eqs. (3.18) and (3.19) in Chap. 3, the positions of the documents D1 and D2 are calculated in the following equations. The X-axis and Y-axis coordinates of the documents D1 are demonstrated in Eqs. (9.19) and (9.20): x D1
x R1 u r1 x R 2 u r 3 r1 r 3
(9.19)
y D1
y R1 u r1 y R 2 u r 3 r1 r 3
(9.20)
The X-axis and Y-axis coordinates of the documents D2 are demonstrated in Eqs. (9.21) and (9.22): x D2
x R1 u r 2 x R 2 u r 4 r2 r4
(9.21)
y D2
y R1 u r 2 y R 2 u r 4 r2 r4
(9.22)
Thus, the distance between the document D1 and R1 is shown in Eq. (9.23).
G ( D1 , R1) 2
( x R1 x D1 ) 2 ( y R1 y D1 ) 2
(9.23)
204
Chapter 9 Ambiguity in Information Visualization Based on Eqs. (9.19), (9.20), and (9.23): G ( D1 , R1) 2
( xR1
y u r1 y R 2 u r 3 2 xR1 u r1 xR 2 u r 3 2 ) ) ( y R1 R1 r1 r 3 r1 r 3
(9.24)
2
G ( D1 , R1) 2
§ r3 · 2 2 ¸ (( x R1 x R 2 ) ( y R1 y R 2 ) ) ¨ 1 3 r r ¹ ©
(9.25)
Similarly, we have the distance between the document D2 and the reference point R1. 2
G ( D 2 , R1) 2
§ r4 · 2 2 ¨ ¸ (( x R1 x R 2 ) ( y R1 y R 2 ) ) 2 4 r r © ¹
(9.26)
The ratio of the distance between the document D1 and R1 to the distance between the document D2 and R1 is shown in the following equation.
G ( D1 , R1) G ( D 2 , R1)
G ( D1 , R1) G ( D 2 , R1)
r3 r1 r 3 r4 r2 r4
1 u r1 1 r3
(9.27)
r2 r4 1
1
(9.28)
Eq. (9.28) indicates that the distances between a reference point to its related documents ultimately depend on two similarity ratios r1/r3 and r2/r4. Even if the similarity between the document D1 and the reference point R1 is larger than the similarity between the document D2 and the reference point R1 (That is, r1 is larger than r2), the document D1 can still be farther away from the reference point R1 than D2 in the visual space. That is because Eq. (9.28) would be larger than 1 when the ratio of the similarity between R1 and D1 to the similarity between R2 and D1 is smaller than the ratio of the similarity between R1 and D2 to the similarity between R2 and D2 (That is, r1/r3 is smaller than r2/r4). The conclusion can be easily expanded to a multiple reference point environment. In other words, the distance between a document and a reference point depends not only upon the similarity between them but also upon the similarities between the document and other related reference points in the visual space. In the visual space, if two objects are overlapping each other in the visual space, it does not mean that they are equally similar to the related reference points in the visual space. In fact, Eq. (9.28) clearly explains it. As long as the similarity ratio r1/r3 is equal to the similarity ratio r2/r4, the documents D1 and D2 are overlapping between the two related reference points in the visual space. In other words, if documents are proportionally similar to the involved reference points, they would be overlapped in the visual space. However, the positive side of the
9.2 Ambiguity analysis in information retrieval visualization models
205
ambiguity is that the documents that address the same topics (They are related to the same reference points) but to different extents (they have the same similarity ratios to the related reference points) are easily identified because they are grouped and mapped onto the same spot in the visual space. Since this kind of ambiguity stems from the similarity ratio between documents and reference points and has nothing to do with the visual space coordinates of reference points (See Eq. (9.28)), the ambiguity cannot be avoided or assuaged by moving the involved reference points in the visual space. However, if the involved reference points are redefined by reassigning different weights to keywords in the reference points and/or changing keywords in the reference points, then the objects are re-projected based upon the redefined reference points and the ambiguity can be alleviated or avoided. That is because redefinition of the reference points can change two important similarity ratios r1/r3 and r2/r4. Therefore it separates the overlapping objects in the visual space. The true reason for this kind of ambiguity is that the distance calculation of a document to related reference points relies on relative similarities between the document and the related reference points caused by the normalization process. To solve this problem, VR-VIBE added a new dimension to the visual space, illustrating and indicating the overall similarity of a document to all reference points. The added dimension shows absolute overall similarity rather than relative similarity. In this sense, overlapping documents with proportional similarities to the related reference points would be distinguished in the third dimension because their overall similarities are different. The larger the overall similarity an object has, the higher it is located in the third dimension, and vice versa. Ambiguity created by arbitrariness of reference point placement Documents, which are totally not related, may be projected onto the same spot in the visual space. This kind of ambiguity is caused not by similarities between an object and reference points but by positions of reference points. Based on the algorithm, if a document is related to two reference points, it should be situated somewhere between a segment produced by the two reference points in the visual space. Its position is not affected by other unrelated reference points. Two unrelated documents can overlap if they are related to different reference points, respectively, and their positions are only located in the intersection between the segments formed by their related reference points in the visual space. Moving either of the related reference points can lead to a position change of the object in the visual space. But it is still on the line formed by the two related reference points and the ratio of its distances to the two reference points stays the same no matter where the moving reference point ends in the visual space. It means that moving a related reference point can separate the overlapping objects in the visual space. For instance, there are four reference points R1, R2, R3, and R4. A document D1 is relevant to R1 and R2 and not relevant to R3 and R4 while another document D2 is not relevant to R1 and R2 and relevant to R3 and R4. For simplicity, suppose that similarities between D1 and R1, R2 are 0.5 and 0.5, respectively; similarities between D2 and R3, R4 are 0.3 and 0.7 respectively. If the four reference points are situated in the visual space in Fig. 9.7, the two documents D1 and
206
Chapter 9 Ambiguity in Information Visualization
R4 R1
D 1’ R2’
D1,D2 R3 R2
Fig. 9.7. Impact of a moving reference point on overlapping documents
D2 overlap in the visual space. D1 is located in the middle of the segment generated by the reference points R1 and R2. D2 is positioned in 3/10 of the segment yielded by the reference points R3 and R4. A broken line indicates the influence of related reference points on a document and a solid line indicates movement of an influenced document or a reference point in the visual space. After the reference point R2 moves to R2’, the object D1 is attracted to D1’ accordingly in order to stay on the new segment generated by R1 R2’ and maintain its relative position to the reference points. Therefore the two documents D1 and D2 no longer overlap in the visual space. This kind of ambiguity is affected by both the number of reference points in the visual space and the dimensionality of the visual space. The maximum number of reference points in a two dimensional VIBE space that produce an unambiguous framework is three, the maximum number in an n-dimensional visual VIBE environment would be n+1 (Benford et al., 1995). WebStar added movement of a reference point to its visual space. The automatic reference point rotation feature serves both as an identification of related objects and as a disambiguation feature in the WebStar visual space. When a reference point rotates around the contour of the display area, the ambiguity, which is related to the moving reference point and created by the arbitrariness of reference point location, can be eliminated. It is crystal clear that the ambiguity created by the arbitrariness of reference point location can be effectively avoided by moving any of the involved reference points. But for the ambiguity produced by the relativity of object distance to reference points, revision of the related reference points in the visual space can separate the overlapping documents.
9.2 Ambiguity analysis in information retrieval visualization models
207
9.2.3 Ambiguity in the pathfinder network The Pathfinder network algorithm transforms a complex object proximity matrix into a simplified network structure to reduce redundant links in the network. Since the primary aim of the Pathfinder network algorithm is to produce a special network with the lowest cost paths rather than to preserve all relationships among the objects as truthfully as possible, some insignificant object proximity relationships in the original matrix have to be compromised or ignored. According the algorithm, only a subset of relationships in the original proximity matrix is used to construct the Pathfinder network: the edges that have weights that are equal to those of their corresponding edges in the path-length-q complete minimum weight matrix are included. In other words, other edges in the original proximity matrix are not part of the final Pathfinder network. These “insignificant” (not salient) edges are pruned and no longer exist in the final Pathfinder network. The Pathfinder network algorithm only prunes edges of the original network, not its nodes. Exclusion of these edges in the Pathfinder network not only results in the direct relationship loss between two objects, but also causes ambiguity of those objects that are close to each other in the original network and may be far away in the Pathfinder network in terms of the path length. This ambiguity may be misleading and cause misinterpretation in the network. For instance, there are four objects (O1, O2, O3, and O4). For simplicity, the two important parameters r and q for the Pathfinder associative network are set to f and 2, respectively, or PFNET (f , 2). The objects are connected each other by edges (See Fig. 9.8) according to the original input proximity matrix W (See Eq. (9.29)).
2 O2
O1
2 1
1 3
O3
O4
1 Fig. 9.8. Display of four connected objects
208
Chapter 9 Ambiguity in Information Visualization
W
§0 ¨ ¨2 ¨1 ¨ ¨3 ©
2 1 3· ¸ 0 2 1¸ 2 0 1¸ ¸ 1 1 0 ¸¹
(9.29)
Based on the above equation and the Pathfinder network algorithm, the pathlength-2 exact minimum weight matrix W2 is shown in Eq. (9.30).
W2
§0 ¨ ¨2 ¨2 ¨ ¨1 ©
2 2 1· ¸ 0 1 2¸ 1 0 2¸ ¸ 2 2 0 ¸¹
(9.30)
From Eqs. (9.29) and (9.30), then we have D2.
D2
§0 ¨ ¨2 ¨1 ¨ ¨1 ©
2 1 1· ¸ 0 1 1¸ 1 0 1¸ ¸ 1 1 0 ¸¹
(9.31)
The final step is to compare D2 and W to find the final Pathfinder network result (See Eq. (9.32) and Fig. 9.9).
2 O1
O2
1
1
O3
O4
1 Fig. 9.9. Final result of the Pathfinder network
9.2 Ambiguity analysis in information retrieval visualization models
PFNET (f,2)
§0 ¨ ¨2 ¨1 ¨ ¨0 ©
2 1 0· ¸ 0 0 1¸ 0 0 1¸ ¸ 1 1 0 ¸¹
209
(9.32)
The above analysis shows that both edges e14 and e23 are removed from the original network. The path length between the objects O2 and O3 in the Pathfinder network are farther away than they are in the original network due to removal of the edge e23, although the path weight (e24, e43) is equal to the path weight (e23). That is the ambiguity. As we know that there are two important parameters r and q associated with the Pathfinder associative network (PFNET(r, q)). They determine the characteristics of the Pathfinder network. The parameter r means a Minkowski metric used in the network and parameter q indicates the Pathfinder network is q-triangular which suggests that all possible weights of these paths in a network, whose path lengths are smaller than and equal to the parameter q, meet the triangle inequality condition. Notice that the edges in the Pathfinder network are a subset of the original proximity matrix; and as the value of the parameter q increases, the number of the edges in the Pathfinder network decreases, and vice versa. As the number of edges decreases, the Pathfinder network becomes more efficient and the possibility of ambiguity also increases. When the parameter q is equal to 1, the Pathfinder network is exactly equal to the original proximity matrix. This implies that decreasing the parameter q can mitigate the ambiguity in the Pathfinder network at the price of increasing path redundancy in the network. In other words, the parameter q can be used to control ambiguity in the Pathfinder network.
9.2.4 Ambiguity in SOM The Kohonen self-organizing map (SOM) is comprised by a group of neurons/nodes in a display grid. Each of these neurons/nodes is associated with a weight vector used to keep knowledge and experience learned from an iterative training and learning process. The weight vector describes characteristics of a neuron in the display grid. The weight vectors are initialized by randomly assigning small non-zero values. An input document/object is randomly selected to compete with the winning node in the grid based on relevance between the input document and a weight vector of a node in the display grid. Once the winning node is identified, the weight vectors of neighboring nodes of the winning node must be revised or adjusted to adapt the impact of the winning node. The impact scope of the winning node, which determines impact neighborhood size, can be controlled by both a neighborhood width function and a learning rate function. The iterative learning and training process ends when the feature map finally converges. After the feature map is generated, all documents are mapped onto their corresponding winning nodes in the display grid. The SOM algorithm suggests that final weight vectors in
210
Chapter 9 Ambiguity in Information Visualization
the display grid may not be unique for the same data set because random initialization of the weight vectors, random selection of input data, impact of different neighborhood sizes and learning rates, and the iterative nature of learning and training. The uncertainty of the node weight vector directly leads to potential uncertainty of projected documents in the display grid. It is the uncertainty of the document position in the display grid that makes interpretation of the projection ambiguity elusive. That is because the position of a projected document is determined by the position of the winning node whose weight vector is most relevant to the document. Unlike the Euclidean spatial characteristic based models, the SOM algorithm does not has a stable and customized reference system based on user’s information need in projection. Its reference system is the winning node of the feature map when a document is projected onto the grid. The winning node can be any node in the grid because the weight vector associated with the winning node is uncertain due to the dynamic and iterative process nature. It suggests that the final feature map reflects a global overview of a document set rather than a customized local overview based on users’ interest. As a result, the SOM algorithm does not offer a customized local overview adjusted by users, which is usually used as an effective means for disambiguation. The SOM dynamic reference system for projection contributes to the uncertainty of a projected document position in the visual space. In addition, when a document is mapped onto the display grid (the visual space), only the relationship between the document and the winning node is considered, and relationships between the document and other documents are ignored. Labeling the feature map, which by nature tries to automatically interpret the feature map, can also produce ambiguity. After the feature map is created, the next step is to label the map, or partition the feature map according to topics that a local area covers. That is, labeling the map is the process of identifying the proper topic terms for each of the local areas in the feature map. Assigning terms to an area in the feature map is primarily based upon the weight vector of a node and/or documents associated with the node. A weight vector and a document are multi-facets while the number of terms assigned to an area, which consists of a group of nodes, is relatively small because of limited display room in the visual space. In addition, the term(s) assigned to an area are directly extracted from the weight vector of the nodes and/or documents associated with the nodes. As a result, ambiguity arises because either the selected terms may not represent topics in that area or they may be not generic enough to cover all topics in the area.
9.2.5 Ambiguity in MDS The multidimensional scaling approach is often used to reveal complex data relationships and discover hidden patterns by optimally projecting objects from a high dimensional space in a low dimensional space. However, optimal projection can result in not only distortion in pairwise distance that is introduced by the projection, but also the geometric distortion (Quist and Yona, 2004).
9.3 Summary
211
The non-metric multidimensional scaling approach employs an iterative procedure to achieve optimization of object placements in the low dimensional visual space when they are mapped onto it. During the iterative process, a loss function is used to calculate the stress value which is an indicator of goodness-of-fit between proximities of objects in a high dimensional space and the Euclidean distances of these projections in a low dimensional visual space. That is, the stress value is used to judge the extent to which object relationships in a high dimensional space are faithfully projected onto a low dimensional space. The stress value is compared with a predefined threshold to see whether it reaches an acceptable goodness fit level. The iterative nature of the algorithm in conjunction with the way to control the iteration by a loss function directly leads to uncertainty of object placements in the low dimensional visual space. For instance, a different stress value can generate a different object configuration for the same input data set in the visual space. As a result, it is difficult to accurately predict the locations of the projected objects in the visual space, which makes interpretation of the object projection ambiguity difficult. Like the SOM algorithm, the MDS algorithm also does not have a stable reference system during object projection. The proximity or similarity between two objects is compared with the Euclidean distance (the projection reference system) in the visual space. Their Euclidean distance is dynamic because positions of the two objects in the visual space are not fixed and are adjusted based on the goodness of projection. For this reason, the generated final visual configuration is a global overview of the entire dataset rather than a customized local view based on users’ interest. Lack of a customized local view usually means that users cannot use it as a disambiguation mechanism.
9.3 Summary Ambiguity in information visualization stems from dimensionality reduction during objection projection. When objects in a high dimensional space are projected onto a low dimensional space based on a certain algorithm, some “insignificant “ attributes of an projected object have to be scarified or “distorted” to preserve the primary attributes of the object in the visual space. The kept attributes are displayed and observed by users in the visual space. In this sense, ambiguity is one of the inherent characteristics of projection which the high dimensionality is forced to reduce. That is, projection ambiguity in the visual space is inevitable as long as the dimensionality decreases. Ambiguity in a visual space can have both positive and negative impact upon system design and use. On the one hand, it can lead to incorrect conclusions and deliberately conceal true relationships among projected objects in the visual space. From the psychological point of view, ambiguous information would result in a loss of users’ understanding of the information meaning. On the other hand, ambiguity can motivate system designers to develop new information visualization
212
Chapter 9 Ambiguity in Information Visualization
models to overcome the negative impact on visual object configuration. Ambiguity can encourage users to engage in more active interaction between users and the visualization system to clarify potential implicit meaning of a visual configuration caused by ambiguity. Ambiguity may provide users with clustered information in some visualization environments. When objects are projected onto the same spot, it implies that these objects share common characteristics which are determined by a projection algorithm. Otherwise, they would not be overlapped in the visual space. It is the common characteristics that categorize the overlapping objects and it is the common characteristics that users can make full use of for their information retrieval decision making. Ambiguity originates from dimensionality reduction during object projection. It is no surprise that information visualization models vary in their projection approaches. Different projection approaches can create not only distinct visual configurations but also unique ambiguity in their visual spaces. Basically, projection approaches can be classified into two categories: iterative projection and non-iterative projection. Iterative projection methods such as the multidimensional scaling algorithm and self-organizing map algorithm do not have a stable and fixed projection reference system. The projection reference system is dynamic and changeable. The position of an object in the visual space is determined by repeating processes. That causes the uncertainty of projected object locations in the visual space and therefore elusive explanation for their ambiguity. Non-iterative projection usually has a stable and fixed projection reference system, and position of an object in the visual space is determined by a nonrepeating calculation procedure. As a result, object locations in the visual space are certain and explicit. Therefore, explanation for ambiguity is relatively easy. For instance, both the Euclidean spatial characteristic based models and multiple reference point based models have a predefined projection reference system which usually consists of reference points and their projection reference objects are explicit and stable. The algorithms can accurately compute object projection positions in the visual space based upon given parameters. Observe that an iterative projection approach like SOM and MDS usually generates a global overview configuration based on entire input data set. The global visual configuration cannot be customized based on user’s interests. In other words, they cannot generate multiple customized views or configurations for the same data set. Generation of multiple customized views or configurations for the same data set is one of the most effective means for disambiguation in a visual space. If a projection reference system, which usually is defined by reference points, changes, then projection emphasis changes and similarities between the reference system and projected objects change accordingly. The objects which share the same projection parameters in the old reference system may no longer share the same projection parameters in the new reference system. Therefore, they are separate in the visual space. This may explain, in part, the fact that information visualization models based upon the iterative projection approaches, which yield only one global configuration, lack a disambiguation mechanism in their visual spaces. In contrast, a non-iterative projection approach such as the Euclidean spatial characteristic based models and multiple reference point based models
9.3 Summary
213
produces a local visual configuration based upon predefined reference points. Redefinition of the reference points can not only yield another customized configuration but also provides users with an important means for solving ambiguity in the visual space. It is worth pointing out that there is a special kind of ambiguity which is not content-oriented in the multiple reference point based model. This ambiguity is created by random reference point placement in the visual space instead of reference point contents and/or object contents. Fortunately, this kind of the ambiguity can be easily circumvented by relocating reference points in the visual space. Ambiguity can also occur in the visualization of a hierarchy structure. For instance, a document is forced into a primary category of the tree structure but it possesses multiple facets, or a document covers multiple subjects. Other subjects must be sacrificed if it is only classified into one primary category of a hierarchy. The sacrifice of other facets can directly lead to ambiguity of the documents in the hierarchy. The solution to this problem is to establish a visual reference system to connect it to all related subject categories in the hierarchy.
Chapter 10 The Implication of Metaphors in Information Visualization
The use of metaphors as a means of organizing thoughts or ideas has a long history that dates back to the fifth century B.C. (Yates, 1966). Metaphor study is multi-disciplinary. It relates to psychology, cognitive science, philosophy, education, computer science, and, of course, linguistics. Metaphors are pervasive in linguistics. However, metaphors are not merely linguistic phenomena; they reflect a deep structure of thought (Lakoff and Johnson, 1980). Metaphors are ubiquitous in interface design. A passion for metaphorical interface design stems from the metaphorical tendency of human information processing and its fundamental and indispensable role in the human cognitive process. It is no surprise that the application of metaphors in a computing environment can be traced back to the invention of computers. When people realized that the process of human computer interaction is also the process of human cognition, it is natural to employ metaphors to shape a variety of models to facilitate human computer interaction. Visualization for information retrieval also embraces metaphors with enthusiasm. The complexity and challenge of visualization for information retrieval are far beyond ordinary human computer interface design in the sense of both system design and system use. Metaphors are desperately needed to establish a cognition-friendly visual environment where users can easily understand the internal working mechanism, intuitively comprehend the objects and their contexts, quickly learn operations and features, and effectively explore the visual environment.
10.1 Definition, basic elements, and characteristics of a metaphor What is a metaphor? There are plenty of definitions, but the essence of a metaphor is understanding and experiencing one thing in terms of another experience (Lakoff and Johnson, 1980). Since a metaphor involves a comparison between different concepts, there are many definitions that emphasize the comparison. Metaphor is a comparison in which the tenor is asserted to bear a partial resemblance to the vehicle (Tourangeau and Sternberg, 1982). A metaphor hinges upon an unusual juxtaposition of the familiar and the unfamiliar (MacCormac, 1989). It is clear that the purpose of a metaphor is to use one concept to explain another concept, so some definitions focus on the characteristics of the comparing and compared
216
Chapter 10 The Implication of Metaphors in Information Visuualization
concepts in terms of explanation. Metaphors bring out the “thisness” of that or “thatness” of this (Burke, 1962). A metaphor is an analogy (MacCormac, 1989), which uses one experience in the one domain to illustrate another experience in a different domain and thus to acquire a better understanding of complex and unfamiliar concepts. Metaphors are also defined as models that apply tangible, concrete, and recognizable objects to abstract concepts and/or processes (Baecker et al., 1995). A metaphor is a description of an object or event, real or imagined, using concepts that cannot be applied to the object or event in a conventional way (Indurkhya, 1992). In other words, metaphors explain usually abstract concepts by using more concrete concepts (Weiner, 1984). In summary, when using a metaphor, a familiar, simple, concrete, intuitive, well-known concept or object is employed to describe another new, complicated, abstract, unknown concept or object by juxtaposition of their similar attributes, which make the concept more easily recognized, communicated, understood, and remembered. One of the distinctive characteristics of metaphors is exaggeration. The hyperbolic nature of a metaphor really enriches its expression and gives users an unbelievable imaginary space. It is this exaggerated nature of a metaphor that produces a sense of humor. Metaphors usually reveal the emotions of its producer, and these emotions may be transferred to users/readers. The use of metaphor possesses a strong cultural background because a metaphor is deeply rooted in its cultural and social context. The cultural aspect of a metaphor makes its application more challenging. A metaphor basically consists of the following elements: target domain, target item, source domain, source item, and mapping/matching (See Fig. 10.1). The target domain is where a relatively unknown and unfamiliar target concept comes from. The target item, or target concept, or target referent, or vehicle, is what is to be interpreted and explained in a metaphor. The source domain is a one where a relatively well-known and familiar source item, or source concept, or source referent, or tenor, comes from. The source referent is metaphorically expressed to explicate the target referent. People are supposed to be quite familiar with both the source domain and source concept in a metaphor. According to structure-mapping theory (Centner and Markman, 1997), the mapping between the source referent and target referent in a metaphor is a process of establishing a structural alignment between two represented referents, which consists of an explicit set of correspondences between the representational elements of the two referents, and then projecting inferences. As a result, metaphorical representation is made up of the referents, their properties, relations between referents, and high-order relations between the relations. The mapping may happen at different levels from referent attributes, to higher-order relations, even to the related domains, and then integrates them into the overall alignment. The process of matching between two domains is also called blending (Fauconnier, 1997). Unfortunately, the target concept of a metaphor may never exactly match the source concept in the alignment process. Metaphors possess both epiphoric and diaphoric properties which arise from the similarity and dissimilarity among the attributes of the two referents, respectively (MacCormac, 1989). It is understandable
10.1 Definition, basic elements, and characteristics of a metaphor
217
because of the domain deference and the referent deference. As a result, mapping between the two referents produces three parts: matched part, unmatched part in the target referent, and unmatched part in the source referent. Some features and attributes of a target referent may not be reflected in a source referent. And similarly some features and attributes of a source referent may not reflect in a target referent (See Fig. 10.1). Even for the matched features and attributes of the referents, the extent of the match may vary. These are the reasons for the mismatch phenomena in a metaphor. Attribute mismatches have significant impact on a metaphor because the projected inference depends heavily upon the alignment. The audience of a metaphor must be capable first of identifying the connection being posited, and second making the correct attribute linkages between the referents (Hamilton, 2000). A mismatch, if it is not handled properly, may cause confusion for the audience. Matching between both the target and the source in a metaphor creates juxtaposition and an association between them. This association may sometimes seem counterintuitive or make no sense if the nature and scope of the two involved domains are significantly dissimilar. In other words, a metaphor results from a cognitive process that juxtaposes two not normally associated referents, producing semantic conceptual anomaly. However, it can be this lack of making sense that generates unexpected and surprising effects that may lead to an easy and more thorough understanding of the target. That is the power of a metaphor. A metaphor constitutes a violation of selection restriction rules within a given context, when in fact, this violation is supposed to explain the semantic tension (Johnson, 1980). It means that a metaphor may be contextually irregular if it is interpreted only literally. This arises from the application a concept from a domain to the context of an “irrelevant domain”. That is, the two referents do not have to be within Target item
Target Domain
Source item
Matched part
Unmatched part in target
Fig. 10.1. Illustration of basic metaphor elements
Source Domain
Unmatched part in source
218
Chapter 10 The Implication of Metaphors in Information Visuualization
similar domains, or species, or genus. As long as they share meaningful characteristics and these characteristics make sense in the contexts, one referent can analog the other. The matching process is a one of integrating two domains. When the two domains map partially, a new domain in the metaphor is derived. Matching generates new features, relationships, and contexts which may not exist in either the target or the source. The new properties, relationships, and contexts are a result of the creative mapping. That is, matching between a target and a source is not simply a mapping process and it produces emotion and new attributes and characteristics belonging to neither tenor nor vehicle but the integration of both.
10.2 Cognitive foundation of metaphors Cognitive science studies various human mental tasks, behaviors, and processes such as thinking, reasoning, planning, learning, memory, attention, and so on. In nature, the interactive process between a human and the information retrieval visualization environment is a cognitive one. Understanding of the relationships between metaphors and cognition at a deep mental level would help us to master the essence of metaphorical embodiment in the information retrieval visualization environment. Metaphors, as proper cognitive devices (MacCormac, 1989), are essential to learning, development of thought, and a more holistic understanding of the domain. Use of metaphors affects perception of a concept, its interpretation and possible subsequent actions. The creation and explanation of metaphors have a close relationship with analogical reasoning and problem solving (Gentner and Gentner, 1983). In a landmark study, Lakoff and Johnson (1980) stated that our ordinary conceptual system is fundamentally metaphorical in nature in terms of how we both think and act, and human thought processes are largely metaphorical because the human conceptual system is metaphorically structured and defined. They claimed that the nature of metaphors is the nature of cognition. Johnson-Laird (1983) further confirmed that from the cognitive perspective, metaphors are regarded as examples of mental models. Metaphors can play an extremely important role in understanding more abstract concepts. Lakoff and Johnson (1980) claimed that the reasoning we use for such abstract topics is somehow related to the reasoning we use for mundane topics. In other words, abstract reasoning depends upon concrete and simple facilities like metaphors. The more complicated a concept, the more a metaphor is needed to provide its explanation. The cognitive metaphor theory (Romero and Soria, 2005) assumes that there is a set of ordinary metaphoric concepts around which people conceptualize the world. The concepts in the ordinary concept system contain the structure of what we perceive, how we get around the world, and how we relate to other people. People tend to solve problems by prior experiences and knowledge gained from
10.3 Mental models, metaphors, and human computer interaction
219
similar situations. Cognitive models derive their fundamental meaningfulness directly from their ability to match up with pre-conceptual structures. Such direct matching provides a basis for an account of truth and knowledge (Lakoff, 1987). The pre-conceptual structures are a pre-existing knowledge system. They are built based upon long-term accumulated experiences and expertise. When people encounter an unfamiliar concept, the cognitive system looks for the best match between the unfamiliar concept and the pre-conceptual structures, and tries to understand it by relating the existing familiar concept, expertise, and experiences to the unfamiliar concept. Since metaphors are familiar and well-understood concepts, events, and objects, they are primary elements of the pre-conceptual structures. If metaphors in the pre-conceptual structures share common characteristics with a new concept, then they may naturally be used to explain the new concept. This theory is extremely important because it helps us understand the cognitive role of metaphors in learning and comprehending a new concept, especially complicated and abstract one.
10.3 Mental models, metaphors, and human computer interaction 10.3.1 Metaphors in human computer interaction Metaphors have strong influences on computing, especially upon interface design. Of all the cognitive science concepts used in human computer interaction, metaphor has proved to be one of the most durable and accepted (Dillon, 2003). Interface metaphors act as a cognitive shortcut by helping users to build on already existing mental models of familiar concepts when learning new systems (Booth, 1989). If users work with a metaphorical interface, previously existing knowledge about the metaphor is bound to affect the perception of the interface and the interaction process. Well designed metaphors proved to be particularly robust as conceptual aids and were quickly adopted as part of the fabric of graphical interface (Hamilton, 2000). It is not surprising that the application of metaphors has become a basic guiding principle in general interface design. It is interesting that both system designers and system end-users benefit from metaphors. System analysts and developers use metaphors in system design and programming such as states, data flow, task, activity, entity, object, overflow, traverse, tree, stack, and queue. On the user front they can easily find metaphorical embodiments in an application such as a window, drawer, folder, paper clip, bookmark, trash can, virus, quarantine, and hour glass. The information highway and Web are almost synonyms of the Internet; and navigation and surfing are equivalent terms to browsing.
220
Chapter 10 The Implication of Metaphors in Information Visuualization
10.3.2 Mental models The theory of mental models was first introduced by Craik (1943). He believed that mental models of human beings are “small-scale models” of reality which are used to reason, anticipate events, and underlie explanation. Mental models are defined as a way that people solve deductive reasoning problems (Johnson-Laird, 1983). A mental model is the appropriate organization and representation of data, function, work tasks, activities, and roles that people inhabit within social organizations of work or play (Marcus, 1994). A mental model is an internal explanatory mechanism of human thought that dictates the way and method in which people perceive, understand, interact with, and make predictions about the real world.
10.3.3 Mental models in HCI Mental models are extremely important for interface design. Users may not be aware of the formation of mental models and impact of mental models on their behavior when they interact with an interface. But people’s thinking, behavior, and actions are guided by mental models in the interactive contexts. From the human computer interaction perspective, mental models tell people how a system works when people interact with the system (Norman, 1988). By understanding the users’ mental model of an information system, designers may better know how users perceive the system, how users infer system features based upon the interface, how users react to the system, and what users expect for a response from a certain function or feature. Therefore, correct mental models may help people to avoid unnecessary loopholes in an interface design and make the interface design more user-centered. The generation of mental models involves the aggregation of experiences and knowledge. Mental models are the result of categorization, classification, and abstraction of a complicated and sophisticated situation or phenomena by excluding some insignificant details and extracting the significant hidden structure. As a result, they simplify the situation or phenomena, present it structurally, and contain minimum detailed information about the situation and phenomena they describe to maintain the analytic, explanatory, and communicative power of a conceptual model. Mental models vary among people because different people have different cultural and technical backgrounds, mental abilities, experiences with a system, and expertise in a domain. In this sense, mental models are subjective. Mental models are not fixed after they are established in the mind. Mental models can be updated and revised as users interact with new environments. They evolve based upon both successful and unsuccessful experiences with a system. Users usually form their mental models about an interface from the following channels: training, user manual, user guide, system help file, and exploration of the system by themselves, and previous experiences with similar systems.
10.3 Mental models, metaphors, and human computer interaction
221
Types of HCI mental models Mental models can be classified into three categories based on distinctive user groups in an information system: user mental model, design model, and system model (Norman, 1988 and Cooper, 1995 a). The user mental model of a system refers to the model that is established from the angle that users perceive and understand the system. The design/manifest model of a system refers to the description and explanation from the perspective of people who make theoretical contributions to the system. It is devised as a means for the understanding of system design. In other words, it is an initial conceptual model of the system. The system/implementation model refers to the model of the system from the viewpoint of system developers and implementers. A system model serves as an intermediate bridge between the users’ model and the design model and directly influences the form of a user model. That is, users perceive and understand the design model through the system model. In an ideal scenario, a design model should be correctly mirrored in an implementation model, and both the design model and implementation model should be correctly echoed in a user mental model. In other words, a well structured design model should be equivalent to its system model. And a user model should be consistent with both a design model and a system model. However, due to a variety of reasons, a user model has no resemblances to its system model and design model, which may lead to cognitive confusion and even frustration when users interact with a system. Mental models can also be classified into a structural model and functional model from the perspective of system use (Preece et al., 1994). In this sense, they can be regarded as sub-categories of the user model. A structural mental model describes a deep internal working mechanism of a system/device and tells people how a system works and internal components are related. Knowing a structural mental model, users can understand the fundamental principle behind the screen, explain system reactions, predict about possible reactions and responses triggered by an operation, and therefore effectively interact with the system. A functional mental model describes how users operate a system/ device; it guides users in the operation of a system that it describes. Implication of metaphors in mental models The role of metaphors in effective interactions between users and information systems has been a focus in mental model studies for a long time. It is widely recognized that users, especially new users, try to understand information systems as analogical extensions of familiar activities and objects (Douglas and Moran, 1983). Metaphors embedded explicitly or implicitly in an application are powerful tools for the development of cognitive and conceptual models (Rubenstein and Hersh, 1984). These observations have inspired a variety of metaphorical interfaces ranging from pervasive metaphorical icons in interfaces, to a metaphorical paradigm for graphic controls, to a metaphorical interface design principle. In nature, mental models are structural analogies of the world (Johnson-Laird, 1983). Therefore, it is very natural to employ metaphors to construct mental models. Metaphors can be embodied in a conceptual design model and be materialized
222
Chapter 10 The Implication of Metaphors in Information Visuualization
in a system model so that the users’ model may be easily and accurately formed and information of the design model can be conveyed to end-users by the system model in a more effective and efficient fashion. For a variety of reasons, a system model may partially reflect a design model. User models can vary in different users for the same system because the generation of a user model is affected by user’s knowledge, expertise, experience, and understanding of the system model. A poor design model definitely would result in an incorrect system model and a good design model does not guarantee a good system model. An improper system model would definitely produce a negative impact upon the generation of a correct user model. The role of metaphors in a design model, system model, and user model is to serve as an effective communication means. An appropriate metaphorical embodiment helps the system designer, system implementer, and system users to understand the system on the same cognitive ground. Metaphors make formations of both a design model and a system model within the same familiar contexts, which definitely would decrease the inconsistencies between the two models. In addition, metaphors enable users to bypass the system model and to directly communicate with the design model by the metaphors. As a result, it reduces a communication layer and eliminates possible “noise” created by the layer. Therefore, a metaphor embodiment facilitates the correct generations of the models; Maximizes the effectiveness of communications among designers, implementers, and end-users; And minimizes “noise” added to both a system model and a user model when a design model is transformed into a system model and shaped in a user model (See Fig. 10.2).
Design model
System model
Metaphor
Fig. 10.2. Role of metaphors in the three models
User model
10.4 Metaphors in information visualization retrieval
223
The key point of metaphorical embodiment in mental models is to find the appropriate and applicable metaphor which fits a design model, and is conveniently and smoothly transformed to a system model, and easily help users to shape a correct user mental model.
10.4 Metaphors in information visualization retrieval Metaphorical embodiments in information retrieval visualization environments are quite different from metaphorical applications in general computer interface design. That is because the former involves more variables and concentrates on not only analogy of each individual control, but also the analogy of an entire visual environment. Metaphors have been applied at different levels of information retrieval visualization environments, ranging from an entire system such as a whole visual semantic framework, to a task such as a search and browsing, to a graphic icon design. Information retrieval visualization environments provide an ideal stage for metaphor application.
10.4.1 Rationales for using metaphors Spatial, sophisticated, and abstract characteristics of information retrieval visualization One of the most salient characteristics of an information retrieval visualization environment is its spatial presentation in a two or three dimensional space. Objects, abstract object relationships, semantic structure, and associated retrieval means have to be illustrated within the space for end-users in a meaningful fashion. It is a challenging task to come up with a visualization model which puts these elements together to establish a theoretical framework to make sense for information retrieval. The process may need complex mathematical reasoning, spatial imagination, and solid understanding of data characteristics and user information seeking behavior. But it is even more challenging for users to comprehend the information retrieval visualization environment. This is because: 1. Semantic relationships among objects and structures in a database are complicated, invisible, and abstract. 2. The high dimensionality of a database has to be reduced in order to present objects in the visual space. During dimensionality reduction some attributes of an object may have to be compromised to accommodate significant attributes in the visual configuration. 3. The objects, object relationships, or inherent structures which underlie the visualization environment may be “distorted” after they are projected on the visual space.
224
Chapter 10 The Implication of Metaphors in Information Visuualization
4. Semantic relationships of objects in the visual space may be expressed in a dynamic way rather than a static way. 5. An information retrieval visualization environment may contain multiple types of objects, for instance, documents, queries, reference points, links, retrieved results, browsing paths, and so on. These objects need not only to be distinguished but also to be understood and manipulated by users. 6. Furthermore, an information retrieval visualization environment usually offers more rich, dynamic, and sophisticated retrieval operations. For example, the search process may require the support of multiple reference points, browsing a visual space is no longer linear and users navigating in a visual space may become disoriented, reformulating a search strategy is affected by multiple factors, and selection of an information retrieval model requires more expertise in information retrieval. In a broader sense, an information retrieval environment is an interactive interface. It should possess all of the basic characteristics of an ordinary human computer interface. In addition, these unique characteristics differentiate information retrieval visualization from ordinary human computer interface design. A design model of information retrieval visualization becomes much more complex due to these unique characteristics, the design model is less easily reflected and materialized in a system model, and establishment of a proper user mental model is even more challenging. As a result, a visual configuration in a visual space is usually abstract and complex for users, especially users without an information retrieval background. If users do not understand the meaning of the visual configuration in a visual space, it is impossible for them to manipulate effectively and efficiently. Finding an effective way to simplify the abstract, sophistical, and complex visual configuration and make the abstruse visual configuration understandable without a huge effort is one of the high priorities for information retrieval visualization. Metaphors as the solutions As the levels of complexity are layered one atop the other in order to produce the high-level behaviors that are the actions we recognize while interacting with the computer, the possibility of talking or thinking literally about the computer’s behavior vanishes. We deal with this complexity and this plasticity by speaking metaphorically about the computer (Hutchins, 1989). Purely intellectual concepts, the theoretical concepts in science, are often - perhaps always – based upon metaphors. Researchers found that spatial property and abstract property, which are basic properties of an information retrieval visualization environment, are inherently metaphorical. Studies showed that metaphors have a natural connection to abstract and spatial concepts. An abstract concept is intrinsically metaphorical (Lakoff and Johnson, 1980). Our comprehension of abstract domains is often shaped through spatial metaphors, a property which can be and has been directly exploited from a wide variety of interface designs (Kuhn and Blumenthal, 1996). The above analysis suggests that the complexity of an information retrieval visualization environment caused by its inherent spatial and abstract properties can be simplified by the metaphorical embodiment in the environment. In other words, metaphors simplify
10.4 Metaphors in information visualization retrieval
225
the complex design model of information retrieval visualization, smooth the progress of materialization of the system model, and facilitate the correct shaping of the user mental model. Metaphors can help users to overcome the learning curve of information retrieval visualization environments. In fact, a metaphor has become the synonym of easy learning in human-computer interface design because it does not require technical knowledge, making an unfamiliar system look and act like a familiar system. As a result metaphors can not only minimize cognitive efforts to reduce the learning time of users, but may also result in long term memory about the system (Allbritton et al., 1995). Metaphors promise a considerable payoff in wideranging improvements in learnability and ease of use (Carroll and Thomas, 1982). It can improve the familiarity and predictability of an information retrieval visualization environment. Understanding of a visual semantic configuration in a visual space is not only important but also necessary. Users cannot effectively and efficiently interact with an information retrieval visualization system if they are thrust into such a complex environment with little background knowledge. This means that users have to understand its visual semantic configuration and learn its features before they can manipulate it. Metaphors help users to instantly grasp the essence of a visual configuration, quickly understand it, and effectively interact with it. Another non-technical benefit of the metaphorical applications is to make a monotonous and tedious visual presentation and routine retrieval controls more vivid and interesting. An information explorative process may become an entertaining process in a metaphorical information retrieval visualization environment. For instance, if users navigate in a metaphorical information system where all elements of a library such as book, bookshelf, information reference desk, storage areas, rooms, check-in, and check-out are integrated, they feel comfortable and excited with the familiar setting and contexts. The fisheye technique maximizes an interesting area and immunizes its surrounding areas like a fish searching for food in the water. And the flexible control, dynamic focus, and dramatic exaggeration of the fisheye view can make information browsing great fun. Metaphors also inspire new scientific ideas, and they are vital to scientific discovery. It is no exception for information retrieval visualization. In the Pathfinder layout algorithm, the spring theory was successfully applied to draw objects in a visual space based on mutual attraction strengths. Metaphors exist as a quite normal creative human cognitive process that combines unrelated concepts to produce new insight (MacCormac, 1989). The insight would definitely enrich information visualization environments.
10.4.2 Metaphorical information retrieval visualization environments Metaphors are widely applied to visualization for information retrieval. Metaphors are used to illustrate different perspectives of information retrieval visualization, ranging from an individual document, to a hierarchy structure, to hyperlink
226
Chapter 10 The Implication of Metaphors in Information Visuualization
structures, to citation linkages, to information retrieval controls and processes, to a customized dataset, and to an entire collection of databases. Applied metaphors include things such as a butterfly, river, disc, the galaxy, the solar system, geographic landscape, islands in an ocean, maps, library, book, bookshelf, fisheye, lens, wall, water flow, etc. The application of metaphors in visualization for information retrieval can be categorized into the following three groups in general: metaphors for semantic framework presentation, metaphors for information retrieval interactions, and metaphors for solving theoretical problems. Most of the metaphorical information retrieval visualization environments are in the first category because a visual presentation is fundamental for the environment. Notice that in reality an information retrieval visualization environment may be classified into multiple metaphorical categories due to multiple embodiments in the visualization environment. Metaphors for semantic framework presentations One of the primary characteristics of an information retrieval visualization environment is the demonstration of object semantic relationships in the visual space. These objects are not randomly scattered in the visual space. Objects must be positioned and projected onto a meaningful framework to form a visual configuration where internal structure, semantic connections among projected objects, and other characteristics of projected objects are illustrated. Metaphors can be embodied in a visualization environment and provide intuitive structures for the display of these objects. Map is a familiar concept and it is employed in metaphorical visual configurations like Visual Net (2005) and WebMap (2003). Important properties of a map such as location, area, neighborhood, distance, and scale can be used to express semantic relationships about a dataset. Location indicates the position of an object in the semantic context/map. An area includes a group of objects with the same semantic characteristics. A neighborhood shows two groups of objects which share some commonality and have some kind of semantic connections. A distance between two objects in a map implies the similarity degree between them. The closer they are in a map, the more relevant, and vice versa. A zoom feature allows users to observe interest area at different levels from a large-scale global overview to a small-scale specific local view. Like maps, landscape is also used in metaphorical interface design. Landscape brings in a variety of physical geographic features like fields, valleys, mountains, paths, rivers, etc. These properties, for instance in SPIRE (Wise, 1999) and VxInsight(Boyack et al., 2002), are used in expressing data relationships in a dataset. WebStar (Zhang and Nguyen, 2005) uses the solar system as its visualization configuration metaphor. We know that in the solar system, celestial objects such as planets and asteroids have their own orbits and move at a constant speed. These objects revolve around the sun in the universe and pull at each other due to gravity. It is gravity that determines the orbit and moving speed of a planet or asteroid. WebStar emulates the solar system. In the WebStar visual space, the defined central (focus) point, or a selected start Web page, is regarded as the sun, scattered
10.4 Metaphors in information visualization retrieval
227
subject icons which represent users’ interests and Web page icons which are outgoing Web pages of the selected start Web page are perceived as planets and asteroids. When a subject rotates, all related Web pages revolve around the central point, the sun. The rotation speed of each Web page icon is determined by the semantic strength between the moving subject and the Web page. Scattered outgoing Web pages are gravitationally affected by the central point. The gravity in the visual space here is defined as the semantic strength between the central point and a scattered Web page. The closer to each other, the more relevant they are, and vice versa. Ryukyu ALIVE (Access Log Information Visualizing Engine) (Wakita and Matsumoto, 2003) presents the Information Galaxy metaphor. Stars within the galaxy represent Web pages. A browsed Web page would jump towards the outermost rim of the galaxy for further review. Un-browsed Web pages would gradually be drawn towards the center of the galaxy, and eventually disappear. Users browse Web pages by observing the moving galaxy system. Another galaxy metaphor is introduced in SPIRE Galaxies (Wise, 1999), where documents as stars are clustered based on their interrelatedness. In the Topic Islands interface (Miller et al., 1998), islands represent topics. The location and size of an island depend upon the relationship among involved topics islands and the number of documents associated to the island respectively. Oceans separate the islands and provide the browsing platform. File Pile (Rose et al., 1993) is designed to support the casual organization of documents. All items/documents in a database can be automatically classified into several meaningful piles on a table. Each pile indicates a certain subject/category where users can put in or pull out items from it. If the number of items in a pile is large enough, it can be subdivided into several related sub-piles upon request of users. Items in a pile can be selected and browsed by users. A library can be utilized as a metaphor to organize documents because library’s elements are very similar to the structure of directories and a library’s basic functions are also similar to directories. The mapping of the virtual entities of directories on the structure of a library is very natural and straightforward. The library rooms are a rendition of directories while the books in it are files. Bookshelves represent different subjects. Doors separate the rooms or subjects (Chudý and Kadlec, 2004). A bookshelf as an independent metaphor provides a natural framework to organize data in the Visual Net system (2005), Forager (Card et al., 1996), and LibViewer system (Rauber and Merkl, 1999). Book icons in the bookshelf can present categories and classifications, or different book types such as reference books, periodicals, non-reference books, or electronic books. The size and thickness of a displayed book icon can be associated with the number of books within a category or book type. Location of a book icon in the bookshelf can indicate the status of a book such as whether it is reserved, or available, or currently borrowed. The color of a book icon can also be used to show its publishing time. Hierarchical structures are widely used to organize information. The parent and children relationship, sibling relationship, and level relationships of a hierarchical structure need to be metaphorically presented. The Disk Tree visualization
228
Chapter 10 The Implication of Metaphors in Information Visuualization
method (Chi et al., 1998) selects a disc layout to display a complicated tree structure. Multiple levels of a hierarchy are illustrated by successive concentric circles sharing the same center (the root of the tree). The nearer to the center a circle is, the higher the level the circle represents. The angular area size of a slice, which is a category within a circle, corresponds to the number of leaves in that category. The vertex of a slice is the parent and all children are located on the arc of the slice. Within the same category all sibling elements are located on the same edge of a circle. In this manner, a hierarchy structure is visualized in a two dimensional space. WEBKVDS(Web Knowledge Visualization and Discovery System) (Chen et al.,2004) chooses a similar disk tree structure to visualize Web visits, Web usage statistics, average access time per page, and access possibility of the links. Multiple attributes of a single document rather than a set of documents can be metaphorically demonstrated in a butterfly structure. The Butterfly Visualizer (Mackinlay et al., 1995) uses the head and two wings of a butterfly to illustrate the relationships between a retrieved document and its citing documents and cited documents. The head of a butterfly as an entry point indicates basic bibliographic information of a retrieved document, the left and right wings of the butterfly comprise all references of this document and citers of this document, respectively. Time, sometimes, is a crucial factor for certain data. It is used as a browsing thread to organize the data to guide users through a series of events. Time is metaphorically presented in many visualization systems for this purpose. Perspective Wall (Mackinlay et al., 1991) is a three dimensional wall metaphorical configuration. One dimension (the horizontal dimension) of the visual space is reserved for the time variable, and the vertical dimension is used to visualize data layering for its information space. Detailed textual data is organized as small grids (cards) posted on the wall. The position of a textual grid (card) is determined by the two important parameters: publishing time of the data (horizontal location) and the types of data (Vertical location). When the wall moves, users browse events that happened in a continuous time period. In this way the large amount of linear structural data can be effectively displayed. ThemeRiver (Havre et al., 2002) applies a river metaphor to visual demonstration of topic changes in a database as time passes. The horizontal flow of the river represents the flow of time. That means that each horizontal point indicates a certain time. Since the flow is continuous, a horizontal section of the river flow represents a period of time. Each point of the horizontal flow corresponds to a vertical section which indicates a topic or theme. This vertical section is equally divided by the horizontal line. The width of the vertical section means the number of related documents to the related topic or theme. All vertical sections consist of a dynamic current of the river. The wider the river, the more documents address the topic or theme at that certain time. The narrower the river, the lesser documents there are that address the corresponding topic in the time. Presence Era (Viégas et al., 2004) uses the geological layers in sedimentary rocks to present the time factor in its interface. Geologically speaking, the accumulation of geological layers over time can reveal the detailed evolution of geological changes. Users can look into the history by examining various layer patterns. The geological formation of sedimentary rocks is significantly impacted by time and geological environments. Geological properties such as
10.4 Metaphors in information visualization retrieval
229
number of the rock layers, location of a layer in rock, and thickness of a layer are utilized to metaphorically present time period, time, and the amount of data respectively to visualize a time-sensitive data set such as e-mail files, online news group discussion archives, the Internet traffic logs, and citation chains in such a geological information visualization environment. Metaphors for information retrieval interaction The interactions between users and an information retrieval visualization environment are vital and sophisticated. Searching, browsing, judging relevance, and other activities are done by interactions. Browsing in an information visualization environment is a necessary and crucial means to find information. A lens is a special reading tool that allows readers to focus on a special interest area in a visual space and exclude other irrelevant areas during a browsing process. A visualization space can provide many interest points and complicated contexts. A special tool like a lens can help users narrow down to a focus area during navigation and avoid overwhelming non-relevant information. For this reason a lens is employed in many visualization applications like VIBE and the previously discussed Perspective Wall. A bifocal lens was used to support the perception of a sequence of messages and the focal area is magnified to emphasize it (Spence and Apperley, 1982). The fisheye view technique can be regarded as a special lens. Fisheye perceives objects around it in a quite unique way. The focus area of fisheye is magnified to show very fine details of the area while other areas are intentionally minimized but are not totally eliminated to maintain a context. Fisheye can smoothly and gradually transfer from the focus area to another focus area so that all areas are naturally connected. Both the size of focus area and degree of the details in the focus area can be controlled at will. It is obvious that the focus area is dynamic. As its interests change, the focus area changes accordingly. These properties of fisheyes can be analogized to an information retrieval visualization environment to facilitate browsing. In addition, the fisheye view technique maximizes use of a limited visual display space to demonstrate more data, and illustrates richer and more specific information without losing the context in the visual space. For these reasons, the fisheye technique has been used in a wide spectrum of information visualization environments such as fisheye map (Yang et al., 2003), fisheye menu (Bederson, 2000), and fisheye hierarchy (Schaffer et al., 1996). Notice that since the non-focus areas are “distorted” to some degree, the distorted areas may lead to the confusion of users. The walking metaphor (Mackinlay et al.,1990) simulates a browsing operation by using the way that human body movement such as motion forward, backward, turning left, turning right; human head rotation such as left, right, down, and up; and the plane motion of the human body such as left, right, up, and down. These motion combinations enable users to make a flexible exploration in the visual space like walking in a physical world. Turning a page, skipping pages, and ruffling pages are common reading behaviors of a reader. In WebBook(Card et al.,1996) these behaviors of a reader are animated in a vivid way. Users can browse next or previous page by clicking on the right or left page of a metaphorical book. As a page is turned by users, its
230
Chapter 10 The Implication of Metaphors in Information Visuualization
contexts gradually appear to users as if they are leafing through real pages. A user can also click on the right or left edge of the book. Then relative distance from the current page position to the selected position on that edge indicates how many pages to skip. The system even allows users to ruffle through the pages to have a quick glance at all pages. Boolean search is implemented in almost all information retrieval systems. But the correct understanding and proper use of the Boolean search is not a simple task. The visualization of Boolean operations may shed light on the problem. Filter/Flow (Young and Shneiderman, 1993) attempts to simplify the complex Boolean query formulation process by using pipelines and water control. Documents of a database are metaphorically depicted as water in a pipeline system. Water is controlled by a series of control valves. A valve is usually connected by two pipelines. Each valve, which consists of a group of search terms selected by users, serves as a filter to control the amount of the document flow. Valve combination way can form a normal Boolean operator like “AND” or “OR”, depending on valve position in the two dimensional visual space. In other words, the positions of valves in the visual space determine nature of the Boolean operation. If two valves are connected by pipelines in parallel in the visual space, it suggests there is an “OR” operation between the two valves. If two valves are connected by pipelines in a serial order, it implies that there is an “AND” operation between the two valves. The diameter of a pipeline corresponds to the amount of flow in the pipeline. Document flow starting from a dataset finally reaches a result pool as final search results after a series of filtering processes. In a traditional manual punch card retrieval system, a card represents a topic or subject and it has a grid system. Each grid cell corresponds to a document. The position of a document is the same in all cards. If a document is related to the topic or subject that the card presents, then the corresponding grid cell of the document is punched. Retrieval processing is simple: selecting a group of interest cards, putting the cards together, and checking grid cell status. If users can see through a grid cell, then the corresponding document is retrieved. That is because that the document is related to all selected topics/subjects. The Semantic Filter (Fishkin and Stone, 1995) simulates this retrieval processing successfully. Metaphorical search strategies are embodied in Book House (Pejtersen, 1991). Book House looks like a familiar library which integrates a library building, rooms, and people. Each of the rooms is equipped with bookshelves and books. Users can seek information by navigating in the visual library setting. Users can enter the library building and roam any available rooms freely, guided by people icons and room titles. Upon entering a room, users may choose four different search strategies: search by analogy, browse pictures, analytical search, and browse book descriptions. These four search methods are visualized by four metaphorical figures in the certain contexts of a room. A classification scheme, which is represented by iconic display, can be selected by users to narrow down their search topics.
10.5 Procedures and principles for metaphor application
231
Metaphor for solving theoretical problems The implication of metaphors on visualization for information retrieval is not only to establish interactive interfaces to facilitate understanding, learning, and manipulating the system for end-users but also to help solving theoretical problems behind the screen for system designers. The spring theory for an optimistic object layout in a visual space is a good example of this category. After semantic relationships among objects are clearly defined by a certain approach, these objects are positioned or drawn in a low dimensional visual space based on their existing relationships. This is the so-called graph drawing issue. It has been proven to be a difficult task because the drawn objects have to be evenly distributed in the visual space to achieve a spatial balance for display, mutual attractions of the objects have to be considered, and unnecessary edge crosses should be avoided. The aim can be reached by applying the spring embedder approach (Eades, 1984). Observe that when in a physical system, the ends of different springs are connected by rings and randomly positioned in an initial status, the positions of all connected rings would automatically be adjusted by the forces of connected springs and ultimately reach their equilibrium. In this case, if rings and springs are replaced by objects and semantic strengths between objects respectively, all objects can achieve an optimum display status in the visual space.
10.5 Procedures and principles for metaphor application 10.5.1 Procedure for metaphor application The previous discussion has mentioned that there are five basic elements in a metaphor: source domain, source items, target domain, target items, and mapping/matching. In fact, the application of a metaphorical information visualization environment must include these basic elements. The essence of metaphor application rests on finding a source domain and source items and mapping them to the target items in a meaningful way. In this case the target domain is information retrieval visualization and target items may be the information retrieval environment and/or its components. The source domain can be open to any meaningful domains. Both finding an appropriate source domain and source items and making successful matching between the source and the target require experiences and imagination, in addition to a solid comprehension of the working mechanism of an information retrieval visualization environment. The procedure is listed as follows. 1. Identify the target domain which is information retrieval visualization; it is obvious. 2. Identify the target items in the domain. The application of metaphors in an information retrieval visualization environment may occur at multiple levels. The entire visualization environment can be a target item, or a task of the visualization environment can be a target item, or a control of a task can be a
232
3.
4.
5.
6.
Chapter 10 The Implication of Metaphors in Information Visuualization target item. If the entire visualization environment is defined as a target item, its sub-elements can be further broken down into target sub-items. Identify the source domain. Potential candidates are user familiar fields, which share commonalities in some degree with the identified target items, or any domains which may trigger an association to the identified items. Identify source items in the source domain. All possible items sharing the common attributes with the target items, associated and related objects of the identified target items, the context of the selected target items, and relationships with other items should be taken into consideration. Match or map the target and the source. Be aware that the match between the target and the source may be partial. Make sure that the important attributes of the target are matched to the salient attributes of the source. Analyze the matched parts between the target and the source and the implication upon the visualization environment, the unmatched part of the target and the implication, and unmatched part of the source and the implication, respectively. Evaluation of the metaphorical environment. Develop a prototype of the metaphorical visualization environment, conduct a pilot experimental study about the visualization system, and revise it based upon user feedback. Revision may occur in any step, changing the source domain, replacing the target items within the source domain, and adjusting matching. First-hand feedback from users is important for a successful metaphorical application in an information retrieval visualization environment.
10.5.2 Guides for designing a good metaphorical visual information retrieval environment Metaphors hold a lot of promise for visualization for information retrieval. The application of metaphors to an information retrieval visualization environment is a complicated process. A good metaphorical visualization environment design requires guidance to avoid the improper application of metaphors. Poor metaphorical embodiment is regarded in interface design as not only unhelpful but also harmful (Cooper, 1995 b). Improper use of metaphors can cripple the interface with non-relevant limitations and blind the designer to new paradigms more appropriate for a computer-based application (Gentner and Nielson, 1996). The proposed guides attempt to assist in the design of a metaphorical information retrieval visualization environment which should be appropriate and suitable to information retrieval, intuitive and easy for users to learn, applicable to system implementation, and extensible for future expansion. Let us discuss these in depth. Be aware cultural differences A metaphor contains two types of representations: explicit representation such as models and artifacts and implicit representation, such as associated background,
10.5 Procedures and principles for metaphor application
233
role, and culture (Benyon and Imaz, 1999). Without a doubt, cultural factors affect design, implementation, functionality, and use of a metaphorical interface. Metaphors are deeply rooted in a cultural context and the cultural impact of metaphors on metaphorical embodiment and use is inevitable. People should be fully aware of the cultural impact. Cultural differences in application and understanding of metaphors goes beyond the simple shape and color of metaphorical icons, metaphorical embodiment in an interface. Deeper and more fundamental conflicts are rooted in culturally different cognitive, emotional, behavioral, and social processes and structures which constitute the network of relationships on which metaphors operate in any given culture (Duncker, 2002). For instance, in the Chinese culture a calculator concept would be instantly associated with an abacus rather than an electronic calculator because the abacus has been used as a calculator for more than thousand years. It is natural and widely acceptable to use an abacus as a calculator icon. In a library metaphorical setting an automatic check-out machine which people from a developed country are familiar with may be a mystery to people from a developing country. Smoothly bridging the target and the source Cognitive dissonance (Festinger, 1957) is created by mismatch between the source and the target. In a human-computer interaction metaphorical context, cognitive dissonance occurs when user’s expectations for the system conflict with his/her metaphorical beliefs. In that instance, users may lose trust in the metaphorical interface and become confused during the interaction with the interface. Cognitive dissonance would definitely degrade system performance, if not cause a complete failure. Properly matching important attributes of the target and salient attributes of the source, minimizing unmatched attributes of the source, and maximizing the matched attributes between the source and target can avoid possible cognitive conflicts. This would also facilitate the smooth transition from one referent to the other and therefore alleviate discomfort. For instance, a search feature can be represented by various metaphorical icons: a dog, or a pair of magnifying glasses, or a binocular, or a magnifying glass. A dog metaphor may mislead users because a dog as the source has too many possible attributes such as guiding, hunting, searching, racing, rescuing, etc. Users may associate a dog to any of these attributes, which can lead to cognitive dissonance if a wrong attribute is selected. In contrast, a magnifying glass or binoculars generates a better a match since the salient attribute of a magnifying glass or binoculars fully maps the search feature. Emphasizing important attributes of the target According to the influential salient imbalance theory (Ortony, 1979), metaphoricity arises from an imbalance in the salience of the common features such that high-salient attributes in the source domain are matched with low-salient features of the target domain. That is, important features or attributes of an information retrieval visualization environment should be mapped to eye-catching attributes of the source so that these attributes are easily perceived by users. The ThemeRiver visualization environment demonstrates a good match between high salient attribute of the source and low salient but important feature of the target.
234
Chapter 10 The Implication of Metaphors in Information Visuualization
The time line is an important attribute because all of the data is organized and presented against this time line. This important attribute must be emphasized in the metaphorical interface. Water flow in a river is the most salient attribute of the source, the best candidate for the metaphor. Furthermore, both of the attributes (time and water) share common characteristics such as dynamics and continuity. It is a perfect match which connects the two attributes. Selecting a distant source domain Selection of a source domain affects not only the further selection of the source items but also the success of a metaphorical embodiment. A good metaphor should involve two different domains and thus have high between-domain distance; and illustrate low within-space distance between the source object and target object in their very distant respective spaces (Tourangeau and Sternberg, 1982). This suggests that people should choose a metaphor with a high dissimilarity between the target domain and the source domain but a high similarity between the target object/concept and the source object/concept. The high dissimilarity between the domains may produce an imaginary room to derive new ideas in the metaphorical design and to create an unexpected click effect for users. But this dissimilarity of the two domains must be based upon the high similarity between the two items which would assure the accurate conveyance of structure and information from the target to the source and avoid mismatch. For instance, the source domains for an information retrieval visualization metaphor can be the solar system, or a galaxy system, or a geographic landscape, or a water pipeline system which have nothing to do with information retrieval. But the relationships among the sun, asteroids, and stars; connections among field, valley, mountain, and path; and associations among water, pipeline, and control valve resemble the semantic relationships among documents/objects in a database. Considering the entire contexts of a selected source When items from the source domain are identified and used in a metaphor, these items should not be isolated from their contexts. The contexts may provide rich and useful information. Its related objects, activities, connections with other objects, and the environment should be considered for possible use. In this sense, a designer should concentrate on not only the selected items from the source domain but also the content-rich contexts. The primary benefit of considering the entire context rather than individual items is that all elements of the source are naturally connected as a whole and their relationships are preserved. This achieves a better holistic effect for the metaphorical application. For instance, a fan as the source of a metaphorical application is usually used to describe and represent visually a hierarchical structure (the target) because its structure looks like a tree structure and it is a quite familiar concept. But if only this structure is used in the metaphor and its important contexts are ignored, the metaphorical application is not considered as a good one. Notice that a fan can have two distinctive statuses: closed and open. If the two statuses are used in the metaphorical hierarchy configuration, it would enhance the flexibility and effectively display a vast array of data. When users are not interested in certain sub-hierarchies, they can be set in a closed
10.5 Procedures and principles for metaphor application
235
status and therefore more display room is saved for interest sub-hierarchies. When users interact with a sub-hierarchy in the closed status, they can activate it and the corresponding sub-hierarchy will be fully displayed in an open status. Another example is a book metaphor. When a book is identified as the source item, its associated concepts, contexts, and activities such as pages, cover, spine, size, color, series, bookshelf, index, classification, lens, turning a page, ruffling pages, etc. may be utilized for further exploration in the book metaphorical embodiment. Information Landscape (Weippl, 2001) is another successful example in which both the source item and its environment are fully used in a metaphorical interface. The metaphorical islands which represent categorized data are elegantly integrated with tide change to illustrate information. The tide is dynamic and tidal change can lead to a change in sea level. A high sea level can cover relatively small and low islands, and therefore make them invisible. A low sea level would make more islands visible. This feature can be used as a filter to hide unnecessary data in the visual space. Controlling the sea level in the visual space, users can visualize different scenarios in the information landscape at will. Information Mural (Jerding and Stasko, 1995), where information like pictures is drawn on a mural, uses two windows in its visual space to illustrate overview and detail information, respectively, for the long sequent messages. The lower window contains the long entire message set. Users can use a rectangle to select a section of the messages in the lower window. Then details of the selected messages are displayed in the upper window for users to view. Expanding features of the source The appropriate identification of the items in a source domain is vital to the successful application of a metaphor in information retrieval visualization environments. Identification and use of the source items should not be limited in the selected domain and may go beyond the domain by expanding new features based upon the domain. Feature expansion would make a metaphorical application more powerful. Take the same fan metaphorical interface as an example; any point of the arc in a fan represents a child node in a hierarchy and vertex is the root of the hierarchy. If the metaphorical application is limited to the original form, it can only display a one-level tree structure. But the metaphor can be expanded to display a multi-level tree structure if a point on the arc in a fan can derive a new fan which represents a new sub-hierarchy. Following the same generation rule, new levels can be generated at will. The original single level fan structure now becomes a multi-level fan structure that accommodates a more complicated hierarch structure. Applying multiple metaphors An information retrieval visualization environment is convoluted. Multiple metaphors may be applied to different levels. For instance they can be used in individual object icon, information retrieval control, semantic framework, and even visualization model design. If multiple metaphors are used, it is important to make sure that these metaphors are compatible and do not conflict in their meanings. In the Visual Net system, two different metaphors are integrated in an OPAC system.
236
Chapter 10 The Implication of Metaphors in Information Visuualization
All top categories of a classification system are organized in a bookshelf as an entry of the OPAC system, and book icons on the bookshelf represent different top categories. The size of each individual book indicates the number of books within the corresponding category. The second integrated metaphor is a semantic map where areas represent sub-categories and all documents within a sub-category are located in the area. Clicking a book icon on the metaphorical bookshelf, users can smoothly enter another metaphorical map where its sub-categories are displayed. Understanding users The motivation of developing a metaphorical information visualization environment is to help users to better understand, learn, and interact with the environment. Identification and selection of a metaphor should not deviate system designers from the ultimate aim of serving users. They must understand users and listen to their feedback about the selected metaphors, what their expectations are for the metaphors, what their interactive behaviors are in the metaphorical contexts, and what preferences they have. These questions would help designers make an appropriate decision on the identification and selection of the source and provide for a smooth match between the source and the target. This data can be collected from user surveys and interviews. A designer should anticipate and predict that users may interpret the metaphorical interface design beyond the designers’ own intentions and expectations. Select concepts or objects which are widely used and recognized by common users. Its definition and scope are well understood by users. A concept or object that requires another metaphor to explain its meaning should not be selected.
10.6 Summary A metaphor is defined as a familiar and well-known concept to represent and explain an unfamiliar and complicated concept. A metaphor uses preexisting knowledge and experiences to understand an unknown concept. Since our ordinary conceptual system is fundamentally metaphorical in nature, it is no surprise that metaphors are regarded as effective cognitive devices which help people to generate an appropriate mental model for comprehending complex, abstract, and spatial concepts. An information retrieval visualization environment is much more complex than a general user computer interface. It usually includes a semantic framework where semantic relationships are displayed and objects are positioned; diverse objects such as reference points, documents, information retrieval models, and so on; and sophisticated operations such as browsing an interesting area, query searching, and navigating the visual space, selecting and manipulating an information retrieval model, and customizing a visual configuration. Interacting with such a visualization environment, users rely upon a mental model to interpret the features of the environment. A mental or cognitive model created in a user’s mind regarding the visual space significantly affects the way that they interact with the
10.6 Summary
237
visualization environment. That is because a mental model guides users’ response to the environment. Undeniably, metaphors can assist users to establish such a critical mental model. A metaphorical information visualization environment simplifies the cognitive process of user mental model generation, and positions understanding of the visualization environment for system designers, system implementers, and system users on the same cognitive ground. Consequently it can effectively reduce misunderstanding and miscommunication among these three groups of people. Metaphorical embodiment in an information retrieval visualization environment can be a double-edged sword. Improper use of a metaphor leads to the confusion and frustration of users. Therefore, guidance is needed to design a good metaphorical visualization environment which is appropriate, intuitive, and robust.
Chapter 11 Benchmarks and Evaluation Criteria for Information Retrieval Visualization
Information retrieval visualization has more than several decades of history. Theoretical visualization models, pilot visual retrieval systems, and commercial visualization retrieval software packages have burgeoned. It is understandable that researchers and developers have paid more attention to innovative visualization retrieval technique development and system implementation, and less attention to research on evaluation of these systems and models in the early phase. That is because system evaluation usually lags behind system development and implementation. In the initial phase, the priority is model design and system development. Without available models/systems, it is impossible to conduct system evaluation. However, as the techniques and theories of information retrieval visualization mature and the commercialization of information retrieval visualization systems proliferate, evaluation of these systems and models is becoming a pressing issue in the field. The situation calls for better metrics and benchmark repositories to evaluate and examine these tools.
11.1 Information retrieval visualization evaluation Unlike scientific visualization, information retrieval visualization as a branch of information visualization does not have a clearly defined inherent physical structure to visualize in a visual space. It leads to the diversity of information retrieval visualization models which are used to reveal and reflect abstract, invisible, semantic relationships among data in a data set. For instance, in a vector-based information system the spatial characteristic based visualization models, the multiple reference point based visualization models, the self-organizing map visualization models, the multi-dimensional scaling visualization models, etc. can be employed to describe and visualize the same dataset. Each model or environment demonstrates unique perspectives of the dataset. On the other hand, this diversity also increases the difficulty to evaluate these information retrieval visualization models/systems due to the lack of objective comparison standards. Furthermore, the richness of database types, multiformity of the described objects in a database, in conjunction with the complexity of information retrieval in the visualization environment, make the evaluation of information retrieval visualization
240
Chapter 11 Information Retrieval Visualization
an intriguing and challenging task. There are many database types available for visualization, ranging from a vector-based information model, to the Boolean based model, hierarchical information organization model, hyperlink-based data model, etc. Each possesses its own intrinsic data structures, characteristics, and data processing. Visualized objects in the same dataset may be quite different, let alone the visualized objects in the different datasets. As an information visualization environment changes, the ways of both information presentation and the corresponding retrieval would change. All these play a role in evaluation of information retrieval visualization. Users search information in an information retrieval visualization environment quite differently from a traditional retrieval system. In a traditional search environment, users usually enter a text-based query, select other search restrictions, choose the presentation structure of search results such as alphabetical ranking, chronological ranking, or relevance ranking, and make a relevance judgment. In a retrieval visualization environment, users may have to visually convert and “spatialize” their information needs in a visual space; understand the framework of a visual presentation, icons, and metaphors; interpret the visual display of projected documents or objects; and manipulate the display and interact with it. The search process in a visual retrieval environment is more complex than in a traditional retrieval environment. The evaluation of a retrieval visualization environment is more difficult than a traditional retrieval system. In fact, the evaluation for information retrieval visualization has twofold: retrieval result evaluation and retrieval environment/interface evaluation. Recall and precision are two primary criteria for retrieval result evaluation while retrieval environment/interface evaluation has a different criterion system. Recall and precision are widely recognized as evaluation criteria for traditional information retrieval systems. These criteria are no longer competitive enough for information retrieval visualization. In a study, traditional information retrieval with visualization was compared with information retrieval without visualization against the proposed criteria like documents saved per search, interactive task precision, and interactive user precision. The authors found that these precision-based criteria failed to handle the complex visualization situations (Veerasamy and Belkin, 1996). Notice that both recall and precision are basically designed to evaluate retrieved individual documents in a traditional retrieval system. In a retrieval visualization environment users not only retrieve individual documents at the micro-level but also retrieve aggregate information at the macrolevel thanks to the visual configurations. Unfortunately, the latter cannot be measured by neither recall nor precision. Cugini (2005) addressed the performance metrics for presenting search results in a visual space. He examined the performance from the following perspectives: percentage of relevant documents found within a given time, relative error of response, relevance score of a selected document, time taken to find a relevant document, and time taken to answer a specific question. People are aware of the importance of information visualization evaluation and have made efforts to solve the problem. One of the pioneering studies in information visualization evaluation was done by Shneiderman (1996). The author
11.1 Information retrieval visualization evaluation
241
presented seven well-defined general criteria, which are: gaining an overview of an entire database, zooming on objects of interest, filtering out irrelevant objects, choosing a set of interest objects to get details if necessary, viewing relationships among objects, keeping a history of previous users’ activities, and extracting a subset of a collection. The evaluation criteria are supposed to apply to all information visualization environments. After these evaluation criteria were adopted, they were used to evaluate four 3D information visualization designs (Wiss et al., 1998). In a study (Freitas et al., 2005), cognitive complexity, spatial organization, information coding, and state transition were identified as evaluation criteria for visual representation of information visualization techniques. Orientation and help, navigation and querying, and data set reduction were also brought out to examine the interaction mechanism of information visualization. These criteria were applied to evaluate an information visualization application, Bifocal Browser, in their study. Others try to analyze information visualization from the goal and task point of view. Visualization technique evaluation principles were presented along this line (Winckler et al., 2004). They specified the users’ goals and verified whether they can reach these goals with an information visualization application. And then they identified interaction mechanisms that can accomplish the task, and graphic rendering function to show information and relating these goals. These can be summarized as four task levels: goal, generic tasks, interactions, and visual presentation. From the data mining angle, a study came up with the following criteria for information visualization: scalability, expressing domain knowledge, dealing with incorrect data, ease of classification and categorization, high dimensionality, visualization flexibility, query functionality, and summary of results (Grinstein et al., 2005). Komlodi et al., (2004) conducted a survey to summarize information visualization evaluation experiments. After analyzing the natures and designs of fifty information visualization experimental studies, the authors classified the experimental studies and generalized four thematic groups of information visualization evaluations. They are controlled experiments comparing design features of an application, usability studies for an information visualization application, controlled experiments comparing multiple tools, and case studies of an application. A methodology for testing a novel information retrieval visualization system was introduced (Morse and Lewis, 2002). Instead of testing all static and dynamic features of as information retrieval visualization system, some non-significant features are disabled or “de-featured” and only basic features are studied. The benefits of this strategy include focusing on the visual display, reducing the influence from context variables, simplifying experimental procedure, and using a larger number of subjects. In order to offer a common evaluation testing environment similar to TREC, researchers in the information visualization field have set up a sample dataset, aiming to initiate the development of the evaluation benchmarks, to provide a common test environment available to the public, and to establish a forum to promote various evaluation methods. For each of sub-datasets, application domain was described, and open-ended domain specific tasks were provided. It was found that it was difficult to compare systems even with specific datasets and tasks
242
Chapter 11 Information Retrieval Visualization
(Plaisant, 2004). A special journal issue about empirical evaluation of information visualizations was organized to address the growing concern about information visualization evaluation (Chen, 2000). Existing research primarily gravitates around the evaluation of information visualization which has a much larger scope than evaluation of information retrieval visualization and a different emphasis. Although information retrieval visualization may be regarded as a special area of information visualization, it has its own unique features and distinctiveness. These unique features and distinctiveness must be integrated and reflected in its evaluation system. The former concentrates more on information visual representation and information expression in a visual environment while the latter concentrates more on information retrieval in addition to information visual representation in a visual context. There are interactive activities in information visualization environments, but there may not be information retrieval activities. However, it is crystal clear that information retrieval visualization has a natural connection to information visual representation. People cannot address information retrieval in a visualization environment without mentioning information visual representation. In fact, information visual representation is the foundation of information retrieval visualization. The characteristics and structures of information visual representation have a strong impact on the characteristics and features of an information visualization environment. Due to these differences, the evaluation criteria for information retrieval visualization should be different from those of information visualization. The evaluation for visual information retrieval should combine both information retrieval and information visualization. Developing widely accepted and sound evaluation criteria for information retrieval visualization is an important and urgent research topic. Such an evaluation system can contribute to both theoretical research of information retrieval visualization and practical information retrieval visualization system development. The study would benefit researchers, designers, system developers, and end-users as well. An evaluation system would guide and steer researchers, developers, and designers toward optimal information retrieval visualization solutions, models and theories. Evaluation systems help them to discover potential features, to identify potential weaknesses of a visualization tool, and to avoid design loopholes. It can also be used by ordinary customers to select information retrieval visualization software among rival products. This would maximize their efforts to improve information retrieval visualization and encourage a more widespread adoption of information retrieval visualization. The evaluation criteria should be valid, universal, fair and applicable to every kind of information retrieval visualization environments, offering a standard of comparative evaluation for across information retrieval visualization tools/models.
11.2 Benchmarks and evaluation standards
243
11.2 Benchmarks and evaluation standards 11.2.1 Factors affecting evaluation standards It has been shown that developing a benchmark and evaluation system for information retrieval visualization is not a simple task because it is affected by various factors and variables. It is necessary to address the factors that actually play a role in information retrieval visualization evaluation. Identifying such factors would give people a better understanding of the later proposed evaluation benchmarks. The first is information visualization task and data. Task and data should be discussed together because the task usually is intertwined with the data. The nature of data usually determines the task of a system. Information visualization tasks and data vary widely; which makes a unified evaluation methodology difficult to create. It is clear that no one general set of visualization tools will be suitable to address all problems (Grinstein et al., 2005). Information visualization systems usually are designed to target toward a specific problem and support tasks are very well associated with this problem. Therefore they behave differently when they are used for visualizing different datasets (Wiss et al., 1998). The true quality of an information visualization system can only be measured in the context of a particular purpose or task (Rushmeier et al., 1995). The second factor is the interactivity of the information search process in a visual environment. A search process is a complicated process and needs a series of interactions between users and an information retrieval visualization system. Users navigate an information space, discover and explore relevant information based upon their information needs. These may comprise a variety of interactive activities. Weherend and Lewis (1990) categorized potential operations users may conduct in visual environments. These include locating an item from a known entry, identifying a set of unknown items, distinguishing objects from a presented set, categorizing objects described by users, clustering linked and grouped objects, distribution of specified categories, ranking a interest objects, comparing entities with different attributes, comparing within and between relations interest object sets, associating objects displayed, and correlating shared attributes between objects. A visual search environment provides users with an intuitive, interactive, and convenient platform for information retrieval and enriches their search activities. However, it is this interactivity that makes the evaluation of it more complex. The third factor is dynamic information seeking in the context of a visual environment. Unlike traditional information retrieval systems, information retrieval visualization makes internal objects and relationships among documents/objects transparent to users. A search process in a visual environment, in fact, is a complex decision-making process about information relevance judgment. A search process is a process of information discovery in a dynamic and information-rich visual environment. This sophisticated process may involve users’ learning ability, spatial orientation ability, perception ability, and a cognitive aspect as well.
244
Chapter 11 Information Retrieval Visualization
The fourth factor is the diversity of information retrieval visualization tools and models. The diversity reflects dimensionality of a visual space which can be two-dimensional, three-dimensional, or virtual reality; semantic frameworks of information representations which can be a subject directory, neural networks, hierarchy structure, or subject map; projected objects in the visual space which can be documents, Web pages, stack information, information flow, or traffic information in a server; semantic relationships among objects which can be visible hyperlinks, bibliographic citation, or invisible semantic similarities; and ways of illustrating these relationships such as the hyperbolic technique. Each of these variables can make a significant contribution to information retrieval visualization evaluation. It is clear that some of these factors for evaluating the effectiveness of information retrieval visualization are by nature more subjective and task-oriented. Therefore, it is difficult to find and generalize their characteristics. It is challenging to present an evaluation benchmark system and to define a measurable metrics system to measure them.
11.2.2 Principles for developing evaluation benchmarks The proposed evaluation benchmarks and criteria should be comprehensive and exhaustive. All of the effectiveness characteristics of information retrieval visualization should be included in such an evaluation system. This ranges from visual information representation, controllability for interactivity between users and visual information systems, to information searching and information browsing. The proposed criteria should be applicable to all data types and tasks of information retrieval visualization models/systems. The criteria should be measurable. In other words, each of the benchmarks and criteria can be managed in terms of measurement. However, in reality, due to the nature of information retrieval visualization, it is extremely difficult to come up measurable criteria for each of the proposed benchmarks.
11.2.3 Four proposed categories for evaluation criteria Information retrieval can basically be classified into two categories based upon its search nature and purpose. The first is a search for detailed information of a known item. For instance, users search for works of a known author, a full-text of a given title, or patent of a particular patent number. The other is a search for uncertain information within a defined interest topic. In the latter case, searchers know the subject/topic they are looking for but they do not know exactly which concrete items they are looking for. In reality, the majority of users’ searches fall in the second category. Unfortunately, traditional information retrieval systems which are built upon a query search mechanism like Boolean based information retrieval systems are more suitable for the first category than the second category. The beauty of information retrieval visualization lies in its capacity for information
11.2 Benchmarks and evaluation standards
245
browsing in a visual information space. Due to its unique 2D or 3D nature of information space, users can engage in information discovery, data mining, and data harvesting by browsing in the visual space. It is totally different from a query search. The process of information browsing, in fact, is also a process of their need clarifications and need definitions. Information retrieval visualization really changes the way that people search for information. It is the interactivity, flexibility, and multi-dimensional nature of a visualization environment that makes visualization information retrieval more competitive when dealing with the second category of searches. In other words, in a visualization environment, users equipped with a variety of interactive control mechanisms can effectively browse information, navigate a visual information space, find relevant information, and discover new information. Therefore, the proposed benchmarks system should include not only evaluation for query search which is an indispensable perspective, but also evaluation from an information browsing perspective. One of the most prominent characteristics of information retrieval visualization is its visual space. Within a visual space, a semantic framework is presented, visual data/objects are projected onto the framework, logical and semantic relationships in the context of the framework are illustrated, and various interactions are carried out. Visual data, the framework, and the way that visual data is presented within the framework are defined as visual information representation. Visual information representation is fundamental and essential to information retrieval visualization. To a large degree, the visual information representation determines the foundation, features, functionality, and characteristics of information retrieval visualization. That is, whether visual information representation is successful or not would decide the success of information retrieval visualization. The proposed evaluation system should include it due to this reason. As we know, a visual information environment offers an ideal and intuitive interface for end-users to interact with. The environment is a window that enables users to communicate with systems. It is a place that users exchange information with visual information systems. Through various interactions between users and systems, users may browse information, submit queries, navigate visual space, make information cluster analysis, customize a local information space based upon their interests, drill down to details of an interesting object, and so on. Information retrieval visualization must provide users with control mechanisms to manipulate information, to participate in decision making, and to complete their tasks. Controllability for information retrieval visualization interaction should be considered in the evaluation metrics. Information browsing, querying, visual information representation, and controllability for interactions are four primary categories within the proposed evaluation system. Querying and information browsing reflect evaluation requirements of information retrieval. Information presentation addresses the way that visual data is organized and presented. It provides a platform for users to control and retrieve information. Controllability for interaction emphasizes interactions between an information retrieval visualization system and its users. They are integrated as a whole and they are dependent upon each other in the visual space.
246
Chapter 11 Information Retrieval Visualization
11.2.4 Descriptions of proposed benchmarks It is evident that the four categories are too broad to measure and examine information retrieval visualization. But they present a structural framework which can guide people to develop more detailed benchmarks within each of the four categories. Information browsing The first criterion within this category is guidance. Users navigate in a twodimensional or three-dimensional visual space to search for relevant information. Due to the multi-dimensional nature of a visualization environment, users need a guidance mechanism to orient themselves in the visualization environment during navigation. This is similar to a compass for a tourist traveling in an unfamiliar territory. This guidance should not only orient users in a visual space but also lead users to appropriate and desired locations. Some information retrieval visualization systems integrate a subject hierarchy structure to facilitate users browsing and locating information (STRETCH, 2005; SPACETREE, 2005; Beaudoin et al., 2005). Displaying information about the area surrounding of a focus area would help users to make a decision about the next browsing step. At each of navigation points, providing users with available and appropriate information discovery means and disabling non-appropriate features would decrease possible disorientation for users. Finally, a well-designed and user-friendly help file which includes explanation for all features and functions would be useful for guidance. The second criterion is exploration. Information retrieval visualization should enable users to overview the entire information space which usually is set as a starting point of navigation. More importantly, a local information space should be generated and presented to users upon request. Detailed information about an object should be provided if that object is selected during browsing. The detail degree of a browsed area should be controlled by users. The local information space should also be easily and smoothly shifted to the entire information space, and jumping from the overview of the visual information presentation to a local view should be allowed. An overview of an entire area, a local view of information space, the control over the detail degree of a browsed area, and detail of a required object are the basic elements of information exploration. The third criterion is the dimensionality of a displayed object in the visual space. The dimensionality of a displayed object in the visual space refers to the degree to which a displayed object offers information about itself in both depth and width. A displayed object in a visual space is usually an icon which tells users information about the object it represents. The design of an icon should be concise, meaningful, and intuitive. Colors, size, and shapes, or their combinations are employed to represent multiple meanings of represented data. For instance, the size of an icon represents the relevance degree. The shape of an icon can represent the type of an object. The color of an icon can represent the status of an object. In Visual Net (Belmont Abbey College North Carolina, 2005), for instance, a holding
11.2 Benchmarks and evaluation standards
247
item icon consists of a circle, several concentric rings, and associated arrows. A red center circle indicates that the holding is for printed material while a blue center circle means that the holding is an electronic material. The thickness of the white concentric ring indicates how large the holding is and the thickness of the green concentric ring shows how new the holding is. When the blue arrow on the outer ring appears, it suggests the holding is in a foreign language, and when a blue arrow on the outer ring appears, it implies the holding is a reference item. The fourth criterion is connections or relationships of a displayed object to others in a visual space. When an object is displayed in a visual space, it is not isolated or disconnected. In other words, when it is presented in a visual space, its relationships with other objects are also illustrated. What relationships are illustrated and how they are illustrated need to be evaluated. In some systems, the connections are visible and in others the connections are invisible. Some connections may be visible only after users make such a request. Sometimes the relationships are connected by links, like hyperbolic-technique-based visualization systems (Visual Thesaurus, 2005; Inxight, 2005), adjacent orders (Map of the Market, 2005), distances and directions such as TOFIR and DARE. VIBE can use length of connected line between two related objects to represent the connection degree of the two objects (Olsen et al., 1993). In a subject tree structure, sibling relationships and parent-child relationships may be shown. Querying A query search feature is indispensable and necessary for information retrieval visualization. This feature distinguishes it from other information visualization models/systems. The method of querying in information retrieval visualization varies in different visual environments and is primarily affected by visual information representation and nature of visualized data. The first one is a simple query search. Information retrieval visualization should accept a search query formed by search terms. Unlike a traditional information retrieval system, it maps the matched results onto its visual environment and visually illustrates them for users by highlighting them. In the visual context, users can observe results, results distribution, and relationships between the query and retrieved objects. These matched objects are colored differently in the visual space so that users can easily distinguish them from other un-matched objects. Basically, information retrieval visualization in this case does not visualize the internal matching processing and only visualizes the matched results. In some systems (Visual Thesaurus, 2005; Inxight, 2005), for instance, search query windows are offered, search results are color differently from unmatched objects, and visual presentations are adjusted and regenerated so that search results are emphasized based on new users needs in the visual spaces. In VIBE the relevance degree between a query and a result object is presented by using different colors. This query search mechanism should be embedded into two levels: global and local. The former refers to the idea that querying is carried out within the entire database while the latter refers to the idea that the querying is within a specified local area. The former is a global search while the latter is a local search. The latter is useful when users navigate into a specific local space such as a sub-branch
248
Chapter 11 Information Retrieval Visualization
of a subject tree structure or a browsed sub-map area, and they may want to search only within that local area. The second criterion is the information retrieval model visualization. In a broad sense, information retrieval is not simple keyword matching. It includes using powerful information retrieval models wherein users may control and manipulate the retrieved results. There are various information retrieval models such as the Boolean retrieval model, cosine model, conjunction model, disjunction model, distance model, ellipse model, and so on. Visualizing these information retrieval models in a visual environment is more challenging than just visualizing the results of a search query. That is because visualizing an information retrieval model is, in fact, visualizing internal information retrieval processing. Users can manipulate both the information retrieval process and information retrieval results. This makes both information representation transparent and information retrieval processing transparent to end-users. As we know an ellipse model can determine a hyper-ellipse contour in a high dimensional vector space which can not be observed by users. The contour is invisible in a high dimensional space. The location of the hyper-ellipse contour is determined by two users’ information interest points, (User interest point is a broader concept of user query. It can include users’ background, reading habits, previous queries, and so on. A query is presented by multiple interest points in an information space). The objects within the contour are regarded as retrieved objects. Users can control the size of the contour to change the size of retrieved objects, or they can change the position of the contour if their interests change. In DARE, the ellipse information retrieval model can be visualized as follows. The ellipse contour in the high dimensional space is mapped onto a low two-dimensional space which can be observed by people. After it is converted to the low dimensional space, it no longer preserves its ellipse shape in the high dimensional space. Instead, it becomes a wave-like curve in the visual space. After this conversion, the invisible hyper-ellipse contour in a high dimensional space becomes visible. The most important and exciting aspect of this conversion is that users can control and manipulate the concrete and visible contour to control information retrieval in the low visual space at will. Another example is Filter/Flow (Young and Shneiderman, 1993). In Filter/Flow, documents in a database are defined as water flow and Boolean logic operators such as logic OR and logic AND are defined as valves to control the water flow (documents). Users can add valves to the water control system to include relevant documents and exclude irrelevant documents. Users can observe the flow change in the visual space. The difference between the previous scenario and this scenario is that the former only visualizes the final results of a search while the latter visualizes both internal search processing and final search results. The third criterion of a query search feature is query reformulation. As we know, the information search process is a dynamic one. The information search process may be affected by the degree of information need understanding, familiarity with the information retrieval system, and the searchers’ background and experiences. For these reasons, a multiple-step search is needed to adjust the search strategy and make the search more accurate. In other words, users need to
11.2 Benchmarks and evaluation standards
249
reformulate their queries based upon initial search results. It is necessary that information retrieval visualization provides users with a feedback mechanism to adjust search queries. Some systems such as DARE and GUIDO allow users to pick up any documents or their combinations in a visual space to replace a current query, or add them to a current query, or revise them. DARE allows users to shift the role of the involved reference points to change the retrieval emphasis. In most multiple user interest point based environments, the content of a user interest point can be dynamically changed or redefined based upon one’s needs. Visual information representation Visual information representation is essential for information retrieval visualization. It is the foundation of information retrieval visualization. Within this category there are seven criteria concerning information retrieval visualization. The first criterion is the dimensionality of the visual space. Visual information space can be two-dimensional, three-dimensional, or virtual reality. Users definitely behave differently in a 2D environment versus a 3D environment (Sebrechts et al., 1999). It is believed that a three-dimensional visual space offers an extra dimension to represent more information. Adding an extra dimension to a two dimensional space is not as easy as “2 + 1 = 3”. The impact of the joined third dimension on information retrieval visualization may be larger than people imagine. Because of an additional dimension, presented information may be richer, illustrated semantic relationships among objects may be more complicated and sophisticated, visual information representation may be more intuitive and natural, and presented information may be more informative. On the other hand, adding an additional dimension to the visual space would increase technical difficulties when systems are implemented and also the operational complexity for users. The second criterion of visual information representation is the semantic attribute revelation. Semantic attribute revelation defines the visual space to some degree. It is evident that an object can have multiple attributes while these attributes define characteristics of the object. In a visual environment, not all attributes of an object are identified and utilized to represent that object. Useful, meaningful, salient, and necessary attributes are selected and preserved while others may be sacrificed and excluded in the construction of a visual environment. The identification and revelation of object attributes has a significant and direct impact on visual information representation. Selected attributes may be assigned to the Xaxis, Y-axis, or Z-axis of a visual space respectively. Attributes can also be expressed in other different ways. These selected attributes lay a foundation for their visual frameworks. For instance, both distance and direction attributes of an object in DARE, direction attributes in TOFIR, distance attributes in GUIDO, hierarchy attributes in CHEOPS, similarity ratio in VIBE, and time attribute and subject attributes in GRIDL(GRaphical Interface for Digital Libraries, 2005) are identified and represented in their visual spaces. In the two dimensional GRIDL space, attributes in the visual space can even be redefined and replaced by a pool of attributes such as classification, publishing years, author, title, physical location, classification and conference place.
250
Chapter 11 Information Retrieval Visualization
The third one is the semantic framework of the visual space. A semantic framework is usually associated with the revealed attributes from objects. A semantic framework, where objects are projected onto, defines the structure of a visual space. Semantic frameworks range from a grid, hierarchy, map, network, to circle, triangle, rectangle, etc. A semantic frame should be meaningful in terms of information retrieval, concise in terms of structure, and aesthetic in terms of visual presentation such as symmetry. The fourth criterion is the intuitiveness of visual information representation. Intuitiveness includes both easy information expression and easy understanding of visual information. The visual information presentation should be expressed in a straight forward manner so that users can easily adapt to the environment. Unfortunately, in reality, due to the complexity of a database, when certain attribute characteristics of data in the database must be preserved and presented and high dimensionality must be reduced, it is difficult to find a simple and intuitive way for the visual information representation. Without a doubt, users prefer an intuitive visual information representation and are more comfortable and willing to interact with an intuitive interface. Researchers and designers of information visualization have been searching for appropriate and applicable metaphors which may be embedded into the semantic frameworks of visual information representation. Familiar concepts, objects, or environments from the real world would facilitate users to understand the visual information representation, decrease users’ learning time for the systems, reduce users’ anxiousness, and therefore increase effectiveness and efficiency. It is easy to find the systems which employs metaphors, for instance, a water fluid metaphor( Filter/Flow), a solar system metaphor (WebStar (Zhang and Nguyen, 2005)), geographic map metaphor (WebMap, 2003), Fisheye (Fisheye menu, 2005), and so on. The fifth aspect of visual information representation is clustering and categorizing. As we know, a displayed object in a visual environment is not isolated or semantically independent of other displayed objects. The displayed objects are semantically connected and associated in some sense. Object location in the context of a semantic framework implies and indicates something. Objects projected onto a close neighborhood suggest that they share similar characteristics because they are projected onto the same spot according to the same projection algorithm. This phenomenon can help users to make a clustering and categorizing analysis in visual environments. This analysis can answer such questions as: How many objects are grouped in a cluster? What are the relationships among different object clusters? Basically, attributes identified and employed from an object to construct a visual space decide the nature of the clustering and categorizing. Clustering and categorizing can be used to support search feedback, perform object similarity analysis, understand overview distribution of documents in a database, and other purposes. For instance, all objects are clustered as a group in DARE if they share similar distances and angles against defined interest points. Semantically relevant objects are clustered and related subjects are adjacent in a semantic map. The sixth criterion is visual information representation customization. Information space for a database should illustrate all data perspectives. However, users’
11.2 Benchmarks and evaluation standards
251
interests usually concentrate on a limited topics/subjects compared to entire coverage of a database. During a search users may change their topics/subjects. It suggests that information retrieval visualization should support both an overview of entire information space and also a customized local view of interests. Upon request from users, it should offer a detailed and customized local view. It is clear that a local view based upon users’ interests is dynamic. It varies in users and even varies in different steps of a search for the same user. This visualization information representation customization is different from simple zoom in/out feature in an interface. Views generated by a series of zoom in/out operation preserve hierarchy relationships while visualization information representation customization does not necessarily follow the same principle. In some situations where a high dimensional information space is converted to a low dimensional visual space, visual information representation is more complicated. The same local area in a high dimensional space may correspond to multiple visual presentations which emphasize different perspectives of the local area. The last criterion within this category is the disambiguation mechanism. Ambiguity is a unique phenomenon of information visualization. Ambiguity happens when a high dimensional information space is converted to a low dimensional visual space. Ambiguity refers to the idea that objects that are located in different places in a high dimensional information space are projected onto the same spot in a low visual information space. It is apparent that projection ambiguity can mislead users because objects that are located in different places in a high dimensional information space should be projected onto different spots in a low visual information space. Notice that mathematical projection ambiguity is inevitable when a space with a high dimensionality is reduced to a space with a low dimensionality. When data is processed and projected onto a visual space in a certain way, the data must be customized, some attributes are preserved, some attributes are eliminated, and some attributes are altered or “distorted” after projection. The point is that if this happens, information retrieval visualization should provide a disambiguation mechanism to solve the problem. For example, in DARE, TOFIR, GUIDO, and VIBE, which are built on a vector-based document space, a spot in visual space can correspond to multiple documents which may be far away from each other in the vector space. Revising user interest point(s), repositioning the affected user interest points, or adding/discarding interest points in visual space can effectively disambiguate the phenomena in these systems. Controllability The first criterion of controllability is the ability to zoom in/out. The original zooming definition refers to the metaphorical operations of a camera that can scan across a scene, move in for a closer observation, or back away to get a wider view. The concept is incorporated into information visualization to allow for exploration of information at both specific level and general level. Toward this aim, all of the display data should be organized and categorized in terms of the detail degree. Users should be able to zoom in/out on interest areas or objects at will. Narrower, more detailed and specific information becomes available as users zoom in. Broader and more general information becomes available as users zoom out.
252
Chapter 11 Information Retrieval Visualization
When zooming, it is important to keep the zooming path and global context. This helps users avoid the possibility of disorientation and improves zoom operation controllability. The way to zoom in/out and the zoom detail level should also be considered. The second criterion of controllability is the activity history preservation and display. Unlike a traditional information retrieval system where interactions with the system are relatively simple, interactions with information retrieval visualization are richer, more diverse and complicated. They range from query search, navigation, browsing, disambiguation, to visual representation customization. All conducted interaction activities with information retrieval visualization should be preserved in some way such as a reverse order. Upon request from users, previous activities should be traced back to allow for replay or a revisit. This is necessary because information exploration process in a visual space, sometimes, is a process of trial and error. Users make a correct decision or reach satisfactory results by trying out various means or features until mistakes are sufficiently reduced or minimized. Activity history preservation and display would reduce the users’ burden of recalling all past activities. The third criterion is filtering. When users navigate into a local area of interest, a visual information representation is customized, search results are displayed in an area, or an object cluster is observed in a visual environment, users may be interested in the visual contexts and some of the objects in the contexts. In other words, some inappropriate or unwanted “noise” should be filtered out while the contexts are kept. For example, certain types of objects, objects within certain time periods, objects with certain attributes, etc. are filtered from the context. In fact, filtering processing is the process of data refinement. The fourth criterion is selection. Selection includes selecting an interest object and an interest area in a visual space. Selection is important for users to navigate in the visual space. Selection enables users to examine a focus object or area, investigate the content of a focus object or area, distinguish possible overlapping objects, and make relevance decision about the involved objects. After objects are selected, the detailed information about the selected objects should be demonstrated and associated operations should be provided. All detailed evaluation criteria or benchmarks discussed above are summarized as the follows: Information browsing Guidance Exploration Dimensionality of a displayed object Connections of a displayed object to others Querying Simple search and visual display of search results Information retrieval model visualization Query reformulation Visual information representation Dimensionality of a visual space
11.3 Summary
253
Semantic attribute revelation Semantic framework Intuitiveness of visual information representation Clustering/categorizing Visual information representation customization Disambiguation mechanism Controllability Zooming in/out Activity history preservation and display Filtering Selection Regarding retrieval result evaluation, recall, precision, and other criteria used for individual result evaluation at the micro-level can still be used in retrieval result evaluation for information retrieval visualization. Without a doubt browsing in a visualization environment would definitely make a positive contribution to retrieving relevant individual objects. On the other hand browsing requires not only efforts but also time. Both two retrieved results and time factors should be considered in measurement of browsing. Therefore, the ratio of the number of retrieved relevant objects/documents to the time sent in browsing in a visual space can be used to measure the quality of browsing.
11.3 Summary Attention to information visualization has increased significantly. More and more research information retrieval visualization models, pilot systems, and commercial applications are available. However, there are still a limited number of studies regarding information visualization evaluation, let alone information retrieval visualization evaluation. There are no widely accepted, reliable, accurate, effective evaluation benchmarks, evaluation criteria, or metrics systems available to test the effectiveness and efficiency of information retrieval visualization. What are the metrics and benchmarks suitable for information retrieval visualization? From which perspectives can the quality of a visual information retrieval be measured? How can similar visual information retrieval tools/models be compared? Researchers, developer, designers, or users of information retrieval visualization want the answers to these questions. A benchmark system and evaluation criteria for information retrieval visualization are presented. Affecting factors from both information retrieval and information visualization are considered in the system. This evaluation standard consists of four categories: information browsing, querying, visual information representation, and controllability for interactivity. Each of the four categories emphasizes a different perspective of information retrieval visualization. Both information browsing and querying reflect information retrieval fundamental natures and characteristics, visual information representation considers visual space
254
Chapter 11 Information Retrieval Visualization
characteristics, the essential part of information retrieval visualization, and finally, controllability for interactivity addresses indispensable interaction between users and an information retrieval visualization environment. It is clear that both information browsing and querying are more associated with tasks, visual information representation is more related to data, and controllability for interactivity is more connected to users. These four categories are dependent upon and affect each other. When people conduct an experience study to examine and test an information retrieval visualization system, they should be aware of the prototype effect problem. Since novel information retrieval visualization models are usually first introduced in the form of a proof-of-theory prototype, examining and testing such a system may bring a concern caused by the fact that its interface may be not userfriendly, the system may not be not robust because of undetected glitches or bugs, and features may be immature and incomplete due to the nature of a prototype system. It may have an unexpected impact on experimental study results. People should be especially cautious when an information retrieval visualization prototype system is compared with a commercial information retrieval visualization system, or when an information retrieval visualization prototype system is compared with a commercial traditional information retrieval visualization system.
Chapter 12 Afterthoughts
12.1 Introduction In this final chapter we recap the addressed main ideas in the book; compare the five major information retrieval visualization models: the multiple reference points based models (MRPBM), the Euclidean spatial characteristic based models (ESCBM), the Pathfinder associative network (PFNET), the Multidimensional scaling models (MDS), and the Self-organizing map model (SOM); and finally discuss several important issues and challenges in information retrieval visualization. Information retrieval has two distinct fronts in terms of search: query searching and browsing. These two fronts have their own strengths and weaknesses in information seeking. They are complimentary and not exclusive of each other. These two fronts should be available for users to explore information in an information retrieval system. However, a traditional information retrieval system favors query searching and focuses on finding individual objects/documents. Browsing as an important information retrieval means is not fully utilized because the inherent weaknesses of the intrinsic structures and the ways of information organization, information presentation, and information seeking prevent users from making full use of the browsing capacity in such an environment. Although information organization and information presentation methods such as popular hyperlink techniques and directory structures are widely applied to support browsing, the browsing potentials are far from fully exploited. The significance of browsing resides in not only distinguishing individual objects/documents and revealing their contents at the micro-level but also exploiting the aggregate information and discovering trends and patterns of a data collection at the macro-level. As a result it really turns information retrieval into a process of data mining, a process of information exploration, and a process of knowledge discovery. Effective exploration of the aggregate information depends heavily upon an effective browsing environment which can illustrate rich aggregate information and facilitate various browsing activities. Information retrieval visualization offers an ideal environment for both browsing and query searching, and opens a new chapter for information retrieval. Information retrieval visualization, as a branch of information visualization, does not have physical structures to inherit from a data collection. On the one hand this
256
Chapter 12 Afterthoughts
characteristic increases complexity of construction for an information retrieval visualization model because abstract structures from a data collection must be generated. On the other hand, the characteristic implies the diversity and variety of information retrieval visualization models. Information retrieval visualization spatializes abstract and invisible semantic relationships of data in a collection and allows people to observe rich and diverse aggregate information in multiple ways. Consequently, information retrieval visualization presents a spatial platform where users carry out information retrieval freely and make full use of the powerful browsing capacity. In addition, it can visualize the internal processing of a traditional information retrieval model, like the cosine retrieval model, provide an open territory to develop new information retrieval mechanisms, conduct information analysis in a more effective visual way, and utilize people’s perceptual ability to minimize cognitive workloads in information seeking. A theoretic model plays a fundamental role in information retrieval visualization. It determines the information organization method and the structures of objects in a database, the objects which will be visualized in a visualization environment, salient attributes of the objects which will be used for projection, a coordination system where a visual semantic framework will be established, a semantic framework which all objects will be projected onto, a projection algorithm which ultimately defines locations of objects in the visual space, and a visual interactive retrieval means which accounts for the way of information seeking in the visual space. In other words, an information retrieval visualization theoretical model underlies the structure, functionality, and features of an information retrieval visualization environment. Although the hyperlink technique is a primary organization approach for Internet information, the application of visualization techniques is not limited to hyperlink structures. It is also applied to search engine results, subject directories, Web traffic information, Web log analysis, etc. In fact, all five of the discussed information visualization models can be used for Internet information visualization. Without a doubt, an elegant algorithm is an extremely important and promising idea which underlies information retrieval visualization. But messages and ideas from a visualization model must be conveyed to and comprehended by end users. The interface of an information retrieval visualization environment bridges end users and a visualization model. The essence of an interface design for information retrieval visualization relies upon effective and efficient communications between users and the visualization environment. In this spirit, it is not surprising that both ambiguity and metaphor in information retrieval visualization can be regarded as special communication phenomena during human and system interaction. In order to effectively and efficiently exchange information, users must understand and handle the inherent ambiguity phenomena which may mislead users during information exploration. Metaphorical visual presentations help users, system developers, and model designers understand a complex information retrieval visualization model not only in a more intuitive, natural, and appealing way, but also on the same cognitive ground. Due to this reason it reduces communication
12.2 Comparisons of the introduced visualization models
257
barriers among the three groups of people by decreasing their cognitive workloads. As a result, the communication between users and the interface is enhanced. Evaluation for information retrieval visualization is more elusive than that for traditional information retrieval. Due to the spatiality of an information space, interactivity of information seeking, diversity of information retrieval visualization models and approaches, complexity of information analytical decision making, the nature of information exploration in a visual space, and the dynamics of information relevance judgment in an information retrieval visualization environment, the evaluation for information retrieval visualization become more complicated and challenging. The evaluation complexity reflects in both retrieval environment and retrieval results.
12.2 Comparisons of the introduced visualization models The introduced information retrieval visualization models have been compared and analyzed at different levels from input data formats, to coordination systems, to visual semantic frameworks, to projection approaches, to characteristics of visual spaces, to information retrieval features, and to ambiguity. The comparisons help users understand these visualization models from quite different perspectives. Input data structures for both MRPBM and ESCBM are document-attribute matrices while input data structures for PFNET, SOM, and MDS are object-object proximity matrices. An input data structure is related to the generation of a visual configuration in a visual space. User need based projection reference systems in MRPBM and ESCBM suggest that dynamic visual configurations in visual spaces are supported. This means that the similarities between documents in a data collection and compared objects in a reference system are updated accordingly as the reference system changes. A customized visual configuration, reflecting a focus area defined by a reference system based on users’ needs, can be produced at will. The document-attribute vector structure supports such dynamic changes of a visual configuration. In other words, the document-attribute vector structure fits in this dynamic similarity calculation perfectly. In contrast, PFNET, SOM, and MDS produce a relatively stable overview configuration in the visual spaces. The visual configuration, which is usually not dynamic, is generated based upon similarities among all involved objects in a data collection rather than a clearly defined reference system. The object-object proximity matrix is suitable for this similarity calculation requirement. A visual space has to be established upon a properly defined coordination system. In ESCBM, object attributes such as angle and/or distance are assigned directly to axes of the coordination systems whose axis type is quantitative. In MRPBM the axis type is quantitative. However, similarities of objects are not directly associated with axes of the coordination systems by reference point positions in order to maintain flexible manipulation of reference points. The axis type of the SOM coordination system is nominal because of the grid layout and object attributes not directly assigned to the axes. The axis type of the MDS coordination
258
Chapter 12 Afterthoughts
system is quantitative and object attributes are not assigned to the axes directly. A visual configuration of PFNET is a network and relationships among objects are kept in a network. This requires a graphic layout algorithm to handle the network drawing to avoid unnecessary edge crossing. Therefore, the axis type of this coordination system is quantitative and object attributes are not assigned to the axes directly. A semantic framework defines a valid area where objects are projected onto in a visual space. Semantic frameworks of ESCBM have stable geometric shapes such as a triangle, a half infinite upright plank, and a half infinite sloping plank. Both MRPBM and MDS do not have any fixed semantic framework shapes in their visual spaces. Semantic frameworks for both PFNET and SOM are a network and a grid, respectively. The projection method of an information visualization model will determine the position of an object in the visual space. When objects in a high dimensional space are projected onto a low dimensional visual space, the projection needs a reference system against which objects are mapped onto the visual space. A reference system varies in different information retrieval visualization environments. This may consist of two reference points like these in ESCBM, or multiple reference points like those in MRPBM. It can be explicit and stable during the construction of a visual configuration like ESCBM and MRPBM. That is, after a reference system is defined by users, the reference system no longer changes during projection processing. It can also be implicit and dynamic during the construction of a visual configuration like PFNET, SOM, and MDS. This means that objects are projected against an adjustable reference system during projection, for instance, the weight vectors associated with the output grid in SOM, adjustable objects in the visual space in the non-metric MDS, and network edge weights in the visual space in PFNET. It is interesting that the projection algorithms with a dynamic and implicit reference system during the construction of a visual configuration usually correspond to an iterative projection processing while the projection algorithms with a stable and explicit reference system do not correspond to an iterative projection processing. Because of the iterative processing nature, the position of a projected object in these visualization models may be not unique. There are many factors that affect termination conditions of the iterative processing. Some factors, such as the stress threshold in MDS, both the triangular q and Minkowski metric r in PFNET, and the neighborhood function in SOM, can be manipulated or controlled by users. Projection algorithms with a non-iterative processing nature usually produce a unique position for a projected object in a visual space like ESCBM. It is worthy to point out that in MRPBM the projection algorithm is not iterative and the position of a projected object in the visual space should be unique. We also claim in the previous chapter that projected objects in the visual space are movable. It sounds as though these two claims are contradictory when in fact, they are not. This is because the position of a projected object in the visual space is unique once the positions of all involved reference points are fixed. A position change of a projected object caused by the position changes in reference points can be accurately calculated. This is totally different from the uncertainty of a projected object in a visual space caused by an iterative processing. Notice that
12.2 Comparisons of the introduced visualization models
259
both ESCBM and MRPBM support a dynamic and customized reference system defined by users. The reference system is defined by users at will. However, when it is defined and used in projection processing, its contents stay stable and explicit. PFNET, SOM, and MDS do not support a dynamic and customized reference system defined by users. The projection algorithms with an iterative processing nature and dynamic reference systems during the construction of a visual configuration usually produce a visual configuration which is a global overview of a data collection, whereas the projection algorithms with a non-iterative processing nature and stable reference systems during the construction of a visual configuration generate a visual configuration which may be a local view. The generated local view focuses on objects and surrounding areas defined by the reference system. This suggests that the projection algorithms with a non-iterative processing nature and stable reference systems during the construction of a visual configuration can customize a visual configuration based upon a user’s preference. In this manner, these visual configurations achieve a much better flexibility. However, as a result the customized visual configurations may demand more system resources in response to real time customization requests from users. In a broad sense, a reference system can be regarded as a special form of a user query. If that is a case, both ESCBM and MRPBM are classified into the query search and browsing (QB) information retrieval visualization paradigm because a reference system as a query narrows down a search to a sub-space, then visualizes the results for browsing. PFNET, SOM, and MDS can be classified into the browsing and query searching (BQ) paradigm or the browsing only (BO) paradigm since they don’t have a clearly defined reference system as a query to narrow down their results in the visual spaces. Distance and angle are not only two important spatial characteristics of a document in the high dimensional document space but also two potential similarity measurements. After documents are mapped onto the visual space against a reference system, a visual configuration is formed. The documents close to the visual space origin of ESCBM are more relevant to the reference system in terms of distance and/or angle measurements. In MRPBM, the documents which are related to the reference points are positioned within an area defined by the reference points in the visual space. The position of a document is determined by the relative attractions between the documents to all related reference points. A visual configuration changes as reference points are replaced in the visual space. The uniqueness of ESCBM is that an internal retrieval process of an information retrieval model such as the cosine model, distance model, conjunction model, disjunction model, ellipse model, and so on, can be visualized in the visual spaces. It is worth pointing out that the visualization of an internal retrieval process of an information retrieval model is built upon a clearly defined reference system. In other words, it is the reference system that is employed to generate a visual information retrieval contour in the visual space. That contour is used to control the size of the retrieval results. Furthermore, now non-traditional information retrieval models can be developed within the visual spaces. The exclusive characteristic of SOM is its derived semantic map from individual object distributions. The semantic
260
Chapter 12 Afterthoughts
map at the macro-level, which is generated based on the weight vectors in the grid, transcends individual objects at the micro-level. Automatic interpretation (Area labeling) of the partitioned areas in the feature map is a unique task for SOM. The process of the automatic interpretation of the partitioned areas is a value added knowledge generation process rather than a simple process of area labeling. Salient characteristic of MRPBM leans towards manipulation flexibility of the reference points. The extraordinary characteristic of PFNET relies on the produced optimal structure. The power of MDS is that the data used in multidimensional scaling analysis is relatively free of any distributional assumption and it can handle various types of data ranging from ordinal data, to interval data, to ratio data. Results of a query search in ESCBM, PFNET, MDS, and SOM can be visualized by highlighting retrieval results in the visual contexts in their visual spaces. Ambiguity is an important issue in information visualization. It happens when objects are projected from a high dimensional space onto a spot in a low dimensional space. On the one hand, the ambiguity can mislead users because separate documents or objects in a high dimensional space are overlapping in a low dimensional space after projection. On the other hand, the ambiguity may reveal something useful from the overlapped objects. When objects are overlapped in a visual space, it implies that they share some common characteristics. Information retrieval visualization environments with an explicit and stable reference system such as MRPBM and ESCBM can reconstruct a new visual configuration by redefining a reference system to disambiguate the overlapping objects to an extent. It should be pointed out that the ambiguity can be alleviated by adjusting the positions of reference points in the visual space if the ambiguity is caused by placements of the reference points in ESCBM.
12.3 Issues and challenges Information retrieval visualization is an emerging field. It is not surprising that there are still many issues and challenges in the field, which we will discuss in detail. Integration of existing visualization techniques One of the most prominent characteristics of information retrieval visualization is its diversity and richness of approaches which result from various explanations and interpretations of an abstract and high dimensional data set. Each of the information visualization approaches has its uniqueness in terms of revealing and handling data. It also has its weaknesses. The diversity, richness, and uniqueness of information retrieval visualization approaches pose a new question. That is, can various information retrieval visualization approaches be synthesized into a visualization environment for users to explore information? The synthesis can take advantages of their strengths, overcome weaknesses, and therefore achieve a better understanding and utilization of a data set. In fact many visualization systems have moved in that direction. MDS, SOM and the parallel coordination
12.3 Issues and challenges
261
visualization method were successfully integrated into one interactive environment (Swayne et al., 2003). With a powerful brushing feature, objects in multiple visual configurations can be identified, associated, and compared efficiently and effectively. The hyperbolic technique was naturally combined with MDS (Walter and Ritter, 2002). The fisheye technique was integrated in the self-organizing map environment for effective information exploration (Yang et al., 2003). The hyperlink technique was coupled with the multiple reference point based visualization method (Olsen et al., 1993). Many popular information visualization methods were applied to hierarchical structures (Spoerri, 1993 a; Koren and Harel, 2003; Hemmje et al., 1994). DARE, TOFIR, and GUIDO are three visualization approaches based on the Euclidean spatial characteristics. Each of these visualization models has its own unique characteristics in terms of the visual spaces, visual semantic frameworks, and ways of visualizing information retrieval models. These three models can be combined into a visualization environment with the brushing technique. In this way, the same data set, the same reference points, the same information retrieval model, and the same metric can be compared and analyzed easily in the three different visualization configurations. These three models can also be synthesized into one three dimensional visualization model. In this case, distance, distance, and angle are assigned to the X-axis, Y-axis, and Z-axis, respectively. The three projection parameters of a document can be calculated against two defined reference points in the document space. Similarly, angle, angle, and distance can be assigned to the X-axis, Y-axis, and Z-axis respectively to form a new three dimensional visual space. Notice that one of the benefits of synthesis for these three dimensional visualization models is that adding an extra dimension may alleviate the notorious ambiguity phenomenon to a degree. Basically, there are two synthesis strategies for information retrieval visualization models. The first one displays multiple visual configurations simultaneously in a larger visualization environment. These visual configurations are generated by different visualization approaches or algorithms based on the same data set. They are connected by the brushing technique. That is, if an object or area in one visual configuration is selected by users, the object or area can be highlighted accordingly in other visual configurations. In this way, an object/ area of interest can be effectively compared and associated in multiple visual configurations. This method does not require significant changes of the involved visualization approaches or algorithms. The only task is to add the brushing technique to existing visualization approaches. The second strategy is more complex than the first. It synthesizes various visualization approaches into one new visualization approach. In other words, it may change the projection algorithms or visual semantic frameworks of the involved visualization approaches, and integrate them into one projection algorithm and one semantic framework. As a result, it only produces one visual configuration in its visualization environment. It is clear that there are other potential integrations of information retrieval visualization approaches which wait for further exploration. When visualization approaches are integrated, their data structures should be compatible, their displayed attributes should be complimentary, and the involved approaches must
262
Chapter 12 Afterthoughts
supply mutual needs or offset mutual shortcomings. Although a suite of visualization techniques can interpret and understand the visualized data from different perspectives, multiple visual configurations may lead to extra cognitive workload for users. A perceptually, semantically, and topologically smooth and seamless synthesis can decrease any possible cognitive burden. Full-text visualization Visualization for a full-text has not attracted enough attention from researchers. Visualization for a full-text is crucial for a lengthy document like a book. As more and more full-text databases are available, visualization for a full-text becomes more important and necessary. Visualization for a full-text differs from visualization for an entire database in many ways. The number of displayed objects for visualization for a full-text is relatively smaller than that for an entire database. Displayed objects in a visual space can be defined as chapters, or paragraphs (such as VIBE), or sentences, or keywords (such as LyberWorld) within a full-text. Generally speaking, the semantic connections among those objects in a full-text are much stronger than these among documents in an entire database because the objects in a full-text are parts of a semantic and logical entity. Most fulltext visualization systems focus on an individual full-text, such as VIBE and LyberWorld while others can visualize multiple full-texts in a visual space, such as TileBar (Hearst, 1995). In TileBar, full-texts of returned documents from a query search are visualized in bar forms, and different color densities are used to show search term distributions within the full-texts. The issues related to visualization for a full-text include, but are not limited to: how to integrate both visualization for a full-text and visualization for a data collection into one visualization environment, and make a smooth transition from the visualization of a data collection to visualization of a full-text; how to develop new visualization models for a full-text; how to define objects and effectively calculate the similarity between defined objects within a full-text; and how to construct meaningful semantic frameworks. Screen real estate Screen real estate issue is a long-standing challenge for information visualization. Theoretically speaking, the more data presented and viewed in a visual space, the more patterns and trends there are to be revealed to users. However, the limited screen space makes it possible to display huge amounts of data. Overwhelming data in a limited display space would reveal nothing and only lead to the confusion to users. An overlapped and congested visual environment degrades the visibility of the interest area, reduces the differentiation of objects in the visual space, and makes it difficult or even impossible to interact with the objects and perceive relationships among the observed objects. Therefore, looking for a balance between the appropriate amount of displayed data and the readability of the visual space is a fundamental issue for any serious information visualization approach. There are several methods to manage this problem. 1. Filtering. There are three kinds of filtering. The first is to narrow down an overview displayed area to a local area to decrease the number of displayed
12.3 Issues and challenges
263
objects. For instance, one way to filter is limiting the display to a specified category, a subject heading, or a class. This approach can obviously reduce the density of displayed objects by focusing on a local area. The second type of filtering is to preserve the global presentation contexts to decrease the number of displayed objects in the space, for instance, limitation of document file types (PDF, DOC, or HTML), size, or a certain time period. This method maintains a global view for objects without certain characteristics while the first method maintains a local view and all objects within the specified local area. The third method of filtering is combination of the previous two methods. 2. Overview + detail. This technique uses two displays to present both detail and overview information, respectively (Bolt, 1984). By identifying and selecting an interest area in the overview display window, users can observe simultaneously the detailed information of the selected area in another detail display window. The contents of a detail display window update accordingly when an interest area shifts in the overview window. Following the same line, the overview + detail approach can be expanded to multiple layers of overview + detail. That is, when a detail window is open, it can be regarded as a new overview window which can trigger another sub-detail window. In this way, it can generate associative multiple layers of overview + detail. After the multiple layers of overview + detail are produced, users may return to any of the upper overview windows at will. Multiple layers of overview + detail have potential to handle huge amounts of data. Be aware that too many layers may result in extra cognitive workload to users. The brushing technique can be used to connect the associative multiple display windows to alleviate this problem. 3. Focus + context. Basically the focus + content method maintains one window for both overview and detail information (Spence and Apperley, 1982; Furnas, 1986). A visual transformation mechanism magnifies an interest area within the overview so that more detailed information can be observed in the focus area. The magnified focus area stays within the overview since the transformation mechanism makes a smooth connection from the focus area to an unfocused neighboring area in the overview. However, as a result of the transformation, the focus area may be distorted to some degree. This may result in potential disorientation and confusion for end users because parallel lines may no longer remain parallel, the angle may change, and the shape of an object may alter in the area. 4. Zooming in/out. Zooming originally referred to observing an object/area at different view distances to get different visual details of the object/area. This concept is widely used in information visualization. There are two kinds of zooming: spatial zooming and semantic zooming. Spatial zooming simply enlarges a focused object/area. In other words, the object stays the same shape at different zooming levels. Geometric relations between objects proportionally increase or decrease as an area is zoomed in or out. In semantic zooming, when zoomed out, instead of seeing a scaled down version of the object/area, it is potentially more effective to see a different
264
Chapter 12 Afterthoughts
representation of it (Bederson and Hollan, 1994). That is, the form and shape of an object/area, which depends on its viewing size or other characteristics, can vary at different zooming levels. Semantic zooming does not necessarily maintain the same shape of an object. This characteristic adds flexibility to hiding or illustrating both different perspectives and the degree of detail of an object during zooming. Overview + detail, focus + context, and zooming approaches have different mechanisms to emphasize a focus area. A focused area appears in a different window in the overview + detail approach; a magnified focused area is embedded in an overview window in the focus + context approach; and a focused area replaces an overview in the same window in the zooming approach. Items in a data collection need to be stratified to support data display at different detail levels. Data stratification, which can be done either manually or automatically, will underlie the degree of detail and quality of the displayed data. When a display changes from one level to another, the ratio of an object size at one level compared to that of the object at the other level can be used to measure the extent of object change. The ratio would definitely affect the degree of geometrical distortion in the focus + context approach, and the smoothness of display change in the zooming approach. Although there are several methods available for conserving screen real estate, people never stop looking for new methods. 2D display vs. 3D display Debate over a two dimensional environment vs. a three dimensional one is not a simply discussion over the method of a data presentation. It affects not only the design of a visual environment but also the use of that environment. In general, 3D displays illustrate an effective overview of a three dimensional space, reveal a variety of shapes, and show more objects. Due to the three dimensional nature, 3D displays provide users with a wide spectrum of interactive controls such as rotating an object, observing an object at 360 degrees, entering/exiting a meaningful entity, etc. They naturally present objects in an environment which is similar to the way that human interact with their real surrounding environment, making user’s navigation in the space more comfortable. On the other hand, 3D environments clearly face more technical and theoretical challenges to accommodate these features. For instance, occlusion, shadows, lighting, ground plane, texturing, and multiple degrees of movement control must be handled in a 3D environment. Because of technical complexity, increased potential response time is expected during interaction. 3D does not lend itself to a rigorous comparative analysis because of the distortions arising from a perspective view, lighting, shadow, and occlusion (Wright, 1999). Navigation in a 3D environment may require special control devices rather than a simple mouse. 2D displays are relatively easy for implementation and achieve faster response time during interaction owing to structure and algorithm simplicity. 2D displays are good for reducing occlusion, since there is no deep ambiguity (Tory, 2003). It does not require strong spatial capacities to interact and navigate a 2D space. People have argued that a 3D display is clearly more effective for physical data that includes 3D spatial variables while 2D has a long and effective history
12.3 Issues and challenges
265
for abstract data (Bertin, 1999). This suggests that the advantages of interaction and navigation in a 3D display depend heavily on whether the 3D display simulates a physical environment. However, for information retrieval visualization it is obvious that its visual space is built on an abstract data space and inherits no physical structures from a data collection. It raises a fundamental question: whether 3D displays are suitable for information retrieval visualization which has a high demand for interaction and navigation. Fortunately, a physical metaphor such as a library or landscape can be embedded to present a semantic and abstract framework of an information retrieval visualization environment, as we discussed in the previous chapter. The metaphorical interface may “change” the nature of the information space from an abstract space to a physical space. The change allows users to take advantage of interaction and navigation in a 3D metaphorical display. A study (Robertson et al., 1998) showed that error rates were lower when retrieving Web pages using their 3D Data Mountain system than when using the standard 2D “Favorites” mechanism of Internet Explorer. The selection of a 2D display or 3D display for information retrieval visualization may be based on a user’s preference and cognitive capacity, nature of a visual presentation and displayed data, and requirements for system response. Metaphors Metaphors have a significant impact on information retrieval visualization. Although there are many metaphorical applications in information retrieval visualization, many questions are still left unanswered. A metaphor stems from a culture, and cultural diversity can enrich metaphorical application. The cultural perspective cannot be ignored in metaphorical application due to the diversity of users. Metaphors can be applied to information retrieval visualization at different levels. The search for new metaphors for object icon presentations, semantic framework presentations, information retrieval processes, and visualization algorithm explanations are research topics in the field. Evaluation of a metaphorical application is a crucial issue. When a metaphor is applied, we need to know to what extent the metaphor matches the target appropriately, to what extent it preserves the salient and meaningful properties of the target, to what extent it reduces users’ cognitive workload, to what extent it improves effectiveness and efficiency of interactive information retrieval, and how it affects users’ retrieval behaviors. Implications for IR research The application of information visualization in information retrieval is not limited to information presentation, browsing, and query searching. It is amazing that information visualization can also be applied to other information retrieval research. Term discrimination analysis addresses the capacity of an indexing term to distinguish one document from others in a document collection and it is widely used for the automatic indexing of documents. Traditionally, a term discrimination value can be defined as the difference of the document space densities before and after a term assignment to a document collection. The space density can be computed by employing the average similarity between all term pairs in the collection.
266
Chapter 12 Afterthoughts
Notice that one of the inherent weaknesses of the traditional term discrimination analysis is that a simple discriminative value does not tell people exactly which documents are affected by the space density change, and to which degree they are affected. For instance, a term discriminative value is 0, and it can come up with various interpretations. One scenario may be that all documents in a data collection are not affected by removal of the term at all and the space density remains the same. Another scenario may be that the parts of documents make positive contributions to the space density increase and the parts of documents make negative contributions to the space density increase. The positive and negative impacts cancel each other out, thus leading to a final result of 0. The traditional method cannot tell the difference. Application of information visualization techniques to term discrimination analysis may shed light on this issue. A visual display of the space densities rather than a single difference value would assist users to reveal and understand the nature of the density space change (Zhang and Wolfram, 2001). It is believed that strong discriminative terms tend to produce more clustered configurations of documents than poor discriminative terms. This principle can be successfully applied to the identification of good/bad discriminative terms in a visualization environment (Dubin, 1995). In the VIBE visual environment, a group of terms is selected as reference points respectively and they are mapped onto the visual space. Due to the mobility of reference points in the visual space, these reference points can be arranged in a circle and all documents are mapped within the circle. The configurations of these projected documents depend on semantic relationships between the reference points and the projected documents. If the documents are not spread out and concentrate in a certain area, it indicates that they are not sensitive to any reference points. It suggests that the selected terms are poor discriminative terms with respect to the document set. However, if the documents are widely spread out and form multiple document clusters, it means that they are related to the reference points. This implies that these terms are good discriminative terms for the documents. Information visualization may also be applied to new information retrieval research fields. Visualization for Internet information The Internet penetrates every corner of the world: from offices, to labs, to classrooms, to libraries, to dorms, to houses. No information retrieval system other than the Internet has such a profound impact upon society. Information dissemination, information sharing, information retrieval, and information utilization have never been easier or more convenient with the Internet. However, the dynamics, vastness, and heterogeneity, which alone can be the major obstacle for the user to select appropriate sources to search (Mostafa, 2004), are coupled with the diversity of users pose both unprecedented challenges and opportunities to information retrieval visualization. Seeking information on the Internet relies heavily upon both query searching and browsing. Relevance assessment for a large results set from a search engine is not an easy task. Browsing in such a huge cyberspace has proved a very difficult job. Notice that utilization of the Internet resources is far
12.3 Issues and challenges
267
beyond Web pages which are, of course, important. It also expands to the Internet traffic, evolution of a website, usage pattern of a portal, network security, and so on. People expect new portable, intuitive, interactive, versatile, and powerful visualization techniques and approaches suitable for Internet information such as visualization search engines for more efficient and effective search, new visualization navigation mechanisms to alleviate, if not eliminate, the notorious disorientation problem, and other potential innovative visualization applications. Evaluation Evaluation for information retrieval visualization refers to measuring the extent to which people use it to achieve retrieval goals in terms of effectiveness, efficiency and satisfaction in the visualization context. An information retrieval visualization environment is much more complicated than an interface of a common information retrieval system because of the multiformity of semantic frameworks, complexity of data relationships, diversity of displayed data, and interactive nature of exploratory search, in conjunction with the perceptual and cognitive abilities in the visualization environment. As a result, users’ information tasks may become more complicated and the retrieval process more sophisticated. It is difficult to come up with a universal evaluation system for all information retrieval visualization environments. There are many issues, such as evaluations for effectiveness of a visual semantic framework, impact of a metaphor on a visual space and user behavior, ambiguity in a visual space, disorientation during navigation, etc. Another important issue is the evaluation of information retrieval visualization results. Traditional information retrieval evaluation criteria are primarily dependent upon the widely accepted recall and precision. But these evaluation criteria are calculated based on individual items such as retrieved items, relevant retrieved items, and relevant items in a data collection. It is apparent that evaluated items result mainly from query searching. In an information retrieval visualization environment, both query searching and browsing are utilized by users to seek information. Browsing as an important retrieval mechanism enables users not only to find relevant individual items at the micro-level but also to identify trends, patterns, clusters, and other useful aggregate information at the macro-level in the visual space. It is crystal clear that trends, patterns, clusters, and other useful aggregate information, which may comprise individual items, cannot be measured by the individual item based upon recall and precision. But as important parts of retrieval results, trends, patterns, clusters, and other useful aggregate information should be reflected in the information retrieval evaluation criteria. However, it is difficult to define measurable and plausible criteria for retrieved aggregate information due to the lack of a meaningful measurement unit. New paradigms As we discussed in Chap. 2 there are three basic information retrieval visualization paradigms. In the query searching and browsing (QB) paradigm, information at the micro-level is searched by queries, and then visualized and browsed in a visual space by users. In the browsing and query searching (BQ) paradigm,
268
Chapter 12 Afterthoughts
information at the macro-level is presented and browsed, and then information at the micro-level is searched and highlighted in the visualization contexts. In the browsing only (BO) paradigm, information at the macro-level is displayed and browsed. In any of these cases, information at the macro-level is only browsable or viewable, but it is not searchable. In other words, aggregate information is never searchable in these paradigms. Finding a new information retrieval visualization paradigm, in which aggregate information at the macro-level is organized in the way that it is searchable, is a huge challenge. The difficulty of this task reflects in identifying and defining meaningful and searchable objects from the aggregate information, especially automatically identifying patterns, clusters, connections, contexts, etc. For instance, in a SOM feature map after subject areas are partitioned, how can subject term(s) be automatically and accurately generalized for a partitioned subject area? If these subject terms can be generated automatically, they may serve as access points to search these subject areas. Another issue is to develop new information retrieval mechanisms in an information retrieval visualization environment. Development of such retrieval mechanisms heavily relies upon the structures of visual semantic frameworks and data projection methods. For example, in the DARE environment, non-traditional information retrieval models were developed to generate an asymmetric information retrieval contour to treat search focuses differently in the visual space (Zhang and Korfhage, 1999). This is totally different from the contour of a traditional information retrieval model which is symmetric and treats search focuses equally in the visual space.
12.4 Summary It is apparent that both MRPBM and ESCBM have a strong and natural connection to information retrieval. As a matter of fact, the developments of both MRPBM and ESCBM are directly driven by information retrieval. MRPBM supports both the Boolean information retrieval model and vector information retrieval model. MRPBM attempts to attack the inherent problem of a traditional information retrieval system caused by a linear result presentation structure. Visual spaces of ESCBM are defined on the important distance and/or angle similarity measure(s) of information retrieval. ESCBM also supports the visualization of the various information retrieval evaluation models. A visual configuration of either MRPBM or ESCBM can be generated on users’ information needs, which makes a customized visual presentation and interactive visualization retrieval possible. Available powerful computing capacity and mature graphic techniques in conjunction with growing users’ demanding, make it possible for information retrieval visualization to support visual information discovery and information exploration in an interactive environment. Information retrieval visualization holds a lot of promises and challenges for information retrieval. It empowers users and enriches information retrieval. It represents the future of information retrieval.
Bibliography
Allbritton DW, McKoon G, and Gerrig R (1995). Metaphor-based schemas and text representations: Making connections through conceptual metaphors. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21(3), 612-625. Allen B (1998). Information space representation in interactive systems: relationship to spatial abilities. Proceedings of the Third ACM Conference on Digital Libraries’98, June 23-26, 1998, Pittsburgh, Pennsylvania, pp. 1-10. Ang CS, Martin DC, and Doyle MD (1994). Integrated control of distributed volume visualization through the World Wide Web. Proceedings of IEEE Visualization’94, October 17-21, 1994, Washington D.C., pp. 13-20. Arnheim R (1972). Visual Thinking. Berkeley, CA: University of California Press Baecker R, Grudin J, Buxton W, and Greenberg S (1995). Readings in Human Computer Interaction: Toward the Year 2000. San Mateo, CA: MorganKaufmann Bartell BT, Cottrell GW, and Belew RK (1992). Latent Semantic Indexing is an Optimal Special Case of Multidimensional Scaling. Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval’92, June 21-24, 1992, Copenhagen, Denmark, pp. 161-167. Bates, MJ (2002). Speculations on Browsing, Directed Searching, and Linking in Relation to the Bradford Distribution. Emerging Frameworks and Methods: Proceedings of the Fourth International Conference on Conceptions of Library and Information Science’02, July 21-25, 2002, Greenwood Village, pp. 137-150. Battista GD, Eades P, Tamassia R, and Tollis IG (1994). Algorithms for drawing graphs: An annotated bibliography. Computational Geometry: Theory and Applications, 4(5), 235-282. Beaudoin L, Parent MA, and Vroomen LC (2005). Cheops: A compact explorer for complex hierarchies. Retrieved November 13, 2005, from the World Wide Web: http://www.istop.com/~maparent/paper.html
270
Bibliography
Bederson BB (2000). Fisheye menus. Proceedings of the 13th Annual ACM Symposium on User interface Software and Technology’00. November 5-8, 2000, San Diego, California, pp. 217-225. Bederson BB, and Hollan JD (1994). Pad++: a zooming graphical interface for exploring alternate interface physics. Proceedings of the 7th Annual ACM Symposium on User interface Software and Technology’94, November 2-4, 1994, Marina del Rey, California, pp.17-26. Belmont Abbey College North Carolina (2005). Retrieved November 13, 2005, from the World Wide Web: http://belmont.antarctica.net/ Benford S, Snowdon D, Greenhalgh C, Ingram R, and Knox I (1995). VR-VIBE: A Virtual Environment for Co-operative Information Retrieval. Proceeding of Eurographics’95, August 30th-September 1st , 1995, Maastricht, pp. 349-360. Benford S, Snowdon D, Colebourne A, O’Brien J, and Rodden T (1997). Informing the design of collaborative virtual environments. Proceedings of the international ACM SIGGROUP conference on Supporting group work: the integration challenge, GROUP’97, November 16-19, 1997, Phoenix, Arizona. ACM Press, pp.71-80. Benyon D, and Imaz M (1999). Metaphors and Models: Conceptual Foundations of Representations in Interactive Systems Development. Human-Computer Interaction, 14(1-2), 159-189. Bertin J (1999). Graphics and graphic information processing. In S. K. Card, J. D. Mackinlay, and B. Shneiderman (Ed.), Readings in information Visualization: Using Vision to Think, pp. 62-65. Morgan Kaufmann Publishers, San Francisco, CA. Bolt RA (1984). The human interface-where people and computers meet. Belmont, CA: Lifetime Learning Publications. Booth P (1989). An introduction to human computer interaction. Hillsdale, UK: Lawrence Erlbaum Associates Publisher. Borg I, and Groenen P (1997). Modern multidimensional scaling: theory and application. New York: Springer. Boyack KW, Wylie BN, and Davidson GS (2002). Domain visualization using VxInsight for science and technology management. Journal of the American Society for Information Science and Technology, 53(9), 764-774. Buja A, Swayne DF, Littman ML, Dean N, and Hofmann H (2001). XGvis: Interactive data visualization with multidimensional scaling, Retrieved April 27, 2007, from the World Wide Web: http://public.research.att.com/~stat/xgobi/ papers/xgvis-joc.pdf Burke K. (1962). A grammar of motives, and a rhetoric of motives. Cleveland: World Pub. Co. Buzydlowski JW, White HD, and Lin X (2001). Term co-occurrence analysis as an interface for digital libraries. Proceedings of Joint Conference on Digital Libraries’01, Springer-Verlag, London. ACM Press, pp. 133-144. Cadez I, Heckerman D, and Meek C (2000). Visualization of navigation patterns on a web site using model based clustering. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’00, August 20-23, 2000, Boston, Massachusetts. ACM Press, pp. 280-284.
Bibliography
271
Cai G (2002). GeoVIBE: A visual interface to geographic digital library. Proceedings of visual interfaces to digital libraries JCDL’02 workshop, London. Springer-Verlag, pp. 171-187. Card SK, Machinlay JD, and Shneiderman B (1999). Readings in information visualization: using vision to think. San Francisco: Morgan Kaufmann, pp, 1-34. Card SK, Robertson GG, and York W (1996). The Web Book and the Web Forager: an information workspace for the World-Wide Web. Proceedings of the SIGCHI Conference on Human Factors in Computing systems: Common Ground, CHI’96, April 13-18,1996, Vancouver, BC, Canada. ACM Press, pp. 111-117. Carey M, Heesch DC, and Ruger SM (2003). Info Navigator: A visualization tool for document searching and browsing. Proceedings of the International Conference on Distributed Multimedia Systems (DMS), September 2003, pp. 23-28. Carroll JM and Thomas JC (1982). Metaphor and the Cognitive Representation of Computing Systems. Trans Systems, Man, and Cybernetics, 12(2), 107-116. Centner D, and Markman AB (1997). Structure mapping in analogy and similarity. American Psychologist, 52(1), 45-56. Chalmers M, and Chitson P (1992). Bead: Explorations in Information Visualization. Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’92, June 21-24, 1992, Copenhagen, Denmark ACM Press, pp. 330-337. Chang SJ and Rice RE (1993). Browsing: A multidimensional framework. Annual Review of Information Science and Technology, 28, 231-276. Chen C (1996). Behavioral patterns of collaborative writing with hypertext: A state-transition approach. Proceedings of HCI on People and Computers XI’96, London, UK. Springer-Verlag, pp. 265-279. Chen C (1997). Structuring and visualising the WWW by generalised similarity analysis. Proceedings of the Eighth ACM Conference on Hypertext’97, April 06-11, 1997, Southampton, United Kingdom. ACM Press, pp. 177-186. Chen C (1999). Visualising semantic spaces and author co-citation networks in digital libraries. Information Processing and Management, 35(3), 401-420. Chen C (2000). Empirical evaluation of information visualizations: An introduction. International Journal of Human-Computer Studies, 53(5), 631-635. Chen C (2004). Searching for intellectual turning points: progressive knowledge domain visualization. Proceedings of National Academy of Sciences of the United States of America, 101(1), 5303-5310. Chen C (2005). Top 10 Unsolved Information Visualization Problems. IEEE Computer Graphics and Applications, 25(4), 12-16. Chen C, and Morris S (2003). Visualizing evolving networks: Minimum spanning trees versus Pathfinder networks. Proceedings of IEEE Symposium on Information Visualization InfoVis’03, Oct 19-24, 2003, Seattle, Washington, pp. 67-74. Chen C, Schuffels C, and Orwig R (1996). Internet categorization and search: a self-organizing approach. Journal of visual communication and image representation. 7(1), 88-102. Chen J, Sun L, Zaiane OR, and Gebel R (2004). Visualizing and discovering web navigational patterns. Proceedings of the 7th International Workshop on the
272
Bibliography
Web and Databases: Colocated with ACM SIGMOD/PODS’04, June 17-18, 2004, Paris, France. ACM Press, pp. 13-18. Chen J, Sun L, Zaïane OR, and Goebel R (2004). Visualizing and discovering web navigational patterns. Proceedings of the 7th international Workshop on the Web and Databases: Colocated with ACM SIGMOD/PODS’04, June 17-18, 2004, Paris, France. ACM Press, pp. 13-18. Chi EH (2002). Improving web usability through visualization. IEEE Internet Computing, 6(2), 64-71. Chi EH, Pitkow J, Mackinlay J, Pirolli P, Gossweiler R, and Card SK (1998). Visualizing the evolution of Web ecologies. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems’98, April 18-23, 1998, Los Angeles, California. ACM Press/Addison-Wesley Publishing Co., pp. 400-407. Chudý R, and Kadlec J (2004). Spatial Interface Design. In: ElectronicsLetters.com, 2004(1), pp. 1-13. Christel MG (1999). Visual digests for news video libraries. Proceedings of ACM international Conference on Multimedia ‘99, October 30- November 05, 1999, Orlando, Florida. ACM Press, pp. 303-311. Christel MG, and Huang CH (2001). SVC for navigating digital news video. Proceedings of the ninth ACM international conference on Multimedia, September 30 - October 05, 2001, Ottawa, Canada, pp. 483-485. ACM Press, New York, NY. Colonna J (1994). Scientific display: a means of reconciling artists and scientists. In C. A. Pickover and S. K. Tewksbury (Ed.), Frontiers of Scientific Visualization, pp. 181-212. John Wiley & Sons, New York, NY. Conti G, and Abdullah K (2004). Passive visual fingerprinting of networking attack tools. Proceedings of the 2004 ACM Workshop on Visualization and Data Mining For Computer Security, VizSEC/DMSSEC’04, October 29-29, 2004, Washington D.C. ACM Press, pp. 45-54. Cooke NJ, Neville KJ, and Rowe AL (1996). Procedural network representations of sequential data. Human-Computer Interaction, 11(1), 29-68. Cooper A (1995 a). About face-the essentials of user interface design. Foster city CA:IDG Books Worldwide. Cooper A (1995 b). The myth of metaphor. Visual Basic Programmer’s Journal, June, 127-128. Craik KJW (1943). The Nature of Explanation. Cambridge, UK: Cambridge University Press. Cresques A (1978). Mapamundi, the Catalan atlas of the year 1375. Urs Graf Publisher. Cugini J (2005). Presenting search results: Design, visualization, and evaluation. Retrieved November 13, 2005, from the World Wide Web: http://www.itl.nist.gov./iaui/vvrg/cugini/irlib/paper-may2000.html Czerwinski M, Dumais S, Robertson G, Dziadosz S, Tiernan S, and van Dantzich M (1999). Visualizing implicit queries for information management and retrieval. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: the CHI Is the Limit, CHI’ 99, May 15-20, 1999, Pittsburgh, Pennsylvania. ACM Press, pp. 560-567.
Bibliography
273
DARPA Neural Network Study. (1988). DARPA Neural Network Study. AFCEA International Press, pp. 60. Dearholt DW, and Schvaneveldt RW (1990). In R.W. Schvaneveldt (Ed.) Pathfinder associative networks: studies in knowledge organization, pp.1-30. Norwood, New Jersey: Ablex Publishing Corporation. Dillon A (2003). User interface design. MacMillan Encyclopedia of Cognitive Science, 4, 453-458. Douglas SA and Moran TP (1983). Learning Text Editor Semantics by Analogy. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems’83, December 12-15, 1983, Boston, Massachusetts. ACM Press, pp. 207-211. Dubin D (1995). Document analysis for visualization. Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval’95, July 09-13, 1995, Seattle, Washington. ACM Press, pp. 199-204. Duncker E (2002). Cross-cultural usability of the library metaphor. Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries JCDL ‘02, July 14-18, 2002, Portland, Oregon, Portland. ACM Press, pp. 223-230. Durand DG, and Kahn P (1998). MAPA: A system for inducing and visualizing hierarchy in websites. Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia : Links, Objects, Time and Space—Structure in Hypermedia Systems: Links, Objects, Time and Space—Structure in Hypermedia Systems’98, June 20-24, 1998, Pittsburgh, Pennsylvania. ACM Press, pp. 66-76. Dürsteler JC (2001). WebMap. Inf@Vis, 55. Retrieved October 4, 2005, from the World Wide Web: http://www.infovis.net/printMag.php?num=55&lang=2 Eades P (1984). A heuristic for graph drawing. Congressus Mumerantium, 42, 149-160. Ellsberg D (1961). Risk, ambiguity and the savage axioms. Quarterly Journal of Economics, 75, 643-669. Eick SG (2001). Visualizing online activity. Communication of the ACM, 44(8), 45-50. Ellis G, and Dix A (2004). Quantum web fields and molecular meanderings: Visualising web visitations. Proceedings of the Working Conference on Advanced Visual interfaces, AVI’04, May 25-28, 2004, Gallipoli, Italy. ACM Press, pp. 197-200. Erickson T (2003). Designing visualizations of social activity: six claims. Proceedings of CHI ‘03 extended abstracts on Human factors in computing systems’03, April 05-10, 2003, Ft. Lauderdale. ACM Press, pp. 846-847. Fang X (2000). A hierarchy search history for web searching. International Journal of Human-Computer Interaction, 12(1), 73-88. Fauconnier G (1997). Mappings in thought and language. Cambridge University Press. Festinger L (1957). Theory of Cognitive Dissonance. Stanford: Stanford University Press. Fisheye menu (2005). Retrieved November 13, 2005, from the World Wide Web: http://www.cs.umd.edu/hcil/fisheyemenu/fisheyemenu-demo.shtml
274
Bibliography
Fishkin K and Stone MC (1995). Enhanced dynamic queries via movable filters. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems’95, May 07-11, 1995, Denver, Colorado. ACM Press/Addison-Wesley Publishing Co., pp. 415-420. Fowler RH, and Dearholt DW (1990). Properties of pathfinder networks. In R.W. Schvaneveldt (Ed.) Pathfinder Associative Networks: Studies in Knowledge Organization, pp. 165-178. Norwood, New Jersey: Ablex Publishing Corporation. Fowler RH, Fowler WAL, and Wilson BA (1991). Integrating query, thesaurus, and documents through a common visual representation. Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval’91, October 13-16, 1991, Chicago, Illinois. ACM Press, pp. 142-151. Fowler RH, Fowler WAL, and Williams JL (1996). Document explorer visualization of WWW document and term space. Department of Computer Science, University of Texas- Pan American, Technical Report, NAG9-551, #96-6. Freitas CMDS, Luzzardi PRG, Cava RA, Winckler MAA, Pimenta MS, and Nedel LP (2005). Evaluation usability of information visualization techniques. Retrieved November 13, 2005, from the World Wide Web: http://www.inf.ufrgs.br/ cg/publications/carla/FreitasEtAl-IHC.pdf Fruchterman TMJ, and Reingold EM (1991). Graph drawing by force-directed placement. Software - Practice and Experience, 21(11), 1129-1164. Furnas GW (1986). Generalized fisheye views. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems’86, April 13-17, 1986, Boston, Massachusetts. ACM Press, pp.16-23. Futrelle RP (1999). Ambiguity in Visual Language Theory and its Role in Diagram Parsing. Proceedings of the IEEE Symposium on Visual Languages, September 13-16, 1999, pp. 172-175. Gaver WW, Beaver J, and Benford S (2003). Ambiguity as a resource for design. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems’03, April 05-10, 2003, Ft. Lauderdale, Florida, ACM Press, pp. 233-240. Gentner D, and Gentner DR (1983). Flowing waters or teeming crowds: mental models of electricity. In D. Gentner and A.L. Stevens (Ed.), Mental models, pp. 99-129. Englewood Cliffs: Lawrence Earlbaum Associates, Inc. Gentner D, and Nielson J (1996). The Anti-Mac Interface. Communications of the ACM, 39(8), 70-82. GRIDL (GRaphical Interface for Digital Libraries) (2005). Retrieved November 13, 2005, from the World Wide Web: http://www.cs.umd.edu/hcil/members/ arose/gridl/ Grinstein GG, Hoffman P, Laskowski SJ, and Pickett RM (2005). Benchmark development for the evaluation of visualization for data mining. Retrieved November 13, 2005, from the World Wide Web: http://home.comcast.net/ ~patrick.hoffman/viz/benchmark.pdf Guttman L (1968). A general nonmetric technique for finding the smallest coordinate space for a configuration of points. Psychometrika, 33, 469-506. Hamilton A (2000). Metaphor in theory and practice: the influence of metaphors on expectations. ACM Journal of Computer Documentation, 24(4), 237-253.
Bibliography
275
Hasan MZ, Mendelzon AO, and Vista D (1996). Applying database visualization to the world wide web. SIGMOD Record, 25(4), 45-49. Havre S, Hetzler E, Perrine K, Jurrus E, and Miller N (2001). Interactive visualization of multiple query results. Proceedings of the IEEE Symposium on Information Visualization 2001 (INFOVIS’01), October 22-23, 2001, pp.105. Havre S, Hetzler E, Whitney P, and Nowell L (2002). ThemeRiver: Visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics, 8(1), 9-20. Haykin S (1994). Neural Networks: A Comprehensive Foundation. NY: Macmillan. Hearst MA (1995). TileBars: visualization of term distribution information in full text information access. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems’95, May 7-11, 1995, Denver, Colorado, pp. 59-66. Hearst MA (1999). User interfaces and visualization. In R. Baeza-Yates and B. Ribeiro-Neto (Ed.), Modern Information Retrieval, chapter 10, pp. 257-323. Harlow, England: Addison-Wesley. Hemmje M, Kunkel C, and Willett A (1994). LyberWorld - A Visualization User Interface Supporting Full Text Retrieval. Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval’94, July 03-06, 1994, Dublin, Ireland, pp. 249-258. Herder E, and Weinreich H (2005). Interactive web usage mining with the navigation visualizer. Proceedings of CHI’05 Extended Abstracts on Human Factors in Computing Systems, April 02-07, 2005. ACM Press, pp. 1451-1454. Hochheiser H, and Shneiderman B (2001). Using interactive visualizations of WWW log data to characterize access patterns and inform site design. Journal American Society for Information Science and Technology, 52(4), 331-343. Hong JI, and Landay JA (2001). WebQuilt: A framework for capturing and visualizing the web experience. Proceedings of the 10th International World Wide Web Conference’01, May 1-5, 2001, Hong Kong, pp. 717-724. Hutchins E (1989). Metaphors for Interface Design. The Structure of Multimodal Dialogue. Amsterdam: North-Holland, pp.11-28. Indurkhya B (1992). Metaphor and Cognition. Kluwer Academic Publishers. Inselberg A (1997). Multidimensional detective. Proceedings of the 1997 IEEE Symposium on information Visualization (infovis ‘97), October 18-25, 1997, pp. 100. Inxight (2005). Retrieved November 13, 2005, from the World Wide Web: http://www.inxight.com/ Jackendoff R (1992). Languages of the minds. Cambridge, MA: MIT Press. Jerding DF, and Stasko JT (1995). The information mural: a technique for displaying and navigating large information spaces. Proceedings of the 1995 IEEE Symposium on information Visualization’95, October 30-31, 1995, Atlanta, GA, pp.43-50. Johnson M (1980). A philosophical perspective on the problems of metaphor. In Honeck, R., Hoffman, R. (Ed.), Cognition and Figurative Language, pp. 47-68. Lawrence Erlbaum Associates Inc., Hillsdale.
276
Bibliography
Johnson-Laird PM (1983). Mental models: Towards a cognitive science of language, inference, and consciousness. Cambridge MA: Harvard University Press. Kamada T, and Kawaii S (1989). An algorithm for drawing general undirected graphs. Information Processing Letters, 31, 7-15. Kaski S, Honkela T, Lagus K, and Kohonen T (1998). WEBSOM: Self-organizing maps of document collections. Neurocomputing, 21, 101-117. Keim DA (2001). Visual exploration of large data sets. Communications of the ACM, 44(8), 38-44. Kobayashi M, and Takeda K (2000). Information retrieval on the web. ACM Computing Survey, 32(2), 144-173. Kohonen T (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59-69. Kohonen T (1990). The self-organizing map. Proceedings of the IEEE, 78 (9), 1464-1480. Kohonen T, Kaski S, Lagus K, Salojärvi J, Honkela J, Paatero V, and Saarela A (2000). Self Organization of a Massive Document Collection. IEEE Transactions on Neural Networks, 11(3), 574-585. Kohonen T (2001). Self-Organizing Maps. Springer Series in Information Sciences. 30, Springer, Berlin, Heidelberg, New York. Koike H (1993). The role of another spatial dimension in software visualization. ACM Transactions on Information Systems, 11(3), 266-286. Komlodi A, Sears A, and Stanziola E, (2004). Information visualization evaluation review, ISRC Tech. Report, Dept. of Information Systems, UMBC. UMBCISRC-2004-1. Retrieved November 13, 2005, from the World Wide Web: http://www.research.umbc.edu/~komlodi/IV-eval Konchady M, D’Amore R, and Valley G (1998). A web based visualization for documents. Proceedings of the 1998 Workshop on New Paradigms in information visualization and manipulation, NPIV’98, November 02-07, 1998, Washington D.C. ACM Press, pp. 13-19. Koren Y, and Harel D (2003). A two-way visualization method for clustering data. Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining’03, August 24-27, 2003, Washington, D.C. ACM Press, pp. 589-594. Korfhage RR (1988). Information retrieval in the presence of reference points, Part 1. Report LIS001/IS88001, School of Library and Information Science, University of Pittsburgh, 1988. Korfhage RR (1991). To See or Not to See—Is That the Query? Proceeding of 14th Annual ACM SIGIR Conference’91, October 13-16, 1991, Chicago. ACM Press, pp. 134-141. Korfhage RR, and Olsen KA (1994). The role of visualization in document analysis. Proceedings of Third Annual Symposium on Document analysis and Information retrieval’94, April 11-13, 1994, Las Vegas, Nevada, pp.199-207. Koshman S (2004). Comparing usability between a visualization and text-based system for information retrieval. Journal of Documentation, 60(5), 565-590.
Bibliography
277
Kruskal JK (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1-27. Kuhn W, and Blumenthal B (1996). Spatialization: spatial metaphors for user interfaces. Conference Companion on Human Factors in Computing Systems: Common Ground’96, April 13-18, 1996, Vancouver, British Columbia, Canada. ACM Press, pp. 346-347. Lagus K, and Kaski S (1999). Keyword selection method for characterizing text document maps. Proceedings of ICANN99, Ninth International Conference on Artificial Neural Networks’99, September 07-10, 1999, Edinburgh, UK, pp. 371-376. Lagus K, Honkela T, Kaski S, and Kohonen T (1999). WEBSOM for Textual Data Mining. Artificial Intelligence Review, 13(5/6), 345-364. Lagus K (2002). Text retrieval using self-organized document maps. Neural Processing Letters, 15 (1), 21-29. Lakoff G (1987). Women, Fire, and Dangerous Things: what categories reveal about the mind. Chicago, IL: University of Chicago Press. Lakoff G, and Johnson M (1980). Metaphors We Live By. Chicago: University of Chicago Press. Lawrence S, and Giles C (1998). Searching the World Wide Web. Science, 280(5360), 98-100. Leuski A, and Allan J (2000). Improving interactive retrieval by combining ranked lists and clustering. Proceedings of RIAO’ 00, April 12-14, 2000, Paris, France, pp. 665-681. Lin X (1997). Map displays for information retrieval. Journal of the American Society for Information Science, 48(1), 40-54. Lin X, Soergel D, and Marchionini G (1991). A self-organizing semantic map for information retrieval. Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval’91, October 13-16, 1991, Chicago, Illinois. ACM Press, pp. 262-269. Lyu MR, Yau E, and Sze S (2002). A multilingual, multimodal digital video library system. Proceedings of the second ACM/IEEE-CS joint conference on Digital libraries’02, July 14-18, 2002, Portland, Oregon. ACM Press, pp. 145-153. Ma KL (2004). Visualization for security. Computer Graphics, 38(4), 4-6. MacCormac ER (1989). A Cognitive Theory of Metaphor. Cambridge, MA: M.I.T. Press. Mackinlay JD, Card SK, and Robertson GG (1990). A Semantic Analysis of the Design Space of Input Devices. Human-Computer Interaction, 5(2-3), 145-190. Mackinlay JD, Robertson GG, and Card SK (1991). The perspective wall: detail and context smoothly integrated. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Reaching Through Technology’91, April 27- May, 1991, New Orleans, Louisiana. ACM Press, pp.173-176. Mackinlay JD, Rao R, and Card SK (1995). An organic user interface for searching citation links. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems’95, May 07-11, 1995, Denver, Colorado. ACM Press, pp. 67-73.
278
Bibliography
MacQueen J (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, University of California Press, Berkeley, pp. 281-297. Mankoff J, Abowd GD, and Hudson SE (2000). OOPS: A toolkit supporting mediation techniques for resolving ambiguity in recognition-based interfaces. Computers and Graphics, 24(6), 819–834. Map of the Market (2005). Retrieved November 13, 2005, from the World Wide Web: http://www.smartmoney.com/marketmap/ Marchionini G, and Schneiderman B (1988). Finding Facts vs. Browsing Knowledge in Hypertext Systems. IEEE Computer, 21(1), 70-79, Marcus A (1994). Managing metaphors for advanced user interfaces. Proceedings of the Workshop on Advanced Visual interfaces’94, June 01-04, 1994, Bari, Italy. ACM Press, pp. 12-18. McCormick BH, DeFanti TA, and Brown MD (1987). Visualization in Scientific Computing. Computer Graphics, 21(6), 1-14. Merkl D and Rauber A (1997).Cluster connections - a visualization technique to reveal cluster boundaries in self-organizing maps. Proceedings of 9th Italian Workshop on Neural Nets (WIRN97), May 22- 24, 1997, Vietri sul Mare, Italy, pp. 324-329. Miller NE, Wong CP, Brewster M, and Foote H (1998). TOPIC ISLANDS—a wavelet-based text visualization system. Proceedings of the Conference on Visualization ‘98, October 18-23, 1998, Research Triangle Park, North Carolina. IEEE Computer Society Press, pp.189-196. Morse EL, and Lewis M (2002). Testing visual information retrieval methodologies case study: comparative analysis of textual, icon, graphical, and “spring” displays. Journal of the American Society for Information Science and Technology, 53(1), 28-40. Morse E, Sochats K, and Williams JG (1995). Visualization, Annual Review of Information Science and Technology (ARIST), 30, 161-207. Mostafa J (2004). Document search interface design: background and introduction to special topic section. Journal of the American Society for Information Science and Technology, 55(10), 869-872. Munzner T (2000). Interactive visualization of large graphs and networks. Ph.D. Dissertation, Stanford University. Munzner T (2002). Information visualization. IEEE Computer graphics and applications, 22(1), pp.20-21. Nation D (1998). WebTOC: A tool to visualize and quantify web site using a hierarchical table of contents browser. Proceedings of CHI 98 Conference Summary on Human Factors in Computing Systems’98, April 18-23, 1998, Los Angeles, California, pp.185-186. Nelson MJ (2005). The visualization of the citation patterns of some Canadian journals. Proceedings of Annual Conference of the Canadian Association for Information Science. Retrieved April 27, 2007, from the World Wide Web: http://www.cais-acsi.ca/proceedings/2005/nelson_2005.pdf
Bibliography
279
Nielsen J (2000). Designing the Web Usability. Indianapolis: New Riders Publishing. Norman DA (1988). The design of everyday things. New York: Doubleday/Currency. Nuchprayoon A, and Korfhage RR (1994). GUIDO, a Visual Tool for Retrieving Documents. Proceedings 1994 IEEE Computer Society Workshop on Visual Languages’94, October 04-07, 1994, St. Louis, MO, pp.64-71. Nuchprayoon, A. and Korfhage, R.R. (1997). GUIDO: visualizing document retrieval. Proceedings of the IEEE Information Visualization symposium’97, September 23-26, 1997, Isle Capri, Italy, pp.184-188. Nurnberger, A. and Detyniecki, M., (2002). Visualizing changes in data collections using growing self-organizing maps. Proceeding of International Joint Conference on Neural Networks, IJCNN’02, May 2002, Honolulu, Hawaii, pp. 1912-1917. Olsen KA, Korfhage RR, Sochats KM, Spring MB,.and Williams JG (1993 a). Visualization of a Document Collection: The VIBE System. Information Processing & Management, 29(1), 69-81. Olsen KA, Korfhage RR, Sochats KM, Spring MB, and Williams JG (1993 b). Visualization of a Document Collection with implicit and explicit links: The Vibe system. Scandinavian Journal of Information Systems, 5, 79-95. Olsen KA, and Korfhage RR (1994). Desktop visualization. Proceedings of IEEE/CS Symposium on Visual Languages’94, October 4-7, 1994, St. Louis, Missouri, pp. 239-244. Ortony A (1979). Beyond literal similarity. Psychological Review, 86(3), 161-180. Paivio A (1990). Mental representation: A dual coding approach. New York: Oxford University Press. Pampalk E, Goebl V, and Widmer G (2003). Visualizing changes in the structure of data for exploratory feature selection. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining’03, August 24-27, 2003, Washington D.C. ACM Press, pp. 157 – 166. Pejtersen AM (1991). Icons for presentation of domain knowledge in interfaces. Proceedings of 1st International ISKO Conference: Tools for Knowledge Organization and the Human Interface, Vol.2, pp. 175-193. Pitkow JE, and Pirolli P (1999). Mining longest repeated subsequences to predict Would Wide Web surfing. Proceedings of 2nd USENIX Symposium on Internet Technologies and Systems’99, October 11-14, 1999, Boulder, CO, pp. 139–150. Plaisant C (2004). The challenge of information visualization evaluation. Proceedings of the Working Conference on Advanced Visual Interfaces’04, May 25-28, 2004, Gallipoli, Italy. ACM Press, pp.109-1160. Preece J, Rogers Y, Sharp H, Benyon D, Holland S, and Carey T (1994). HumanComputer Interaction. Harlow, England: Addison-Wesley. Quist M, and Yona G (2004). Distributional scaling: an algorithm for structurepreserving embedding of metric and nonmetric spaces. Journal of machine learning research, 5, 399-420.
280
Bibliography
Rasmussen E (1992). Clustering algorithms. In W. B. Frakes, and R. Baeza-Yates (Ed.), Information retrieval: data structures & algorithms, pp.419-442. Englewood Cliffs, NJ.: Prentice Hall. Rauber A (1999). LabelSOM: on the labeling of self-organizing maps. Proceedings of the International joint conference on neural networks, IJCNN’99, Washington, D.C., July 10-16, 1999, pp. 3527-3532. Rauber A, and Merkl D (1999). SOMLib: a digital library system based on neural networks. Proceedings of the Fourth ACM Conference on Digital Libraries’99, August 11-14, 1999, Berkeley, California. ACM Press, pp. 240-241. Rees-Potter LK (1989). Dynamic thesaural systems: A bibliometric study of terminological and conceptual change in sociology and economics with the application to the design of dynamic thesaural systems. Information and Processing and Management, 25(6), 677-691. Robertson G, Card SK, and Mackinlay JD (1989). The cognitive coprocessor architecture for interactive user interfaces. Proceedings of the 2nd Annual ACM SIGGRAPH Symposium on User Interface Software and Technology’89, November 13-15, 1989, Williamsburg, Virginia. ACM Press, pp. 10-18. Robertson G, Czerwinski M, Larson K, Robbins DC, Thiel D, and van Dantzich M (1998). Data mountain: using spatial memory for document management. Proceedings of the 11th Annual ACM Symposium on User interface Software and Technology’98, November 01-04, 1998, San Francisco, California. ACM Press, pp. 153-162. Robertson PK (1991). A Methodology for Choosing Data Representations. IEEE Computer Graphics and Applications, 11(3), 56-68. Robertson S (2004).Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5), 503-520. Robertson SE, and Sparck Jones K (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129-146. Romero E, and Soria B (2005). Cognitive metaphor theory revised. Journal of Literary Semantics, 34(1), 2-21. Rose DE, Mander R, Oren T, Ponceleon DB, Salomon G, and Wong YY (1993). Content Awareness in a File System Interface: Implementing the `Pile’ Metaphor for Organizing Information. Proceedings of the International Conference on Research and Development in Information Retrieval’93, June 27- July 01, 1993, Pittsburgh, PA, pp. 260-269. Rowe LA, Davis M, Messinger E, Meyer C, Spirakis C, and Tuan A (1987). A Browser for directed graphs. Software Practice & Experience, 17(1), 67-76. Rubenstein R and Hersh H (1984). The Human Factor: Designing Computer Systems for People. Burlington, MA: Digital Press. Rushmeier H, Botts M, Uselton S, Walton J, Watkins H, and Watson D (1995). Metrics and Benchmarks for Visualization. Proceedings of the 6th Conference on Visualization ‘95, October 29 - November 03, 1995, Washington, D.C. IEEE Computer Society, pp. 422-426. Salton G (1989). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley.
Bibliography
281
Salton G, Allan J, and Singhal AK (1996). Automatic text decomposition and structuring. Information Processing and Management, 32(2), 127-138. Scaife M, and Rogers Y (1996). External cognition: How do graphical representations work. International Journal of Human-Computer Studies, 45(2), 185–213. Schaffer D, Zuo Z, Greenberg S, Bartram L, Dill J, Dubs S, and Roseman M (1996). Navigating hierarchically clustered networks through fisheye and fullzoom methods. ACM Transactions on Computer-Human Interaction, 3(2), 162-188. Schneider JW (2005). Naming clusters in visualization studies: Parsing and filtering of noun phrases from citation contexts. Proceedings of ISSI 2005, 10th International Conference of the International Society for Scientometrics and Informatrics’05, July 24-28, 2005, Stockholm, Sweden. Stockholm: Karolinska University Press, pp. 406-416. Schneider JW and Borland P (2004). Identification and visualization of ‘concept symbols’ and their citation context relations: A semi-automatic bibliometric approach for thesaurus construction and maintenance. Knowledge and Change Proceedings of the 12th Nordic Conference for Information and Documentation’04, September 01-03, 2004, Aalborg, pp. 44-56. Schvaneveldt RW, Durso FT, and Dearholt DW (1989). Networking structure in proximity data. In G. Bower (Ed.), The psychology of learning and motivation, 24, pp.249-284. Academic Press. Seagull FJ, and Walker N (1992). The effects of hierarchical structure and visualization ability on computerized information retrieval. International Journal of Human-Computer Interaction, 4, 369-385. Sebrechts MM, Vasilkis J, Miller MS, Cugini JV, and Laskowski SJ (1999). Visualization of search results: A comparative evaluation of text, 2D, and 3D interface. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval’99, August 15-19, 1999, Berkeley, California. ACM Press, pp. 3-10. Shneiderman B (1996). The eyes have it: A task by data type taxonomy for information visualizations. Proceedings of IEEE Symp. Visual Languages, September 03-06, 1996, Washington D.C. IEEE CS Press, pp. 336-343. Sidiropoulos NS (1999). Mathematical programming algorithms for regressionbased nonlinear filtering in RN. IEEE Transaction on signal processing, 47(3), 771-782. Small H (1999). Visualizing science by citation mapping. Journal of the American Society for Information Science, 50(9), 799-813. Small H (1994). A SCI-Map case study: building a map of AIDS research. Scientometrics, 26, 5-20. Small H, and Garfield E (1985). The geography of science: disciplinary and national mapping. Journal of Information Science, 11(4), 147-159. Small H (1973). Co-citation in the scientific literature: A new measure of the relationship between documents. Journal of the American Society for Information Science, 24, 265-269. SpaceTree: A novel node-link tree browser (2005). Retrieved November 13, 2005, from the World Wide Web: http://www.cs.umd.edu/hcil/spacetree/index.shtml
282
Bibliography
Spark Jones K (1972). A statistical interpretation of term importance in automatic indexing. Journal of Documentation, 28(1), pp.11-21. Spence R, and Apperley M (1982). Data Base Navigation: an Office Environment for the Professional. Behaviour & Information Technology, 1(1), pp. 43-54. Spoerri A (1993 a). InfoCrystal: A Visual Tool for Information Retrieval. Proceedings of Visualization’ 93, San Jose, CA, pp.150-157. Spoerri A (1993 b). InfoCrystal: a visual tool for information retrieval and management, Proceedings of the second international Conference on information and knowledge management’93, November 01-05-1993, New York, NY, pp. 11-20. STRETCH: Visualize, Organize, and Manage Information (2005). Retrieved November 13, 2005, from the World Wide Web: http://www.elastictech.com/ index.html Swayne DF, Lang DT, Buja A, and Cook D (2003). GGobi: evolving from XGobi into an extensible framework for interactive data visualization. Computational Statistics & Data Analysis, 43(4), 423-444. Takane Y, Young FW, and De Leeuw J (1977). Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features. Psychometrika, 42(1), 7-67. Takeuchi A, and Amari S (1976). Foundation of topographic maps and columnar microstructures. Biol. Cybernetics, 35, 63-71. Tatemura J (2000). Virtual reviewers for collaborative exploration of movie reviews. Proceedings of the 5th international conference on Intelligent user interfaces’00, January 09-12, 2000, New Orleans, Louisiana, pp. 272-275. Teoh ST, Ma K, Wu SF, and Zhao X (2002). Case study: Interactive visualization for internet security. Proceedings of IEEE Visualization’02, October 27November 01, 2002, Boston,MA, pp. 505 – 508. Thompson RH, and Craft WB (1989). Support for browsing in an intelligent text retrieval system. International Journal of Man-Machine Studies, 30(6), 639-668. Torgerson MS (1952). Multidimensional scaling: I. Theory and method. Psychometrika, 17(4), 401-419. Tory M (2003). Mental Registration of 2D and 3D Visualizations (An Empirical Study). Proceedings of the 14th IEEE Visualization’03 (Vis’03), October 22-24, 2003, Washington, D.C., pp. 371-378. Tourangeau R, and Sternberg R (1982). Understanding and appreciating metaphors. Cognition, 11, 203-244. Tufte ER (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press. Turner WA, Chartron G, Caville F, and Michelet B (1988). Packaging information for peer review: new co-word analysis techniques. In A.F.J.v. Raan (Ed.), Handbook of quantitative studies of science and technology, pp. 291-323. Amsterdam: North Holland. Veerasamy A, and Belkin N (1996). Evaluation of a tool for visualization of information retrieval results. Proceedings of the 19th annual International ACM SIGIR Conference on Research and development in Information Retrieval’96, August 18-22, 1996, Zurich, Switzerland. ACM Press, pp. 85-92.
Bibliography
283
Vicente K, Hayes B, and Williges R (1987). Assaying and Isolating Individual Differences in Searching a Hierarchical File System. Human Factors, 29(3), 349-359. Viégas FB, Perry E, Howe E, and Donath J (2004). Artifacts of the Presence Era: Using Information Visualization to Create an Evocative Souvenir. Proceedings of the IEEE Symposium on information Visualization’04, October 10-12, 2004, Austin, TX, pp. 105-111. Viegas FB, Wattenberg M, and Dave K (2004). Studying cooperation and conflict between authors with history flow visualizations. Proceedings of SIGCHI’04, December 20, 2004, Vienna, Austria. ACM Press, pp. 575-582. Visual Net (2005). Antarct System, Inc. Retrieved March 20, 2005, from the World Wide Web: http://belmont.antarcti.ca/start Visual Thesaurus (2005). Retrieved November 13, 2005, from the World Wide Web: http://www.thinkmap.com/visualthesaurus.jsp Voorbij HJ (1999). Searching scientific information on the internet: A Dutch academic user survey. Journal American Society for Information Science and Technology, 50(7), 598-615. Wakita A, and Matsumoto F (2003). Information visualization with Web3D: spatial visualization of human activity area and its condition. SIGGRAPH Computer Graphics. 37(3), 29-33. Walter JA, and Ritter H (2002). On interactive visualization of high-dimensional data using the hyperbolic plane. Proceedings of the Eighth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining’02, July 2326, 2002, Edmonton, Alberta, Canada. ACM Press, pp. 123-132. Waterworth J, and Chignell M (1991). A model of information exploration. Hypermedia, 3(1), 35-38. WebMap (2003). WebMap Technologies, Inc. Retrieved March 23, 2005, from the World Wide Web: http://www.webmap.com. Weherend S, and Lewis CA (1990). A problem-oriented classification of visualization techniques. Proceedings of the 1st Conference on Visualization’ 90, San Francisco, California. IEEE Computer Society Press. pp. 139-143. Weiner JE (1984). A knowledge representation approach to understanding metaphors. Computational Linguistics, 10(1), 1-15. Weippl E (2001). Visualizing content based relations in texts. Proceedings of the 2nd Australasian conference on User interface’01, January 29-February 01, 2001, Washington D.C., pp. 34-41. White HD (1998). Visualizing a discipline: an author co-citation analysis of information science, 1972-1995. Journal of the American Society for Information Science and Technology, 49(4), 327-355. White HD (2003). Author cocitation analysis and Pearson’s r. Journal of the American Society for Information Science and Technology, 54(13), 1250-1259. Wiesman F, van den Herik HJ, and Hasman A (2004). Information retrieval by metabrowsing. Journal of the American Society for Information Science and Technology, 55(7), 565 – 578. Willshaw DJ, and von der Malsburg C (1976). How patterned neural connections can be set up by self-organization. Poc.Roy.Soc., 194(1117), 431-445.
284
Bibliography
Winckler MA, Palanque P, and Freitas CM (2004). Task and scenario-based evaluation of information visualization techniques. Proceedings of the 3rd Annual Conference on Task Models and Diagrams’04, November 15-16, 2004, New York, NY. ACM Press, pp. 165-172. Wise JA (1999). The ecological approach to text visualization. Journal of the American Society for Information Science, 50(13), 1224-1233. Wishart D (2001). Clustan professional user guide. Edinburgh, Scotland: Cluster Ltd. Wiss U, Carr D, and Jonsoon H (1998). Evaluating three-dimensional information visualization designs: A case study of three designs. Proceedings of the International Conference on Information Visualization’98, July 29-31, 1998, London, England, pp. 137-144. Wright W (1999). Information animation applications in the capital markets. In S. K. Card, J. D. Mackinlay, and B. Shneiderman (Ed.), Readings in information Visualization: Using Vision to Think, pp. 83-91. Morgan Kaufmann Publishers, San Francisco, CA. Yang CC, Chen H, and Hong K (2003). Visualization of large category map for internet browsing. Decision Support Systems, 35(1), 89-102. Yates FA (1966). The Art of Memory. Chicago: University of Chicago Press. York J, Bohn S, Pennock K, and Lantrip D (1995). Clustering and dimensionality reduction in SPIRE. Proceedings of the Symposium on Advanced Intelligence Processing and Analysis, Mar. 28- 30, 1995, Tysons Corner, VA. Washington D.C.: Office of Research and Development, pp. 73. Young D, and Shneiderman B (1993). A graphical filter/flow model for Boolean queries: An implementation and experiment. Journal of the American Society for Information Science, 44(4), 327-339. Young G, and Householder AS (1941). A note on multidimensional psychophysical analysis. Psychometrika, 6(5), 331-333. Zeki S (1992). The visual image in mind and brain. Scientific American, 267(3), 69-76. Zhang H, and Salvendy G (2001). The implications of visualization ability and structure preview design for web information search tasks. International Journal of Human-Computer Interaction, 13(1), 75-95. Zhang J (2000). A Visual Information Retrieval Tool. Proceedings of the 63 Annual Meeting of the American Society for Information Science, November 1216, 2000, Chicago, IL. Medford, NJ. Information today, Inc., pp. 248-257. Zhang J (2001 a). TOFIR: A Tool of Facilitating Information Retrieval--Introduce a visual retrieval model. Information Processing & Management, 37(4), 639-657. Zhang J (2001 b). The characteristic analysis of the DARE visual space. Information Retrieval, 4(1), 61-78. Zhang J, and Korfhage RR (1999). DARE: Distance and Angle Retrieval Environment: A Tale of the Two Measures. Journal of the American Society for Information Science, 50(9), 779-787. Zhang J, and Nguyen T (2005 a). A new term significance weighting approach. Journal of Intelligent Information Systems, 24(1), 61-85.
Bibliography
285
Zhang J, and Nguyen TN (2005 b). WebStar: A visualization model for hyperlink structures. Information Processing and Management, 41(4), 1003-1018. Zhang J, and Rasmussen E (2001). Developing a new similarity measure from two different perspectives. Information Processing & Management, 37(2), 279-294. Zhang J, and Rasmussen E (2002). An experimental study on the iso-contentbased angle similarity measure, Information Processing & Management, 38(3), 325-342. Zhang J, and Wolfram D (2001). Visualization of term discrimination analysis. Journal of the American Society for Information Science and Technology, 52(8), 615-627. Zhao Y, and Karypis G (2002). Evaluation of hierarchical clustering algorithms for document databases. Proceedings of the eleventh international conference on Information and knowledge management’02, November 04-09, 2002, McLean, Virginia, pp. 515-524. Zhu J, Hong J, and Hughes JG (2004). PageCluster: Mining conceptual link hierarchies from web log files for adaptive web site navigation. ACM Transaction on Internet Technology, 4(2), 185-208.
Index
Index A
B
abstract information, 8 abstract property, 224 adaptive learning, 109 agglomerative clustering algorithm, 43, 44 aggregate information, 7, 8, 14, 240, 255, 267 aggregation algorithm, 173 alignment process, 216 ambiguity, 24, 47, 66, 173, 191, 251, 256, 260 angle-angle-based visualization model, 88 conjunction evaluation model, 94 cosine evaluation model, 89 disjunction evaluation model, 94 distance evaluation model, 90 ellipse evaluation model, 95 angle-distance integrated similarity measure, 33 arbitrariness, 202, 206 artificial neural network, 108 associativity, 22 attention point, 10, 14 attribute extraction, 17 autoassociator, 111 automatic indexing, 24–27 auxiliary view point AVP, 75, 195, 200
bibliographic coupling, 137 blending, 216 BO paradigm, 17, 259, 268 Boolean retrieval system, 36, 49, 230, 248 BQ paradigm, 16, 259, 267 breadth first search algorithm, 185 browsing, 5–7, 10, 13, 169, 223, 228, 245, 255 brushing, 13, 261, 263 C
Cartesian coordinate, 52, 67 Cartesian space, 74 Cassini oval evaluation model, 39, 78 categorization, 112, 168, 220 categorizing, 180, 243, 250 centroid, 42, 103 centroid vector, 122 City block measure, 31 clarity, 6 cluster membership function, 42 cluster merging function, 44 clustering, 8, 111, 180, 194, 243, 250 clustering algorithm, 40 clustering analysis, 112 co-citation, 137, 140
288
Index
cognitive ability, 12 cognitive advantage, 2 cognitive capacity, 13 cognitive device, 218 cognitive dissonance, 233 cognitive ground, 222, 256 cognitive load, 11, 12, 13, 167, 171, 262 cognitive metaphor theory, 218 cognitive model, 219 cognitive principle, 12 cognitive process, 12, 215 cognitive role, 219 cognitive shortcut, 219 color property, 14 commutativity, 22 competition, 111 complexity, 166, 224, 239, 256, 257, 267 conjunction evaluation model, 10, 15, 19, 24, 36, 78, 248, 259 continuity, 5, 234 controllability, 18, 244, 245, 251 convergence, 111, 116, 119 convex polygon, 59 coordination system, 17, 257 cosine evaluation model, 10, 24, 34, 76, 248, 259 cosine similarity measure, 29, 138 crawler, 168 criterion icon, 49 customization, 250 D
data mining, 8, 111, 112, 241, 255 dendrogram, 43 density estimator, 111 design model, 224 diaphoric property, 216 Dice co-efficient measure, 28 dimensionality, 17, 22, 74, 112, 194, 223, 246, 249, 264 dimensionality reduction, 19, 75, 112, 192, 193, 223
disjunction evaluation model, 10, 15, 19, 24, 38, 78, 248, 259 distance evaluation model, 15, 24, 77, 248, 259 Distance to Reference Axis DTRA, 195 distance-angle-based visualization model, 79 conjunction evaluation model, 84 cosine evaluation model, 81 disjunction evaluation model, 84 distance evaluation model, 82 ellipse evaluation model, 86 distance-based evaluation model, 35 distance-distance-based visualization model, 97 Cassini oval evaluation model, 101 conjunction evaluation model, 99 cosine evaluation model, 101 disjunction evaluation model, 99 distance evaluation model, 99 ellipse evaluation model, 100 distributivity, 22 divisive algorithm, 43 document projection position, 54 document reference point vector, 53 document-term matrix, 22 Dominance distance measure, 31 DRP, 75 E
ellipse evaluation model, 10, 15, 24, 36, 77, 248, 259 empirical evaluation, 242 epiphoric property, 216 equivalent index, 140 Euclidean distance, 31, 73, 74, 114, 129, 179, 211 Euclidean plane, 128 Euclidean space, 73, 127, 172 evaluation, 19, 232, 257, 267 evaluation criteria, 241, 242 evaluation of a search result, 45
Index experiment, 241 external cognition, 12 F
facilitator, 12 facility, 12, 218 familiarity, 225, 248 feature map, 111, 112 feedback, 6, 49, 110, 232, 249 feed-forward, 110 filtering, 241, 252, 262 fisheye, 229, 261 flexibility, 18, 47, 171, 194, 203, 245, 264 flow analysis, 184 focus, 67, 226, 229 focus + context, 263 full-text, 25, 64, 112, 139, 262 G
Gaussian neighborhood function, 116, 118 geographic feature, 226 geographic property, 228 global overview, 210, 211, 259 goodness-of-fit, 211 granularity, 6, 22, 168 graph drawing, 136, 231 graphic entity, 12 grid, 186 guidance, 11, 168, 232, 246 H
Hamming distance measure, 31 heterogeneity, 266 heuristic, 5, 7, 14 hidden layer, 109 hierarchical clustering algorithm, 40, 43 average linkage clustering, 44 single linkage clustering, 44 holistic, 7, 11, 14, 107, 218, 234 hyperbolic, 172, 261
289
hyperlink, 5, 9, 72, 140, 166, 179, 244, 256 hyperlink hierarchy inverted hyperlink tree method, 176 linkage-similarity-based method, 179 subject-directory-assistance method, 177 user-interference method, 176 user-usage-based method, 177 webpage-content-based method, 178 I
impact neighborhood, 115 inference, 216 information analysis, 15 Information at the macro-level, 7–8 Information at the micro-level, 7–8 information retrieval, 2, 4 information retrieval (evaluation) model, 34 information retrieval means, 15 information retrieval visualization, 13, 240 information retrieval visualization environment, 3, 223 information seeking behavior, 6, 167, 180, 223 information space, 8, 9, 14 information visualization, 3 inner product similarity measure, 28 input layer, 109 interactive activity, 13 interactive behavior, 194 interactive browsing, 15 interactive control, 245 interactive search, 46 interactivity, 7, 243, 245 interface evaluation, 240 interior result icon, 49 intermediate reference point, 56, 61 internal mental presentation, 12
290
Index
intuitiveness, 250 Inverse Document Frequency IDF, 25 inversion transformation, 31 iso-extent contour, 87 iteration, 6 J
Jaccard similarity measure, 29, 138 K
key view point KVP, 75, 195, 201 K-means clustering algorithm, 42 knowledge acquisition, 15 knowledge discovery, 15, 255 L
labeling, 119, 120, 210, 260 learnability, 225 learning curve, 225 learning rate function, 117, 209 learning vector quantization, 111 linguistic, 12, 191, 215 local view, 18, 210, 211, 246, 251, 259 longest repeating subsequence, 184 loss function, 211 lost in cyberspace, 10, 165, 167, 170 M
Manhattan distance measure, 31 manipulability, 172 matching, 216, 217, 218, 231 mental model, 220 design model, 221 functional model, 221 structural model, 221 system model, 221 user mental model, 221 metaphor, 172, 185, 215, 256, 265 metaphorical configuration, 228 metaphorical embodiment, 218, 219, 222, 223, 224, 232, 233, 234, 235, 237
metaphorical presentation, 18 minimal spanning tree, 121 Minkowski metric, 30, 129, 209, 258 mismatch, 217, 233 model for automatic reference point rotation, 66 model for fixed multiple reference points, 49 model for movable multiple reference points, 52 multi-facet, 4, 19, 112, 210 multiformity, 239, 267 mutual inclusion, 140 N
neighborhood function, 117, 209 network security, 186, 267 neural network, 108–11 neuron, 209 non-Euclidean space, 172 non-hierarchical clustering algorithm, 40, 42 non-metric multidimensional scaling, 211 non-traditional evaluation model, 15 O
object overlapping circle, 196 orthogonal coordination system, 17 orthogonality, 24 output grid, 111, 112 output layer, 109 overlap co-efficient measure, 29 overview + detail, 263 P
parallel computational method, 109 parallel coordinate, 18, 187, 260 partition co-efficient, 55 partitioning clustering algorithm, 40 path length, 130 path-length-i complete minimum weight matrix, 131
Index path-length-i matrix, 130 path-length-i minimum weight matrix, 130 path-length-q complete minimum weight matrix, 207 pattern, 3, 7, 14, 109, 170, 184, 210, 267 Pearson Product Moment Correlation Coefficient, 33 Pearson r, 33, 138 perception ability, 12, 13 perceptual inference, 12 perceptual system, 12 platform, 13, 227, 243, 256 point of interest, 48 polar coordinate, 67 portability, 171 precision, 45, 240, 253, 267 pre-conceptual structure, 219 predictability, 184, 225 priority-based traversal, 174 probability-based term weighting algorithm, 26 projected contour, 19 projection algorithm, 18, 19, 192, 194, 256, 258, 259 projection angle, 67, 80, 88, 194 projection contour function, 19 projection distance, 67, 80, 97, 194 property map, 112 proportional similarity, 29 Q
QB paradigm, 16, 17, 259, 267 q-triangular, 130, 209, 258 quad-tree, 186 quantization error, 120 query reformulation, 49, 248 query searching, 5–7, 19, 141, 168, 169, 223, 245, 247, 255 R
recall, 45, 240, 253, 267 recognition of pattern, 12 reference axis RA, 75
291
reference point, 34, 48, 192, 224 reference point monopoly triangle, 61 reference system, 18, 210, 257, 258, 260 referent, 216 regression, 117 relativity, 202 relevance feedback, 24 relevance judgment, 5, 229, 240, 257 relevance sphere boundary, 66 result grid, 112 retrieval contour, 34, 35, 36, 38, 71, 75, 83, 259, 268 retrieval result evaluation, 240, 253 rhythm, 3, 7 S
salient imbalance theory, 233 Salton term weighting method, 26 scalar kernel function, 117 scientific visualization, 3 screen real estate, 172, 180, 185, 262 search strategy, 10, 49, 224 selection, 252 self-organization, 109 self-organizing map, 111, 112, 172, 209 semantic framework, 2, 17, 18, 226, 250, 256, 258, 267 semantic zoom, 263 similarity measure, 27–34 simplicity, 202 source domain, 216, 231 source item, 216, 231 space density, 15, 266 spatial ability, 10 spatial characteristic, 8, 14 spatial pattern, 10 spatial property, 10, 14, 224 spatial structure, 12 spatial visualization ability, 171 spatial zoom, 263
292
Index
spatiality, 15, 257 spatialization, 14, 15 spring theory, 136, 225, 231 stability, 111 stop word list, 139 stop word method, 25 subject directory, 9, 168, 169, 172, 177, 244 subject stop list, 25 supervised learning, 110 Supremum distance measure, 31 synthesis, 260 system model, 224 T
Tanimoto similarity measure, 29, 138 target domain, 216, 231 target item, 216, 231 task, 241, 243 tenor, 215 term co-occurrence analysis, 139 term discrimination analysis, 71, 265 term seeding, 139 term semantic network, 123 term weighting, 24–27 TF×IDF, 26, 65 theory of cognitive facilities, 12 tier, 49 tolerance of imprecise, 109 topographic map, 173 topological map, 112 traffic analysis, 170, 183 traversal, 181 traversal algorithm, 174 breadth first search method, 174
depth first search method, 174, 175 trend, 3, 7, 14, 184, 267 triangle inequality principle, 127 U
unsupervised learning, 110 usage pattern subtraction method, 185 user model, 225 user profile, 48 user-interference Euclidean method, 176 V
vector information retrieval system, 51 vector quantization,, 111 vector space model, 22–24 vehicle, 216 visual configuration, 2, 224 visual cortex, 11 visual information retrieval, 13 visual presentation, 2, 13, 225 visual space, 2, 245 visualization, 3 von Neumann structure, 109 W
weight vector, 112, 209 winning neuron, 114 winning node, 111, 114, 209 Z
zoom, 241, 251, 263
The Information Retrieval Series Gerald Kowalski. Information Retrieval Systems: Theory and Implementation. ISBN 0-7923-9926-9 Gregory Grefenstette. Cross-Language Information Retrieval. ISBN 0-7923-8122-X Robert M. Losee. Text Retrieval and Filtering: Analytic Models of Performance. ISBN 0-7923-8177-7 Fabio Crestani, Mounia Lalmas, and Cornelis Joost van Rijsbergen. Information Retrieval: Uncertainty and Logics: Advanced Models for the Representation and Retrieval of Information. ISBN 0-7923-8302-8 Ross Wilkinson, Timothy Arnold-Moore, Michael Fuller, Ron Sacks-Davis, James Thom, and Justin Zobel. Document Computing: Technologies for Managing Electronic Document Collections. ISBN 0-7923-8357-5 Marie-Francine Moens. Automatic Indexing and Abstracting of Document Texts. ISBN 0-7923-7793-1 W. Bruce Croft. Advances in Informational Retrieval: Recent Research from the Center for Intelligent Information Retrieval. ISBN 0-7923-7812-1 Gerald J. Kowalski and Mark T. Maybury. Information Storage and Retrieval Systems: Theory and Implementation, Second Edition. ISBN 0-7923-7924-1 Jian Kang Wu, Mohan S. Kankanhalli, Joo-Hwee Lim, Dezhong Hong. Perspectives on Content-Based Multimedia Systems. ISBN 0-7923-7944-6 George Chang, Marcus J. Healey, James A.M. McHugh, Jason T.L. Wang. Mining the World Wide Web: An Information Search Approach. ISBN 0-7923-7349-9 James Z. Wang. Integrated Region-Based Image Retrieval. ISBN 0-7923-7350-2 James Allan. Topic Detection and Tracking: Event-based Information Organization. ISBN 0-7923-7664-1 W. Bruce Croft, John Lafferty. Language Modeling for Information Retrieval. ISBN 1-4020-1216-0 Yixin Chen, Jia Li and James Z. Approaches to Image Retrieval . David A. Grossman and Ophir Heuristics. Second edition; ISBN
Wang. Machine Learning and Statistical Modeling ISBN 1-4020-8034-4 Frieder. Information Retrieval: Algorithms and 1-4020-3003-7; PB: ISBN 1-4020-3004-5
John I. Tait. Charting a new Course: Natural Language Processing and Information Retrieval. ISBN 1-4020-3343-5 Udo Kruschwitz. Intelligent Document Retrieval: Exploiting Markup Structure. ISBN 1-4020-3767-8 Peter Ingwersen, Kalervo Jrvelin. The Turn: Integration of Information Seeking and Retrieval in Context. ISBN 1-4020-3850-X Amanda Spink, Charles Cole. New Directions in Cognitive Information Retrieval. ISBN 1-4020-4013-X James G. Shanahan, Yan Qu, Janyce Wiebe. Computing Attitude and Affect in Text: Theory and Applications. ISBN 1-4020-4026-1 Marie-Francine Moens. Information Extraction, Algorithms and Prospects in a Retrieval Context. ISBN 1-4020-4987-0 Maristella Agosti. Information Access through Search Engines and Digital Libraries. ISBN 978-3-540-75133-5 Jin Zhang. Visualization for Information Retrieval. ISBN 978-3-540-75147-2