Topics in Cognitive Science 3 (2011) 1–2 Copyright ! 2011 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01129.x
Introduction to Volume 3, Issue 1 of topiCS Wayne D. Gray Executive Editor
With this issue, topiCS begins its third year of high-quality papers with on-time print and electronic publication of each issue. In recognition of our fast entry into the ranks of journal excellence, the ISI Web of Science has announced that topiCS will be included in the Social Science Citation Index as well as the Current Contents: Social and Behavioral Sciences Index. We will receive our first Impact Factor in the 2011 edition of the Journal Citation Reports, which will be released in Summer 2012. This is great news for our past, current, and future authors as well as our topiCS Editors. It means that papers in our first two volumes are already highly cited. Of course, we knew we were doing well; however, it is very gratifying to have our collected labors acknowledged by the folks at Thomson Reuters. This issue of Topics in Cognitive Science introduces a new topic, continues an annual tradition, and includes a commentary and response on a paper published in our January 2009 issue (Volume 1, Issue 1). Computational Methods to Extract Meaning from Text and Advance Theories of Human Cognition has been shepherded by Danielle McNamara (University of Memphis). Counting her excellent introduction, there will be nine papers on this topic, five of which appear in this issue with the rest appearing in 3:2. We also continue our tradition of publishing the Best Of papers that have appeared in recent cognitive science conferences. All Best Ofs have been triply reviewed: once to get into a conference, once as a prize winner for that conference, and once for topiCS. This year’s crop of four Best Ofs include two each from the 2010 Cognitive Science Conference and the 2010 International Conference on Cognitive Modeling. Although there were other Best Ofs that met our criteria, those authors declined our invitation because they were already in the process of including their conference work as part of a longer journal publication. Our final topic is a Commentary by Benjamin Hilbig (University of Mannheim) and Tobias Richter (University of Kassel) and response by Henry Brighton and Gerd Gigerenzer (Max Planck Institute for Human Development) on the paper, ‘‘Homo heuristicus: Why Biased Minds Make Better Inferences’’ (Gigerenzer & Brighton, 2009). We believe this commentary and response serve to deepen the discussion begun in the original paper and we are sure that the cognitive science community will enjoy this exchange. In August, we delivered our first annual report on topiCS to the Cognitive Science Society’s Governing Board. For that report, we assembled a variety of statistics focused on
2
W. D. Gray ⁄ Topics in Cognitive Science 3 (2011)
estimating how well we did in our first year (Volume 1, 2009) in relation to our big sister journal, Cognitive Science. As we could not compare citation counts, our report highlighted the number of downloads of 2009 papers in the 16-month period from January 2009 to April 2010. This is a partial count because it is limited to papers downloaded from the WileyBlackwell journals website. In that period, we had 11,205 downloads of 2009 topiCS papers compared to 9,826 downloads of 2009 Cognitive Science papers. Obviously, with their 30+ years of publication, Cognitive Science had vastly more total downloads than topiCS, but we are very pleased that our first year attracted so much positive attention from the cognitive science community. Note that at that time (April 2010), our number-one download was the paper by Gigerenzer and Brighton (2009) that is the subject of the commentary that appears in this issue. As always, topiCS encourages commentaries and new topics. Send your commentaries directly to me at
[email protected] along with a short note. If you are proposing a topic, please open communications with a short first note (about 300–650 words or fewer) and be sure to consult the topiCS FAQ page, http://csjarchive.cogsci.rpi.edu/topiCS/FAQs.html, for Preparing a Proposal for topiCS.
Reference Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences. Topics in Cognitive Science, 1(1), 107–143.
Topics in Cognitive Science 3 (2011) 3–17 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01117.x
Computational Methods to Extract Meaning From Text and Advance Theories of Human Cognition Danielle S. McNamara Department of Psychology, University of Memphis
Abstract Over the past two decades, researchers have made great advances in the area of computational methods for extracting meaning from text. This research has to a large extent been spurred by the development of latent semantic analysis (LSA), a method for extracting and representing the meaning of words using statistical computations applied to large corpora of text. Since the advent of LSA, researchers have developed and tested alternative statistical methods designed to detect and analyze meaning in text corpora. This research exemplifies how statistical models of semantics play an important role in our understanding of cognition and contribute to the field of cognitive science. Importantly, these models afford large-scale representations of human knowledge and allow researchers to explore various questions regarding knowledge, discourse processing, text comprehension, and language. This topic includes the latest progress by the leading researchers in the endeavor to go beyond LSA. Keywords: Sematic models; Computational techniques; Meaning extraction; Cognition; Memory; Embodiment; Latent representations; LSA
1. Introduction One method of scientifically investigating human cognition is to examine what we write. This endeavor is facilitated by current technologies that allow scientists to conduct discourse analyses on extremely large scales. Statistical models are applied to large linguistic corpora (i.e., collections of written text) to extract (what we think is) the meaning of the text and, moreover, to enhance our understanding of how the human mind works. A fundamental assumption among this collection of articles is that human cognition manifests itself in our Correspondence should be sent to Danielle S. McNamara, Department of Psychology ⁄ Institute for Intelligent Systems, University of Memphis, 202 Psychology Building Memphis, TN. E-mail: dsmcnamara1@gmail. com
4
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
writings. Large text corpora combined with computational techniques for analyzing these corpora allow scientists to extract meaning from text and, by consequence, to explore various aspects of the human mind and culture that manifest in text. To a large extent, this volume of topiCS is dedicated to showing the value of using corpus analytical techniques to understand cognition. A well known statistical method of extracting meaning from text is latent semantic analysis (LSA). In Landauer, McNamara, Dennis, and Kintsch (2007), we provided basic information about LSA as well as a number of examples of its use in furthering our understanding of cognition and in contributing to applied technologies. LSA was a groundbreaking method in which word meanings are extracted by determining the company the word keeps across large corpora of texts. Collections of word meanings then comprise sentence, paragraph, and document meanings. More specifically, given a large corpus of text with millions of words and thousands of documents, a matrix is created that indicates the context in which each word occurs. The context of a word is the document that it occurs in, which may be the sentence, paragraph, or entire text. Each word or term is defined in terms of a vector (i.e., one dimensional array of values) of the documents in which it occurs. This is a sparse matrix because most terms occur in few documents and it is a large matrix because there are many terms across many documents. Thus, the matrix is reduced to discover its latent properties. In LSA, it is reduced by applying singular value decomposition (SVD) and then truncating the number of dimensions to include hundreds rather than thousands of dimensions in the matrix. This process creates an LSA space that is multidimensional, where each term is located in this multidimensional space. Similarity metrics such as the cosine between words, or collections of words, are indicative of how related they are. A higher cosine indicates that the words are more related. By uncovering these latent relations between words, the meanings of the words, sentences, and texts can be derived. Latent semantic analysis has been extremely successful in helping researchers understand a wide range of theoretical issues concerning meaning and also in scaling up theoretical insights to applied tasks (see e.g., Landauer et al., 2007; Louwerse, 2010, this volume). However, LSA alone is not the focus of this issue of topiCS. Here, we focus on going beyond LSA. In this topic, the featured articles describe new methods of using LSA as well as other mathematical and computational approaches that are comparable and sometimes exceed the capabilities of LSA. Our objective is not to reinforce the multiple uses of LSA, but more so to show how current research goes beyond LSA.
2. Statistical semantic models Statistical models of semantics have been a focus of research for over a half century (e.g., Osgood, Suci, & Tannenbaum, 1957; Smith, Shoben, & Rips, 1974). However, the availability of technologies that could handle large corpora and high-dimensional spaces (e.g., Salton, Wong, & Yang, 1975) helped to change the field of semantics. These technological advances helped to spur LSA, which has been largely responsible for the growth in interest
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
5
in and use of statistical models of semantics over the last two decades. LSA was immensely successful both as a theoretical model (e.g., Landauer, 2007; Landauer & Dumais, 1997) and as a practical tool (e.g., Landauer et al., 2007). Albeit successful and effective, a good deal of research has been devoted to improving LSA and to developing and improving semantic models in general. The majority of the articles in this volume are concerned with this objective. Over the past few decades, numerous semantic models have been developed and thus there is a large scope of models available that are capable of (more or less) successfully extracting meaning from text. Riordan and Jones (2010, this volume) describe nine statistical models of semantics, which they refer to as distributional models: • LSA (Landauer & Dumais, 1997) • Probabilistic Topic Model (Steyvers, Chemudugunta, & Smyth, 2010, this volume; Steyvers & Griffiths, 2007) • Correlated Occurrence Analog to Lexical Semantics (COALS; Rohde, Gonnerman, & Plaut, 2005) • Contextual Similarity Log-Likelihood (CS-LL; Dunning, 1993) • Contextual Similarity Log-Odds (CS-LO); Lowe, 2001) • Hyperspace Analog to Language (HAL; Lund & Burgess, 1996) • High Dimensional Explorer (HIDEx; Shaoul & Westbury, 2006) • Positive Pointwise Mutual Information (PosPMI; Church & Hanks, 1990) • Bound Encoding of the Aggregate Language Environment (BEAGLE; Jones & Mewhort, 2007) Stone, Dennis, and Kwantes (2010, this volume) describe three additional models: • Vectorspace Model (Salton et al., 1975) • Sparse Nonnegative Matrix Factorization (SpNMF: Xu, Liu, & Gong, 2003) • Constructed Semantics Model (CSM; Kwantes, 2005) In addition, Howard, Shankar, and Jagadisan (2010, this volume) describe their transformation of an episodic memory model temporal context model (TCM) into a new approach to semantic modeling: • Predictive Temporal Context Model (pTCM) One means of subdividing semantic models is into two categories, context word and context region (see e.g., Riordan & Jones, 2010, this volume). These two approaches differ in terms of the matrix that comprises the initial representation. Context word models use a word-by-word (term-by-term) matrix that is defined using a moving window approach. Essentially, the document size is reduced and defined only by words surrounding the target word. Words’ co-occurrences within a defined window (e.g., two or more consecutive words) comprise the matrix. These models differ in the size of the window and whether the window considers both prior and subsequent words within the window. In addition, the words’ co-occurrences are differentially weighted (e.g., by multiplying the co-occurrence
6
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
matrix by another matrix that includes weights). They may be weighted by distance from the target words (HAL, COALS), frequency within the corpus (HIDEx), or their conditionalized co-occurrences (COALS, CS-LL, CS-LO, PosPMI). BEAGLE is also a context word model, but it incorporates convolution (e.g., Murdock, 1992) as a means to compress n-gram information, which captures both syntactic (word order) and semantic (contextual) information about each word. Context region models, including the Vectorspace model, LSA, Topic, SpNMF, CSM, and pTCM, use a word-by-context (or term-by-document) matrix. In models such as the Vectorspace model, LSA, and SpNMF, the context of a word is the document in which it occurs; as described earlier, the document may be the sentence, paragraph, or entire text. In the Topic model (see Steyvers et al., 2010, this volume), the context is defined by latent topics identified in the corpus. Each document is represented by multiple topics as a probability distribution, and each topic is represented by a probability distribution of words. LSA and SpNMF are much like the Vectorspace model in that they both weight the vector representations of the terms by multiplying the vector by the log of the terms’ frequency within the document and then dividing by the inverse document frequency (IDF)—these two techniques ensure that words that are important to a document are heavily weighted but the impact is reduced for words that are frequent across documents (and thus contain less discriminatory information). LSA departed from the Vectorspace model by its use of SVD to reduce the matrices to their latent dimensions. SpNMF differs from other models in that it is constrained to contain only nonnegative values. The CSM model (Kwantes, 2005) differs from other context region models by using retrieval mechanisms in lieu of dimension reduction techniques. The retrieval mechanisms in CSM are based on the MINERVA 2 model of episodic memory (Hintzman, 1984). Accordingly, memories for events are feature-based representations stored as individual traces, or episodes. Each event or item is stored as a vector of values associated with features, which in turn have a probability value or learning rate. Retrieval depends on similaritybased matching rules and prior probabilities. Semantic similarity in CSM is a function of resonance between a probe vector and the context vectors. The context vectors with the stronger resonance to the probe contribute more strongly to the ultimate outcome. Different from other distributional models, CSM is an incremental learning model because it updates the representation as the sequence of words is presented. The pTCM introduced by Howard et al. (2010, this volume) similarly updates the representation incrementally. pTCM is an extension of TCM, a model of episodic memory (Howard & Kahana, 2002). TCM is a distributed memory model that specifies the effects of temporal context on episodic memory (usually memory for words embedded in lists). Rather than using co-occurrences in memory to explain memory phenomena such as primacy and recency, TCM provides an account that is based on mechanisms related to contextual drift and contextual retrieval. Thus, TCM has similarities to models such as LSA due to the overlapping assumptions regarding the effects of context on memory and learning. pTCM models semantic memory by using the current state of context (i.e., a weighted sum of the semantic representations of recently presented items) as a cue to generate predictions about what word will be presented next. The semantic representation
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
7
of a word is formed by averaging all of the predictions that precede its presentation. The model forms these representations incrementally over text rather than using the entire text representation as do most semantic memory models (and thus the time to run the model is somewhat cumbersome). Howard et al. show that the pTCM model performs comparably to LSA in the task of identifying semantically related words. Although a disadvantage stems from the time to run the model, its clear advantage is that it is based on a model of episodic memory; thus, this research shows that similar architectures may be used for both episodic memory and semantic representations in memory (see also the BEAGLE model; Jones & Mewhort, 2007). The deep generative model described by Hinton and Salakhutdinov (2010, this volume) is also iteratively trained and inspired by cognitive models, but it is quite different from other semantic models described in this volume. Hinton and Salakhutdinov describe and test a nonlinear, multilayer model that yields binary codes that can serve to describe documents’ meaning. The lower layers of the model are hidden layers with distributed representations that are iteratively trained using a Restricted Boltzman Machine (Hinton, Osindero, & Teh, 2006). After the features of one layer are learned, those vectors are used as data to train the next hidden layer. This process of iterative learning yields an encoder network that converts the word-count vectors to compact codes. Then the process is reversed such that the Restricted Boltzman Machine acts as a decoder network that reconstructs the word-count vectors from the compact code vectors. These are combined to produce a multilayer autoencoder network. Backpropagation is used to iteratively improve the solution. This learning process is relatively slow. By contrast, after learning, retrieving solutions is rapid, much more rapid than techniques such as LSA. This difference would be particularly noticeable with large data sets. In addition, the new data can be added to the network iteratively, so the system can continue to learn. Thus, this type of approach would be particularly suitable for handling extremely large corpora such as the Wikipedia. The initial training would be very slow, but thereafter, new data could be added, and retrieval based on meaning would be extremely fast.
3. Comparing semantic models In the quest to improve computational models of semantics, one current research question regards which models more effectively capture meaning and cognition. This topic includes two studies that compared models in terms of their ability to account for cognitive phenomenon. Stone et al. (2010, this volume) compared six models’ ability to judge similarity between paragraphs (i.e., LSA, SpNMF, Topic, Topic-JS, Vectorspace, and CSM). They found that all of the models fared relatively dismally in comparison to a word overlap model that used the cosine between term frequency vectors. That is, the model that used the nonlatent, overt representation of the text fared best in simulating humans’ judgments of paragraph similarity. The latent models, however, showed improved performance with when smaller, topic-focused corpora were used to train the models (see also Shapiro & McNamara, 2000).
8
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
Constraining the knowledge space helps to avoid contamination from multiple word meanings across contexts. For example, whereas humans can quickly discern the meaning of bank in the context of finances, most statistical models of semantics will retrieve all senses of words (Kintsch & Mangalath, 2010, this volume). Thus, more focused, topic-specific spaces avoid the problem of retrieving multiple senses of words that are likely to be irrelevant. Riordan and Jones (2010, this volume) compared models’ ability to capture semantic clusters in comparison to humans’ judgments of concept features. They found that among the nine models compared, a context word model, COALS, was best at capturing the semantic clusters. The model’s success was attributed to how it conditionalizes the co-occurrences of words. COALS is based on HAL, but it reduces the influence of high-frequency words such as function words (e.g., the, a, that) to reduce the impact of chance co-occurrence. It does so by calculating the conditional rate of co-occurrence (i.e., the frequency of co-occurrence in comparison to its base rate co-occurrence with other words). This is achieved in COALS using a correlation matrix in which the negative values are discarded and the positive values square rooted (to increase their weight). However, this explanation for COALS relative success may not provide an entirely convincing explanation because other models also control for chance co-occurrence of words and virtually all models control for highly frequent words. Thus, why this model faired so well in terms of detecting semantic clusters remains a question (but see also Rohde et al., 2005). The success of COALS in Riordan and Jones (2010, this volume) supports the notion that the meaning of words is in the company they keep (Firth, 1957)—but further implies that the company only includes close neighbors who are not friends with everyone else. Hence, local word co-occurrence (corrected for chance) goes a long distance in extracting meaning. Likewise, Stone et al. (2010, this volume) found that the nonlatent word overlap model faired far better than the six statistical models. One important issue regards whether meaning need be extracted using a latent or second-order representation or by contrast whether that meaning can be extracted with equal success from the words alone.
4. Improving semantic models Comparing semantic models improves our understanding of which aspects of the models contribute most to successful meaning extraction. Another approach to improving methods of extracting meaning from text is to augment statistical models with other techniques. Steyvers, Chemudugunta, and Smyth (2010, this volume) do so by combining the Topic model with human judgments. They propose that models of the learning process can be enhanced by using corpus techniques to represent background knowledge of the learner. As such, by combining information from corpora and human data, statistically derived relationships can complement human-defined concepts by filling in missing information, and vice versa, the combination of the techniques may indicate the relative importance of preexisting and new knowledge extracted from a text. Another means of improvement regards the analyses that are conducted with their output and how the results are interpreted. Louwerse (2010, this volume) uses multidimensional
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
9
scaling (MDS) to extract the representation yielded by the LSA analysis, reducing it from hundreds of dimensions to a two-dimensional visualization. One advantage of MDS (and other similar methods) is that it provides an indication of the distance between concepts. For Louwerse, this technique affords the visualization of spatial and temporal relations present in the text corpora. Similarly, Riordan and Jones (2010, this volume) use MDS to convey the nature of the clusters of concepts. The distance between clusters represents similarity of the clusters; the height represents the internal consistency of the cluster; and the area indicates the number of terms in the cluster. Thus, MDS can provide additional information as well as facilitate interpretation of underlying relationships in the data. Other researchers have also used this technique. For example, O’Rourke and Calvo (2009) have incorporated the use of MDS in combination with LSA to examine the relationships between paragraphs in essay writing. MDS essentially allows the researcher to uncover the clusters of related concepts and the distances between text sections, much like principal component and factor analyses.
5. Embodiment One important and hot topic in computational semantics regards the potential importance of perceptual simulations and embodiment in cognitive processing. It is clear to most that our experiences are constrained by the world and how our body fits into that world. By consequence, cognition is shaped by our experiences in the material world. Similarly, it is clear that the mind often represents the world, and our experiences in that world, in some sort of analog form, such as images. If that is the case, some argue that modeling human cognition using mathematically based or symbolic representations is futile because these models cannot capture a fundamental aspect of cognition, embodiment (e.g., Barsalou, 1999; De Vega, Glenberg, & Graesser, 2008; Glenberg, 1997). Moreover, some argue that meaning as it relates to cognitive processes cannot be extracted from representations such as text because text, being verbal and symbolic, cannot provide a complete picture of human cognition. One potential value of this extremist view is that it has served to fuel the debate. In addition, some researchers seek elegant models or attempt to see how far one can go with a simple model. Pushing the limits of simple models can reveal a good deal about the phenomena. Nonetheless, most recognize that there is value to both viewpoints (e.g., Paivio & Sadoski, in press). Most researchers and theorists recognize that cognition is comprised of symbolic, abstract, and verbal thought and reasoning, in addition to concrete and embodied representations. The notion that we have only one form or the other to represent meaning in the mind is, frankly, absurd. Two articles in this topic address this issue (Louwerse, 2010, this volume; Riordan & Jones, 2010, this volume). Riordan and Jones (2010, this volume) compare symbolic, distributional models (e.g., LSA) and feature-based models (that rely on human judgments of features) in their ability to represent semantic clusters. Feature-based models have been argued to more aptly represent perception and action in cognition than do purely symbolic, distributional models such as LSA because they represent words’ meanings in terms of their descriptive features (e.g., McRae, Cree, Seidenberg, & McNorgan, 2005; see Riordan &
10
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
Jones, 2010, this volume, for additional citations). These features are assumed to be closer to sensorimotor experiences, which in turn are assumed to comprise a fundamental aspect of words’ meanings (e.g., Barsalou, 2003). Riordan and Jones (2010, this volume) compare feature-based models to the nine distributional models listed earlier (i.e., LSA, Topic, COALS, CS-LL, CS-LO, HAL, HIDEx, PosPMI, and BEAGLE). They examine the ability of these nine models to extract semantic clusters in comparison to human generated norms on word features. In the first two studies, the models were trained using the TASA corpus and then compared on their ability to cluster words according to semantic class labels used in WordNet. Their performance was compared to McRae feature norms (McRae et al., 2005) on concrete nouns in the first study and Vinson and Vigliocco (2008) feature norms on nouns and verbs in the second study. In the third study, they compared models using semantic classes from the MacArthur-Bates Communicative Development Inventories (Fenson et al., 1994) and trained the models using the CHILDES database (MacWhinney, 2000) of caregiver speech (adults’ utterances to children, 12–48 months). Across the three studies, Riordan and Jones (2010, this volume) found that several models rivaled the human-based feature norms. However, across datasets, COALS was the most consistently found to be comparable to the feature-based norms in reproducing semantic classes. However, they also found that the distributional and featural information were not redundant. The distributional models tended to pick up on actions, functions, and situations, and less about perceptual attributes such as color or texture. Nonetheless, the distributional models use of linguistic cues in the language rivaled performance by feature norms (produced by humans). One implication of this research is that a model that combines human derived norms within a distributional approach may be more successful (see Steyvers et al., 2010, this volume). On the whole this research indicates that the statistical models are able to pick up on information associated with embodied thought, but not in the same way as do humans. Louwerse (2010, this volume) also argues that both aspects of processing (i.e., symbolic and embodied) are important to describing human cognition. He refers to this as the Symbol Interdependency hypothesis, which supposes that language and language comprehension depend on interdependencies of abstract linguistic symbols as well as the references these symbols make to perception and action. Furthermore, he provides evidence that perceptual and modal aspects of cognition can be extracted from text corpora, which has been considered purely symbolic by the embodied theorists. The underlying assumption of Louwerse’s argument (and this issue of topiCS) is that how we think manifests itself in the language we use. As such, the effects of perception and action, or embodied thought, on cognition can be extracted from language. Louwerse (2010, this volume) shows that a wide array of results that have been used to support the embodied perspective can be replicated using techniques such as LSA. For example, he demonstrated that word pairs from the same sensory modality (motoric, smell, sound, taste, touch, and vision) have higher LSA cosines than do words from different modalities. In addition, concepts have a stronger relationship according to LSA cosine values to their properties or features than to properties descriptive of other concepts. Thus, like
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
11
Riordan and Jones (2010, this volume), Louwerse shows that features can be detected using computational models such as LSA, even if those features might be considered embodied and thus beyond computation by many researchers.
6. Where is meaning in text? We assume that there is meaning in text and that meaning resides in the words, sentences, paragraphs, and so on. However, one question is where the meaning of the text resides. Can the full scope of meaning be extracted solely from the words and their co-occurrences (i.e., the company they keep) or is more context and information needed, such as syntax or human-derived data? Most if not all text comprehension models assume that comprehension occurs at various levels that together produce the reader’s mental representation (see e.g., Kintsch, 1998; McNamara & Magliano, 2009). For example, the Construction-Integration model assumes at least three levels of representation, the surface code (words and syntax), the propositional textbase (deeper meaning of the text), and the situation model (a combination of the text with prior knowledge). Because readers’ comprehension is multidimensional, a more complete picture of it is provided when comprehension is assessed in various ways and at multiple levels (e.g., using both text-based and situation model questions; McNamara & Kintsch, 1996). One cause for multiple levels of comprehension can be attributed to cognitive processing mechanisms. However, another is the signal itself. Language is comprised of multiple levels of information—it is multidimensional. If language comprises different levels of meaning, then statistical models that seek to extract meaning from text should also assume that the meaning should be extracted based on these levels. For example, Jones, Kintsch, and Mewhort (2006) demonstrated that including both semantic and syntactic (word order) information improves the ability of semantic models to account for a wider range of cognitive phenomena (see also, Dennis, 2005; Dennis & Kintsch, 2008). Similarly, different aspects of a representation can be emphasized by varying parameters within the algorithms. For example, McNamara, Cai, and Louwerse (2007) evaluated variations of the LSA algorithm to examine whether the performance of LSA could be improved by varying two factors: emphasis on high- versus low-frequency words, and similarity strictness. Overall, the study indicated that different algorithms may be more apt to detect differences in meaning depending on the level of analysis. Thus, different algorithms may be more or less appropriate and effective depending on the cognitive processes that are targeted in the particular study. Indeed, this is an underlying assumption of Coh-Metrix (Graesser & McNamara, 2010, this volume; McNamara, Louwerse, McCarthy, & Graesser, 2010). Coh-Metrix provides information about various levels of language to support investigations of how and when these levels come into play in various situations. Coh-Metrix provides indices on words (e.g., frequency, concreteness, homonymy, polysemy), sentences (e.g., length, noun phrase density, number of words before the main verb), and cohesion (lexical diversity, referential cohesion, semantic cohesion). Cohesion is the level of connectivity in a
12
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
text—the degree to which clauses, sentences, ideas, and larger sections of a text are explicitly tied together. If text is conceptualized as a network of nodes and links, then the nodes would represent the words (or the parts of words) of the text. The words of a text can be predominately abstract or concrete, familiar or unfamiliar, ambiguous or unambiguous. These are characteristics of the words that result from and are thus signals for the meaning of a text. Indeed, articles in this topic support the notion that a good deal of a text’s meaning can be detected just on the basis of the words in the text. For example, Stone et al.’s (2010, this volume) findings indicate that meaning can be extracted from the nonlatent information available in large text corpora (see also Recchia & Jones, 2009). LSA was a groundbreaking technique because it demonstrated that the text could be reduced to fundamental components comprising a latent representation, which contained the essentials of the text’s semantics. However, dimension reduction is not always crucial to successful meaning extraction. Along these lines, Louwerse (2010, this volume) argues that it is not computational models such as LSA that, first and foremost, afford the ability to extract meaning from text, but rather this ability emerges from the organization of language itself. Cohesion is one aspect of that organization and, in particular, connectivity. Cohesion is the glue that allows the ideas to stick together. At the surface level, words units serve as a glue to connect phonemes and morphemes, and syntax serves as glue to combine the words into a meaningful unit. At the textbase level, verbs serve as glue to connect ideas together to form meaningful clauses. Overlapping words and concepts among sentences serve to tie the ideas together. Likewise, at the situation model level, connectives serve as signals of the relationships between ideas and between the larger concepts that are being expressed across the text. Further, rhetoric, pragmatics, and global cohesion cues tie the text or discourse into a meaningful unit of discourse. These cohesion cues are important to comprehension because when those cues are not present in the text, then the reader must use reasoning and prior knowledge or retrieve prior text to infer the missing connections (e.g., McNamara & Kintsch, 1996; O’Reilly & McNamara, 2007). It is the combined information from the words, sentences, and relations that afford the most complete picture of language. Thus, picking up on the multidimensionality of text and its deeper meaning likely depends on assessing the various dimensions of how those words are connected, beyond their particular characteristics, proximity, and co-occurrence (latent or nonlatent). If meaning is present at multiple levels, then a more complete statistical model of semantics would benefit by taking into account multiple levels of processing. Kintsch and Mangalath (2010, this volume) do just that. They present a model that makes use of both context-word and context-region matrices as well as syntactic information. These sources of information are combined within a modified version of the Construction-Integration model of comprehension (CI-II). The word-document matrix generated using the Topic Model provides relational, gist information (called the gist trace), whereas a word-word matrix provides information representative of a surface level of processing (called the explicit relational trace). Syntactic information is provided by a dependency grammar parser, which provides two explicit sequential traces (one for each side of the dependency unit). Thus, this model
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
13
potentially captures both textbase (gist) and surface (explicit) level representations of text, as well as syntactic information. The CI-II model randomly samples from the explicit sequential trace with the constraint that the sample be both semantically and syntactically relevant as well. Kintsch and Mangalath show that the conditionalized combination of these three sources of information is more powerful across of a range of semantic, lexical, and sentential tasks compared to using only one or two sources or using LSA. The CI-II model (Kintsch & Mangalath, 2010, this volume) also emphasizes the importance of context in deriving meaning from words and sentences. Whereas multiple meanings of words may reside in long-term memory, a generative comprehension model is necessary to weed out these multiple word senses when a word in understood in the context of text and discourse. Whereas a word such as band has multiple meanings in long-term memory, its meaning is constrained by context in sentences such as He played in a rock band and He wore his wrist band. The CI-II model narrows down the meaning of words in context by making use of multiple sources of information and basing activation on the combined conditionalized probabilities. This operates somewhat like the predication model (Kintsch, 2001, 2007, 2008). Such an approach allows the contextualized meaning of words to emerge iteratively in working memory. This is particularly important for accounting for more complex cognition, such as the understanding of metaphor.
7. Conclusion Collectively, the research presented in this topic supports a number of conclusions. First, numerous models and variations on models have been developed over the last two decades, with a dramatic growth in interest and research in the last decade. Semantic models that apply statistical algorithms to large corpora of text are powerful means for extracting meaning from text. The computational power that has emerged over the last three decades has afforded the ability to analyze large corpora that successfully represent a good deal of human knowledge. Some have questioned the ability of such techniques in capturing the true essence of human cognition because ‘‘bag of words’’ techniques potentially miss out on important aspects of cognition. However, we see here that even nonlatent approaches to corpora analysis successfully capture much of what has been deemed well beyond the mere words in text. Indeed, Louwerse (2010, this volume) shows that a good deal of the results reported by embodiment theorists can be simulated using both nonlatent and latent (i.e., LSA) statistical models. Second, semantic models can be augmented by combining them with data generated by humans, by accounting for word order or syntax, and by accounting for multiple levels of cognitive processing. Essentially, it seems that including multiple sources of information and assuming multiple levels of processing will be necessary for models to account for a wide range of cognitive data. For the most part, semantic models simulate human knowledge, as it resides inertly in long-term memory. The contents of a model’s knowledge base can be manipulated by controlling the text that comprises the space. For example, to simulate an
14
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
8-year-old, one would use text and discourse to which a typical 8-year-old might have been exposed. The performance of semantic models improves when the text corpora used to create the semantic space is contextually and developmentally constrained (e.g., Shapiro & McNamara, 2000; Stone et al., 2010, this volume). The importance of the ability of semantic models to simulate human knowledge should not be trivialized—it was a path-breaking achievement both theoretically and practically. Nonetheless, one challenge for these models is to go beyond the fine tuning of extracting semantic similarity based on statistical constraints in corpora, which are in turn aligned with particular mathematical properties. Semantic models must account for complex cognitive phenomena such as humans’ understanding of synonymy, paraphrase, metaphor, and coherence. Indeed, several of the researchers featured in this issue have just done that—or at least paved the road to do so in future research. To go beyond the simulation of knowledge, and account for performance on a wide variety of tasks, it seems that semantic models must use a combination of approaches. First, models that combine assumptions relevant to both episodic memory and semantic processing are successful in accounting for a variety of phenomena, including incremental learning, memory, and semantic processing (Hinton & Salakhutdinov, 2010, this volume; Howard et al., 2010, this volume; Jones & Mewhort, 2007). Second, information from syntax often plays an important role in the processing of text and discourse, and by consequence including sources of information representative of syntax (e.g., word order, grammatical dependencies) improves model performance (Dennis, 2005; Dennis & Kintsch, 2008; Jones et al., 2006; Kintsch & Mangalath, 2010, this volume; Riordan & Jones, 2010, this volume). Third, comprehension comprises multiple levels of processing, including surface, textbase, and situation model levels of understanding (Kintsch, 1998), and thus including multiple sources of information may be necessary to account for the full scope of comprehension, memory, and learning phenomena (Graesser & McNamara, 2010, this volume; Kintsch & Mangalath, 2010, this volume). Indeed, many practical applications use latent representations extracted using statistical algorithms such as LSA in combination with information from the words and syntax (e.g., McNamara, Boonthum, Levinstein, & Millis, 2007). The need for multilevel models may not always be apparent because some levels of processing overwhelm the others, depending on the situation and the targeted dependent variable (McNamara & Magliano, 2009). For example, oftentimes prior knowledge overwhelms other factors to the extent that there are few discernable contributions from the text itself (other than word frequency). Likewise, when the targeted construct is the global text understanding, or a document, then the effects of syntax may be overwhelmed by the meanings of the words and the text as a whole. This is likely an explanation for why statistical models such as LSA that ignore syntax are oftentimes able to successfully extract meaning from text, despite ignoring fundamental levels of text meaning. Nonetheless, extracting the full meaning of text, including the full glory of its multidimensionality, will require using multiple, complementary approaches. The future likely lies more in the combination of techniques, rather than determining one winning model. Better, more complete, models of semantics are likely to emerge by measuring multiple levels of meaning.
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
15
References Barsalou, L. W. (1999). Perceptual symbol systems. Behavior and Brain Sciences, 22, 577–660. Barsalou, L. W. (2003). Abstraction in perceptual symbol systems. Philosophical Transactions of the Royal Society of London: Series B, 358, 1177–1187. Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22–29. De Vega, M., Glenberg, A., & Graesser, A. C. (2008). Symbols and embodiment: Debates on meaning and cognition. Oxford, England: Oxford University Press. Dennis, S. (2005). A memory-based theory of verbal cognition. Cognitive Science, 29, 145–193. Dennis, S., & Kintsch, W. (2008). Text mapping and inference rule generation problems in text comprehension: Evaluating a memory-based account. In F. Schmalhofer & C. Perfetti (Eds.), Higher level language processes in the brain: Inference and comprehension processes (pp. 105–132). Mahwah, NJ: Erlbaum. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74. Fenson, L., Dale, P., Reznick, S., Bates, E., Thal, D., & Pethick, S. (1994). Variability in early communicative development. Monographs of the Society for Research in Child Development, 59, 1–185. Firth, J. R. (1957). A synopsis of linguistic theory, 1930–1955. In J. R. Firth (Ed.), Studies in linguistic analysis (pp. 1–32). Special volume of the Philological Society. Oxford, England: Blackwell. Glenberg, A. M. (1997). What memory is for. Behavioral and Brain Sciences, 20, 1–55. Graesser, A. C., & McNamara, D. S. (2010). Computational analyses of multilevel discourse comprehension. Topics in Cognitive Science, DOI: 10.1111/j.1756-8765.2010.01081.x Hinton, G., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. Hinton, G., & Salakhutdinov, R. (2010). Discovering binary codes for documents by learning deep generative models. Topics in Cognitive Science, 3, 74–91. DOI: 10.1111/j.1756-8765.2010.01109.x Hintzman, D. L. (1984). MINERVA2: A simulation model of human memory. Behavior, Research Methods, Instruments, and Computers, 16, 96–101. Howard, M. W., & Kahana, M. J. (2002). A distributed representation of temporal context. Journal of Mathematical Psychology, 46, 269–299. Howard, M., Shankar, K., & Jagadisan, U. (2010). Constructing semantic representations from a graduallychanging representation of temporal context. Topics in Cognitive Science, 3, 48–73. DOI: 10.1111/j.17568765.2010.01112.x Jones, M. N., Kintsch, W., & Mewhort, D. J. K. (2006). High-dimensional semantic space accounts of priming. Journal of Memory and Language, 55, 534–552. Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114, 1–37. Kintsch, W. (1998). Comprehension: A paradigm for cognition. New York: Cambridge University Press. Kintsch, W. (2001). Predication. Cognitive Science, 25, 173–202. Kintsch, W. (2007). Meaning in context. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 89–105). Mahwah, NJ: Erlbaum. Kintsch, W. (2008). Symbol systems and perceptual representations. In M. de Vega, A. Glenberg, & A. Graesser (Eds.), Symbols and embodiment: Debates on meaning and cognition (pp. 145–163). New York: Oxford University Press. Kintsch, W., & Mangalath, P. (2010). The construction of meaning. Topics in Cognitive Science, DOI: 10.1111/ j.1756-8765.2010.01107.x Kwantes, P. J. (2005). Using context to build semantics. Psychonomic Bulletin & Review, 12, 703–710. Landauer, T. K. (2007) LSA as a theory of meaning. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 3–34). Mahwah, NJ: Erlbaum.
16
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
Landauer, T., & Dumais, S. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211– 240. Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook of latent semantic analysis. Mahwah, NJ: Erlbaum. Louwerse, M. M. (2010). Symbol interdependency in symbolic and embodied cognition. Topics in Cognitive Science, DOI: 10.1111/j.1756-8765.2010.01106.x Lowe, W. (2001). Towards a theory of semantic space. In J. Moore & K. Stenning (Eds.), Proceedings of the 23rd Conference of the Cognitive Science Society (pp. 576–581). Mahwah, NJ: Erlbaum. Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical cooccurrence. Behavioral Research Methods, Instrumentation, and Computers, 28, 203–208. MacWhinney, B. (2000). The CHILDES Project: Tools for analyzing talk, 3rd ed. Mahwah, NJ: Erlbaum. McNamara, D. S., Boonthum, C., Levinstein, I. B., & Millis, K. (2007). Evaluating self-explanations in iSTART: Comparing word-based and LSA algorithms. In T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp. 227–241). Mahwah, NJ: Erlbaum. McNamara, D. S., Cai, Z., & Louwerse, M. M. (2007). Comparing latent and non-latent measures of cohesion. In T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 379–400). Mahwah, NJ: Erlbaum. McNamara, D. S., & Kintsch, W. (1996). Learning from text: Effects of prior knowledge and text coherence. Discourse Processes, 22, 247–288. McNamara, D. S., Louwerse, M. M., McCarthy, P. M., & Graesser, A. C. (2010). Coh-Metrix: Capturing linguistic features of cohesion. Discourse Processes, 47, 292–330. McNamara, D. S., & Magliano, J. P. (2009). Towards a comprehensive model of comprehension. In B. Ross (Ed.), The psychology of learning and motivation (pp. 298–372). New York: Elsevier Science. McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37, 547–559. Murdock, B. B. (1992). Serial organization in a distributed memory model. In A. F. Healy, S. M. Kosslyn, & R. M. Shiffrin (Eds.), From learning theory to connectionist theory: Essays in honor of William K. Estes (pp. 201–225). Hillsdale, NJ: Erlbaum. O’Reilly, T., & McNamara, D. S. (2007). Reversing the reverse cohesion effect: Good texts can be better for strategic, high-knowledge readers. Discourse Processes, 43, 121–152. O’Rourke, S. T., & Calvo, R. A. (2009). Visualizing paragraph closeness for academic writing support. In S. Murugesan (Ed.), Handbook of research on the web 2.0, 3.0, and X.0 technologies, business, and social applications (Ch. XLVII). Hershey, PA: IGI Global. Osgood, C. E., Suci, G., & Tannenbaum, P. (1957). The measurement of meaning. Urbana, IL: University of Illinois Press. Paivio, A., & Sadoski, M. (in press). Lexicons, contexts, events, and images: Commentary on Elman (2009) from the perspective of dual coding theory. Cognitive Science. Recchia, G. L., & Jones, M. N. (2009). More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis. Behavior Research Methods, 41, 657–663. Riordan, B., & Jones, M. N. (2010). Redundancy in perceptual and linguistic experience: Comparing feature-based and distributional models of semantic representation. Topics in Cognitive Science, DOI: 10.1111/j.17568765.2010.01111.x Rohde, D. L. T., Gonnerman, L. M., & Plaut, D. C. (2005). An improved method of deriving word meaning from lexical co-occurrence. Unpublished manuscript. Available at: http://tedlab.mit.edu/~dr/. Accessed January 15, 2010. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18, 613–620. Shaoul, C., & Westbury, C. (2006). Word frequency effects in high-dimensional co-occurrence models: A new approach. Behavior Research Methods, 38, 190–195.
D. S. McNamara ⁄ Topics in Cognitive Science 3 (2011)
17
Shapiro, A. M., & McNamara, D. S. (2000). The use of latent semantic analysis as a tool for the quantitative assessment of understanding and knowledge. Journal of Educational Computing Research, 22, 1–36. Smith, E. E., Shoben, E. J., & Rips, L. J. (1974). Structure and process in semantic memory: A featural model for semantic decisions. Psychological Review, 1, 214–241. Steyvers, M., Chemudugunta, C., & Smyth, P. (2010). Combining background knowledge and learned topics. Topics in Cognitive Science, 3, 18–47. DOI: 10.1111/j.1756-8765.2010.01097.x Steyvers, M., & Griffiths, T. L. (2007). Probabilistic topic models. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 427–448). Mahwah, NJ: Erlbaum. Stone, B. P., Dennis, S. J., & Kwantes, P. J. (2010). Comparing Methods for Single Paragraph Similarity Analysis. Topics in Cognitive Science, 10.1111/j.1756-8765.2010.01108.x. Vinson, D. P., & Vigliocco, G. (2008). Semantic feature production norms for a large set of objects and events. Behavior Research Methods, 40, 183–190. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In C. Clarke and G. Cormack (Eds.), Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 267–273). New York: ACM Press.
Topics in Cognitive Science 3 (2011) 18–47 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01097.x
Combining Background Knowledge and Learned Topics Mark Steyvers,a Padhraic Smyth,b Chaitanya Chemuduganta,b a
Department of Cognitive Sciences, University of California, Irvine Department of Computer Science, University of California, Irvine
b
Received 12 November 2008; received in revised form 19 June 2009; accepted 08 September 2009
Abstract Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. Although topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always ideal. Human-defined concepts, however, tend to be semantically richer due to careful selection of words that define the concepts, but they may not span the themes in a data set exhaustively. In this study, we review a new probabilistic framework for combining a hierarchy of human-defined semantic concepts with a statistical topic model to seek the best of both worlds. Results indicate that this combination leads to systematic improvements in generalization performance as well as enabling new techniques for inferring and visualizing the content of a document. Keywords: Topic model; Concept-topic model; Hierarchical concept-topic model; Concepts; Background knowledge; Human-defined knowledge; Data-driven learning; Bayesian models
1. Introduction Many recent computational approaches to semantic cognition and statistical natural language processing operate on a purely data-driven basis. These models can extract useful information merely on the basis of statistical information contained in large text collections. From a machine-learning perspective, such models are attractive because they allow for a rapid analysis and understanding of new collections of text without significant human coding or annotation effort (e.g., Newman, Chemudugunta, Smyth, & Steyvers, 2006). From a cognitive science perspective, these models are attractive because they show that many findings related to semantic cognition can be explained by simple statistical learning processes. Correspondence should be sent to Mark Steyvers, Department of Cognitive Sciences, University of California, 3151 Social Sciences Plaza, Irvine, CA 92697-5100. E-mail:
[email protected]
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
19
Such learning processes can account for many empirical findings in areas such as language acquisition (Newport & Aslin, 2004; Newport, Hauser, Spaepen, & Aslin, 2004), multimodal language learning (Yu, Ballard, & Aslin, 2005), object perception (Fiser & Aslin, 2005), and eye movements (Najemnik & Geisler, 2005). In this research, we start with the assumption that much of our semantic representations can be acquired from experience in the form of large text collections, given appropriate statistical learning machinery. However, we also assume that building in some structure and prior knowledge might be required to create suitable representations. It has been shown recently how data-driven learning approaches can be combined with structured representations such as hierarchies, graphs, trees, and rules to create powerful new learning models (Chater & Manning, 2006; Kemp & Tenenbaum, 2008). In our research, we show how structured background knowledge and statistical learning processes can be combined. The combination of prior knowledge and novel information gained from experience raises two broad theoretical questions. First, how can prior knowledge facilitate the acquisition of new knowledge? We will investigate the circumstances under which prior knowledge can significantly help in learning semantic representations. Second, how can new knowledge be used to make changes in our background knowledge? We will demonstrate how corpus-driven learning processes can be used to identify gaps in an existing semantic representation. 1.1. Data-driven learning approaches There are a variety of unsupervised approaches for extracting semantic representations from large text collections that do not rely on background knowledge. In the context of a general ‘‘bag-of-words’’ framework, each document is represented by a vector that contains counts of the number of times each term (i.e., word or word combination) appears in the document. One general approach is to apply dimensionality reduction algorithms to represent the high-dimensional term vectors in a low-dimensional space. The dimensionality reduction can involve nonlinear projection methods such as self-organizing maps (Kohonen et al., 2000; Lagus, Honkela, Kaski, & Kohonen, 1999) or linear projection methods such as latent semantic analysis (LSA; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998). As a result of the dimensionality reduction, neighboring points in the semantic space often represent words or documents with similar contextual usages or meaning. These representations have been shown to model human knowledge in a variety of cognitive tasks (Landauer & Dumais, 1997) and educational assessment applications (Foltz, Gilliam, & Kendall, 2000). Other recent models in cognitive science have focused on alternative unsupervised methods to extract semantic representations at the sentence or document level (e.g., Dennis, 2004; Jones & Mewhort, 2007). In a probabilistic framework, a variety of clustering techniques have been developed that characterize each document by a single latent cluster or topic (e.g., Cutting, Karger, Pedersen, & Tukey, 1992; McCallum, Nigam, & Ungar, 2000; Popescul, Ungar, Flake, Lawrence, & Giles, 2000). Through unsupervised learning, these clusters can be learned automatically and give broad information about the content of documents. The drawback of the one-to-one mapping between documents and clusters is that documents that cover a
20
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
diverse set of topics can only be represented by a single cluster leading to problems in interpretation (e.g., Newman et al., 2006). A more flexible unsupervised framework, known as statistical topic modeling, allows each document to be represented by multiple topics (Blei, Ng, & Jordan, 2003; Buntine & Jakulin, 2004; Griffiths & Steyvers, 2004; Griffiths, Steyvers, & Tenenbaum, 2007; Hofmann, 1999; Steyvers & Griffiths, 2007). The basic concept underlying topic modeling is that each document is composed of a probability distribution over topics, where each topic represents a probability distribution over words. The topic–document and topic–word distributions are learned automatically from the data and provide information about the semantic themes covered in each document and the words associated with each semantic theme. The underlying statistical framework of topic modeling enables a variety of interesting extensions to be developed in a systematic manner, such as author-topic models (Steyvers, Smyth, Rosen-Zvi, & Griffiths, 2004), correlated topics (Blei & Lafferty, 2006), hierarchical topic models (Blei, Griffiths, Jordan, & Tenenbaum, 2003; Li, Blei, & McCallum, 2007; Teh, Jordan, Beal, & Blei, 2006), time-dependent topics (Wang, Blei, & Heckerman, 2008) and models that combine topics and syntax (Griffiths, Steyvers, Blei, & Tenenbaum, 2005), as well as image features and text (Blei & Jordan, 2003). Topic models have also been useful as cognitive models to explain human associations, gist extraction, and memory errors (Griffiths et al., 2007). One of the drawbacks of a purely data-driven learning process, such as topic modeling, is that the resulting representations can require some effort to interpret. As an illustrative example of the information learned by topic models, Fig. 1 (top row) shows five examples of topics that were derived from the TASA corpus, a collection of over 37,000 text passages from educational materials (e.g., language and arts, social studies, health, sciences) collected by Touchstone Applied Science Associates (TASA; see Landauer et al., 1998). The figure shows the 15 words that have the highest probability under each topic. Each number corresponds to the probability that a word is generated conditioned on a learned topic. It is often easier to interpret topics relative to other representations such as document clusters in cluster models or latent dimensions in latent semantic analysis (e.g., Newman et al., 2006). The words in the topics in the top row of Fig. 1 appear to relate to colors, gases, and the atmosphere, American presidents, European countries in World War II, and Japan and World War II, respectively. However, because topics are defined by probability distributions over words and have no simple names or definitions that can explain their content, an interpretation of the content of a topic often requires a subjective analysis of the connections between the high-probability words in a topic. This subjective process can lead to different outcomes depending on which individual is doing the analysis. Some progress has been made to automate the labeling of topics (Mei, Shen, & Zhai, 2007), but it remains to be seen how easily accessible such statistical representations are to human users. Even with these techniques, topic interpretability remains an issue when faced with small noisy data sets. Data-driven learning models require large amounts of data in order to obtain accurate and useful representations, and such data might not always be available. In addition, because the model tunes itself to the dominant semantic themes in a corpus, it might not accurately represent outlier documents. Although such models might be able to tell that
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011) word red blue green yellow white color bright colors orange brown pink look black purple cross
prob. 0.202 0.099 0.096 0.073 0.048 0.048 0.030 0.029 0.027 0.027 0.017 0.017 0.016 0.015 0.011
!"**"$(!"/"&, red green blue brown yellow orange pink purple reddish yellowish greenish brownish bluish redness pinkish
word oxygen carbon dioxide air ramona gas nitrogen gases atmosphere hydrogen water respiraion process beezus breathe
prob. 0.136 0.097 0.050 0.046 0.037 0.036 0.030 0.026 0.020 0.020 0.016 0.014 0.014 0.012 0.011
!-+*.!)/(+/+*+$%, oxygen gold iron lead carbon hydrogen copper mercury nitrogen sodium tin aluminum sulfur calcium uranium
word president roosevelt congress johnson office wilson nixon reagan kennedy carter presidents administration presidential white budget
prob. 0.129 0.032 0.030 0.026 0.021 0.021 0.020 0.018 0.018 0.017 0.012 0.012 0.011 0.011 0.010
/+)0+&,("1($)%."$)/()$0( &+2."$)/(2"3+&$*+$%, president governor presidential ruler presidency dynasty sovereign chancellor premier governorship regent dynastic gubernatorial vp mp
word france french europe germany german countries britain italy western european british war germans country nations
prob. 0.071 0.069 0.051 0.043 0.041 0.030 0.024 0.019 0.019 0.019 0.016 0.015 0.013 0.012 0.012
!"#$%&'($)*+, america england france china mexico india spain canada japan germany egypt russia italy alaska australia
21 word war japanese japan II american peace civil end wars treaty fought fighting military ended forces
prob. 0.201 0.035 0.035 0.035 0.030 0.029 0.019 0.016 0.014 0.013 0.012 0.012 0.012 0.011 0.011
4)& war peace pacific campaign peaceful hostile warfare wartime peacefully tactics concord crusade warlike warring peacetime
Fig. 1. Examples of five topics (out of 300) extracted from the TASA corpus (top row). The closest corresponding concepts from the CALD knowledge base (bottom row). The most likely words in each topic along with the matching word in the concept are highlighted. The column ‘‘prob’’ for each topic refers to the probability of each word in that topic.
a document falls outside the scope of representative semantic themes, it is difficult to identify the content that is covered by such documents. Therefore, in the absence of a large repository of relevant background documents to build topic models, it can be difficult to get interpretable and effective representations. 1.2. Human-defined semantic representations An entirely different approach to constructing semantic representations is to rely on human knowledge and judgment. Considerable effort has gone into developing humandefined knowledge databases that characterize commonsense and lexical knowledge in humans. Such knowledge is created by trained experts in projects such as Cyc (Lenat & Guha, 1989; Panton et al., 2006), WordNet (Fellbaum, 1998; Miller, 1990), and Roget’s thesaurus (Roget, 1911) or by untrained volunteers in projects such as ConceptNet (Havasi, Speer, & Alonso, 2007). Similarly, in cognitive science, many behavioral experiments have elicited detailed knowledge from many college students about semantic associations (Nelson, McEvoy, & Schreiber, 1998) and concepts and features (McRae, Cree, Seidenberg, & McNorgan, 2005; Ruts et al., 2004). Such human-defined representations can serve as proxies for mental representations.
22
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
To highlight the difference between learned topics and human-defined knowledge, we will show some examples of human-defined concepts that were created by lexicographers as part of the Cambridge Advanced Learner’s Dictionary (CALD). In contrast to other taxonomies such as WordNet (Fellbaum, 1998; Miller, 1995), CALD groups words primarily according to semantic topics with the topics hierarchically organized. In addition, CALD provides names for each concept that are helpful for visualization purposes. CALD consists of 2,183 semantic concepts with each concept consisting of a set of words and a name that describes the concept. Fig. 1 (bottom row) shows an example of CALD concepts that resemble the meaning of the learned topics in the top row of Fig. 1. Each concept is illustrated with the name of the concept (shown on top) and a subset of 15 words that are part of the concept. Apart from alphabetical order, there is no natural way to order words within a concept. To better summarize the sometimes large number of words in each concept, we ordered the words by word frequency in the TASA corpus. A comparison between learned topics and human-defined CALD concepts in Fig. 1 reveals some interesting differences: The words in each topic are associated with probabilities that indicate how likely each word is to be found in a context of that topic, which is quite useful to get a fine-grained indication about the relevance of a word to a topic. In contrast, CALD concepts do not provide any information about the prominence, frequency, or representativeness of the words in each concept—either a word is present or it is absent in a concept. A clear advantage of concepts is that they are often more interpretable than learned topics by virtue of having a name (or small number of words) that describes the concept, providing concepts more precise coverage compared to topics. For example, the first topic on colors includes words that are not color words (e.g., bright and look), whereas a color concept will restrict itself to just color words. Concepts can also have broader coverage relative to topics because all words are considered as candidates and not just words occurring in a particular corpus. For example, the concept on chemical elements lists all chemical elements (as known by the lexicographer), whereas a learned topic might focus more on the high-frequency chemical elements. Also, a learned topic could omit certain elements altogether because they did not occur in the corpus. Although there are many advantages of human-defined knowledge databases, a major drawback is that they require extensive manual involvement and are time consuming to build and update given new emerging information. For some applications such as analyzing and summarizing text collections, no suitable knowledge database might even be available that has a suitable coverage of the domain. In contrast, data-driven topics can be tuned to themes in a corpus and can easily discover and summarize the dominant semantic themes for a wide variety of text collections. 1.3. Combining human-defined knowledge and data-driven learning approaches Clearly, representations based on either a purely data-driven approach or human-defined knowledge have limitations. In this article, we will review some of our recent work that combines human-defined concepts with statistical data-driven approaches to learning
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
23
semantic representations (Chemudugunta, Holloway, Smyth, & Steyvers, 2008; Chemudugunta, Smyth, & Steyvers, 2008a, 2008b). The objective is to combine these approaches with the goal of taking advantage of the best features of both approaches. When there are few documents to learn from, these hybrid models are primarily driven by humandefined concepts. When trained on large document collections, data-driven topics can fill in gaps in the human-defined concepts. From a machine-learning perspective, automatically identifying such gaps can lead to a variety of useful applications where we update existing representations without requiring extensive human effort in discovering new emerging themes. From a cognitive science perspective, the hybrid model leads to novel ways of thinking about semantic representations. Instead of assuming that such representations are purely the result of data-driven learning processes, they might be a combination of preexisting knowledge and new knowledge extracted from a collection of text. We make no theoretical claims about the source of the prior knowledge. Although it is likely that such prior knowledge is itself acquired by experience, we do not attempt to explain how this is learned from experience. The plan for the rest of the study is as follows. In Section 2, we review the basic principles of topic models and then describe the concept–topic model that combines concepts and topics into a single probabilistic model. We also describe the hierarchical concept–topic model which takes advantage of known hierarchical structure among concepts. Section 3 describes the text corpus and concept data set that we used to conduct our experiments. Section 4 describes a series of experiments that evaluate the predictive performance of a number of different models, showing for example that prior knowledge of concept words and concept relations can lead to better topic-based language models. In Section 5, we discuss a number of examples that illustrate how documents can be tagged at the word level with human-defined concepts. In Section 6, we discuss the type of information that is learned by topics but not captured by concepts. In Section 7, we show how the concept– topic model can automatically find appropriate concepts for novel words. In the final sections, we conclude the study with a brief discussion of related research, future directions, and final comments.
2. Concept–topic models A clear advantage of an unsupervised learning approach such as topic modeling is that the model can be tuned to the themes of the particular document collection it is trained on. In addition, the probabilistic model that underlies the topic model allows one to automatically tag each word in a document with the topic most likely to have generated it. On the contrary, human-defined concepts such as the CALD knowledge base have much broader coverage of English words and include useful names of concepts that clarify the set of words that could be included in the concept, and aid in interpretability. In this section, we will describe concept–topic and hierarchical concept–topic models that combine data-driven topics and human-defined concepts (Chemudugunta et al., 2008b, 2008c). We begin with a brief review of topic models.
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
24
2.1. Topic model The topic model (or latent Dirichlet allocation [LDA] model; Blei et al., 2003) is a statistical learning technique for extracting a set of topics that describe a collection of documents. A topic t is represented by a multinomial distribution over the V unique word types in the ðtÞ ðtÞ ðtÞ corpus, /ðtÞ ¼ ½u1 ; :::; uV %, where uw ¼ pðwjtÞ and 1 £ w £ V. Therefore, a topic can be viewed as a V-sided die and generating n word tokens from a topic is akin to throwing the topic-specific die n times. There are a total of T topics and a document d is represented as a ðdÞ ðdÞ ðdÞ multinomial distribution over those T topics, hðdÞ ¼ ½h1 ; :::; hT %, where ht ¼ pðtjdÞ and 1 £ t £ T. The variables u and h indicate which words are important for which topic and which topics are important for a particular document, respectively. Generating a word token for a document d involves first selecting a topic t from the document–topic distribution h(d) and then selecting a word from the corresponding topic distribution u (t). This process is repeated for each word token in the document. Let z be the random variable that represents the topic indices sampled from h(d). We write p(zi = t|d) as the probability that the tth topic was sampled for the ith word token (in document d) and p(wi|zi = t) as the probability of word wi under topic t. The model specifies the following conditional probability of the ith word token in a document: T X pðwi jzi ¼ tÞpðzi ¼ tjdÞ ð1Þ pðwi jdÞ ¼ t¼1
In the LDA model, Dirichlet priors are placed on both u and h, to smooth the word–topic and topic–document distributions (for a description of Dirichlet priors, see Steyvers & Griffiths, 2007; Gelman, Carlin, Stern, & Rubin, 2003). In many applications, a symmetric Dirichlet density with single hyperparameters a and b are used for h and u, respectively. For all the topic models in this research, we will use a symmetric Dirichlet prior for u using a single hyperparameter b. For the topic–document distributions h, we will use an asymmetric Dirichlet prior h, with a vector a containing hyperparameter values for every topic (and concept for concept–topic models). An asymmetric prior is useful when some concepts (or topics) are expressed in many or just a few documents across the collection. With an asymmetric prior, more skewed marginal distributions over h can be obtained to express rare or frequent topics (or concepts). The sequential process of first picking a topic from a topic distribution, and then picking a word token from a distribution over word types associated with that topic can be formalized as follows: 1. For each topic t 2 f1; :::; Tg, select a word distribution /ðtÞ & DirichletðbÞ 2. For each document d 2 f1; :::; Dg (a) Select a distribution over topics hðdÞ & DirichletðaÞ (b) For each word position i in document ! "d (i) Select a topic zi & Discrete hðdÞ ! " (ii) Generate a word token from topic zi, wi & Discrete /ðzi Þ
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
25
This generative process can be summarized by the graphical model shown in Fig. 2A. In the graphical notation, shaded and unshaded variables indicate observed and latent (i.e., unobserved) variables, respectively, and the arrows indicate the conditional dependencies between variables. The plates (the boxes in the figure) refer to repetitions of sampling steps with the variable in the right corner referring to the number of samples. For example, the inner plate over z and w illustrates the repeated sampling of topics and words until Nd word tokens have been generated for document d. The plate surrounding h illustrates the sampling of a distribution over topics for each document d for a total of D documents. The plate surrounding u illustrates the repeated sampling of distributions over word types for each topic until T topics have been generated. Given the words in a corpus, the inference problem involves estimating the word–topic distributions u, the topic–document distributions h, and the topic assignments z of individual words to topics. These distributions can be learned in a completely unsupervised
!"#
!$#
Į
Į #
ș β
!
ij
"
!%#
βφ
ij
βφ
!
ij
"
%$'
βψ ψ
%$'
Į
γ
τ !)"
ș
ȟ
ȗ !)"
!
(
Ȝ
&
# &
βψ ψ
" '
#
ș
%$&
Fig. 2. Graphical models for the topic model (A), the concept–topic model (B), and the hierarchical concept– topic model (C).
26
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
manner without any prior knowledge about topics or which topics are covered by which documents. One efficient technique for obtaining estimates of these distributions is through collapsed Gibbs sampling (Griffiths & Steyvers, 2004). Steyvers and Griffiths (2007) present a tutorial introduction to topic models that discusses collapsed Gibbs sampling. The main idea of collapsed Gibbs sampling is that inference is performed only on z, the assignments of word tokens to topics. The remaining latent variables h and u are integrated out (‘‘collapsed’’). Words are initially assigned randomly to topics and the algorithm then iterates through each word in the corpus and samples a topic assignment given the topic assignments of all other words in the corpus. This process is repeated until a steady state is reached and the topic assignments are then used to estimate the word–topic and topic– document distributions. The vector a that contains the hyperparameter values for every topic (and concept for concept–topic models, see below) is updated using a process involving fixed-point update equations (Minka, 2000; Wallach, 2006). See Appendix A of Chemudugunta et al. (2008b) for more details. To summarize, the topic model provides several pieces of information that are useful for understanding documents. The topic–document distributions indicate the important topics in each document. The word–topic distributions indicate which words are important for which topic (e.g., the top row of Fig. 1 shows some example word–topic distributions estimated for the TASA corpus). Finally, the probabilistic assignments zi of word tokens to topics are useful for tagging purposes, providing information about the role each word is playing in a specific document context and helping to disambiguate multiple meanings of a word (e.g., Griffiths et al., 2007). 2.2. Concept–topic model The concept–topic model is a simple extension to the topic model where we add C concepts to the T topics of the topic model resulting in an effective set of T + C word distributions for each document. We assume that each of the C concepts (such as the CALD concepts in Fig. 1) are represented as a set of words. Therefore, these human-defined concepts only give us a membership function over words—either a word is a member of the concept or it is not. One straightforward way to incorporate concepts into the topic modeling framework is to convert them to probability distributions over their associated word sets. In the concept–topic model, we will treat each concept c as a multinomial distribution ðcÞ ðcÞ wðcÞ ¼ ½w1 ; :::; wV %, where wðcÞ w ¼ pðwjcÞ and 1 £ w £ V. Importantly, each word type that = c. Of course, is not part of the concept will have zero probability, that is, wðcÞ w = 0 for w 2 there are no direct observations available about the probabilities of word types within a concept, but we can use a model similar to the topic model to estimate these probabilities from corpus data. Therefore, the concept–topic model is simply an extension of the topic model where we have a number of learned topics as well as constrained topics where nonzero probability can only be given to words in human-defined concepts. In the concept–topic model, the conditional probability of the ith word token wi in a document d is
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
pðwi jdÞ ¼
T X t¼1
pðwi jzi ¼ tÞpðzi ¼ tjdÞþ
C þT X
t¼Tþ1
pðwi jzi ¼ tÞpðzi ¼ tjdÞ;
27
ð2Þ
where the indices 1 £ t £ T refer to all topics and indices T + 1 £ t £ T + C refer to all concepts. In this generative process, an index zi is sampled from the distribution over topics and concepts for the particular document. If zi £ T, a word token is sampled from topic zi, and if T + 1 £ zi £ T + C, a word token is sampled from concept zi ) T among word types associated with the concept. The topic model can be viewed as a special case of the concept–topic model when there are no concepts present, that is, when C = 0. At the other extreme of this model where T = 0, the model relies entirely on predefined concepts. ðtÞ ðtÞ ðtÞ To specify the complete generative model, let /ðtÞ ¼ ½u1 ; :::; uV &, where uw ¼ pðwjtÞ and 1 £ w £ V, refer to the multinomial distribution over word types for topic t when ðcÞ ðcÞ 1 £ t £ T, and let WðcÞ ¼ ½w1 ; :::; wV &, where wðcÞ w ¼ pðwjcÞ and 1 £ w £ V refer to the multinomial distribution over word types for concept c = t–T when T + 1 £ t £ T + C. As with the topic model, we place Dirichlet priors on the multinomial variables h, u, and w, with corresponding hyperparameters a, b/, and bw. The complete generative process can be described as follows:
! " 1. For each topic t 2 f1; :::; Tg, select a word distribution /ðtÞ ' Dirichlet b/! " 2. For each concept c 2 f1; :::; Cg, select a word distribution wðcÞ ' Dirichlet bw 3. For each document d 2 f1; :::; Dg (a) Select a distribution over topics and concepts hðdÞ ' DirichletðaÞ (b) For each word position i in document d# $ (i) Select a component zi ' Discrete hðdÞ # $ (ii) If zi £ T, generate a word token from topic zi, wi ' Discrete /ðzi Þ# ; other$ wise, generate a word token from concept ci = zi - T, wi ' Discrete wðci Þ
Note that in Step 2, the sampling of words for a concept is constrained to only the words that are members of the human-defined concept. Fig. 2B shows the corresponding graphical model. All the latent variables in the model can be inferred through collapsed Gibbs sampling in a similar manner to the topic model (see Chemudugunta et al., 2008b for details). We note that even though we are partially relying on humans to define the word–concept memberships, we still apply purely unsupervised algorithms to estimate the latent variables in the model. This is in contrast to a supervised learning approach where the human-defined knowledge is used as a target for prediction. Here, the human-defined knowledge is only used as a constraint on the probability distributions that can be learned for each concept. We also note that the concept–topic model is not the only way to incorporate semantic concepts. For example, we could use the concept–word associations to build informative priors for the topic model and then allow the inference algorithm to learn word probabilities for all words (for each concept), given the prior and the data. We chose the restricted vocabulary approach to exploit the sparsity in the concept–word associations (topics are distributions over all the words in the vocabulary but concepts are restricted to just their sets of associated words, which are much smaller than the full vocabulary). This sparsity at the
28
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
word level allows us to easily perform inference with tens of thousands of concepts on large document collections. A general motivation for the concept–topic approach is that there might be topics present in a corpus that are not represented in the concept set (but that can be learned). Similarly, there may be concepts that are either missing from the text corpus or are rare enough that they are not found in the data-driven topics of the topic model. The marriage of concepts and topics provides a simple way to augment concepts with topics and has the flexibility to mix and match topics and concepts to describe a document. 2.3. Hierarchical concept–topic model Although the concept–topic model provides a simple way to combine concepts and topics, it does not take into account any hierarchical structure the concepts might have. For example, CALD concepts are arranged in a hierarchy that starts with the concept everything which splits into 17 concepts at the second level (e.g., science, society, general ⁄ abstract, communication). The hierarchy has up to seven levels with each level specifying more specific concepts. In this section, we describe a hierarchical concept–topic model that incorporates hierarchical structure of a concept set. Similar to the concept–topic model described in the previous section, there are T topics and C concepts. However, as opposed to the flat organization of the concepts in the concept–topic model, we now utilize the hierarchical organization of concepts when sampling words from concepts. Before we formally describe the model, we illustrate the basic idea in Fig. 3. Each topic and concept is associated with a ‘‘bag of words’’ that represents a multinomial distribution over word types. In the generative process, word tokens can be generated from the concept part of the model by sampling a path from the root of the concept tree to some distribution over word types associated with the concept (left box in Fig. 3). Alternatively, word tokens can be generated through the topic part of the model (right box). The dashed and dotted lines show examples of two word tokens sampled through the hierarchical concept part of the model and the topic part of the model, respectively. For the first word token, the option ‘‘topic’’ is sampled at the root node, Topic 1 is then sampled, and then a word token is sampled from the multinomial over words associated with Topic 1. For the second word token, the option ‘‘concept’’ is sampled at the root node, then the option science is sampled as a child of the concept everything, the word distribution for science is then selected, and a word from this distribution is sampled. Each transition in the hierarchical part of the model has an associated probability and the transition probabilities are document dependent—some paths are more likely in context of some documents. For example, in physics and chemistry documents, one might expect all transitions toward the science concept to be elevated but differentiated between the transitions toward the physics and chemistry concepts. To preview what information is learned by the model, we need to distinguish between variables learned at the word, document, and corpus levels. At the word level, the model learns the assignments of topics or concepts to word tokens. These assignments can be directly used for tagging purposes and word–sense disambiguation. At the document level,
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
29
+/012345/654/3718
!"!#$%&'()
9
%,-'+
:
*,+'!%$
*+'!(+!
-&$*'+*
%,-'+
+&!.'*%#$
Fig. 3. An illustration of the hierarchical concept–topic model.
the model learns both topic probabilities and concept–transition probabilities in the concept tree. The latter information is useful because it allows a hierarchical representation of document content. At the document level, the model also learns the switch probability that a word is generated through the topic or concept route. The adaptive nature of the switch probability allows the model to flexibly adapt to different documents. Documents that contain material that has poor concept coverage will have a high probability of switching to the topic route. At the corpus level, the model learns the probabilities of the word–topic and word–concept distributions. The word–topic distributions are useful to learn which semantic themes beyond those covered in the concepts are needed to explain the content of the whole document collections. The word–concept distributions are useful to learn which words are important for each concept. Finally, at the corpus level, the model also learns the hyperparameters for each transition in the concept tree. The learned hyperparameters allow the model to make certain paths more prominent across all documents. For example, if a document collection includes many documents on science, the path toward the science concept could be made more likely (a priori). Our approach is related to the hierarchical pachinko allocation model 2 (HPAM 2) as described by Mimno, Li, and McCallum (2007). In the HPAM 2 model, topics are arranged
30
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
in a three-level hierarchy with root, super-topics, and subtopics at Levels 1, 2, and 3, respectively, and words are generated by traversing the topic hierarchy and exiting at a specific level and node. In our model, we use a similar mechanism for word generation via the concept route. There is additional machinery in our model to incorporate the data-driven topics (in addition to the hierarchy of concepts) and a switching mechanism to choose the word generation process via the concept route or the topic route. To give a formal description of model, for each document d, we introduce a ‘‘switch’’ distribution p(x|d) that determines if a word should be generated via the topic route or the concept route. Every word token wi in the corpus is associated with a binary switch variable xi. If xi = 0, the previously described standard topic mechanism is used to generate the word. That is, we first select a topic t from a document-specific mixture of topics h(d) and generate a word token from the word distribution associated with topic t. If xi = 1, we generate the word token from one of the C concepts in the concept tree. To do that, we associate with each concept node c in the concept tree a document-specific multinomial distribution with dimensionality equal to Nc + 1, where Nc is the number of children of the concept node c. This distribution allows us to traverse the concept tree and exit at any of the C nodes in the tree—given that we are at a concept node c, there are Nc child concepts to choose from and an additional option to choose an ‘‘exit’’ child to exit the concept tree at concept node c. We start our walk through the concept tree at the root node and select a child node from one of its children. We repeat this process until we reach an exit node and a word token is generated from the parent of the exit node. Note that for a concept tree with C nodes, there are exactly C distinct ways to select a path and exit the tree, as there is only one parent for each concept node, and thus, one path to each of the C concepts. In the hierarchical concept–topic model, a document is represented as a weighted combination of mixtures of T topics and C paths through the concept tree and the conditional probability of the ith word token in document d is given by pðwi jdÞ ¼ pðxi ¼ 0jdÞ þ pðxi ¼ 1jdÞ
T X t¼1
TþC X
c¼Tþ1
pðwi jzi ¼ tÞpðzi ¼ tjdÞ
pðwi jzi ¼ cÞ½pðexitjc; dÞpðcjparentðcÞ; dÞ% % %pðrootjdÞ'
ð3Þ
The sequential process to generate a document collection with D documents under the hierarchical concept–topic model is as follows: ! " 1. For each topic t 2 f1; :::; Tg, select a word distribution /ðtÞ ( Dirichlet b/! " 2. For each concept c 2 f1; :::; Cg, select a word distribution wðcÞ ( Dirichlet bw 3. For each document d 2 f1; :::; Dg (a) Select a switch distribution nðdÞ ( BetaðcÞ (b) Select a distribution over topics hðdÞ ( DirichletðaÞ (c) For each concept c 2 f1; :::; Cg ! " i. Select a distribution over children of c, fðcdÞ ( Dirichlet sðcÞ
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
31
(d) For each word position i in document ! d" (i) Select a binary switch xi ! Bernoulli nðdÞ (ii) If xi = 0 ! " (A) Select a topic zi ! Discrete hðdÞ ! " (B) Generate a word from topic zi, wi ! Discrete /ðzi Þ (iii) Otherwise, create a path starting at the root concept node, k1 ¼ 1 (A) Select a child node kj ; kjþ1 ! Discreteðfðkj dÞ Þ, and increment j. Repeat until kjþ1 is an exit node ! " (B) Generate a word from concept ci ¼ kj , wi ! Discrete wðci Þ . Set zi = ci + T where u(t), w(c), bu, and bw are analogous to the corresponding symbols in the concept–topic model described in the previous section. The variable n(d), where n(d) = p(x|d), represents the switch distribution and h(d), where h(d) = p(t|d) represents the distribution over topics for document d. The variable f(cd)represents the multinomial distribution over children of concept node c for document d (this has dimensionality Nc + 1 to account for the additional ‘‘exit’’ child). The hyperparameters c, a, and s(c) are the parameters of the priors on n(d), h(d), and f(cd), respectively. Note that a, as in the previous topic and concept–topic models, is a vector with hyperparameter values for each topic. Similarly, s(c)is a vector of hyperparameters values, to allow for different a priori probabilities of traversing the concept-tree. This allows the model to tune itself to different corpora and make it more likely to sample a path toward the science concept in a corpus of scientific documents. Fig. 2C shows the corresponding graphical model. The generative process above is quite flexible and can handle any directed-acyclic concept graph (for any nontree, there would be more than one way of reaching each concept, leading to increased complexity in the inference process). The model cannot, however, handle cycles in the concept structure as the walk of the concept graph starting at the root node is not guaranteed to terminate at an exit node. In the hierarchical concept–topic model, the only observed information is the set of words in each document, the word–concept memberships, and the tree structure of the concepts. All remaining variables are latent and are inferred through a collapsed Gibbs sampling procedure. Details about this procedure are described by Chemudugunta et al. (2008b).
3. Text and concept data The experiments for all our simulations are based on the TASA corpus (Landauer & Dumais, 1997) consisting of D = 37,651 documents with passages excerpted from educational texts used in curricula from the first year of school to the first year of college. The documents are divided into nine different educational genres. We focus here on a subset of TASA documents classified as science, consisting of D = 5,356 documents. As mentioned previously, CALD consists of 2,183 semantic concepts. CALD groups words primarily according to semantic concepts with the concepts hierarchically organized. The hierarchy starts with the concept everything which splits into 17 concepts at the second level (e.g.,
32
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
science, society, general ⁄ abstract, communication). The hierarchy has up to seven levels, with each interior node splitting into a median of seven children nodes. The concepts vary in the number of the words with a median of 54 word types and a maximum of 3,074. Each word can be a member of multiple concepts, especially if the word has multiple senses. We created two vocabularies. One is a W = 21,072 word vocabulary based on the intersection between the vocabularies of TASA and CALD. We also created a vocabulary of W = 142,010 words based on the union of TASA and CALD vocabularies. For both vocabularies, all stop words and infrequent words were removed.
4. Tagging documents One application of concept models is to tag unlabeled documents with human-defined concepts. The tagging process involves assigning likely concepts to each word in a document, depending on the context of the document. The document content can then be summarized by the probability distribution over concepts that reveal the dominant semantic themes. Because the concept models assign concepts at the word level, the results can be aggregated in many ways, allowing for document summaries at multiple levels of granularity. For example, tagging can be performed on snippets of text, individual sections of a document, whole documents, or even collections of documents. For all of our tagging examples, we used the intersection vocabulary (the results are qualitatively similar using the union vocabulary). 4.1. Tagging with the concept–topic model As an illustration of how the model can be used to quickly summarize a document, Fig. 4 shows the CALD concept assignments to individual words in a TASA document. We used the concept–topic model with concepts only (T = 0). The four most likely concepts are listed for this document. For each concept, the estimated probability distribution over words is shown next to the concept (note that these estimates are over the whole corpus and are not document specific). For example, for the concept of chemical elements, the word oxygen is more likely than the word chlorine. The probability of words in concepts is not just influenced by number of tokens across the whole corpus but also by the number of concepts that contain the word type and the relative probability between concepts in each document. The model has estimated that in the conceptual context of chemical elements, the word oxygen is more likely than the word chlorine. This conditional salience is useful for evaluating the relative importance of words to specific concepts, going beyond the logical set definitions provided by the human lexicographers who developed the concepts. In the document, words assigned to the four most likely concepts are tagged with letters a–d (and color coded if viewing in color). The words assigned to any other concept are tagged with ‘‘o’’ and words outside the vocabulary are not tagged. In the concept–topic model, the distributions over concepts within a document are highly skewed such that the probability mass is distributed over only a small number of concepts. In the example
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011) !"# "
$%!&"' ./01.2
C
./0;2@
+
./.?@?
<
./.?2B
)
./@.?0
33
()*+,-! $3456(5
$%#&!' ,7,+!8)*9 %./21:1' ,7,+!8)* %./0;:1' 8"<="Ɵ)* %./.>??' -8)!)*9 %./.12;' =)*9 %./.@;2' 8"<=)"+ƟA, %./.B1:' -8)!)* %./.2>2' (3DE6(FG DGDEDHI5 )JK#,* %./;.2;' LK<8)#,* %./0>10' +"8C)* %./.10.' *=!8)#,* %./.:1.' 9)<=MN %./.@:2' 9M7OM8 %./.B0B' +L7)8=*, %./.;?>' FIPE5Q EPGD(RGD5Q FHS "!)N9 %./;..?' N)7,+M7,9 %./2?:@' "!)N %./22?0' N)7,+M7, %./0.>@' =)*9 %./.2:2' =9)!)-,9 %./.0;@' =)* %./.0.@' =9)!)-, %./..:?' 5RTUFIPE6( $FVI6(GD5 DGD(IV6(6I4 FHS DGD(IVPH6(5 ,7,+!8=+=!K %./2B:B' ,7,+!8=+ %./22?0' ,7,+!8=+"7 %./0.>2' +M88,*! %./.>>2' !)W %./.BB>' N"#*,Ɵ9N %./.;2?' PI3DV
IL, LK<8)#,*C =)*9" =NN,<="!,7K) "Ʃ"+L) !L,N9,7A,9 !) W"!,8) N)7,+M7,9+ !) O)8N) +)NC=*"Ɵ)*9) +"77,<) LK<8)*=MN =)*9"/ IL, +L7)8=*,C =)*9" "79) "99)+="!,) W=!L W"!,8) N)7,+M7,9+ "*< C,+)N, LK<8"!, P8<=*"8=7K)Q !L, -)9=ƟA,) LK<8)*=MN =)*9" "*< !L, *,#"ƟA,) +L7)8=*,C =)*9" W"*<,8) "C)M! O8,,7K) =* !L, 9)7MƟ)*) =* "77 <=8,+Ɵ)*9)/ 3)W,A,8Q WL,* !L, ,7,+!8)7KƟ+ +,77) =9 +)**,+!,<) !) " C"Ʃ,8K)Q !L, "*)<,< C,+)N,9 -)9=ƟA,7K) +L"8#,<" "*< !L, +"!L)<,< C,+)N,9 *,#"ƟA,7K) +L"8#,<"/ IL, -)9=ƟA,7K) +L"8#,<" LK<8)*=MN =)*9" "8, !L,* "Ʃ8"+!,<) !)W"8< !L, +"!L)<,< "*< !L, *,#"ƟA,7K) +L"8#,<" +L7)8=*,C =)*9" "8, "Ʃ8"+!,< ) !)W"8< !L, "*)<, IL, !)W< )O +M88,*!< =*9=<,) !L, +,77) !L,8,O)8, +)*9=9!9 )O -)9=ƟA,) LK<8)*=MN =)*9" !)W=*#< =* )*, <=8,+Ɵ)*) "*< *,#"ƟA,) +L7)8=*,C =)*9" !)W=*#< =* !L, )--)9=!,) <=8,+Ɵ)*)/ XL,* !L, LK<8)*=MN =)*9" 8,"+L) !L, +"!L)<,
Fig. 4. Illustrative example of tagging a document excerpt using the concept–topic model with concepts from CALD.
document, the four most likely concepts cover about 50% of all words in the document. Fig. 4 illustrates that the model correctly disambiguates between words that have several conceptual interpretations. For example, the word charged has many different meanings and appears in 20 CALD concepts. In the example document, this word is assigned to the physics concept, which is a reasonable interpretation in this document context (the word charged does not appear in the list of words associated with the concept physics because its probability falls below the threshold for visualization). Similarly, the ambiguous words current and flow are correctly assigned to the electricity concept. 4.2. Tagging with the hierarchical concept–topic model One of the advantages of the hierarchical concept–topic model is that the hierarchical relations between concepts can be used to enhance the visualization of documents. Fig. 5 shows the result of inferring the hierarchical concept mixture for an individual TASA document using CALD concept sets. For the hierarchy visualization, we selected the seven concepts with the highest probability and included all ancestors of these concepts when visualizing the tree (we selected seven concepts to tradeoff informativeness and complexity of the display). The ancestors that were not part of the top seven concepts are visualized with dashed ovals. The CALD subtree highlights the specific semantic themes of birth, breathing, and stopping breathing along with the more general themes of science and medicine. This illustration shows that the model is able to give interpretable results for an individual document at multiple levels of granularity. At a higher level of granularity, the hierarchical concept–topic model can also summarize sets of documents. Across documents, the model learns the hyperparameters associated with
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
34
B?C !"# $%&'()')* $#+,%- %. -#/#*%$0#(' *)&'& .+%0 1,+'" 2(',* -#)'" )(- 3)( 1# -,/,-#- ,('% ) (#%()')* $#+,%-4 ,(.)(354 3",*-"%%-4 )-%*#&3#(3#4 )-2*'"%%-4 )(- (#&3#(3#6 !"# (#%()')* $#+,%-4 7",3" #8'#(-& .+%0 1,+'" '% '"# #(- %. '"# .,+&' .%2+ 7##9&4 1#:,(& /#+5 )1+2$'*5 )' 1,+'"6 ;"5&,%*%:,3)* )-<2&'0#('& 02&' 1# 0)-# =2,39*54 1#3)2 '"# (#71%+( 02&' &2--#(*5 -% .%+ ,'*. '"% '",(:& '")' '"# 0%'"#+ 1%-5 ")& 1##( -%,(: .%+ ,'6 !"2&4 '"# (#71%+( 02&' 3)++5 %( +#&$,+)',%(4 %1'),( (2'+,#('&4 -,:#&' (2'+,#('&4 #83+#'# 7)&'#&4 +#:2*)'# 1%-5 '# 0$#+)'2+#4 )(- &% .%+'"6 >%7#/#+4 ,'& 0%&' ,00#-,)'# (##- ,& '% %1'),( %85:#( )(- #83+#'# 3)+1%( -,%8,-#4 &% ,'& .,+&' 1+#)'" ,& 3+,',3)*6 !"# .,+&' 1+#)'" 02&' 1# $)+',32*)+*5 .%+3#.2*4 1#3)2 '"# (#71%+( *2(:& )+# 3%**)$-4 )(- '"# ),+7)5& )+# &0)** )(- %..#+ 3%(&,-#+)1*# +#&,&')(3# '% ),+ 0%/# 0#('6 ?*&%4 &2+.)3# '#(&,%( '#(-& '% "%*- '"# 0%,&' 0#01+)(#& %. '"# *2(:& '%:#'"#+6 @%+'2()'#*54 '"# *2(:& %. ) .2** '#+0 .#'2& +#'# &2+.)3')('4 7",3" +#-23#& &2+.)3# '#(&,%(4 )(- ).'#+ '"# .,+&' $%7#+.2* 1+#)'" 1#:,(& '% #8$)(- '"# *2(:&4 1+#)'",(: 1#3%0#& #)&,#+6 A' ,& (%' 3*#)+ 7"#'"#+ '"# .,+&' 1+#)'" ,& &',02*)'#- 15 %(# %+ /#+)* .)3'%+&6 !"% '")' 0)5 1# ,(/%*/#- ,(3*2-# )( ,(3+#)&,(: *#/#* %. 3)+1%( -,%8,-#4 ) -#3+#)&,(: $"4 *%7 %85:#( 3%(3#('+)',%(4 ) -+%$ ,( 1%-5 '#0$#+)'2+# )(- 0#3")(,3)* &',02*)',%( '")' %332+& -2+,(: )(- ).'#+ '"# 1,+'" $+%3#&&6 ;+,%+ '% 1,+'"4 '"# .#'2& -#$#(-& $+,0)+,*5 %( :*23% )(- .)''5 )3,-& %1'),(#- .+%0 '"# 0%'"#+ 1*%%- )& #(#+:5 &%2+3#&
BDC .11160 ROOT
.07608 SCIENCE
.00203 ANIMAL AND PLANT BIOLOGY
.00014 ANIMAL PHYSIOLOGY
.03935 BREATHING AND STOPPING BREATHING
.03859 CHEMISTRY
.02433 MOVEMENT AND LOCATION
.00118 MEDICINE
.06017 OBSTETRICS (PREGNANCY AND BIRTH)
.06003 BIRTH
Fig. 5. Example of a single TASA document from the science genre (A). The seven-most probable concepts inferred by the hierarchical concept–topic model for this document using the CALD concepts (B). The dashed concepts are ancestor concepts of the top seven concepts that were included for visualization purposes.
the transitions from concept nodes to the children of concept nodes. These hyperparameters determine how likely it is (a priori) for the model to generate a document along the path to the physics and chemistry concepts (for example). Fig. 6 shows the 20 highest probability concepts for a random subset of 200 TASA documents from the science genre. For each concept, the name of the concept is shown in all caps. The visualization also includes the ancestor nodes (shown in dashed ovals) to complete the path to the root node. The numbers in Fig. 6 represent the marginal probability for the concept. The marginal probability is computed based on the product of probabilities along the path of reaching the node as well as the probability of exiting at the node, marginalized (averaged) across all documents: X ½pðexitjc; dÞpðcjparentðcÞ; dÞ# # #pðrootjdÞ%: ð4Þ pðcÞ / d
Many of the most likely concepts as inferred by the model relate to specific science concepts (e.g., geography, astronomy, chemistry, etc.). These concepts all also fall under the general science concept, which is also one of the most likely concepts for this document collection. Therefore, the model is able to summarize the semantic themes in a set of documents at multiple levels of granularity. In the original CALD concept set, each concept consists of a set of words and no knowledge is provided about the prominence, frequency, or representativeness of words within the concept. In the hierarchical concept–topic model, for each concept, a distribution over words is inferred that is tuned to the specific collection of documents. For example, for the concept astronomy, the word planet receives much higher probability than the word Saturn or equinox, all of which are members of the concept. These differences in word probabilities
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
35
.10427 ROOT
.01149 SCIENCE
.00961 CHEMISTRY
.00571 CHEMICAL ELEMENTS
.00670 THE EARTH AND OUTER SPACE
.00391 CHEMISTRY GENERAL WORDS
.00364 ASTRONOMY
.02236 LIFE, DEATH AND THE LIVING WORLD
.00355 MEASURES AND QUANTITIES
.00128 GEOGRAPHY
.00109 SEAS, RIVERS AND WATER
.01182 USING THE MIND
.00228 PHYSICS
.00382 WEATHER AND CLIMATE
.01061 MOVEMENT AND LOCATION
.00417 TECHNOLOGY
.00635 ATOMS, MOLECULES AND SUB-ATOMIC PARTICLES
.00343 THE STATE OF MATTER
.00492 ELECTRICITY AND ELECTRONICS
.00464 COMMUNICATION
.00216 FARMING AND FORESTRY
.00587 ELECTRICITY
.00109 ANIMAL FARMING
.00330 ANIMAL FOOD
.00511 FLOODS, TIDES AND CURRENTS
Fig. 6. Visualization of the marginal concept distributions from the hierarchical concept–topic model learned on science documents using CALD concepts. The 20 most likely concepts are shown, including the five ancestor nodes (shown in dashed ovals) needed to complete the path to the root node.
highlight the ability of the model to adapt to variations in word usage across document collections.
5. Generalization performance In the previous section, the tagging illustrations provided a qualitative assessment of concept–topic models. To get a more quantitative evaluation, we assess the performance of the topic model, concept–topic model, and the hierarchical concept–topic model by evaluating their capability to explain new documents that the model has not been trained on. The idea is that models that are trained on documents of a certain genre should generalize to new documents from the same genre. One formal way to assess generalization performance is through perplexity. Perplexity is a quantitative measure for comparing language models (Brown, deSouza, Mercer, Della Pietra, & Lai, 1992) and is widely used to compare the predictive performance of topic models (e.g., Blei et al., 2003; Wallach et al., 2009). Although perplexity does not directly measure aspects of a model such as interpretability or coverage, it is nonetheless a useful general predictive metric for assessing the quality of a topic model. Perplexity is equivalent to the inverse of the geometric mean of the likelihood of holdout data. The perplexity of a collection of test documents given the training set is defined as:
36
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
Perpðwtest jDtrain Þ ¼ exp $
! log p ð w jD Þ d train d¼1 PDtest N d d¼1
PDtest
ð5Þ
where wtest is the set of word tokens in the test documents, wd is the set of word tokens in document d of the test set, Dtrain is the training set, and Nd is the number of word tokens in document d. Lower perplexity scores indicate that the model’s predicted distribution of heldout data is closer to the true distribution. The experiments in this section are again based on the TASA data set. We train the models on a random subset of 90% of documents classified as science, creating a training set of D = 4,820 documents. By training the models, we obtain estimates for the word–topic distributions, topic–document distributions, the assignments of word tokens to topics and concepts, as well as the hyperparameters on the topic–document distributions (for all models, asymmetric Dirichlet priors were used for the document-specific topic distributions). Note that in all reported simulations, we used the intersection vocabulary (the results are qualitatively similar using the union vocabulary). We then evaluate generalization performance on the remaining documents in the science genre and also on a subset of documents classified as social studies. By testing on science and social studies documents, we evaluate the models’ ability to generalize either within the same genre or between genres. For each test document, we used a random 50% of words of the document to estimate document-specific distributions and measure perplexity on the remaining 50% of words using the estimated distributions. More details about the perplexity computation are provided in Appendix B of Chemudugunta et al. (2008b). 5.1. Perplexity comparison across models We compare the perplexity of the topic model (TM), concept–topic model (CTM), and the hierarchical concept–topic model (HCTM) trained on document sets from the science genre of the TASA collection and using concepts from CALD. Fig. 7A, B shows the perplexity of TM, CTM, and HCTM as a function of the number of data-driven topics T. Panel (a) shows the results when the model is trained and tested on science documents. Panel (b) shows the results when the model is trained on science documents and tested on social studies documents. The point T = 0 indicates that there are no topics used in the model. The results clearly indicate that incorporating concepts greatly improves the perplexity of the models (lower perplexity indicates better predictive performance). Both CTM and HCTM outperform TM, which does not rely on human-defined concepts. The results also show that human-defined concepts by themselves (i.e., the perplexity obtained when the number of learned topics T = 0) are not sufficient to get the best generalization performance— additional learned topics that are tuned to the specific content of the document collection are needed for optimal performance (around 100–300 learned topics). One important point to note is that the improved performance by the concept models is not due to the high number of word distributions T + C, compared with the topic model that utilizes only T topics. In fact, even with T = 2,000 topics, TM does not improve its perplexity and even shows signs of deterioration in quality.
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
37
!%#
!"# 800
4000 3800
Perplexity
750
Perplexity
4200
Topic Concept Topic Hierarchical Concept Topic
700 650
3600 3400 3200 3000
600
2800
550 0 50 100
200
300
//
//
2000
2600 0 50 100
500 1000 2000
Number of Learned Topics
!$#
1000
300
400
//
//
500 1000 2000
8000
Topic Concept Topic Hierarchical Concept Topic
1500
200
Number of Learned Topics
!
7000
Perplexity
Perplexity
400
6000 5000 4000 3000
500
0
20
40
60
80
100
Percentage of training documents
2000 0
20
40
60
80
Percentage of training documents
100
Fig. 7. Comparing perplexity for topic model, concept–topic model, and the hierarchical concept–topic model as a function of number of topics (A–B) and percentage of training documents (C–D). Panels (A) and (C) show the results when the model is trained and tested on documents from the science genre. Panels (B) and (D) show the results when the model is trained on documents from the science genre, but tested on documents from the social studies genre.
We next look at the effect of varying the amount of training data for all models. Fig. 7C shows the results when the model is trained and tested on science documents. Fig. 7D shows the results when the model is trained on science documents and tested on social studies documents. When there is very little training data (e.g., up to 500 documents), both concept– topic models significantly outperform the topic model. Because learned topics in TM are entirely data driven, there is not enough statistical information to build accurate representations on the basis of just a few hundred documents (in the extreme case where there is no training data available, topics of TM will just be uniform distributions and prediction will be at chance). In the regime of little training data, however, the concept models can leverage the human-defined concepts, providing a priori structure to the learning algorithm.
38
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
Of the two concept models, HCTM outperforms CTM when little training data are available (see Fig. 7A,B) or when the model generalizes to documents from a different genre (see Fig. 7B,D). HCTM and CTM rely on the same set of human-defined concepts, but HCTM also imposes hierarchical constraints on these concepts. For example, the concept model needs no documents to learn that physics and chemistry are related concepts since this knowledge is already built in. Therefore, if a document appears to be about physics, the model predicts with small probability that chemistry words can appear in the document. These a priori concept relations are clearly useful when little data are available but are relatively less beneficial with larger amounts of training data. In this scenario, the concepts and topics can be fine-tuned to the data and the difference in performance between flat concept– topic representations and hierarchical concept representations are less pronounced. When generalizing to new kinds of documents (e.g., when training on science documents and testing on social studies documents), the hierarchical concept–topic model outperforms the concept–topic model regardless of the amount of training data. In this case, the learned knowledge is less useful and the a priori structure in the hierarchical relations between concepts provides necessary constraints on the inference process.
6. Relation between learned topics and concepts Both the concept–topic and hierarchical concept–topic models allow for a combination of concepts and learned topics. These learned topics are useful to identify different gaps in the existing concept sets and capture semantic themes beyond those covered in the concepts. Such learned topics will depend on the background corpus. To get a better understanding of the kind of information captured by these models, we applied the concept–topic model to the TASA documents in the science genre. We set the number of learned topics T = 50. In one simulation, we ran the model using the union vocabulary that combines words from the TASA corpus and CALD concepts. Importantly, the union vocabulary includes words that are not part of CALD. Because the model gives zero probability for such words under any concepts, these words have to be modeled by the learned topics. Fig. 8A shows examples of topics learned by the model. The words not covered by CALD are shown in bold. The learned topics clearly capture many of the words in the corpus that are not part of CALD, including names of people (Darwin), technical words (axon), but also some words such as later and easiest that one would expect to be present in a thesaurus. Note that the reason words such as later and easiest are excluded in CALD is not because CALD lists only root word forms (related word forms are encoded in the database). These words appear to be genuine omissions in CALD that the concept–topic model handled by learned topics. We also ran a simulation with the concept–topic model using the intersection vocabulary that includes words only present in both the TASA corpus and CALD database. Fig. 8B shows some example topics learned by the model. By definition, all words shown in these topics are members of some concepts so the concept–topic model is able to explain these words by first selecting a concept and then a word from a concept. These learned topics focus on word correlations that are not currently captured by the concepts. For example, the
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
39
(A) word darwin charles evolution son galapagos lamarck beagle england flytrap malthus wallace alfred geological jacques later
prob. 0.049 0.024 0.012 0.006 0.006 0.006 0.004 0.003 0.003 0.003 0.003 0.003 0.003 0.003 0.003
word newton galileo isaac later newtons inertial permanganate fleming straight-line huygens italian newtonian alexander tabletop rectilinear
prob. 0.084 0.027 0.018 0.013 0.010 0.008 0.007 0.006 0.006 0.004 0.004 0.004 0.004 0.004 0.003
word axon fulcrum dendrites permanganate acetylcholine axons nadph parasympathetic riverwood easiest inhibitory cutter effectors energized parkman
prob. 0.012 0.011 0.010 0.008 0.007 0.007 0.005 0.005 0.005 0.004 0.004 0.004 0.004 0.004 0.004
word paleozoic mesozoic carbonyl precambrian cenozoic cambrian thermonuclear aldehyde quaternary aldol ketones alvin quats coatings flood-hazard
prob. 0.071 0.069 0.051 0.043 0.041 0.030 0.024 0.019 0.019 0.019 0.016 0.015 0.013 0.012 0.012
word universe galaxy galaxies milky nebula cosmic billions spiral interstellar resembles acquiring nebulae galactic static accustomed
prob. 0.094 0.075 0.062 0.039 0.015 0.013 0.011 0.007 0.007 0.006 0.005 0.005 0.004 0.004 0.003
word hypothesis scientific scientist hypotheses educated suggested outcome suggests verified wrong suggestions duplicate incorrect searches suggest
prob. 0.145 0.063 0.043 0.041 0.011 0.010 0.009 0.008 0.007 0.007 0.005 0.005 0.005 0.005 0.004
word acres farmer swirling cornfield plowing ranchers differed well-preserved energetically feasible interlocking splinter tumbling bewildering buttocks
prob. 0.023 0.019 0.008 0.007 0.007 0.006 0.005 0.005 0.004 0.004 0.004 0.004 0.004 0.003 0.003
word pollution pollutants era chemicals large factories smog polluted automobiles fulcrum sulfur amounts unfit dumping areas
prob. 0.148 0.053 0.034 0.029 0.020 0.018 0.018 0.013 0.013 0.011 0.011 0.011 0.010 0.009 0.009
(B)
Fig. 8. Examples of learned topics for the CTM model. Panel (A) illustrates a simulation using the union vocabulary that includes words that are part of the TASA corpus but are not part of the CALD vocabulary (these words are shown in bold). Panel (B) illustrated topics from a simulation on the intersection vocabulary that only includes words present in both TASA and CALD.
first two words in the left-most topic, universe and galaxy, are clearly related but are not members of the same concept. Similarly, hypothesis and scientific as well as pollution and chemicals are word pairs that are not members of the same concept but often co-occur in documents. The learned topics can be used to capture such correlations.
7. Expanding existing concepts with new words One of the biggest disadvantages of utilizing human-defined knowledge databases is the large amount of manual effort involved to build and update the knowledge database. For
40
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
example, the lexicographers of the CALD database have to continually update the concepts to include words that have changed meaning or to insert entirely new concepts. In addition, the CALD database has to be checked for human errors, which might be difficult to detect manually. One way to test the utility of the concept model is to see whether it can automatically identify omissions within human-defined concepts, that is, words that should be in a concept but have been omitted. In the previous section, we showed how such words can become part of learned topics. In this section, we tested whether a concept–topic model could learn to expand existing concepts with new words that appear to have been omitted from concepts. In our simulation approach, we removed selected words from the CALD concepts and tested how well a concept–topic model could use the TASA corpus to identify which concept they should be associated with. We only evaluate the concept–topic model with no learned topics (i.e., T = 0) on this concept recovery task (we expect that the hierarchical concept–topic model gives similar results). As a baseline method, we could compare the model against a number of existing models such as LDA or LSA. For simplicity, we focus here on LSA. We computed the singular value decomposition of the document word co-occurrence matrix for the TASA corpus and projected the terms onto N-dimensional concept space. Given a test word, we return a ranking of the concepts determined by the average distance from the test word to the m closest words in the concept. Distance is measured using cosine similarity. We experimented with different values of N and m, and report our results for N = 150 and m = 5 (results were relatively insensitive to the exact values used). We compiled a list of 152 terms from the corpus where each term was a member of only one concept. The concept had to be well represented by the corpus, that is, at least 60% of the words in the concept were present in the corpus. Furthermore, the term had to be a significant member of the concept, that is, the term had the highest frequency in the corpus among all the terms in the same human-defined concept. In the concept–topic model, if a word is not included in the set of concepts to begin with, the model will be unaware of it. So for the purposes of this experiment, we ‘‘removed’’ a word by placing it in all 2,183 concept sets—in effect this tells the model that the word exists but gives the model no clue about which concept it belongs to. After training the model, we can simply count how often a word is assigned to a particular concept (via the z assignments) to produce a ranked list of concepts given a word. Fig. 9 shows an example of the rankings returned by the concept–topic model for three test words. For each removed word, the figure shows the top five ranked concepts. We label each concept with one of four letters: M(atch) indicates a match to the target concept, P(arent) indicates a concept on the path from the root concept to the target concept, C(hild) indicates a concept in the subtree rooted at the target concept, and O(ther). In the example, the model is able to rank the target concept as the first or second ranked concept but even the highly ranked mismatching concepts are often quite reasonable target concepts. For example, the word soot has strong associations with the concept cleaning and tidying places and things, as well as dirt untidiness, which are semantically related to the word soot, but which did not originally contain the removed word.
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011) ,$0'/$) .'-)
41
,+&*$) ('&%$#"!
!""#$ 1234(567898:478;4<9;=98:4>57(6?478;4<@98:? %&'$()*+,-./$*0$-*&1,/.2*3 1234;9,<478;4A8<9;986?? 1234BA95;98:?C487D6?478;4<=>6?4 2E 123468F9,28D68<7549??A6? 45!67#! 1(3498?6( %&'$23/8-.$39&8/ 1234?2(96<= 12349D>,2F98:4E6,<959<=478;4>6?<4(28<,25 1>34>578478;4789D75? :4;67#4"5!$ %&'$(<9-8/$93+$<*-9.2*3/ 12346D9<<98:478;4(7?<98:459:@< 1234>29842E4<@64(2D>7?? 1234?>2,7?<9D6? 1234>7=98:47<<68<928478;4B698:4(7,6EA5 1D4H40+"%IJ4(4H4%IKL)J4>4H4#+-$&"J424H4'"I$-3
Fig. 9. Example of rankings by concept–topic model in word recovery task.
Note that the concept–topic model is also able to identify multiple meanings for a given word. For example, the word directions can refer to north, south, east and west (koints of the kompass) as well as a set of instructions (places and locations). Also note that many words in the CALD concepts are classified according to their definition (e.g., soot is a product of combustion) rather than their descriptive qualities (e.g., soot causes dirtiness and soot is an environmental issue). Table 1 shows the overall results for the concept–topic model and latent semantic analysis. The table shows the probability that a concept is ranked in the top K returned concepts (precision), as a target concept, parent concept, or child concept. The results show that the concept–topic model outperforms the latent semantic analysis approach in this conceptrecovery task. The concept–topic model is often able to rank the target concept (out of 2,183 concepts) in the top 10 or 20. For both the concept–topic models and the latent semantic analysis approach, the parent or child of the target concept also often appear (more than expected by chance) in the top 10 or 20 concepts, indicating that these models are able to recover more specific as well as more general concepts related to the novel word.
8. Discussion Although most of the earlier work on topic modeling is purely data driven, in that no human knowledge is used in learning the topic model, there are some exceptions. Boyd-Graber, Blei, and Zhu (2007) develop a topic modeling framework that combines
42
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011) Table 1 Precision results for the concept-recovery task LSA Top 10 concepts Target Parent Child Top 20 concepts Target Parent Child
Concept–Topic
.45 .22 .05
.55 .15 .04
.53 .30 .05
.57 .26 .05
human-derived linguistic knowledge using unsupervised topic models for the purpose of word–sense disambiguation. Andrzejewski, Zhu, and Craven (2009) recently introduced an iterative topic modeling process where a human can inspect the topics and specify which words should have high probability in a topic and which words should not appear together in a topic. By replacing the multinomial distribution over words in a topic with a Dirichlet forest prior, the knowledge expressed by a human can be taken into account in the next iteration of the topic modeling process. Wei and Croft (2007) use manually built topics using documents and categories from the Open Directory Project for information retrieval. The manual topics are built by aggregating documents for selected categories and obtaining probability distributions by normalizing the word counts of the associated documents. Topic modeling has also been used for finding mappings between ontology pairs (Spiliopoulos, Vouros, & Karkaletsis, 2007). The work of Ifrim and Weikum (2006) and Bundschus, Dejori, Yu, Tresp, and Kriegel (2008) combines topics and concepts for the purposes of text classification. Our framework is somewhat more general in that we not only improve the quality of making predictions on text data by using prior human concepts but also are able to make inferences in the reverse direction about concept words and concept hierarchies given data. In addition, our concept–topic models do not require labeled data. Although topic modeling has also been used to semi-automatically build taxonomies from data (e.g., Dietz & Stewart, 2006; Zavitsanos, Paliouras, Vouros, & Petridis, 2007), these approaches do not make use of existing ontologies. There is also a significant amount of prior work on using data to help with ontology construction, evaluation, and document tagging, such as learning ontologies from text data (e.g., Maedche & Staab, 2001), methodologies for evaluating how well ontologies are matched to specific text corpora (Alani & Brewster, 2006; Brewster, Alani, Dasmahapatra, & Wilks, 2004), and systems for tagging documents with semantic concepts using word-level matching techniques (Dill et al., 2003). Our work is broader in scope in that we propose general-purpose probabilistic models that combine concepts and topics within a single framework, allowing us to use the data to make inferences about how documents and concepts are related (for example). It should be noted that in the work reviewed in this paper, we do not explicitly investigate techniques for modifying an ontology in a data-driven manner (e.g., adding ⁄ deleting words from concepts
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
43
or relationships among concepts)—however, the framework we propose could certainly be used as a basis for exploring such ideas. There are several potentially useful directions in which the hierarchical concept–topic model can be extended. One interesting extension to try is to substitute the Dirichlet prior on the concepts with a Dirichlet process prior, where each concept will now have a potentially infinite number of children, a finite number of which are observed at any given instance (e.g., Teh et al., 2006). When we do a random walk through the concept hierarchy to generate a word, we now have an additional option to create a child topic and generate a word from that topic. There would be no need for the switching mechanism as data-driven topics are now part of the concept hierarchy. Such a model would allow us to add new topics to an existing concept set hierarchy and could potentially be useful in building a recommender system for updating concept ontologies. An alternative direction to pursue would be to introduce additional machinery in the generative model to handle different aspects of transitions through the concept hierarchy. In HCTM, we currently learn one set of path correlations for the entire corpus (captured by the Dirichlet parameters s in HCTM). It would be interesting to introduce another latent variable to model multiple path correlations. Under this extension, documents from different genres can learn different path correlations (similar to the work of Boyd-Graber et al., 2007). For example, scientific documents could prefer to utilize paths involving scientific concepts, and humanities concepts could prefer to utilize a different set of path correlations when they are modeled together. A model of this type would also be able to make use of class labels of documents if available.
9. Conclusions We have proposed a probabilistic framework for combining data-driven topics and semantically rich human-defined concepts. We first introduced the concept–topic model, a straightforward extension of the topic model, to utilize human-defined semantic concepts in the topic modeling framework. The model represents documents as a mixture of topics and concepts, thereby allowing us to describe documents using the semantically rich concepts. We further extended this model with the hierarchical concept–topic model where we incorporate the concept hierarchy into the generative model by modeling the parent–child relationship in the concept hierarchy. Our experimental results show that the semantic concepts significantly improve the quality of the resulting models. Modeling concepts and their associated hierarchies appears to be particularly useful when there are limited training data—the hierarchical concept–topic model has the best predictive performance overall in this regime. We view the current set of models as a starting point for exploring more expressive generative models that can potentially have wide-ranging applications, particularly in areas of document modeling and tagging, ontology modeling and refining, and information retrieval. In addition, these models are useful to expand the current cognitive science framework to characterize human learning of semantic information. Many existing models in cognitive
44
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
science explain how a human learner extracts semantic information from only a single source of information: statistical co-occurrence information between words and documents. The current set of models suggests that the learning process in such models can be enhanced when additional background information is available. For example, a human learner might already be familiar with certain concepts (and the relations between concepts) and the exposure to (new) statistical information such as word–document co-occurrences serves to refine existing concepts or perhaps learn new ones.
Acknowledgments This material is based upon work supported in part by the National Science Foundation under Award Number IIS-0083489, and by the Office of Naval Research under Award Number N00014-08-1-1015. We are extremely grateful to three anonymous reviewers, and to Danielle McNamara and Simon Dennis, for their very helpful comments on an earlier version of this article. We thank America Holloway for assistance in providing the results in Section 7.
References Alani, H., & Brewster, C. (2006). Metrics for ranking ontologies. 4th International EON Workshop, 15th International World Wide Web Conference. New York: ACM. Andrzejewski, D., Zhu, X., & Craven, M. (2009). Incorporating domain knowledge into topic modeling via Dirichlet forest priors. The 26th International Conference on Machine Learning (ICML). New York: ACM. Blei, D. M., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2003). Hierarchical topic models and the nested Chinese restaurant process. In S. Thrun, L. Saul, & B. Scholkopf (Eds.), Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press. Blei, D., & Jordan, M. (2003). Modeling annotated data. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 127–134). New York: ACM. Blei, D., & Lafferty, J. (2006). Correlated topic models. In Y. Weiss, B. Scho¨lkopf, & J. Platt (Eds.), Advances in Neural Information Processing Systems 18 (pp. 147–154). Cambridge, MA: MIT Press. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Boyd-Graber, D., Blei, D., & Zhu, X. (2007). A topic model for word sense disambiguation. Proceedings of the Joint Conference of Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 1024–1033). New York: ACM. Brewster, C., Alani, H., Dasmahapatra, S., & Wilks, Y. (2004). Data driven ontology evaluation. International Conference on Language Resources and Evaluation. Paris, France. Brown, P. F., deSouza, P. V., Mercer, R. L., Della Pietra, V. J., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479. Bundschus, M., Dejori, M., Yu, S., Tresp, V., & Kriegel, H. (2008). Statistical modeling of medical indexing processes for biomedical knowledge information discovery from text. Proceedings of the 8th International Workshop on Data Mining in Bioinformatics (BIOKDD). New York: ACM. Buntine, W. L., & Jakulin, A. (2004). Applying discrete PCA in data analysis. In: M. Chickering & J. Halpern (Eds.), Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 59–66). San Francisco, CA: Morgan Kaufmann Publishers.
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
45
Chater, N., & Manning, C. (2006). Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10(7), 335–344. Chemudugunta, C., Holloway, A., Smyth, P., & Steyvers, M. (2008a). Modeling documents by combining semantic concepts with unsupervised statistical learning. 7th International Semantic Web Conference (pp. 229–244). Berlin: Springer-Verlag. Chemudugunta, C., Smyth, P., & Steyvers, M. (2008b). Combining concept hierarchies and statistical topic models. ACM 17th Conference on Information and Knowledge Management. New York: ACM. Chemudugunta, C., Smyth, P., & Steyvers, M. (2008c). Text modeling using unsupervised topic models and concept hierarchies. Technical Report. Available at: http://arxiv.org/abs/0808.0973. Accessed on May 16, 2010. Cutting, D. R., Karger, D., Pedersen, J. O., & Tukey, J. W. (1992). Scatter ⁄ gather: A cluster-based approach to browsing large document collections. Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 318–329). New York: ACM Press. Dennis, S. (2004). An unsupervised method for the extraction of propositional information from text. Proceedings of the National Academy of Sciences, 101, 5206–5213. Dietz, L., & Stewart, A. (2006). Utilize probabilistic topic models to enrich knowledge bases. Proceedings of the E SWC 2006 workshop on mastering the gap: From information extraction to semantic representation. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J., & Zien, J. (2003). SemTag and seeker: Bootstrapping the semantic web via automated semantic annotation. Proceedings of the 12th international conference on World Wide Web (pp. 178–186). New York: ACM. Fellbaum, C. (Ed.) (1998). WordNet, an electronic lexical database. Cambridge, MA: MIT Press. Fiser, J., & Aslin, R. N. (2005). Encoding multi-element scenes: Statistical learning of visual feature hierarchies. Journal of Experimental Psychology: General, 134, 521–537. Foltz, P. W., Gilliam, S., & Kendall, S. (2000). Supporting content-based feedback in online writing evaluation with LSA. Interactive Learning Environments, 8, 111–129. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis, 2nd ed. London: Chapman & Hall. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Science, 101, 5228–5235. Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2005) Integrating topics and syntax. In L. K. Saul (Ed.), Advances in neural information processing 17 (pp. 537–544). Cambridge, MA: MIT Press. Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. T. (2007). Topics in semantic representation. Psychological Review, 114(2), 211–244. Havasi, C., Speer, R., & Alonso, J. (2007). ConceptNet 3: A flexible, multilingual semantic network for common sense knowledge. Proceedings of Recent Advances in Natural Language Processing. Hofmann, T. (1999). Probabilistic latent semantic indexing. Proceedings of the 22nd Annual ACM Conference on Research and Development in Information Retrieval (pp. 50–57). New York: ACM Press. Ifrim, G., & Weikum, G. (2006). Transductive learning for text classification. 10th European conference on principles and practice of knowledge discovery in databases (pp. 223–234). Berlin, Germany. Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114, 1–37. Kemp, C., & Tenenbaum, J. B. (2008). The discovery of structural form. Proceedings of the National Academy of Sciences, 105(31), 10687–10692. Kohonen, T., Kaski, S., Lagus, K., Saloja¨rvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks (Special Issue on Neural Networks for Data Mining and Knowledge Discovery), 11-3, 574–585. Lagus, K., Honkela, T., Kaski, S., & Kohonen, T. (1999). WEBSOM for textual data mining. Artificial Intelligence Review, 13, 5–6. 345–364. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.
46
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259–284. Lenat, D. B., & Guha, R. V. (1989). Building large knowledge-based systems: Representation and inference in the Cyc project. Reading, MA: Addison-Wesley. Li, W., Blei, D., & McCallum, A. (2007). Nonparametric Bayes pachinko allocation. Conference on Uncertainty in Artificial Intelligence (UAI). Corvallis, OR: AUAI Press. Maedche, A., & Staab, S. (2001). Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2), 72– 79. McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–178). New York: ACM Press. McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37, 547–559. Mei, Q., Shen, X., & Zhai, C. (2007). Automatic labeling of multinomial topic models. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 490–499). New York: ACM Press. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to Word Net: An on line lexical database. International Journal of Lexicography, 3(4), 235–244. Mimno, D. M., Li, W., & McCallum, A. (2007). Mixtures of hierarchical topics with pachinko allocation. International Conference on Machine Learning (ICML) (pp. 633–640). Corvallis, OR. Minka, T. P. (2000). Estimating a Dirichlet distribution. Technical report. Cambridge, MA: Massachusetts Institute of Technology. Najemnik, J., & Geisler, W. S. (2005). Optimal eye movement strategies in visual search. Nature, 434, 387– 391. Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998). The University of South Florida word association, rhyme, and word fragment norms. Available at: http://www.usf.edu/FreeAssociation/. Newman, D., Chemudugunta, C., Smyth, P., & Steyvers, M. (2006). Analyzing entities and topics in news articles using statistical topic models. Springer Lecture Notes in Computer Science (LNCS) series—IEEE international conference on intelligence and security informatics. Berlin: Springer-Verlag. Newport, E. L., & Aslin, R. N. (2004). Learning at a distance: I. Statistical learning of non-adjacent dependencies. Cognitive Psychology, 48, 127–162. Newport, E. L., Hauser, M. D., Spaepen, G., & Aslin, R. N. (2004). Learning at a distance: II. Statistical learning of non-adjacent dependencies in a non-human primate. Cognitive Psychology, 49, 85–117. Panton, K., Mtuszek, C., Lenat, D., Schneider, D., Witbrock, M., Siegel, N., & Shepard, B. (2006). Common sense reasoning—From Cyc to intelligent assistant. Lecture Notes in Computer Science, 3864, 1–31. Popescul, A., Ungar, L. H., Flake, G. W., Lawrence, S., & Giles, C. L. (2000). Clustering and identifying temporal trends in document databases. In Proceedings of the IEEE Advances in Digital Libraries (pp. 173–182). Los Alamitos, CA: IEEE Computer Society. Roget, P. M. (1911). Roget’s thesaurus of English words and phrases. New York: Thomas Y. Crowell. Ruts, W., De Deyne, S., Ameel, E., Vanpaemel, W., Verbeemen, T., & Storms, G. (2004). Flemish norm data for 13 natural concepts and 343 exemplars. Behavior Research Methods, Instruments, and Computers, 36, 506–515. Spiliopoulos, V., Vouros, G., & Karkaletsis, V. (2007). Mapping ontologies elements using features in a latent space. In Proceedings of the IEEE ⁄ WIC ⁄ ACM International Conference on Web Intelligence (pp. 457–460). Washington DC: IEEE Computer Society. Steyvers, M., & Griffiths, T. L. (2007). Probabilistic topic models. In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 427–448). Mahwah, NJ: Erlbaum. Steyvers, M., & Griffiths, T. L. (2008). Rational analysis as a link between human memory and information retrieval. In N. Chater & M. Oaksford (Eds.), The probabilistic mind: Prospects from rational models of cognition (pp. 327–347). Oxford, England: Oxford University Press.
M. Steyvers et al. ⁄ Topics in Cognitive Science 3 (2011)
47
Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In W. Kim, R. Kohavi, J. Gehrke, & W. DuMouchel (Eds.), The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 306–315). New York: ACM. Teh, Y. W., Jordan, M., Beal, M., & Blei, D. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581. Wallach, H. (2006). Topic modeling: Beyond bag-of-words. In W. Cohen & A. Moore (Eds.), Proceedings of the 23rd International Conference on Machine Learning (pp. 977–984). Pittsburgh, PA. Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. In A. P. Danyluk, L. Bottou, & M. L. Littman (Eds.), Proceeding of the 26th International Conference on Machine Learning (pp. (1105–1112.) New York: ACM. Wang, C., Blei, D., & Heckerman, D. (2008). Continuous time dynamic topic models. In D. McAllester and A. Nicholson (Eds.), Uncertainty in artificial intelligence (pp. 579–586). Corvallis, OR: AUAI Press. Wei, X., & Croft, W. B. (2007). Investigating retrieval performance with manually-built topic models. Proceedings of the 8th Large-Scale Semantic Access to Content (Text, Image, Video and Sound) Conference (RIAO’07). Paris, France. Yu, C., Ballard, D. H., & Aslin, R. N. (2005). The role of embodied intention in early lexical acquisition. Cognitive Science, 29, 961–1005. Zavitsanos, E., Paliouras, G., Vouros, G.A., & Petridis, S. (2007). Discovering subsumption hierarchies of ontology concepts for text corpora. In proceedings of the IEEE ⁄ WIC ⁄ ACM International Conference on Web Intelligence (pp.402–408.) Washington, D.C.: IEEE Computer Society.
Topics in Cognitive Science 3 (2011) 48–73 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01112.x
Constructing Semantic Representations From a Gradually Changing Representation of Temporal Context Marc W. Howard, Karthik H. Shankar, Udaya K. K. Jagadisan Department of Psychology, Syracuse University Received 26 January 2009; received in revised form 2 November 2009; accepted 5 November 2009
Abstract Computational models of semantic memory exploit information about co-occurrences of words in naturally occurring text to extract information about the meaning of the words that are present in the language. Such models implicitly specify a representation of temporal context. Depending on the model, words are said to have occurred in the same context if they are presented within a moving window, within the same sentence, or within the same document. The temporal context model (TCM), which specifies a particular definition of temporal context, has proved useful in the study of episodic memory. The predictive temporal context model (pTCM) uses the same definition of temporal context to generate semantic memory representations. Taken together pTCM and TCM may prove to be part of a general model of declarative memory. Keywords: Episodic memory; Semantic memory; Mathematical modeling; Computational modeling
1. Introduction The importance of temporal context in learning the meaning of words has long been central to our understanding of the acquisition of word meaning. Contemporary computational models of semantic memory exploit this basic idea. However, the definitions of temporal context they use are contradictory with one another and often not theoretically motivated. For instance, in the BEAGLE model (Jones & Mewhort, 2007), the semantic representation of a word is the weighted average of all other word vectors that were presented in the same sentence as the word. In BEAGLE temporal context is operationalized as being constant within a sentence but changing completely between sentences. That is, words within the same sentence are in the same temporal context, but words in adjacent sentences are in completely different temporal contexts. Similarly, in latent semantic analysis (LSA) Correspondence should be sent to Marc Howard, Department of Psychology, Syracuse University, 430 Huntington Hall, Syracuse, NY 13244. E-mail:
[email protected]
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
49
and the topic model (Griffiths, Steyvers, & Tenenbaum, 2007; Landauer & Dumais, 1997), a word · document matrix is the starting point for the calculations. This implies a representation of temporal context that is constant within a document, but that changes completely between documents.1 Both of these approaches, BEAGLE and LSA and the topic model, share the assumption that temporal context is a categorical variable, but they differ in the time scale associated with the rate of change of temporal context. The fact that temporal context is only implicitly defined by these (and related) models makes the task of comparing the models, which vary on a number of other dimensions as well, considerably more difficult. The basic strategy of the research program described here is to use an explicit representation of temporal context inherited from work on episodic memory as a starting point for developing a computational semantic model. We will first briefly describe temporal context as defined by the temporal context model (TCM, Howard & Kahana, 2002; Howard, Fotedar, Datey, & Hasselmo, 2005; Sederberg, Howard, & Kahana, 2008). We will then describe how retrieval of temporal context can function to efficiently extract relationships between stimuli. Next, we describe the predictive temporal context model (pTCM, Shankar, Jagadisan, & Howard, 2009) as a solution for how to efficiently extract the meanings of words embedded in natural sequences. We then present evidence that pTCM can provide a useful description of information extracted from natural text. We close by describing several significant challenges that remain. 1.1. Temporal context in episodic memory The initial goal of TCM was to account for the recency and contiguity effects observed in episodic recall tasks. The recency effect refers to the finding that, all other things being equal, memory is better for more recently experienced information. The contiguity effect refers to the finding that, all other things being equal, items experienced close together in time become associated such that when one comes to mind it tends to bring the other to mind as well. The contiguity effect has been extensively studied in episodic recall tasks, where it exhibits a characteristic asymmetry (see Kahana, Howard, & Polyn, 2008a, for a review). Somewhat surprisingly, the contiguity effect, like the recency effect, persists over relatively long time scales, extending at least hundreds of seconds (Howard, Youker, & Venkatadass, 2008b). Similarly, the contiguity effect is observed in the very earliest stages of immediate free recall (Howard, Venkatadass, Norman, & Kahana, 2007), a prediction unique to TCM among models of the recency effect. Table 1 summarizes verbally the assumptions that constitute TCM. In TCM episodic recall proceeds by cuing with the current state of a distributed representation of temporal Table 1 Principles of operation of the temporal context model (TCM) 1. Temporal context changes gradually over time 2. Items are cued by a state of context to the extent it overlaps with their encoding contexts 3. Presentation of items causes a change in the state of context 4. Repeated/recalled items can recover the state of context in which they were previously studied
50
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
context. This state of context changes gradually over time. Studied items are activated by a context cue to the extent that it overlaps with the state of context when they were studied. The recency effect results from the combination of these two properties. After study of a list, items presented more recently in time were encoded in states of context that more strongly resemble the probe context. The concept that a gradually changing memory signal contributes to forgetting is not unique to TCM, but it has a long history in the mathematical psychology of learning and memory (e.g., Anderson & Bower, 1972; Estes, 1955, see also Brown, Neath, & Chater, 2007; Mensink & Raaijmakers, 1988; Murdock, 1997). TCM builds on these models, but it makes the additional assumption that contextual drift is caused by the items themselves. This assumption enables the model to account for the contiguity effect in episodic memory (Howard, Kahana, & Wingfield, 2006; Polyn, Norman, & Kahana, 2009a; Sederberg et al., 2008). Because the input to the context vector is caused by items, repetition of an item causes the state of context to change to a state similar to that during study of the neighbors of the original presentation of the repeated item, resulting in a contiguity effect. A further assumption of TCM is that repeated items can recover or retrieve their study context. That is, they can cause the context state to be partially reset to the state prior to the previous presentation of the repeated item. An example may make this more concrete. Suppose that the model is presented with a list of words that includes the sequence … absence, hollow, pupil, river, darling … The temporal context in which pupil is encoded includes input caused by hollow, and, to a lesser extent because it was further in the past, input caused by absence. Similarly, the temporal context in which each of the other items was encoded is composed of the input caused by the preceding items, weighted by their recency. If the context immediately after presentation of this sequence is used as a cue, darling would be most strongly activated because its encoding context is most similar to the cue context. In this way the model accounts for the recency effect. The model accounts for contiguity as well. Suppose that pupil is repeated at some later time, and it successfully recovers its encoding context. Then, the context cue recovered by pupil provides a better cue for river than for darling because the encoding context for river did not drift as far from pupil. Similarly, recovery of pupil’s encoding context makes a more effective cue for hollow than absence for the same reason. In this way, the model accounts for the contiguity effect in both the forward and backward directions.2 The ability of items to recover their prior contextual states endows TCM with a number of important properties. For instance, the backward association manifest in the contiguity effect depends on the ability of items to recover their prior temporal contexts. Similiarly, because the prior state of context includes input caused by the items that preceded the initial presentation of the repeated item, recovering this state results in a mixing of the input patterns caused by items on the basis of their contiguity. This property can be exploited to describe effects in relational memory, as we shall see shortly. A more formal description of TCM follows. Readers who wish to avoid a mathematical description may choose to skip this subsection.
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
51
1.1.1. Formal description of TCM We will deviate from some of the details (and notation) used in previous papers in order to create as much consistency as possible with the development of the semantic memory model used here. In the discussion that follows we will assume, unless otherwise noted, that the subject studies an extremely long list without repetitions of items. The state of temporal context at time step i, ti, is a result of the previous state of context and an input pattern tIN i , caused by the item presented at time step i: ti ¼ qti"1 þ ð1 " qÞtIN i ;
ð1Þ
where q is a scalar less than one. We assume that the input vectors are chosen such that the sum of their components is unity. Under these conditions, the sum of the components of t is also unity. Eq. 1 implements the assumption that context changes gradually over time; all other things being equal, the state of temporal context at time step i resembles the previous state of context more than other states more distant in the past. Temporal context changes gradually as more (unique) items are successively presented. We use an outer product matrix associating contexts to items to enable items to be activated by a contextual cue. During study, items are encoded in their temporal context. The matrix M is updated such that the change in M is given by: DM ¼ fi t0i"1 ;
ð2Þ
where fi is the vector associated with item i and the prime reflects the transpose. In order to recall an item, the matrix M is multiplied from the right with the current state of context. Eq. 2 results in the property that each item fi is activated to the extent that its study context overlaps with the context used as a cue. It remains to describe the properties of the tIN vectors. The input pattern tIN caused by an item is composed of a fixed component that does not change across repetitions of an item over the time scale of a laboratory memory experiment that we will refer to as f and a changing component we will refer to as h. Each fi and each hi are caused by the item presented at time step i and depend only on the identity of that item and its previous history. The f vectors for each item are fixed throughout the simulation. If item a is presented at time step i, then ^ tIN i ¼ ð1 " cÞfa þ cha :
ð3Þ
The hat in the second term indicates that h is normalized prior to entering this expression. We fix the fas to be unit vectors that serve as the basis vectors for the t space. With learning, ha changes from one presentation of item fa to another according to Dha ¼ ti"1 :
ð4Þ
The function of h is to enable items to recover the temporal context in which they were studied. This implements property 4 in Table 1.
52
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
1.2. Relational memory, retrieved context, and the hippocampus Consider the case of a repeated item that recovers its prior study context when it is repeated. This means that the input caused by this item is not consistent across its two presentations. The change in the input patterns with repetitions has wide-reaching implications. The mixing of input patterns creates the ability for the model to describe associations between items that do not actually co-occur. Consider the case in which the subject learns a pair of items a–b and then much later learns b–c. If contextual retrieval takes place (i.e., if c is nonzero), then during learning of a–b, the input pattern caused by b will include the temporal context that preceded it. This state of context includes information contributed by item a. As a consequence, during learning of b–c, the input pattern caused by b includes information about a. This means that the encoding context for c includes ‘‘parts of’’ a, even though a and c were not presented close together in time. In fact, such transitive associations among items that have not been presented close together in time are observed (e.g., Bunsey & Eichenbaum, 1996; Howard, Jing, Rao, Provyn, & Datey, 2009; Slamecka, 1976). For instance, Howard et al. (2009) taught human subjects a long list of paired-associates with overlapping pairs. That is, subjects learned a list of 35 pairs of the form a–b, b–c, c–d … presented in a random order for a total of 12 presentations each. During a final free recall session, subjects were asked to recall all the items from all the pairs in the order they came to mind. If a subject had just recalled a double-function word from the list, the next word the subject recalled would tend to come from a nearby pair, even if it was not from the same pair as the just-recalled word. For example, if the subject had just recalled b, the next word that came to mind would be more likely be d than e. In other words, subjects showed behavioral evidence for transitive associations between items across pairs that fell off as a function of the number of links in the chain between items. It is as though the subjects were able to integrate the pairs into a common global memory structure. Howard et al. (2009) demonstrated that TCM provides a good description of this behavior. Although transitive associations actually make paired associate learning more difficult (see especially Provyn, Sliwinski, & Howard, 2007), they provide an extremely useful computational function by allowing the model to infer relationships between items that are not explicitly instructed. That is, the model does not have to be explicitly instructed that a and c ‘‘go together’’ in order for a to evoke c. A successful model of semantic memory needs to be able to place tens of thousands of symbols in the proper relation to one another. If each of those relations needed to be explicitly instructed, the number of presentations necessary would be extremely large. Moreover, the model does not need to make a priori assumptions about the possible structure present in the learning set (Kemp & Tenenbaum, 2009; Tenenbaum, Griffiths, & Kemp, 2006). This is possible because retrieved context ‘‘spreads’’ along the links in the chain such that the representation at the end of training reflects the topology of the pairs it was trained on. Note that this functionality depends on a training set in which relationships can be directly inferred from contiguity. The function of contextual retrieval in TCM is in some sense analogous to the function of dimensional reduction in LSA and the topic model. To illustrate this, Fig. 1 shows results for TCM, LSA, and the topic model trained on a ‘‘corpus’’ that consisted of a set of
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011) (b)
(a) H G F E D C B A
(c) H G F E D C B A
H G F E D C B A A
B
C
D
E
F
G
H
53
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
Fig. 1. The temporal context model (TCM), a model of episodic recall based on contextual overlap, and computational models of semantic memory both predict transitive associations. In all three panels, the figure shows the similarity of the representation of each item in a double function list of paired associates after training. (a) Retrieved temporal context as defined by TCM shows transitive associations. The shading of each square codes for the similarity of the temporal context vector retrieved by the corresponding pair of items after 10 trials of learning on the corresponding double function list. Vector similarity was assessed using the inner product. High values of the inner product are shaded dark. (b) A representation generated using latent semantic analysis (Landauer & Dumais, 1997) shows transitive associations. A singular value decomposition was computed for an item-context matrix corresponding to training on a double function list of pairs. Two dimensions were retained. Similarity of each pair of vectors was assessed using the cosine of the angle between them. High values of cosine are dark. (c) The topic model (Griffiths et al., 2007) was trained on a set of contexts simulating presentation of a double function list. The simulation used two topics and a ¼ 0.1 and b ¼ 0.1 (see Griffiths et al., 2007 for details). The similarity between each pair of items was estimated by comparing the Kullback–Leibler divergence of the distribution over topics induced by each item. Small values of divergence, corresponding to high similarity, are dark.
‘‘documents’’ each of which contained a single double function pair.3 That is, document 1 consisted of the words a and b, document 2 consisted of the words b and c, and so on. Each panel shows the similarity of the representation of each word to each other word. Transitive associations can be seen by the shading among pairs of items that did not co-occur. TCM, LSA (with two dimensions), and the topic model (with two topics) all build transitive associations that bridge across pairs. Interestingly, LSA only exhibits transitive associations if the number of dimensions retained is less than the number possible. That is, if all seven dimensions were retained for LSA, the model does not exhibit across-pair associations. Rather it only makes words that occur in the same document similar to one another. Similar results are observed for the topic model with seven topics (one for each document). It should be noted that HAL and BEAGLE also illustrate transitive associations, although this is not attributable to dimensional reduction in those models. Contextual retrieval enables the development of a representation of the items that reflects their global co-occurrence structure. For instance, suppose that we train the model on a set of overlapping pairs a–b, b–c … z–a, with the pairs presented in a random order and each pair completely isolated from the others. After training, the input caused by b will resemble the input caused by c more than the input caused by d. Similarly, the input caused by b will resemble the input caused by d more than that caused by e and so on. Retrieved temporal context enables TCM to place the input vectors caused by each item in the proper global relationship to each other. Rao and Howard (2008) showed that TCM with retrieved context
54
(A)
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
(B)
!/!!
34*,%)(*56)&
!/!" !/!# !/!$ !/!2 !/!1 !/!0
!
" # %&'()*+),-.)&
$
Fig. 2. Contextual retrieval enables the extraction of global structure from isolated episodes. (A) Miniature example of a small-world network with connectivity chosen according to the structure of English as estimated by Steyvers and Tenenbaum (2005). (B) The cue strength between pairs chosen from the small-world network are shown as a function of the shortest path between the items. Filled symbols show TCM with contextual retrieval. The open symbol shows the value for TCM without contextual retrieval. Only one point is shown for the model without contextual retrieval because the cue strength is zero for items not presented as part of the same pair.
can not only learn a one-dimensional topology, the ring, but also a two-dimensional topology in which items form a sheet, and more realistic topologies corresponding to naturally occurring language. Fig. 2A shows a miniature version of a small-world network (Strogatz, 2001; Watts & Strogatz, 1998) used to train the model. The network was generated with 10,000 nodes (500 are shown in Fig. 2A) with connectivity comparable to that of the English language, as estimated by the network analysis of WordNet performed by Steyvers and Tenenbaum (2005). We trained TCM on pairs chosen by selecting nodes connected by an edge of the graph. Fig. 2B shows the cue strength between items4 as a function of the length of the shortest path between them in the network. Note that pairs with a value of shortest path greater than one were never presented together during training. Nonetheless, the model correctly describes the distances among items from remote areas of the network. Further, this behavior depends on contextual retrieval—the cue strength is zero for remote items if contextual retrieval does not take place (open symbols). 1.3. pTCM: Learning structure by predicting the future We have seen that contextual retrieval enables TCM to discover latent structure from presentation of isolated stimulus events and integrate them into a global representation. We
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
55
have also seen that the model can learn complex topologies believed to underlie natural language. This seems like it might be sufficient to construct a model of semantic structure. Our initial strategy was to take TCM as just described and present it with a very long sequence of natural language and evaluate the model’s behavior. As it turns out, this is a deeply theoretically unsatisfactory model.5 The reason turns out to be that, unlike the artificial examples explored above, proximity in natural language is not a strong proxy for similarity. Consider the meaning we would learn for a novel word presented in the following sentence ‘‘The baker reached into the oven and pulled out the FLOOB.’’ What is the meaning of FLOOB? In TCM, the representation of FLOOB would be updated to include information from the preceding context; that is, FLOOB would become similar to the representation of ‘‘out,’’ ‘‘pulled,’’ ‘‘oven,’’ ‘‘baker,’’ etc. While it is reasonable to infer that a FLOOB has something to do with those words, it is not at all the case that FLOOB is synonymous with the preceding context. If FLOOB were synonymous with its predecessors, it would be redundant and there would be no purpose to use the word FLOOB in that context. A much more natural way to describe the meaning of FLOOB would be to make FLOOB similar to the words that would have fit into that temporal context, for instance ‘‘cake’’ or ‘‘bread’’ (Dennis, 2005). Fig. 3 illustrates this problem more concretely. We trained TCM with a set of sentences generated by the simple language generator program (SLG, Rohde, 1999) using a simple grammar previously used in a connectionist simulation of language acquisition (Borovsky & Elman, 2006). The SLG generated a set of sentences from words drawn from several categories of nouns (e.g., animals, people) and verbs (e.g., perception, action) subject to both syntactic and semantic constraints (examples can be found in Fig. 3A). Fig. 3B reflects the IN semantic space generated from TCM. More explicitly, we calculated tIN a !tb between different words and aggregated the results according to their category relationships. As shown by Fig. 3, words become similar to the words that precede them; because the sentences all have either a N-V-N or a N-V structure, nouns become similar to verbs and vice versa. pTCM (Shankar et al., 2009) builds on the framework of TCM (Table 2). Just like TCM, it uses a representation of temporal context that changes gradually over time. Similarly, context is used to cue items and the input to context is caused by items. However, in pTCM, the context is used as a cue not only when items are to be recalled, but also at each time step to create a prediction about what will happen next (Property 2, Table 2). The semantic representation of an item is composed of the prediction vectors in which the word is experienced over time. This semantic representation for each word becomes part of the input to the temporal context vector when that word is presented. A more formal definition follows in the next subsection, which can be omitted by the reader not interested in the mathematical details of the model’s operation. Before that, we briefly demonstrate that the adjustments present in pTCM enable us to solve the problem of learning semantic representations from sequentially organized materials. Fig. 3C shows the results of the simulation with the SLG conducted with pTCM. In pTCM, the representations of words become similar to other words from the same category (dark boxes along the diagonal). To the extent that there is residual similarity across categories, it respects the part of speech of the words. For instance, the shaded box in the upper right of Fig. 3C reflects the fact that verbs are more similar to other verbs than they are to
3
N_humans
4
N_objects N_places
VI_CoS VI_Communication VI_Motion
5 6 7 8 Categories VT_Eating
VT_Perception
9 10 11
VT_Action
1
1
lion eats bread. 2
2
N_food
N_food
2
sandwich drops.
1
3
3
1
N_animals
2
N_food
3
N_humans
4
N_objects
N_places
VI_CoS VI_Communication VI_Motion
5 6 7 8 Categories
VT_Eating
VT_Perception
9 10 11 VT_Action
Fig. 3. Predictive temporal context model (pTCM) as a model of semantic learning. (A) Sample sentences generated by the simple language generator. (B–C) Similarity between the representations of words belonging to each category of the simple language. Dark boxes correspond to high similarity. The similarity between each word and itself is excluded from this plot. (B) Category structure for temporal context model (TCM) after being trained on sentences sampled from the simple language. (C) Same as (B), but for pTCM. Unlike TCM, pTCM has learned the category structure underlying this simple language.
N_animals
N_animals
N_humans
N_humans
animal smells juice.
N_food
N_objects
4
N_places
5
6
VI_CoS
7
VI_Communication
VI_Motion
8
VT_Action
9
VT_Eating
10
11
VT_Perception
4
N_animals
C
N_objects
N_places
5
6
VI_CoS
7
VI_Communication
VI_Motion
8
VT_Action
9
VT_Eating
10
11
VT_Perception
lamb sees pizza.
mouse walks.
rabbit walks.
mother drinks juice.
dad gobbles fruit.
rabbit enters room.
sister sits.
B
Categories
A
tiger walks.
Categories
56 M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
57
Table 2 Principles of operation of the predictive temporal context model (pTCM). Compare to Table 1 1. Temporal context changes gradually over time 2. Items are cued by a state of context to the extent it overlaps with their encoding context; cuing at each time step of learning yields a prediction 3. Presentation of items causes a change in the state of context partially driven by the stored semantic representations 4. Repeated/recalled items can recover the state of context in which they were previously studied 5. The semantic representation of an item is composed of the prediction vectors that obtained when it was presented
nouns. This ability to simultaneously capture syntactic and semantic roles is common to the simple recurrent network (Elman, 1990) and the syntagmatic-paradigmatic model (Dennis, 2004, 2005). 1.3.1. Formal description of pTCM Let us describe the process of computing the prediction vector and exploiting this information to develop a semantic representation more formally. The prediction vector at time step i, pi, is calculated using pi ¼ Mti ;
ð5Þ
where M differs from the matrix in Eq. 2 by virtue of being row-normalized. The vector pi can be thought of as the prediction for what item will be presented at time step i + 1. It has been proven that for bigram languages this prediction can be perfect (Shankar et al., 2009). Each word a in the language is associated with a semantic representation sa that is built up from the prediction vectors available when the item is presented. If word a is presented at time step i, then sa is updated such that the change in sa is given by: Dsa ¼ pi$1 :
ð6Þ
Finally, the semantic vector contributes to the input pattern (tIN) to context when the corresponding item is presented. If item a is presented at time step i, then the input pattern tIN i is given by sa : tIN i ¼ ð1 $ /Þfa þ /^
ð7Þ
In principle, we could consider a combination of Eq. 3 with Eq. 7, using both c and / to generate tIN i . Because our focus in this paper is on learning semantic representations and not on episodic recall, we set c = 0 and ignore contextual retrieval. Shankar et al. (2009) demonstrated several useful results regarding the behavior of pTCM using toy languages (even simpler than the SLG) to be able to quantitatively compare the model’s behavior to the generating function of the language. One key finding is that / enables the model to generalize across items that have similar roles in the language in much the same way that c enables TCM to generalize across contiguity relations among items. In addition, Shankar et al. (2009) derived an expression that enables one to calculate the steady
58
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
state behavior of the model in a much more computationally efficient way. In pTCM, calculation of the p vector at each time step requires a matrix multiplication. Hence, pTCM is much more computationally intensive than TCM. The expression for the steady-state behavior of the model exploits the somewhat surprising fact that at steady-state the semantic representations can be calculated just with knowledge of M. Similarly, the steady state M can be calculated directly if the semantic representations are known. Shankar et al. (2009) proved that this steady state approximation precisely describes the simulation model’s behavior with a sufficiently long training set and also closely approximated the simulation model’s behavior with shorter training sequences. An important point to note is that pTCM is history dependent. That is, the effect of being trained on a particular sequence early in training is different from the effect of being trained on the same sequence later in training. If the model is being used to estimate a language with a constant generating function, this property is a nuisance. The approximation can be thought of as the ‘‘historyindependent’’ model that would result from averaging over all possible sequences that could lead to the same initial estimate of M.
2. pTCM as a model of natural language processing The foregoing results suggest that pTCM ought to be able to learn semantic structure from natural language. In order to test this, we trained pTCM on the TASA corpus and examined the model’s behavior on a synonym test and a free association test. Because of the size of the corpus, it was not practical to run the entire simulation model. Instead, we used the history-independent steady state approximation of the simulation model (Shankar et al., 2009). 2.1. Simulation methods In order to test the applicability of the model to real-life language acquisition, we trained pTCM on a widely used corpus of the English language—the Touchstone Applied Science Associates (TASA) college level corpus. The TASA corpus contains about 11 million words across 37,000 documents and 93,000 unique words. To preprocess the corpus, we stripped it of punctuation, numbers, very frequent and commonly occurring words (including function words like ‘‘a,’’ ‘‘the,’’ etc.), and words that occurred in fewer than three documents and fewer than 10 times in the whole corpus. This resulted in a reduced corpus of about 5 million tokens and 48,000 unique words. The context vector was set to zero at the beginning of every paragraph. The sentence separators, on the other hand, were treated as being equivalent to distractor tasks, and q was changed transiently during the transition from one sentence to the next, assuming a value of qD between sentences. The computation time for the steady-state approximation was sped up further by writing a parallel sparse implementation using the message passing interface (MPI) library in C++. Throughout the simulations described here, we set the sparsity threshold to 10)5. However, the amount of time required to run the approximation on a dual Xeon quad core (3.16 GHz)
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
59
machine with 8 Gb of RAM made it impractical to evaluate the model many times with varying parameters. To reduce the processing time further, we collapsed a large number of low-frequency words remaining in the preprocessed tokens into a single token. This token was treated differently from the others in that it did not have an entry in M or a semantic representation s. The effect of our treatment was such that when this special token was encountered in the corpus, the context vector was multiplied by q, but no other action was taken. After reducing the number of words to 10,152 in this way, calculating the model on the corpus with a single set of parameters took about 16 h. We evaluated the model parameters based on the model’s ability to describe the semantic structure of English. For this, we assembled a pool of cue words such that: 1. Each word was normed in the USF free association database (Nelson, McEvoy, & Schreiber, 2004). 2. Each word was present in our reduced corpus. 3. Each word had a synonym as evaluated by WordNet. 4. Each word’s first associate was present in our reduced corpus. 5. Each word’s first listed synonym was present in our reduced corpus. There were 1,040 such words. These had 591 unique synonyms and 583 unique first associates. In order to evaluate the models’ ability to describe performance on the synonym test in a fair manner, it was necessary to find a nonparametric measure of the degree to which the model captures the structure of the space. We arbitrarily separated the synonym pairs into a set of cues and a set of targets. For each cue, we calculated its inner product with the semantic representation of each of the targets, and retained the rank of the correct target relative to the other targets in the set. Ties were addressed by taking the mean rank of the values tied with that of the target. The distribution of ranks was retained and the mean log rank on the synonym test was minimized. The results presented in this paper are based on the parameters for which the mean log rank is minimal. An analogous procedure, wherein the mean log rank on the free association test is minimized, can be adopted to evaluate the model’s parameters based on the models’ performance on the free association test. In this paper, we do not report results from these parameters. We computed ranks for pTCM using two measures of similarity between word a and b. One measure, which we refer to as the pTCM free associate strength, is constructed by taking the input pattern caused by item a, tIN a multiplying it by the matrix M and measuring the degree to which word b is present in the output.6 This is analogous to presenting item a, allowing it to provide input to the state of temporal context, and seeing to what extent b is predicted. The other measures the inner product of IN tIN a , the input pattern caused by item a, with tb , the input pattern caused by item b. We refer to this latter measure as the pTCM vector space model. The time necessary to compute pTCM on the corpus precluded a thorough search of the parameter space. We adopted the strategy of attempting to search the parameter space retaining 3,000 dimensions, then evaluating the best-fitting parameters for the model retaining 10,152 dimensions. We used a downhill simplex algorithm to minimize performance on a variety of synonym tests; these ultimately did not completely converge to a solution and
60
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
we took the most satisfactory solution that was available. We evaluated the simulation model with the same parameters and 3,000 dimensions. However, optimization of the simulation model was not practical and we only report results from the approximation. In order to compare pTCM to a version of LSA trained on the same inputs, we computed an LSA solution on the reduced corpus. This corpus did not include the stop words, short words, and extremely infrequent words omitted at the parsing stage, but did include the words collapsed into a single string. We varied the number of dimensions retained in steps of 50 to find the value that resulted in the best performance (as measured by mean rank) on our synonym test. This value was 800. One might argue that our use of an impoverished version of LSA is somewhat unfair to that method—unlike pTCM, LSA is not subject to computational limitations that make it impractical to run on the entire corpus. For this reason, we also calculated results for LSA calculated on the entire corpus with 300 dimensions. This calculation was done with the same in-house LSA implementation we used to compute the reduced corpus. The results of this calculation were checked against the SEMMOD package (Stone, Dennis, & Kwantes, 2008). 2.2. Results The best-fitting value of q, 0.68 was much greater than zero, indicating that multiple preceding content words contributed to the model’s estimate of temporal context. The value of qD describing the rate of contextual drift across sentence boundaries was also much greater than zero, indicating that information across sentence boundaries contributed positively to model performance. Critically, the best-fitting value of /, 0.41 was greater than zero, indicating that the ability to generalize across experiences was important to the model’s performance. We found that a broad variety of values of / yielded roughly similar ranks on the synonym test as long as the value of / did not approach zero or one, at which point performance fell off dramatically. Fig. 4A shows the cumulative distribution of ranks for the pTCM vector space model (dark blue) and LSA (light red) when both are trained on the reduced corpus. The graph gives the cumulative probability that the similarity of the cue word’s synonym obtained a given rank relative to the other targets. Good performance is reflected as a large number of low ranks, which means that the cumulative probability increases at low ranks. Put another way, good performance is manifest as a higher line in Fig. 4A. As can be seen from Fig. 4A, the distribution of ranks generated by the pTCM vector space model for synonyms was robustly lower than the ranks generated by LSA when they were both trained on the reduced corpus. Fig. 4 compares performance on the synonym test for the pTCM vector space model trained on the reduced corpus (dark blue) to LSA trained on the entire corpus (light red). Although the performance of the two models appears very similar, it can be established statistically that LSA trained on the entire corpus outperforms pTCM trained on the reduced corpus. For instance, the rank on the synonym test was lower for LSA trained on the entire corpus for 551 synonyms, whereas pTCM trained on the reduced corpus only produced a lower rank for 461 synonyms (p < .005 by the binomial distribution; there were 28 ties). The results of the analysis of the synonym test indicate that pTCM trained on the reduced
B
Cumulative Probability
0.0 0.2 0.4 0.6 0.8 1.0
Cumulative Probability
A
0
100
300 Rank
500
61
0.0 0.2 0.4 0.6 0.8 1.0
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
0
100
300
500
Rank
Fig. 4. Predictive temporal context model’s (pTCM’s) vector space model performs comparably to latent semantic analysis (LSA) on a synonym test. A set of 1,040 synonym pairs was assembled. For each model, we calculated the similarity between a word and its synonym and expressed that as a rank relative to all the other words’ synonyms. Low ranks indicate the model is doing well at placing words with similar meanings in similar locations. (A) Cumulative probability distribution of the rank of the synonym for the pTCM vector space model (dark blue) and LSA trained on the same words (light red). The higher curve indicates a larger proportion of low ranks, and thus better model performance. pTCM shows a marked improvement over LSA trained on the same words. (B) Same as (A), but comparing pTCM trained on the reduced corpus (dark blue) to LSA trained on the entire corpus (light red). Despite the fact that pTCM only had detailed information about 10,000 words (as opposed to 93,000 for LSA), there are relatively modest differences between the performance of pTCM and LSA.
corpus (approximately 10,000 unique words) outperforms LSA trained on the same corpus, and comparable, although slightly worse, to LSA when it was trained on the entire corpus (approximately 100,000 unique words). Performance by pTCM on the synonym test was moderately correlated with performance by LSA. The correlation across pairs between the rank assigned to synonyms by pTCM and by LSA trained on the reduced corpus was 0.56. The correlation between pTCM and LSA trained on the entire corpus was 0.69. However, both of these numbers were reliably less than the correlation between LSA trained on the reduced corpus and LSA trained on the entire corpus, 0.74. Note that although this comparison led to the highest correlation, it also corresponded to the largest difference in performance. We obtained comparable results for the free associate test. First, Fig. 5A shows the cumulative probability functions for the distribution of ranks of the first free associates for the pTCM vector space model (light red) and the pTCM free associate model in which the semantic representation of the cue item is used to predict the subsequent item (dark blue). There is a strong advantage for the pTCM free associate model over the pTCM vector space model in modeling free associates. Fig. 5B shows the cumulative probability distribution for the pTCM free associate model trained on the reduced corpus (dark blue), LSA trained on the reduced corpus (light red) and LSA trained on the entire corpus (lighter green). As with the comparison with the synonym test, pTCM produced reliably lower ranks than LSA when they were both trained with the reduced corpus. As with the synonym test, when LSA is trained on the entire corpus, there is a small but reliable advantage over pTCM trained on the
B
Cumulative Probability
Cumulative Probability
A
0
100
300 Rank
500
0.0 0.2 0.4 0.6 0.8 1.0
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011) 0.0 0.2 0.4 0.6 0.8 1.0
62
0
100
300
500
Rank
Fig. 5. Performance on free association norms. Similarity ratings were evaluated for a list of 1,040 words paired with their strongest associates. The strength of the relationship between the prime and its first associate was calculated and turned into a rank relative to the first associates of the other primes. Cumulative probability distributions of ranks are shown. Lower ranks reflect better model performance, meaning that higher curves reflect better model performance. (A) The cumulative probability distribution of ranks of the first associates for the predictive temporal context model (pTCM) ‘‘recall’’ model (dark blue) and the pTCM vector space model (light red). For the pTCM recall model, the semantic representation of the cue item was used as a cue via the contextto-item matrix. The activation of each target was used to generate its rank. The vector space model simply uses the inner product of the semantic representation of items to generate similarity values. (B) Cumulative probability distributions of ranks for the pTCM recall model (dark blue), latent semantic analysis (LSA) trained on the same words (light red) and LSA trained on the entire corpus (lighter green). pTCM trained on the reduced corpus shows dramatic improvement over LSA when it was trained on the same words. LSA trained on the entire corpus shows a modest improvement over pTCM trained on the reduced corpus.
reduced corpus. For instance, the rank of the first free associate was lower for LSA trained on the entire corpus for 551 cues, whereas pTCM trained on the reduced corpus only produced a lower rank for 388 cues (p < .001 by the binomial distribution; there were 101 ties). On the free associate test, the pTCM free associate model was only moderately correlated with LSA and with the pTCM vector space model, and the pTCM vector space model was more strongly correlated with LSA trained on the entire corpus. The correlation across pairs of the rank assigned by the pTCM free associate model to the first free associate to the rank assigned by the pTCM vector space model was 0.48. The correlations between the pTCM free associate model and LSA trained on the reduced and entire corpus were also moderate, both r ¼ .49. Interestingly, the correlation between the ranks assigned by the pTCM vector space model and LSA trained on the entire corpus were reliably higher, 0.68, and also higher than the correlation between LSA trained on the entire corpus and LSA trained on the reduced corpus, 0.64. There are several conclusions that can be reached from these analyses. First, the two measures derived from pTCM, the vector space similarity and free associate strength, produce somewhat different results. In particular, the vector space model was inferior at modeling human free associate performance (Fig. 5A). For both the synonym and free associate test, pTCM produced a dramatic advantage over LSA when both methods were trained on the reduced corpus. When LSA was trained on all the words in the corpus, approximately
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
63
Table 3 Nearest neighbors to the word ‘‘baker’’ using various measures of semantic similarity pTCM Vector Space
pTCM Free Associate
LSA
Quimby Frits Wiggle Liza Roberts Rogers Miyo Mandy Frances Cass Handing Pooh Jed Gregory Nicky Oswald Zaphod Pippi Gran
Helper Tennessee Cindy Peel Shoemaker Cooper Loaf Rotten Lazy Shop Baked Blacksmith Cakes Onion Dough Novels Baking Huddled Batter
Gaslight Pastry Holmes Sherlock Kendrick Passersby Sirhan Richard Cakes Tarts Humphrey Wallace Dough Hubert Irwin Daley Assasinations Begrimed Leavened
Notes. The first column shows the nearest neighbors in the pTCM semantic space. The second column shows the free associates of ‘‘baker’’ using pTCM. The third column shows the LSA nearest neighbors with pseudodoc weighting from lsa.colorado.edu with words that appeared in the corpus less than or equal to three times removed. The highest ranking word for all three measures was baker, which has been removed from this table. pTCM, Predictive temporal context model; LSA, latent semantic analysis.
100,000 unique words, it produced superior results to pTCM trained on about 10,000 unique words. It is tempting to assume that if pTCM were also provided more information by means of training it on the entire corpus it would dramatically outperform LSA. While this is a possibility, it is possible that information about low-frequency words may function as a source of noise that could reduce pTCM’s performance. Nonetheless, there are clearly qualitative differences between what is being responded to by the different pTCM measures and LSA. Table 3 shows the nearest neighbors of the word baker for the pTCM vector space model, the pTCM free associate model and LSA trained on the entire corpus. Several results can be obtained from examination of Table 3. First, the pTCM vector space model has exclusively identified proper names, with an emphasis on last names (e.g., quimby, frits, and wiggle all appear as last names in the TASA corpus). The vector space model ultimately rates as similar words that occur in similar contexts. In the TASA corpus, proper names often occur in similar roles in descriptions of conversations, as well as in the context of first names. The pTCM free associate measure of baker generates words with a variety of origins that can be understood from examination of the TASA corpus. For instance, the presence of
64
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
tennessee in this list is due to several passages that discuss former Senator Howard Baker of Tennessee. The presence of cindy in the list is attributable to a single passage in which a student (Cindy) has a conversation with her teacher (Mr. Baker). A majority of the entries for baker are related to the baking profession (e.g., loaf, shop, baked). In most cases, the pTCM free associate measure produces words that appear in close proximity to baker in the corpus. In contrast, LSA’s responses are grouped according to several broad themes that occur in the corpus. One easily identifiable theme are words related to the profession of baking (e.g., pastry, cakes, tarts, dough). The documents describing former Senator Howard Baker give rise to several near-neighbors that are related to politics and news in the late sixties and early seventies (e.g., sirhan, humphrey, wallace, daley, assasinations). In addition, multiple LSA near-neighbors are related to passages describing the fictional detective Sherlock Holmes (e.g., gaslight, holmes, sherlock, begrimed), who lived on Baker Street in London. Although these measures provide comparable performance on synonym tests (Fig. 4B) and free associate tests (Fig. 5B), Table 3 suggests that the measures accomplish this level of performance by responding to different sources of information. Examination of Table 3 relects the fact that LSA responds to thematic information available at the level of the document. The vector space model of pTCM responds by rating words that are preceded by similar temporal contexts as similar to one another. The pTCM free associate measure rates as similar words that occur in close temporal proximity to one another. The fact that these different sources of information lead to similar performance suggests the possibility that these measures could be combined to provide a metric more useful than any one of the measures taken in isolation.
3. General discussion Previous work on TCM demonstrated that gradually changing temporal context can provide a good account of recency and contiguity effects observed in episodic memory (Howard & Kahana, 2002; Howard et al., 2008b; Polyn et al., 2009a; Sederberg et al., 2008). Here we discuss efforts to construct a model of semantic memory within the same framework. Contextual learning enables generalization across multiple experiences that lack overlapping elements placing stimuli into the correct global arrangement (Howard et al., 2009; Rao & Howard, 2008). pTCM (Shankar et al., 2009) enables generalization to take place in sequentially organized materials, such as natural language. We then showed several demonstrations that pTCM can be applied to natural language by training the model on a reduced version of the TASA corpus. pTCM dramatically outperformed LSA trained on the same reduced corpus and produced results close, although statistically inferior, to those of LSA when LSA trained on the entire corpus. The point of this exercise was not to claim that pTCM is of superior practical utility than LSA or other computational models of semantic memory at this time. At present, the computational demands of pTCM and the necessity of fitting multiple parameters make it
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
65
unwieldy for many applications. However, because of its tight coupling with a successful model of episodic recall, it has theoretical advantages over other existing computational semantic memory models. The fact that a common description of temporal context can be used as a core concept of both a model of episodic memory performance and a model of semantic memory acquisition suggest that these concepts can form the basis of a common account of declarative memory. pTCM may be uniquely suited for describing the process of learning and memory retrieval that combines both semantic and episodic information. 3.1. Where is pTCM in the space of computational models of semantic learning? A natural question to ask is how pTCM relates to extant computational models of semantic memory. Here we briefly discuss the commonalities and differences between pTCM and several widely used models. 3.1.1. HAL The hyperspace analogue to language (HAL, Lund & Burgess, 1996) model uses a semantic representation that codes each word as the vector of counts of other words in the language appeared in a moving context window with that word. There are numerous similarities between HAL and pTCM. These include the fact that more recent words within a document contribute more strongly to the meaning of a word (this property is not shared with LSA or the topic model). In HAL, a word recovers the set of words that occurred nearby during learning—a process similar to retrieval of temporal context. This property enables HAL, along with the other models considered here, to account for transitive associations among items that did not co-occur in the corpus. One of the differences between HAL and pTCM is that the range over which temporal context is defined in pTCM can be quite long in pTCM—many prior words can contribute to the context. Although it remains to be seen how the ‘‘tail’’ of this distribution contributes to the semantic representations obtained from natural language, it is worth noting that the best-fitting parameters obtained here indicate that performance was optimal when many prior items contributed. A quick calculation reveals that in our natural language simulations as many as 27 prior items could have contributed to the context vector before passing under the sparsity threshold with the parameters used.7 This difference between HAL and pTCM is perhaps analogous to the distinction between buffer models of short-term memory (e.g., Atkinson & Shiffrin, 1968; Raaijmakers & Shiffrin, 1980) and TCM in understanding episodic free recall. There, the primary advantage of gradually changing temporal context over buffers with finite range is that temporal context can provide a more natural account of recency and contiguity accounts that extend over multiple time scales (see Howard, Kahana, & Sederberg, 2008a; Kahana, Sederberg, & Howard, 2008b; Sederberg et al., 2008; Usher, Davelaar, Haarmann, & Goshen-Gottstein, 2008, for a thorough discussion of the relationship between TCM and buffer models). The other major point of distinction between HAL and pTCM is that in HAL there is no generalization across word meaning during learning.
66
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
3.1.2. BEAGLE In BEAGLE (Jones, Kintsch, & Mewhort, 2006; Jones & Mewhort, 2007), each word is initially assigned a random vector of some length (Jones & Mewhort, 2007 used vectors of dimension 2,048). During learning, an item, or semantic, representation and an order representation is aggregated for each word. This distinction between item and order information, as well as much of the mathematical machinery of BEAGLE, is inherited from the TODAM model of episodic memory tasks (Murdock, 1982, 1997). In BEAGLE, the semantic representation is formed by summing the vectors of the other words that co-occur in the same sentence. The order representation is used by constructing an N-gram convolution between successive words. The pTCM has many commonalities with BEAGLE. BEAGLE’s contextual representation can be understood as analogous to the average prior context in which an item is presented (i.e., it is similar to h if it were averaged over all prior presentations of the item). If it were possible to set c to zero during learning, but non-zero during retrieval, the h that would result would be very similar to the context representation in BEAGLE. In both models, there is a natural model of free association that emerges—in BEAGLE this is taken from the order representation used to support the cued recall task (Murdock, 1982). There are, however, important differences. As discussed above, context in pTCM changes continuously and persists across sentence boundaries, allowing for long-range contingencies between words.8 In contrast, BEAGLE stops at sentence boundaries. The major difference between the models is that in pTCM the representation of a word that is used to make subsequent predictions changes during learning. BEAGLE relies on statistical averaging of the initial word vectors to build up the representations. In pTCM, the changing semantic representation of a word contributes to the temporal context vector, so that all of the information that has been learned up to that point can be brought to bear in making a prediction and thus updating the representation of the other items in the language. This may result in more robust generalization during learning. 3.1.3. LSA Latent semantic analysis (Landauer & Dumais, 1997) has set a standard for computational models of semantic knowledge for more than a decade. It has been successful in a broad range of applied settings and has shed considerable light on the basis of knowledge formation in a theoretical sense. Although the end-state of learning in pTCM and LSA are similar to some extent (e.g., Fig. 4B), pTCM and LSA are conceptually very different. The pTCM is a learning model in which information is gradually built up over experience. In contrast, the algorithm of LSA requires that all experience be accessible prior to calculating the semantic representation. LSA is consistent with a representation of temporal context that changes abruptly between documents but does not change at all within a document. The parameters of pTCM are sufficiently flexible to approximate a very slowly changing context vector by setting q.1 and qD ¼ 1. The best-fitting parameters were far from these values, suggesting that the there are meaningful changes in temporal context within a document. As mentioned previously, although a vector space can be extracted from
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
67
pTCM, this is not the only, or even necessarily the best, representation of meaning possible within pTCM. The free associate measure is not subject to the constraints of a vector space. For instance, the associative strength between two words is asymmetric and can violate the triangle inequality. With all these differences in mind, it is remarkable that the end-state of pTCM is as similar to that of LSA as it is. 3.1.4. The topic model The probabilistic topic model (Griffiths et al., 2007), like LSA, starts with a word-bydocument co-occurrence model. It makes the assumption that the words in a document are sampled from mixtures across a latent variable referred to as a topic. The model constructs a prior on the degree of mixing of topics in a given document, then estimates the probability of sampling a word given each topic using latent Dirichlet allocation (Blei, Ng, & Jordan, 2003). The distribution of words across topics gives an estimate of their meaning. Many of the points of contrast between pTCM and the topic model are the same as those with LSA: The construction of topics makes the assumption that meaning does not change within a document, the topics calculation is taken after study of the entire corpus. Both pTCM and the topic model have a natural account of retrieval from memory, although in pTCM’s case this is embedded more strongly in a model of episodic retrieval. The primary advantage of the topic model over pTCM is its natural treatment of polysemy, which does not currently have an analog in pTCM. 3.1.5. The syntagmatic-paradigmatic model The syntagmatic-paradigmatic model (SP, Dennis, 2004, 2005) attempts to automatically extract knowledge from naturally occurring text by using training exemplars to mutually satisfy syntagmatic and paradigmatic constraints. Syntagmatic association are formed between words that occur in series in language—for instance, run and fast. In contrast, paradigmatic associations are formed between words that have similar meaning—or that fulfill similar roles in language—for example, run and walk. SP utilizes both types of associations to model the generation of language. In pTCM, paradigmatic associations are analogous to those constructed using the vector space representation. Paradigmatic associates are words that fit into similar contextual roles and thus have similar semantic representations in pTCM. Syntagmatic associates also have an analog in pTCM. Given a semantic representation of an item a, when multiplied by M, this gives the set of items that are predicted to follow a based on experience with the corpus (see Table 3). A single matrix, however, cannot capture the rich syntactic structure of English as may be possible with the SP model. 3.2. Does pTCM provide additional insight into the neural basis of memory? One of the strengths of TCM as a model of episodic memory is the existence of a linking hypothesis between the structures of the model and physical processes that take place in the medial temporal lobe of the brain (Howard et al., 2005). This linking hypothesis has led to predictions about the behavior of brain states that have been confirmed with measurements from neural ensembles (Manns, Howard, & Eichenbaum, 2007). The confirmation of these
68
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
neurophysiological predictions, coupled with the confirmation of numerous behavioral predictions (Howard et al., 2007, 2008b; Howard et al., 2009; Polyn et al., 2009a; Polyn, Norman, & Kahana, 2009b; Schwartz, Howard, Jing, & Kahana, 2005; Unsworth, 2008) make TCM a strong model of episodic recall. Although our understanding of pTCM is at a much earlier stage, it is possible that the extension to pTCM will enhance the set of neural phenomena that can be addressed in a common cognitive framework. Because pTCM is a superset of TCM (compare Table 1 with Table 2), a linking hypothesis between pTCM and the brain shares many of the same contact points—the context vector should reside in extrahippocampal medial temporal lobe (MTL) regions, especially the entorhinal cortex (see Polyn & Kahana, 2008, for a different hypothesis) and the hippocampus should be responsible for the recovery of temporal context. There are two unique predictions of pTCM. One is that the semantic representation of items should come to reflect the temporal contexts in which they are experienced. The second is that the brain uses the current state of temporal context to generate a prediction about what will happen next. There is neurophysiological evidence that suggests both of these predictions hold. Neurons in area TE of the monkey inferotemporal cortex, a region one synapse away from the MTL, respond to high-level visual stimuli during and following their presentation in a way that is not dependent on their coarse physical properties (Miyashita & Chang, 1988). Remarkably, neurons that respond to a particular stimulus are also more likely to respond to other stimuli that are repeatedly experienced close together in time (Miyashita, 1988) or as members of a bidirectionally presented pair of stimuli that predict one another (Sakai & Miyashita, 1991). Because the neurons are responding to an arbitrary temporal pairing of the stimuli rather than any physical property they have, these findings are as one would expect if the neurons were coding a semantic representation constructed from prediction vectors. This pair-coding phenomenon has been observed both in TE and perirhinal cortex (Erickson, Jagadeesh, & Desimone, 2000; Messinger, Squire, Zola, & Albright, 2001; Naya, Yoshida, & Miyashita, 2003), an extrahippocampal medial temporal lobe region one synapse from the entorhinal cortex. The pair-coding phenomenon also depends on feedback from the medial temporal lobe (Higuchi & Miyashita, 1996; Naya et al., 2003). Both of these properties are as one would expect if the change in the neurons’ responsiveness with experience depended on a prediction generated by a temporal context vector residing in extrahippocampal medial temporal lobe, especially the entorhinal cortex. The other large-scale prediction of pTCM, that the brain generates a prediction about subsequent stimuli based on the current state of temporal context, may also have a neurophysiological analog. The N400 is a negative potential observed most prominently when subjects are perceiving words that are semantically incongruous (Kutas & Hillyard, 1980), that is, ‘‘The baker reached into the oven and pulled out the boat.’’ The N400 is observed to the extent that a word is not well-predicted by its preceding semantic context (Bentin, McCarthy, & Wood, 1985; van Berkum, Hagoort, & Brown, 1999; Federmeier, 2007). Notably, the prediction can be generated both by proximate words (Bentin et al., 1985) and more remote semantic context (van Berkum et al., 1999), suggesting that the prediction is generated across multiple time scales at once. These findings suggest that the N400 could reflect a mismatch between a prediction generated from a temporal context vector and a presented stimulus.
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
69
The identification of these ERPs with the mismatch between a prediction vector and the presented stimulus may facilitate development of another strong link between the mathematical framework of pTCM and MTL physiology. The N400 has a large generator in extrahippocampal MTL cortical regions (McCarthy, Nobre, Bentin, & Spencer, 1995). The N400 may be understood, at least in part, as a modulation of ongoing oscillatory activity in the MTL (Fell et al., 2004). While we do not wish to claim that the MTL generator is the sole source of the scalp ERP, presentation of a stimulus that is poorly predicted by its semantic context apparently has a profound effect on human MTL physiology. Moreover, the N400 in the anterior MTL to a studied stimulus predicts whether that stimulus will subsequently be recalled (Fernandez et al., 1999; similar effects are also recorded at the scalp, see Paller & Wagner, 2002 for a review). The involvement of the N400 in the MTL in both integration with semantic context and episodic memory encoding could eventually lead to a number of interesting constraints on a physical model of declarative memory. 3.3. Episodic memory and semantic memory We have shown that it is possible to build a model of semantic memory acquisition in the same framework occupied by a model of episodic memory. This framework leads to predictions about a tight coupling between episodic and semantic memory that we have not yet explored. For instance, in the simulations of natural language using pTCM we did not allow contextual recovery, a process we believe to be an essential aspect of episodic memory (Howard et al., 2005; Sederberg, Miller, Kahana, & Howard, in press), to take place. One challenge of this future work is to specify the conditions under which episodic recovery succeeds. One intriguing possibility is that words that are poorly predicted by their study context are bound effectively to that context such that they can recover the context in the future. Another challenge is to determine which context is recovered by a word that is experienced multiple times. On the one hand, the function of episodic memory as recall of a specific event situated in a specific spatiotemporal context is defeated if all prior contexts in which a word has been experienced contribute to the context it recovers. On the other hand, simulations of learning double-function lists within a particular experiment suggest a gradual change in the temporal context recovered by an item, reflecting multiple study events in the same experiment (Howard et al., 2009). It is possible that these simulations mistake recovery of temporal context for the buildup of a prediction vector such as that utilized here. These distinctions may be teased apart by future experimentation. The specific integration of semantic memory into the TCM framework offered by pTCM potentially places strong constraints on TCM as a model of episodic recall. Since the earliest treatments of TCM, the input caused by an item when it is initially presented is to be understood as reflecting its prior history. In more recent treatments of TCM, a distinction is made between the preexperimental context-to-item matrix and the newly learned part of the context-to-item matrix which encodes information about the study items’ encoding context (Polyn et al., 2009a; Sederberg et al., 2008). Polyn et al. (2009a) used the preexperimental matrix to carry information about semantic relationships among words which was sufficient to account for the existence of semantic clustering in free recall. In the context of modeling
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
70
episodic memory, pTCM may be understood as a method to initialize the values of the preexperimental context-to-item matrix and the input patterns caused by items when they are initially presented. Taken together, the two models reflect a shared hypothesis about the interaction between semantic and episodic factors on memory retrieval.
Notes 1. See Shankar et al., 2009 for a more complete discussion of this point. 2. The asymmetry observed in the contiguity effect is also explained by the model. This is because, unlike this simplified example, the input pattern caused by pupil when it is repeated also includes the input pattern it caused during study (see Eq. 3). Because this overlaps with the encoding context for words that followed pupil but not those that preceded it, this causes an asymmetry. 3. The results for the topic model were generously provided by Mark Steyvers. 4. More explicitly, f0b MtIN a is the cue strength between item a and item b. 5. Actually, one can get practically useful results out of the model if one allows c to be zero during study but nonzero during retrieval. This representation ends up being similar to the semantic representation in BEAGLE or the vectors of the HAL model. Given TCM, though, this account is theoretically unsatisfactory. If retrieved context is useful, why wouldn’t it be used during the thousands of hours of study that are presumably reflected by the corpus? 6. That is, we compute f0b MtIN a . 7. This is an upper limit calculation that assumes that there are no sentence boundaries in this string of words. 8. This can be seen from the fact that the best-fitting value of qD was not zero. Acknowledgment Supported by NIH grant 1-R01 MH069938 to M.W.H. Thanks to Mark Steyvers who calculated the predictions of the topic model used in Fig. 1C. We thank Vinayak Rao for developing the software for presenting pairs chosen from a small-world network and performing early simulations. Aditya Datey, Hongliang Gai and Aditya Udas provided software support. Udaya Jagadisan is now in the Department of Biomedical Engineering, University of Pittsburgh.
References Anderson, J. R., & Bower, G. H. (1972). Recognition and retrieval processes in free recall. Psychological Review, 79(2), 97–123.
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
71
Atkinson, R. C., & Shiffrin, R. M. (1968). Human memory: A proposed system and its control processes. In K. W. Spence & J. T. Spence (Eds.), The psychology of learning and motivation (Vol. 2, pp. 89–105). New York: Academic Press. Bentin, S., McCarthy, G., & Wood, C. C. (1985). Event-related potentials, lexical decision and semantic priming. Electroencephalography and Clinical Neurophysiology, 60(4), 343–355. van Berkum, J. J. Hagoort, P., & Brown, C. M. (1999). Semantic integration in sentences and discourse: Evidence from the N400. Journal of Cognitive Neuroscience, 11(6), 657–671. Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Borovsky, A., & Elman, J. (2006). Language input and semantic categories: A relation between cognition and early word learning. Journal of Child Language, 33(4), 759–790. Brown, G. D., Neath, I., & Chater, N. (2007). A temporal ratio model of memory. Psychological Review, 114(3), 539–576. Bunsey, M., & Eichenbaum, H. B. (1996). Conservation of hippocampal memory function in rats and humans. Nature, 379(6562), 255–257. Dennis, S. (2004). An unsupervised method for the extraction of propositional information from text. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5206–5213. Dennis, S. (2005). A memory-based theory of verbal cognition. Cognitive Science, 29, 145–193. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Erickson, C. A., Jagadeesh, B., & Desimone, R. (2000). Clustering of perirhinal neurons with similar properties following visual experience in adult monkeys. Nature Neuroscience, 3(11), 1143–1148. Estes, W. K. (1955). Statistical theory of spontaneous recovery and regression. Psychological Review, 62, 145– 154. Federmeier, K. D. (2007). Thinking ahead: The role and roots of prediction in language comprehension. Psychophysiology, 44(4), 491–505. Fell, J., Dietl, T., Grunwald, T., Kurthen, M., Klaver, P., Trautner, P., Schaller, C., Elger, C. E., & Ferna´ndez, G. (2004). Neural bases of cognitive ERPs: More than phase reset. Journal of Cognitive Neuroscience, 16(9), 1595–1604. Fernandez, G., Effern, A., Grunwald, T., Pezer, N., Lehnertz, K., Du¨mpelmann, M., Van Roost, D., & Elger, C. E. (1999). Real-time tracking of memory formation in the human rhinal cortex and hippocampus. Science, 285, 1582–1585. Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114(2), 211–244. Higuchi, S., & Miyashita, Y. (1996). Formation of mnemonic neuronal responses to visual paired associates in inferotemporal cortex is impaired by perirhinal and entorhinal lesions. Proceedings of the National Academy of Sciences of the United States of America, 93(2), 739–743. Howard, M. W., & Kahana, M. J. (2002). A distributed representation of temporal context. Journal of Mathematical Psychology, 46(3), 269–299. Howard, M. W., Fotedar, M. S., Datey, A. V., & Hasselmo, M. E. (2005). The temporal context model in spatial navigation and relational learning: Toward a common explanation of medial temporal lobe function across domains. Psychological Review, 112(1), 75–116. Howard, M. W., Kahana, M. J., & Wingfield, A. (2006). Aging and contextual binding: Modeling recency and lag-recency effects with the temporal context model. Psychonomic Bulletin & Review, 13, 439–445. Howard, M. W., Venkatadass, V., Norman, K. A., & Kahana, M. J. (2007). Associative processes in immediate recency. Memory & Cognition, 35, 1700–1711. Howard, M. W., Kahana, M. J., & Sederberg, P. B. (2008a). Postscript: Distinguishing between temporal context and short-term store. Psychological Review, 115, 1125–1126. Howard, M. W., Youker, T. E., & Venkatadass, V. (2008b). The persistence of memory: Contiguity effects across several minutes. Psychonomic Bulletin & Review, 15, 58–63.
72
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
Howard, M. W., Jing, B., Rao, V. A., Provyn, J. P., & Datey, A. V. (2009). Bridging the gap: Transitive associations between items presented in similar temporal contexts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35, 391–407. Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order information composite holographic lexicon. Psychological Review, 114, 1–32. Jones, M. N., Kintsch, W., & Mewhort, D. J. (2006). High-dimensional semantic space accounts of priming. Journal of Memory and Language, 55, 534–552. Kahana, M. J., Howard, M., & Polyn, S. (2008a). Associative processes in episodic memory. In H. L. Roediger III (Ed.), Cognitive psychology of memory, Vol. 2 of learning and memory – a comprehensive reference (J. Byrne, Editor) (pp. 476–490). Oxford, England: Elsevier. Kahana, M. J., Sederberg, P. B., & Howard, M. W. (2008b). Putting short-term memory into context: Reply to Usher, Davelaar, Haarmann and Goshen-Gottstein (2008). Psychological Review, 115, 1119–1126. Kemp, C., & Tenenbaum, J. B. (2009). Structured statistical models of inductive reasoning. Psychological Review, 116(1), 20–58. Kutas, M., & Hillyard, S. A. (1980). Reading senseless sentences: Brain potentials reflect semantic incongruity. Science, 207(4427), 203–205. Landauer, T. K., & Dumais, S. T. (1997). Solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments & Computers, 28(2), 203–208. Manns, J. R., Howard, M. W., & Eichenbaum, H. B. (2007). Gradual changes in hippocampal activity support remembering the order of events. Neuron, 56, 530–540. McCarthy, G., Nobre, A. C., Bentin, S., & Spencer, D. D. (1995). Language-related field potentials in the anterior-medial temporal lobe: I. Intracranial distribution and neural generators. The Journal of Neuroscience, 15, 1080–1089. Mensink, G.-J. M., & Raaijmakers, J. G. W. (1988). A model for interference and forgetting. Psychological Review, 95, 434–455. Messinger, A., Squire, L. R., Zola, S. M., & Albright, T. D. (2001). Neuronal representations of stimulus associations develop in the temporal lobe during learning. Proceedings of the National Academy of Sciences of the United States of America, 98(21), 12239–12244. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature, 335(6193), 817–820. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331(6151), 68–70. Murdock, B. B. (1982). A theory for the storage and retrieval of item and associative information. Psychological Review, 89, 609–626. Murdock, B. B. (1997). Context and mediators in a theory of distributed associative memory (TODAM2). Psychological Review, 104(2), 839–862. Naya, Y., Yoshida, M., & Miyashita, Y. (2003). Forward processing of long-term associative memory in monkey inferotemporal cortex. Journal of Neuroscience, 23(7), 2861–2871. Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. Behavior Research, Methods Instruments and Computers, 36(3), 402–407. Paller, K. A., & Wagner, A. D. (2002). Observing the transformation of experience into memory. Trends in Cognitive Science, 6(2), 93–102. Polyn, S. M., & Kahana, M. J. (2008). Memory search and the neural representation of context. Trends in Cognitive Science, 12(1), 24–30. Polyn, S. M., Norman, K. A., & Kahana, M. J. (2009a). A context maintenance and retrieval model of organizational processes in free recall. Psychological Review, 116, 129–156. Polyn, S. M., Norman, K. A., & Kahana, M. J. (2009b). Task context and organization in free recall. Neuropsychologia, 47(11), 2158–2163.
M. W. Howard, K. H. Shankar, U. K. K. Jagadisan ⁄ Topics in Cognitive Science 3 (2011)
73
Provyn, J. P., Sliwinski, M. J., & Howard, M. W. (2007). Effects of age on contextually mediated associations in paired associate learning. Psychology and Aging, 22, 846–857. Raaijmakers, J. G. W., & Shiffrin, R. M. (1980). SAM: A theory of probabilistic search of associative memory. In G. H. Bower (Ed.), The psychology of learning and motivation: Advances in research and theory (Vol. 14, pp. 207–262). New York: Academic Press. Rao, V. A., & Howard, M. W. (2008). Retrieved context and the discovery of semantic structure. In J. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems 20 (pp. 1193– 1200). Cambridge, MA: MIT Press. Rohde, D. L. T. (1999). The simple language generator: Encoding complex languages with simple grammars. In Technical Report, CMU-CS-99-123. Pittsburgh, PA: Carnegie Mellon, Department of Computer Science. Sakai, K., & Miyashita, Y. (1991). Neural organization for the long-term memory of paired associates. Nature, 354(6349), 152–155. Schwartz, G., Howard, M. W., Jing, B., & Kahana, M. J. (2005). Shadows of the past: Temporal retrieval effects in recognition memory. Psychological Science, 16(11), 898–904. Sederberg, P. B., Howard, M. W., & Kahana, M. J. (2008). A context-based theory of recency and contiguity in free recall. Psychological Review, 115, 893–912. Sederberg, P. B., Miller, J. F., Kahana, M. J., & Howard, M. W. (in press). Temporal contiguity between recalls predicts episodic memory performance. Memory & Cognition. Shankar, K. H., Jagadisan, U. K. K., & Howard, M. W. (2009). Sequential learning using temporal context. Journal of Mathematical Psychology, 53, 474–485. Slamecka, N. J. (1976). An analysis of double-function lists. Memory & Cognition, 4, 581–585. Steyvers, M., & Tenenbaum, J. (2005). The large scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29, 41–78. Stone, B., Dennis, S., & Kwantes, P. J. (2008). A systematic comparison of semantic models on human similarity rating data: The effectiveness of subspacing. In B. C. Love, K. McRae, & V. M. Sloutsky (Eds.), The Proceedings of the Thirtieth Conference of the Cognitive Science Society (pp. 1813–1818). Austin, TX: Cognitive Science Society. Strogatz, S. H. (2001). Exploring complex networks. Nature, 410(6825), 268–276. Tenenbaum, J. B., Griffiths, T. L., & Kemp, C. (2006). Theory-based Bayesian models of inductive learning and reasoning. Trends in Cognitive Science, 10(7), 309–318. Unsworth, N. (2008). Exploring the retrieval dynamics of delayed and final free recall: Further evidence for temporal-contextual search. Journal of Memory and Language, 59, 223–236. Usher, M., Davelaar, E. J., Haarmann, H. J., & Goshen-Gottstein, Y. (2008). Short-term memory after all: Comment on Sederberg, Howard, and Kahana (2008). Psychological Review, 115(4), 1108–1118. Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440–442.
Topics in Cognitive Science 3 (2011) 74–91 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01109.x
Discovering Binary Codes for Documents by Learning Deep Generative Models Geoffrey Hinton,a Ruslan Salakhutdinovb a Department of Computer Science, University of Toronto Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology
b
Received 12 March 2009; received in revised form 8 January 2010; accepted 8 January 2010
Abstract We describe a deep generative model in which the lowest layer represents the word-count vector of a document and the top layer represents a learned binary code for that document. The top two layers of the generative model form an undirected associative memory and the remaining layers form a belief net with directed, top-down connections. We present efficient learning and inference procedures for this type of generative model and show that it allows more accurate and much faster retrieval than latent semantic analysis. By using our method as a filter for a much slower method called TF-IDF we achieve higher accuracy than TF-IDF alone and save several orders of magnitude in retrieval time. By using short binary codes as addresses, we can perform retrieval on very large document sets in a time that is independent of the size of the document set using only one word of memory to describe each document. Keywords: Deep learning; Semantic hashing; Auto-encoders; Restricted Boltzmann machines; Document retrieval; Binary codes
1. Introduction Representing the semantic content of a document is an unsolved problem. We think it is very unlikely that a low-dimensional representation containing only a few hundred numbers will ever be capable of capturing more than a tiny fraction of the content of the distributed representation over millions of neurons that is used by the brain. Even if the documents do lie on (or near) a fairly low-dimensional, nonlinear manifold in a high-dimensional space of word sequences, it is unlikely that the best way to capture the structure of this manifold is by trying to learn explicit coordinates on the manifold for each document. The brain is much Correspondence should be sent to Geoffrey Hinton, Department of Computer Science, University of Toronto, 6 King’s College Rd, Toronto, ON, M5S 3G4 Canada. E-mail:
[email protected]
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
75
more likely to capture the manifold structure implicitly by using an extremely high-dimensional space of distributed representations in which all but a tiny fraction of the space has been ruled out by learned interactions between neurons. This type of implicit representation has many advantages over the explicit representation provided by a low-dimensional set of coordinates on the manifold: 1. It can be learned efficiently from data by extracting multiple layers of features to form a ‘‘deep belief net’’ in which the top-level associative memory contains energy ravines. The low energy floor of a ravine is a good representation of a manifold (Hinton, Osindero, & Teh, 2006). 2. Implicit representation of manifolds using learned energy ravines makes it relatively easy to deal with data that contain an unknown number of manifolds each of which has an unknown number of intrinsic dimensions. 3. Each manifold can have a number of intrinsic dimensions that varies along the manifold. 4. If documents are occasionally slightly ill-formed, implicit dimensionality reduction can accommodate them by using energy ravines whose dimensionality increases appropriately as the allowed energy level is raised. The same approach can also allow manifolds to merge at higher energies. In addition to these arguments against explicit representations of manifold coordinates, there is not much evidence for small bottlenecks in the brain. The lowest bandwidth part of the visual system, for example, is the optic nerve with its million or so nerve fibers and there are good physical reasons for that restriction. Despite all these arguments against explicit dimensionality reduction, it is sometimes very useful to have an explicit, low-dimensional representation of a document. One obvious use is visualizing the structure of a large set of documents by displaying them in a two or three-dimensional map. Another use, which we focus on in this paper, is document retrieval. We do not believe that the low-dimensional representations we learn in this paper tell us much about how people represent or retrieve documents. Our main aim is simply to show that our nonlinear, multilayer methods work much better for retrieval than earlier methods that use low-dimensional vectors to represent documents. We find these earlier methods equally implausible as cognitive models, or perhaps even more implausible as they do not work as well. A very unfortunate aspect of our approach to document retrieval is that we initialize deep autoencoders using the very same ‘‘pretraining’’ algorithm as was used in Hinton et al. (2006). When this algorithm is used to learn very large layers, it can be shown to improve a generative model of the data each time an extra layer is added (strictly speaking, it improves a bound on the probability that the model would generate the training data). When the pretraining procedure is used with a central bottleneck, however, all bets are off. Numerous models for capturing low-dimensional latent representations have been proposed and successfully applied in the domain of information retrieval. Latent semantic analysis (LSA; Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990) extracts
76
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
low-dimensional semantic structure using singular value decomposition to get a low-rank approximation of the word-document co-occurrence matrix. This allows document retrieval to be based on ‘‘semantic’’ content rather than just on keywords. Given some desired dimensionality for the codes, LSA finds codes for documents that are optimal in the sense that they minimize the squared error if the word-count vectors are reconstructed from the codes. To achieve this optimality, however, LSA makes the extremely restrictive assumption that the reconstructed counts for each document are a linear function of its code vector. If this assumption is relaxed to allow more complex ways of generating predicted word counts from code vectors, then LSA is far from optimal. As we shall see, nonlinear generative models that use multiple layers of representation and much smaller codes can perform much better than LSA, both for reconstructing word-count vectors and for retrieving semantically similar documents. When LSA was introduced, there were no efficient algorithms for fitting these more complex models, but that has changed. LSA still has the advantages that it does not get trapped at local optima, it is fast on a conventional computer, and it does not require nearly as much training data as methods that fit more complex models with many more parameters. LSA is historically important because it showed that a large document corpus contains a lot of information about meaning that is relatively easy to extract using a sensible statistical method. As a cognitive model, however, LSA has been made rather implausible by the fact that nonlinear, multilayer methods work much better. A probabilistic version of LSA (pLSA) was introduced by Hofmann (1999), using the assumption that each word is modeled as a single sample from a mixture of topics. The mixing proportions of the topics are specific to the document, but the probability distribution over words that is defined by each topic is the same for all documents. For example, a topic such as ‘‘soccer’’ would have a fixed probability of producing the word ‘‘goal’’ and a document containing a lot of soccer-related words would have a high mixing proportion for the topic ‘‘soccer.’’ To make this into a proper generative model of documents, it is necessary to define a prior distribution over the document-specific topic distributions. This gives rise to a model called ‘‘Latent Dirichlet Allocation,’’ which was introduced by Blei, Ng, and Jordan (2003). All these models can be viewed as graphical models (Jordan, 1999) in which hidden topic variables have directed connections to variables that represent word counts. One major drawback is that exact inference is intractable due to explaining away (Pearl, 1988), so they have to resort to slow or inaccurate approximations to compute the posterior distribution over topics. A second major drawback, that is shared by all mixture models, is that these models can never make predictions for words that are sharper than the distributions predicted by any of the individual topics. They are unable to capture an important property of distributed representations, which is that the broad distributions predicted by individual active features get multiplied together (and renormalized) to give the sharp distribution predicted by a whole set of active features. This intersection or ‘‘conjunctive coding’’ property allows individual features to be fairly general but their joint effect to be much more precise. The ‘‘disjunctive coding’’ employed by mixture models cannot achieve precision in this way. For example, distributed representations allow the topics ‘‘torture,’’ ‘‘deregulation,’’
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
77
and ‘‘oil’’ to combine to give very high probability to a few familiar names that are not predicted nearly as strongly by each topic alone. Since the introduction of the term ‘‘distributed representation’’ (Hinton, McClelland, & Rumelhart, 1986), its meaning has evolved beyond the original definition in terms of set intersections, but in this paper the term is being used in its original sense. Welling, Rosen-Zvi, and Hinton (2005) point out that for information retrieval, fast inference is vital and to achieve this they introduce a class of two-layer undirected graphical models that generalize restricted Boltzmann machines (RBMs; see Section 2) to exponential family distributions, thus allowing them to model nonbinary data and to use nonbinary hidden variables. Maximum likelihood learning is intractable in these models because they use nonlinear distributed representations, but learning can be performed efficiently by following an approximation to the gradient of a different objective function called ‘‘contrastive divergence’’ (Hinton, 2002). Several further developments of these undirected models (Gehler, Holub, & Welling, 2006; Xing, Yan, & Hauptmann, 2005) show that they are competitive in terms of retrieval accuracy to their directed counterparts. There are limitations on the types of structure that can be represented efficiently by a single layer of hidden variables and a network with multiple, nonlinear hidden layers should be able to discover representations that work better for retrieval. In this paper, we present a deep generative model whose top two layers form an undirected bipartite graph (see Fig. 1). The lower layers form a multilayer directed belief network, but unlike Latent Dirichlet Allocation this belief net uses distributed representations. The model can be trained efficiently by using an RBM to learn one layer of hidden variables at a time (Hinton et al., 2006; Hinton, 2007a). After learning the features in one hidden layer, the activation vectors of those features when they are being driven by data are used as the ‘‘data’’ for training the next hidden layer.
2000 T
W 1 +ε 6 Top Layer Binary Codes
128
128
500
RBM
T
W 2 +ε 5
W3
500
500 W3
500
500 T
W2
500
2000
The Deep Generative Model
500 W1
2000
Recursive Pretraining
128 Gaussian Noise
W2
500 T W1
T
W 3 +ε 4
RBM
RBM
Code Layer W 3 +ε 3
500 W 2 +ε 2
500 W 1 +ε 1
2000
Fine−tuning
Fig. 1. Left panel: The deep generative model. Middle panel: Pretraining consists of learning a stack of restricted Boltzmann machines (RBMs) in which the feature activations of one RBM are treated as data by the next RBM. Right panel: After pretraining, the RBMs are ‘‘unrolled’’ to create a multilayer autoencoder that is fine-tuned by backpropagation.
78
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
After this greedy ‘‘pretraining’’ is complete, the composition of all of the RBMs yields a feed-forward ‘‘encoder’’ network that converts word-count vectors to compact codes. By composing the RBMs in the opposite order (but with the same weights) we get a ‘‘decoder’’ network that converts compact code vectors into reconstructed word-count vectors. When the encoder and decoder are combined, we get a multilayer autoencoder network that converts word-count vectors into reconstructed word-count vectors via a compact bottleneck. This autoencoder network only works moderately well, but it is an excellent starting point for a fine-tuning phase of the learning which uses back-propagation to greatly improve the reconstructed word counts. In general, the representations produced by greedy unsupervised learning are helpful for regression or classification, but this typically requires large hidden layers that recode the structure in the input as complex sparse features while retaining almost all of the information in the input. When the hidden layers are much smaller than the input layer, a further type of learning is required (Hinton & Salakhutdinov, 2006). After the greedy, layerby-layer training, the deep generative model of documents is not significantly better for document retrieval than a model with only one hidden layer. To take full advantage of the multiple hidden layers, the layer-by-layer learning must be treated as a ‘‘pretraining’’ stage that finds a good region of the parameter space. Starting in this region, back-propagation learning can be used to fine-tune the parameters to produce a much better model. The backpropagation fine-tuning is not responsible for discovering what features to use in the hidden layers of the autoencoder. Instead, it just has to slightly modify the features found by the pretraining in order to improve the reconstructions. This is a much easier job for a myopic, gradient descent procedure like back-propagation than discovering what features to use. After learning, the mapping from a word-count vector to its compact code is very fast, requiring only a matrix multiplication followed by a componentwise nonlinearity for each hidden layer. In Section 2 we introduce the RBM. A longer and gentler introduction to RBMs can be found in Hinton (2007a). In Section 3 we generalize RBMs in two ways to obtain a generative model for word-count vectors. This model can be viewed as a variant of the Rate Adaptive Poisson model (Gehler et al., 2006) that is easier to train and has a better way of dealing with documents of different lengths. In Section 4 we describe both the layer-by-layer pretraining and the fine-tuning of the deep generative model. We also show how ‘‘deterministic noise’’ can be used to force the fine-tuning to discover binary codes in the top layer. In Section 5 we show that 128-bit binary codes are slightly more accurate than 128 real-valued codes produced by LSA, in addition to being faster and more compact. We also show that by using the 128-bit binary codes to restrict the set of documents searched by TF-IDF (Salton & Buckley, 1988), we can slightly improve the accuracy and vastly improve the speed of TF-IDF. Finally, in Section 6 we show that we can use our model to allow retrieval in a time independent of the number of documents. A document is mapped to a memory address in such a way that a small hamming-ball around that memory address contains the semantically similar documents. We call this technique ‘‘semantic hashing’’ (Salakhutdinov & Hinton, 2007).
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
79
2. Learning feature detectors for binary images A Boltzmann machine is a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm (Hinton & Sejnowski, 1983) that allows them to discover features that represent complex regularities in the training data. The learning algorithm is very slow in networks with many layers of feature detectors, but it is fast in the RBM—a network with a single layer of feature detectors that are not directly connected to one another. RBMs have been used extensively for modeling binary images and so they will be explained in this context as in Hinton (2007b) before introducing the modifications that are required for modeling documents. If there are no direct interactions between the feature detectors and no direct interactions between the pixels, there is a simple and efficient way to learn a good set of feature detectors from a set of binary training images (Hinton, 2002). We start with zero weights on the symmetric connections between each pixel i and each feature detector j. We then repeatedly update each weight, wij, using the difference between two measured, pairwise correlations Dwij ¼ !ð< vi hj >data # < vi hj >recon Þ
ð1Þ
where ! is a learning rate, < vihj > data is the frequency with which pixel i and feature detector j are on together when the feature detectors are being driven by images from the training set, and < vihj > recon is the corresponding frequency when the feature detectors are being driven by reconstructed images. A similar learning rule can be used for the biases. Given a training image, we set the binary state, hj, of each feature detector to be 1 with a probability given by the logistic sigmoid, r(x) ¼ (1 + exp ()x)))1 ! X ð2Þ pðhj ¼ 1Þ ¼ r bj þ vi wij i2pixels
where bj is the bias of j and vi is the binary state of pixel i. Once binary states have been chosen for the hidden units we produce a ‘‘reconstruction’’ of the training image by setting the state of each pixel to be 1 with probability ! X ð3Þ hj wij pðvi ¼ 1Þ ¼ r aj þ j2features
where ai is the bias of i. The learned weights and biases of the features implicitly define a probability distribution over all possible binary images. Sampling from this distribution is difficult, but it can be done by using ‘‘alternating Gibbs sampling.’’ This starts with a random image and then alternates between updating all of the features in parallel using Eq. 2 and updating all of the pixels in parallel using Eq. 3. After Gibbs sampling for sufficiently long, the network reaches ‘‘thermal equilibrium.’’ The states of pixels and feature detectors still change, but the probability of finding the system in any particular binary configuration does not.
80
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
3. A generative model of word counts The binary stochastic units used in Boltzmann machines can be generalized to ‘‘softmax’’ units that have more than two discrete values (Hinton, 2002). We use an RBM in which the hidden units are binary, but each visible unit has as many different values as there are word types. We represent the mth discrete value of visible unit i by a vector containing a 1 at the mth location and zeros elsewhere. Each hidden unit, j, then has many different weights connecting it to each visible unit, i, and it provides top-down support for the mth value of visible unit i via a weight, wm ij . The activation rule for a softmax unit is !P " m exp h w j j ij !P " pðvm ð4Þ i ¼ 1jhÞ ¼ P k exp h w j ij k j
where the superscript m is used to denote one of the discrete values of i and k is an index over all possible discrete values. Now suppose that for each document we create an RBM with as many softmax units as there are words in the document. Assuming that we are ignoring the order of the words, all of these softmax units can share the same set of weights connecting them to the binary hidden units. The weights can also be shared by the whole family of different-sized RBMs that are required for documents of different lengths. We call this the ‘‘Replicated Softmax’’ model. Using N softmax units with identical weights is equivalent to having one softmax unit which we sample N times. This makes it clear that using N replicated softmax units is equivalent to taking N samples from a multinomial distribution. A pleasing property of softmax units is that the learning rule in Eq. 1 remains the same (Hinton, 2002): m m Dwm ij ¼ !ð< vi hj >data $ < vi hj >recon Þ
ð5Þ
As all of the N softmax units share the same weights, we can drop the subscript on the v and write the learning rule as: m m Dwm j ¼ !ð< Nv hj >data $ < Nv hj >recon Þ
ð6Þ
where vm denotes the count for the mth word divided by N. This model was called the ‘‘constrained Poisson model’’ in Hinton and Salakhutdinov (2006).
4. Pretraining and fine-tuning a deep generative model A single layer of binary features is not the best way to capture the structure in the count data. After learning the first layer of features, a second layer is learned by treating the activation probabilities of the existing features, when they are being driven by real data, as the
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
81
data for the second-level RBM (see Fig. 1). The difference from learning the first layer of features is that the ‘‘visible’’ units of the second-level RBM are also binary, as in a standard RBM. This greedy, layer-by-layer training can be repeated several times to learn a deep, hierarchical model in which each layer of features captures strong high-order correlations between the activities of features in the layer below. Recursive learning of deep generative model: 1. Learn the parameters h1 ¼ (W1,a1,b1) of the replicated softmax model. 2. Freeze the parameters of the replicated softmax model and use the activation probabilities of the binary features, when they are being driven by training data, as the data for training the next layer of binary features. 3. Freeze the parameters h2 that define the second layer of features and use the activation probabilities of those features as data for training the third layer of binary features. 4. Proceed recursively for as many layers as desired. To justify this layer-by-layer approach, it would be good to show that adding an extra layer of feature detectors always increases the probability that the overall generative model would generate the training data. This is almost true: Provided the number of feature detectors does not decrease and their weights are initialized correctly, adding an extra layer is guaranteed to raise a lower bound on the log probability of generating the training data (Hinton et al., 2006). 4.1. Fine-tuning the weights After pretraining, the individual RBMs at each level are ‘‘unrolled’’ as shown in Fig. 1 to create a deep autoencoder. If the stochastic activities of the binary features are replaced by deterministic, real-valued probabilities, we can then backpropagate through the entire network to fine-tune the weights for optimal reconstruction of the count data. For the finetuning, we divide the count vector by the number of words so that it represents a probability distribution across words. Then we use the cross-entropy error function, C, with a ‘‘softmax’’ at the output layer. C¼"
X
m vm data log voutput
m
ð7Þ
The fine-tuning makes the codes in the central layer of the autoencoder work much better for information retrieval. 4.2. Making the codes binary During the fine-tuning, we want backpropagation to find codes that are good at reconstructing the count data but are as close to binary as possible. To make the codes binary, we add Gaussian noise to the bottom-up input received by each code unit. Assuming
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
82 A
B
5
2.5
x 10
5
7
x 10
6 2 5 1.5
4 3
1
2 0.5 1 0 0
0.2
0.4
0.6
0.8
1
0 0
0.2
0.4
0.6
0.8
1
Fig. 2. The distribution of the activities of the 128-code units on the 20 Newsgroup training data before and after adding deterministic noise to the code units. When the fine-tuning of the autoencoder network is performed without adding noise, the code layer learns to transmit a lot of information by using precise real values that lie between 0 and 1. The distribution of values is shown on the left. If these values are thresholded to produce a binary code, the reconstruction will be poor because the fine-tuning did not take the thresholding into account. If additional noise is added to the input to the code units during the fine-tuning, the total input received from the layer below learns to be big and positive or big and negative so that the code unit can still transmit one bit of information despite the noise. This distribution is shown on the right and it makes the codes much more robust to thresholding.
that the decoder network is insensitive to very small differences in the output of a code unit, the best way to communicate information in the presence of added noise is to make the bottom-up input received by a code unit large and negative for some training cases and large and positive for others. Fig. 2 shows that this is what the fine-tuning does. We tried other ways of encouraging the code units to be binary, but Gaussian noise worked better. To prevent the added Gaussian noise from messing up the conjugate gradient fine-tuning, we used ‘‘deterministic noise’’ with mean zero and variance 16. For each training case, the sampled noise values are fixed in advance and do not change during training. With a limited number of training cases, the optimization could tailor the parameters to the fixed noise values, but this is not possible when the total number of sampled noise values is much larger than the number of parameters. 4.3. Details of the training To speedup the pretraining, we subdivided both datasets into small mini-batches, each containing 100 cases,1 and updated the weights after computing the gradient of the reconstruction error on each mini-batch. For large datasets this is much more efficient than using the entire dataset to compute the gradient. Early in the training, the gradient vector computed from a small mini-batch is likely to point in the same general direction as the gradient vector computed from the whole dataset, so progress can be made rapidly by just using one
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
83
small mini-batch per weight update. Of course, there will be sampling noise when the gradient is computed in this way, but that will be corrected on subsequent mini-batches. For both datasets each layer was greedily pretrained for 50 passes (epochs) through the entire training dataset. The weights were updated using a learning rate of 0.1, momentum of 0.9, and a weight decay of 0.0002 · weight · learning rate. The weights were initialized with small random values sampled from a zero-mean normal distribution with variance 0.01. For fine-tuning we used the method of conjugate gradients2 on larger minibatches of 1,000 data vectors, with three line searches performed for each minibatch in each epoch. To determine an adequate number of epochs and to avoid overfitting, we fine-tuned on a fraction of the training data and tested performance on the remaining validation data. We then repeated fine-tuning on the entire training dataset for 50 epochs. Slight overfitting was observed on the 20 Newsgroup corpus but not on the Reuters corpus. After fine-tuning the codes were thresholded to produce binary code vectors. The asymmetry between 0 and 1 in the energy function of an RBM causes the unthresholded codes to have many more values near 0 than near 1, so we used a threshold of s ¼ 0.1. This works well for document retrieval even though it is suboptimal for document reconstruction. We experimented with various values for the noise variance and the threshold, as well as the learning rate, momentum, and weight-decay parameters used in the pretraining. Our results are fairly robust to variations in these parameters and also to variations in the number of layers and the number of units in each layer. The precise weights found by the pretraining do not matter as long as it finds a good region from which to start the fine-tuning (Erhan, Manzagol, Bengio, Bengio, & Vincent, 2009).
5. Experimental results To evaluate performance of our model on an information retrieval task we do the following: Document retrieval procedure: 1. Map all query (test) documents into 128-bit binary codes by performing an up-pass through the model and thresholding top-level activation probabilities at s ¼ 0.1. 2. For each query document: Calculate its similarity to all other test documents in the 128-bit space using Hamming distance. Retrieve the D most similar documents. Measure accuracy by computing the ratio of correctly retrieved documents (which belong to the same category as a query document) to the total number of retrieved documents. 3. Average results across all queries. Results of Gehler et al. (2006) show that pLSA and LDA models do not generally outperform LSA and TF-IDF. Therefore, for comparison, we only used LSA and TF-IDF as
84
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
benchmark methods. For LSA each word count, ci, was replaced by log (1 + ci) before the SVD, which slightly improved performance. TF-IDF computes document similarity directly in the word-count space, which is slow. For both these methods we used the cosine of the angle between two vectors as a measure of their similarity. ‘‘TF-IDF’’ stands for ‘‘Term-Frequency, Inverse-Document-Frequency’’ and it is a very sensible way of deciding how important the counts of a particular word-type are for determining the similarity of two documents. If a particular word has a high count (a high Term Frequency) in both documents this makes them similar. But the similarity is much greater if that word is rare in documents in general (a high Inverse Document Frequency). Generally, the logarithm of the inverse document frequency is used. 5.1. Description of the text corpora In this section we present experimental results for document retrieval on two text datasets: 20-Newsgroups and Reuters Corpus Volume II. The 20 newsgroup corpus contains 18,845 postings taken from the Usenet newsgroup collection. The corpus is partitioned fairly evenly into 20 different newsgroups, each corresponding to a separate topic.3 The data were split by date into 11,314 training and 7,531 test articles, so the training and test sets were separated in time. The training set was further randomly split into 8,314 training and 3,000 validation documents. Some newsgroups are very closely related to each other, for example, soc.religion.christian and talk.religion.misc, while others are very different, for example, rec.sport.hockey and comp.graphics (see Fig. 3). We further preprocessed the data by removing common stopwords, stemming, and then only considering the 2,000 most frequent words in the training dataset. As a result, each posting was represented as a vector containing 2,000 word counts. No other preprocessing was made. The Reuters Corpus Volume II is an archive of 804,414 newswire stories4 that have been manually categorized into 103 topics. The corpus covers four major groups: corporate/industrial, economics, government/social, and markets. Sample topics are displayed in Fig. 3. The data were randomly split into 402,207 training and 402,207 test articles. The training set was further randomly split into 302,207 training and 100,000 validation documents. The available data were already in the preprocessed format, where common stopwords were removed and all documents were stemmed. We again only considered the 2,000 most frequent words in the training dataset. 5.2. Results For both datasets we used the 2000-500-500-128 architecture shown in Fig. 1. To see whether the learned 128-bit codes preserve class information, we used Stochastic Neighbor Embedding (Hinton & Roweis, 2003) to visualize the 128-bit codes of all the documents from five or six separate classes. Fig. 3 shows that for both datasets the 128-bit codes preserve the class structure of the documents.
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011) A
85
20 Newsgroup 2−D Topic Space rec.sport.hockey comp.graphics
talk.politics.guns
sci.cryptography talk.politics.mideast
soc.religion.christian
B
Reuters 2−D Topic Space Disasters and Accidents Government Borrowing
European Community Monetary/Economic
Energy Markets
Accounts/Earnings
Fig. 3. Two-dimensional embedding of 128-bit codes using SNE for 20 Newsgroup data (panel A) and Reuters RCV2 corpus (panel B).
In addition to requiring very little memory, binary codes allow very fast search because fast bit counting routines5 can be used to compute the Hamming distance between two binary codes. On a 3GHz Intel Xeon running C, for example, it only takes 3.6 ms to search through 1 million documents using 128-bit codes. The same search takes 72 ms for 128dimensional LSA and 1.2 s for TF-IDF, though this could be considerably reduced by using a less naive method that uses inverted indexing.
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
86
Newsgroup Dataset
A 0.9 0.8
Fine−tuned 128−bit DGM
0.7
Accuracy
0.6 0.5 0.4
128−bit DGM prior to fine−tuning
0.3 0.2
LSA 128
0.1 0
1
3
7
15
31
63
127
255
511 1023 2047 4095 7531
Number of Retrieved Documents
Newsgroup Dataset
B 0.9 0.8
Hybrid 128−bit DGM using TF−IDF
0.7 TF−IDF
Accuracy
0.6 0.5
LSA 128
0.4 0.3 0.2 0.1 0
1
3
7
15
31
63
127
255
511 1023 2047 4095 7531
Number of Retrieved Documents
Fig. 4. Accuracy curves for 20 Newsgroup dataset, when a query document from the test set is used to retrieve other test set documents, averaged over all 7,531 possible queries.
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011) A
87
Reuters Corpus 0.55
0.5
Fine−tuned 128−bit DGM
0.45
Accuracy
LSA 128 0.4
0.35
128−bit DGM prior to fine−tuning
0.3
0.25
0.2
1
3
7
15
31
63
127
255
Number of Retrieved Documents
B
511
1023
Reuters Corpus 0.55
0.5
Hybrid 128−bit DGM using TF−IDF
Accuracy
0.45
0.4
LSA 128
0.35
0.3
0.25
TF−IDF 0.2
1
3
7
15
31
63
127
255
Number of Retrieved Documents
511
1023
Fig. 5. Accuracy curves for Reuters RCV2 dataset, when a query document from the test set is used to retrieve other test set documents, averaged over all 402,207 possible queries.
Figs. 4 and 5 (panels A) show that our 128-bit codes are better at document retrieval than the 128 real-values produced by LSA. TF-IDF is slightly more accurate than our 128-bit codes6 when retrieving the top 15–30 documents in either dataset. If, however, we use our 128-bit codes to preselect the top 100 documents for the 20 Newsgroup data or the top 1,000 for the Reuters data, and then re-rank these preselected documents using TF-IDF, we get better accuracy than running TF-IDF alone on the whole document set (see Figs. 4 and 5). This shows that the 128-bit codes can correctly reject some documents that TF-IDF would rank very highly. On 400,000 documents, the naive implementation of TF-IDF takes 0.48 s and our more accurate hybrid method takes .0014 + .0012 s—about 200 times faster.
88
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
6. Retrieval in constant time using a semantic address space Using 128-bit codes, we have shown that documents which are semantically similar can be mapped to binary vectors that are close in hamming space. If we could do this for 30-bit codes, we would have the ultimate retrieval method: Given a query document, compute its 30-bit address and retrieve all of the documents stored at similar addresses with no search at all. For a billion documents, a 30-bit address space gives a density of one document per address and a hamming-ball of radius 5 should contain about 175,000 documents. The retrieved documents could then be given to a slower but more precise retrieval method. It is unlikely that the 175,000 documents in the hamming-ball would contain most of the very similar documents, so recall would not be high, but it is possible that precision could still be very good. Using 20-bit codes, we checked whether our learning procedure could discover a way to model similarity of count-vectors by similarity of 20-bit addresses that was good enough to allow high precision retrieval for our set of 402,207 test documents. For example, a hamming ball of radius 4 contains 6,196 of the million addresses so it should contain about 2,500 documents. Fig. 6 shows that accuracy is not lost by restricting TF-IDF to this preselected set. We can also use a two-stage filtering procedure by first retrieving documents using 20-bit addresses in a hamming ball of larger radius 6 (about 25,000 documents), filter these down to 1,000 using 128-bit codes, and then apply TF-IDF. This method is faster and achieves higher accuracy as shown in Fig. 6. Scaling up the learning to a billion training cases would not be particularly difficult. Using mini-batches, the learning time is sublinear in the size of the dataset if there is
TF−IDF TF−IDF using 20 bits TF−IDF using 20 bits and 128 bits Locally Sensitive Hashing
50
Precision (%)
40
30
20
10
0
0.1
0.2
0.4
0.8
1.6
3.2
Recall (%)
6.4
12.8 25.6 51.2 100
Fig. 6. Accuracy curves for Reuters RCV2 dataset, when a query document from the test set is used to retrieve other test set documents, averaged over all 402,207 possible queries. Locality sensitive hashing (LSH) is the fastest current method for finding similar documents. It is less accurate and also much slower than using the Hamming ball around a learned binary code to create a shortlist for TF-IDF.
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
89
redundancy in the data as there surely is. Also, different processors can be used to compute the gradients for different examples within a large mini-batch. To improve recall, we could learn several different ‘‘semantic’’ address spaces on disjoint training sets and then preselect documents that are close to the query document in any of the semantic address spaces.
7. Discussion We have described a rather complicated way of learning efficient, compact, nonlinear codes for documents, and we have shown that these codes are very good for retrieving similar documents. The learning is slow, especially the fine-tuning stage, but it scales very well with the size of the dataset. The use of a simple gradient method as opposed to a complicated convex optimization, means that it is easy to incorporate new data without starting all over again. Even though the learning is quite slow, the inference is very fast. The code for a new document can be extracted by a single, deterministic feed-forward pass through the encoder part of the network and this requires only a few vector-matrix multiplies. To gain further insight into the nature of the codes that are learned, it would be helpful if we could visualize what the individual elements of the codes represent. Unfortunately, this is not nearly as easy to do as with topic models because in our codes, each component is activated by a very large set of documents and the precision comes by intersecting all of these sets. By contrast, in topic models each word must be generated by a single topic so the individual topics are, necessarily, far more precise and therefore far easier to visualize. For generative models of labeled data it is possible to fix the class label and see what typical instances of that class look like as is done in Hinton et al. (2006), but this method cannot be applied when the model is learned on unlabeled data. Our method for learning codes could be applied to the same word document matrix in a different way by treating each word as a training case rather than each document. The codes would then represent word meanings and it would be interesting to visualize them in a two-dimensional map using t-SNE (van der Maaten & Hinton, 2008), which is very good at organizing the layout so that very similar codes are very close to one another (see http:// www.cs.toronto.edu/ hinton/turian.png for an example). While the use of short codes may be interesting for the technology of information retrieval, we think it is largely irrelevant to the issue of how people represent and retrieve documents for the reasons given in the Introduction. However, the same deep learning methods can be used to extract very large, very sparse codes, and we think this is a far more promising direction for future research on how the brain represents the contents of a document. Notes 1. The last minibatch contained more than 100 cases. 2. Code is available at http://www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/.
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
90
3. The data are available at http://people.csail.mit.edu/jrennie/20Newsgroups (20newsbydate.tar.gz). It has been preprocessed and organized by date. 4. The Reuter Corpus Volume 2 is available at http://trec.nist.gov/data/reuters/ reuters.html. 5. Code is available at http://www-db.stanford.edu/!manku/bitcount/bitcount.html. 6. 256-bit codes work slightly better that 128-bit ones.
Acknowledgments This research was supported by NSERC, CFI, and Google. G. E. H. is a fellow of the Canadian Institute for Advanced Research.
References Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6), 391–407. Erhan, D., Manzagol, P., Bengio, Y., Bengio, S., & Vincent, P. (2009). The difficulty of training deep architectures and the effect of unsupervised pre-training. AISTATS 2009, 5, 153–160. Gehler, P., Holub, A., & Welling, M. (2006). The rate adapting poisson (RAP) model for information retrieval and object recognition. Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, PA. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1711–1800. Hinton, G. E. (2007a). Learning multiple layers of representation. Trends in Cognitive Science, 11, 428–434. Hinton, G. E. (2007b). To recognize shapes, first learn to generate images. In P. Cisek, D. Kalaska, & J. Haran (Eds.), Computational neuroscience: Theoretical insights into brain function (pp. 535–547). Montreal: Elsevier. Hinton, G. E., & Roweis, S. T. (2003). Stochastic neighbor embedding. In S. Thrun, L. K. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems 15 (pp. 833–840). Cambridge, MA: MIT Press. Hinton, G. E., & Salakhutdinov, R. R. (2006). Non-linear dimensionality reduction using neural networks. Science, 313, 504–507. Hinton, G. E., & Sejnowski, T. J. (1983). Optimal perceptual inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, Washington, DC. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: explorations in the microstructure of cognition. Volume 1: Foundations (pp. 77–109). Cambridge, MA: MIT Press. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. Hofmann, T. (1999). Probabilistic latent semantic analysis. In P. Laskey & H. Prade (Eds.), Proceedings of the 15th conference on uncertainty in AI (pp. 286–296). Stockholm, Sweden: Morgan Kaufmann. Jordan, M. (1999). Learning in Graphical Models. Cambridge, MA: MIT Press. van der Maaten, L. J. P., & Hinton, G. E. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9, 2579–2605.
G. Hinton, R. Salakhutdinov ⁄ Topics in Cognitive Science 3 (2011)
91
Pearl, J. (1988). Probabilistic inference in intelligent systems: networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Salakhutdinov, R. R., & Hinton, G. E. (2007). Semantic hashing. In J. Fernandez-Luna, B. Piwowarski, & J. F. Huete (Eds.), Proceedings of the SIGIR Workshop on Information Retrieval and Applications of Graphical Models). Amsterdam, The Netherlands. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523. Welling, M., Rosen-Zvi, M., & Hinton, G. (2005). Exponential family harmoniums with an application to information retrieval. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems 17 (pp. 1481–1488). Cambridge, MA: MIT Press. Xing, E., Yan, R., & Hauptmann, A. G. (2005). Mining associated text and images with dual-wing harmoniums. In F. Baccus, & T. Jaakkola (Eds.), Proceedings of the 21st conference on uncertainty in artificial intelligence (pp. 633–641). Edinburgh, Scotland: AUAI press.
Topics in Cognitive Science 3 (2011) 92–122 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01108.x
Comparing Methods for Single Paragraph Similarity Analysis Benjamin Stone,a Simon Dennis,b Peter J. Kwantesc a
School of Psychology, The University of Adelaide Department of Psychology, Ohio State University c Defence Research and Development Canada (Toronto) b
Received 27 February 2009; received in revised form 6 July 2009; accepted 8 September 2009
Abstract The focus of this paper is two-fold. First, similarities generated from six semantic models were compared to human ratings of paragraph similarity on two datasets—23 World Entertainment News Network paragraphs and 50 ABC newswire paragraphs. Contrary to findings on smaller textual units such as word associations (Griffiths, Tenenbaum, & Steyvers, 2007), our results suggest that when single paragraphs are compared, simple nonreductive models (word overlap and vector space) can provide better similarity estimates than more complex models (LSA, Topic Model, SpNMF, and CSM). Second, various methods of corpus creation were explored to facilitate the semantic models’ similarity estimates. Removing numeric and single characters, and also truncating document length improved performance. Automated construction of smaller Wikipedia-based corpora proved to be very effective, even improving upon the performance of corpora that had been chosen for the domain. Model performance was further improved by augmenting corpora with dataset paragraphs. Keywords: Semantic models; Paragraph similarity; Corpus preprocessing; Corpus construction; Wikipedia corpora
1. Introduction
The rate at which man [sic] has been storing up useful knowledge about himself and the universe has been spiralling upwards for 10,000 years. —Toffler, 1973, p. 37 Correspondence should be send to Simon Dennis, Department of Psychology, Ohio State University, 225 Psychology Building, 1835 Neil Avenue, Columbus, OH 43210. E-mail:
[email protected]
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
93
Nearly four decades later, Toffler’s remark is perhaps even more relevant in today’s internet-driven world. ‘‘Information overload’’ may be regarded as pervasive in many professions, and filtering strategies such as the summarization of text are commonplace. Government leaders and company executives make informed decisions based on briefs or short summaries of complex issues, provided by department managers who have in turn summarized longer reports written by their staff. In academia, the abstract is used to provide an overview of a paper’s contents, so that time-pressed researchers can filter and absorb information related to their fields of study. In many areas it is important to be able to accurately judge the similarity between two or more paragraphs of information. Sorting and extracting useful information from large collections of these types of summaries can prove both overwhelming and time consuming for humans. In an attempt to address this issue, semantic models have been successfully employed at these tasks. For example, latent semantic analysis (LSA; Landauer, McNamara, Dennis, & Kintsch, 2007) has been used to grade student essay scripts (Foltz, Laham, & Landauer, 1999). Similarly, the Topic Model has been used to extract scientific themes from abstracts contained in the Proceedings of the National Academy of Sciences (Griffiths & Steyvers, 2004). In a surveillance application, nonnegative matrix factorization has been applied to the large ENRON e-mail dataset to extract topics or themes (Berry & Browne, 2005). Other models such as the vector space model (henceforth called ‘‘Vectorspace’’) were originally designed to index (or order by relevance to a topic) large sets of documents (Salton, Wong, & Yang, 1975). Semantic models have also been shown to reflect human knowledge in a variety of ways. LSA measures correlate highly with humans’ scores on standard vocabulary and subject matter tests; mimic human word sorting and category judgments; simulate word-word and passage-word lexical priming data; and accurately estimate passage coherence (Landauer et al., 2007). The Topic Model has proven adept at predicting human data on tasks including the following: free association, vocabulary tests, lexical decision, sentence reading, and free recall (Griffiths, Tenenbaum, & Steyvers, 2007). Other models have been developed to reflect specific psychological processes. For example, the constructed semantics model (CSM) was developed as a global-matching model of semantic memory derived from the MINERVA 2 architecture of episodic memory (Kwantes, 2005). 1.1. Different types of textual language unit When making similarity comparisons on textual stimuli with semantic models, several researchers have highlighted the need to delineate textual stimuli into different language units (Foltz, 2007; Kireyev, 2008; Landauer & Dumais, 1997; McNamara, Cai, & Louwerse, 2007). Past research has modeled human comparisons of similarity on four types of textual language units: words, sentences, single paragraphs, and chapters or whole documents (Foltz, 2007). 1.1.1. Word comparisons Griffiths et al. (2007) found that the Topic Model outperformed LSA on several tasks, including word association and synonym identification. Griffiths and colleagues compared
94
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
performance by the Topic Model and LSA on a word association task using norms collected by Nelson, McEvoy, and Schreiber (1998). The study used 4,471 of these words that were also found in an abridged1 37,651-document (26,243 word, 4,235,314 token) version of the Touchstone Applied Science Associates (TASA) corpus. Moreover, the TASA corpus was used as a knowledge base for both the Topic Model and LSA. Two measures were employed to assess the models’ estimates of word association. The first measure assessed central tendency, focusing on the models’ ability to rank word targets for each word cue. The other measure assessed the proficiency of each model’s estimate of the most likely target response for each word cue. Griffiths and colleagues found that the Topic Model outperformed LSA on both of these performance measures. Furthermore, they reported that both models performed at levels better than chance and a simple word co-occurrence model. In another study, Griffiths et al. (2007) compared the Topic model and LSA on a subset of the synonym section taken from the Test of English as a Foreign Language (TOEFL). The TOEFL was developed in 1963 by the National Council on the Testing of English as a Foreign Language, and it is currently administered by the Educational Testing Service.2 The synonym portion of TOEFL offers four multiple choice options for each probe word, Griffiths and colleagues only included items in which all five words also appeared in the aforementioned abridged version of the TASA corpus. Similarity evaluations between the probes and possible synonyms revealed that the Topics model (70.5%) answered more of the 44 questions correctly than LSA (63.6%). Furthermore, the Topic Model (0.46) predictions captured more of the variance found in the human responses than LSA (0.3). The Topic Model is a generative model that assesses the probability that words will be assigned to a number of topics. One of the key benefits of this generative process is that it allows words to be assigned to more than one topic, thus accommodating the ambiguity associated with homographs (Griffiths et al., 2007). For example, using the Topic Model, the word ‘‘mint’’ may appear in a topic that contains the words ‘‘money’’ and ‘‘coins,’’ and in another topic containing the words ‘‘herb’’ and ‘‘plants.’’ Griffiths et al. (2007) argue that this attribute gives the Topic Model an advantage over models like LSA which represent meanings of words as individual points in undifferentiated Euclidean space (pp. 219–220). 1.1.2. Sentence comparisons McNamara et al. (2007) used several implementations of LSA to estimate the relatedness of sentences. The human judged similarity of these sentences decreased from paraphrases of target sentences, to sentences that were in the same passage as target sentences, to sentences that were selected from different passages to the target sentences. Likewise, comparing sentences using a standard implementation of LSA and the TASA corpus, these researchers found estimates of similarity were greatest for paraphrases, then same passage sentences, with different passage sentences judged least similar. When human estimates were correlated with the LSA estimates of sentences similarity, it was found that a version of LSA that emphasized frequent words in the LSA vectors best captured the human responses. Subsequently, using data collected in the McNamara et al. (2007) study, Kireyev (2008) found that LSA outperformed the Topic Model at this task.
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
95
1.1.3. Single paragraph comparisons Lee et al. (2005) examined similarity judgments made by Adelaide University students on 50 paragraphs that were collected from the Australian Broadcasting Corporations news mail service. These paragraphs ranged from 56 to 126 words in length, with a median length of 78.5 words. Lee and colleagues compared several models’ estimates of similarity to the aforementioned human ratings. These models included word-based, n-gram, and several LSA models. Using a knowledge base of 364 documents also drawn from the ABC news mail service, LSA under a global entropy function was the best performing model,3 producing similarity ratings that correlated about 0.60 with human judgments in this study. LSA’s result in this study was also consistent with the inter-rater correlation (approximately 0.605) calculated by these researchers. More recently, Gabrilovich and Markovitch (2007) produced a substantially higher correlation with the human similarity judgments recorded for the Lee paragraphs (0.72) using the model they developed, explicit semantic analysis (ESA). The ESA model uses Wikipedia as a knowledge base, treating Wikipedia documents as discrete humangenerated concepts that are ranked in relation to their similarity to a target text using a centroid-based classifier. Kireyev (2008) used LSA and the Topic Model to estimate similarity of pairs of paragraphs taken from 3rd and 6th grade science textbooks. It was proposed that paragraphs that were adjacent, should be more similar than nonadjacent paragraphs. Difference scores were calculated between adjacent and nonadjacent paragraphs for both grade levels, with higher scores indicating better model performance. While it was not stated whether one model significantly outperformed the other at this task, on average LSA (0.75) scored higher on the 3rd grade paragraphs than the Topic Model (0.49). However, there was little difference between the two models on the 6th grade paragraphs (LSA 0.33, Topic Model 0.34). 1.1.4. Chapters or whole document comparisons Martin and Foltz (2004) compared whole transcripts of team discourse to predict team performance during simulated reconnaissance missions. Sixty-seven mission transcripts were used to create the researcher’s corpus (UAV-Corpus). LSA was used to measure the similarity between transcripts of unknown missions to transcripts of missions where performance scores were known. To estimate the performance of a team based on their transcript using LSA, an average performance score was calculated from the 10 most similar transcripts found in the UAV-corpus. Performance scores estimated using LSA were found to correlate strongly (0.76) with the actual team performance scores. Kireyev (2008) compared the similarity estimates of LSA and the Topic Model using 46 Wikipedia documents. These documents were drawn from six different categories: sports, animals, countries, sciences, religion, and disease. While both models correctly found more similarity between within-category documents than across-category documents, Kireyev (2008) concluded that LSA performed this task consistently better than the Topic Model.
96
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
1.2. The dual focus of this paper This paper describes the outcome of a systematic comparison of single paragraph similarities generated by six statistical semantic models to similarities generated by human participants. Paragraph complexity and length can vary widely. Therefore, for the purposes of this research, we define a paragraph as a self-contained section of ‘‘news’’ media (such as a pre´cis), presented in approximately 50–200 words. There are two main themes that are explored in this paper. At one level it is an evaluation of the semantic models, in which their performance at estimating the similarity of single paragraph documents is compared against human judgments. As outlined above, past research has indicated that performance of some models is clearly better depending on which type of textual units were used as stimuli. For example, the Topic Model was shown to perform better than LSA in word association research, where the textual unit was at the single word level. However, inherent difficulties such as homographs that affect models like LSA at the word unit level, may be less problematic for assessments made on larger textual units (sentences, paragraphs, and chapters or whole documents). These larger textual units contain concurrently presented words that may be less ambiguous and are thus able to compensate for a model’s inability to accommodate homographic words (Choueka & Lusignan, 1985; Landauer & Dumais, 1997). Research has indicated that LSA performs well at the paragraph level (Lee et al., 2005), but there are other models that may perform equally well if not better than LSA at this task. Therefore, in this research we compare six models’ efficiency at the task of modeling human similarity judgments of single paragraph stimuli. The models examined were word overlap, the Vectorspace model (Salton et al., 1975), LSA (Landauer, McNamara, Dennis, & Kintsch, 2007), the Topic Model (Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2002), sparse nonnegative matrix factorization (SpNMF; Xu, Liu, & Gong, 2003), and the CSM (Kwantes, 2005). Our evaluation of these models is tempered by factors such a model compilation speed, consistency of performance in relation to human judgments of document similarity, and intrinsic benefits such as producing interpretable dimensions. At another level this paper explores the characteristics of the corpora or knowledge bases utilized by these models that may improve models’ performance when approximating human similarity judgments. With the exception of the word overlap model, a good background knowledge base is essential to the models’ performance. Past research has identified various aspects of corpus construction that affect the performance of the pointwise mutual information co-occurrence model on word-based tasks such as the TOEFL synonym test (Bullinaria & Levy, 2006). These factors included the size and shape of the context window, the number of vectors included in the word space, corpus size, and corpus quality. To address this issue, we have evaluated aspects of corpus composition, preprocessing, and document length in an attempt to produce suitable background corpora for the semantic models. To this end, four studies are described in this paper that examine the semantic models’ performance relative to human ratings of paragraph similarity. In the first study, semantic models use domain-chosen corpora to generate knowledge spaces on which they make
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
97
evaluations of similarity for two datasets of paragraphs. Overall, the models performed poorly using these domain-chosen corpora when estimates were compared to those made by human assessors. In the second study, improvements in the models’ performance were achieved by more thoroughly preprocessing the domain-chosen corpora to remove all instances of numeric and single alphabetical characters. In the third study, smaller targeted corpora (subcorpora) constructed by querying a larger set of documents (Wikipedia4) were examined to assess whether they could produce sufficient performance to be generally useful (Zelikovitz & Kogan, 2006). In many applications the hand construction of corpora for a particular domain is not feasible, and so the ability to show a good match between human similarity evaluations and semantic models’ evaluations of paragraph similarity using automated methods of corpus construction is a desirable outcome. Furthermore, document length of the essay-like Wikipedia articles was manipulated to produce better approximations of human judgment by the semantic models. Finally, in the fourth study, several of the models were found to produce better estimates of paragraph similarity when the dataset paragraphs were included in the models’ backgrounding corpus.
2. Semantic models, human datasets, and domain-chosen corpora 2.1. Semantic models The semantic models examined were word overlap, the Vectorspace model (Salton et al. 1975), LSA (Landauer, McNamara, Dennis, & Kintsch, 2007), the Topic Model (Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2002), SpNMF (Xu, Liu, & Gong, 2003), and the CSM (Kwantes, 2005). Word Overlap: Simple word overlap was used as a baseline in this research. It is the only model that does not use a corpus or knowledge base. Instead, it is a word co-occurrence model. Term frequencies are calculated for each paragraph in the dataset, and similarities are then measured as cosines (see Eq. 1) of the resulting paragraph vectors. cos h ¼
v1 " v2 kv1 kkv2 k
ð1Þ
The Vectorspace model (Salton, Wong, & Yang, 1975): The Vectorspace model assumes that terms can be represented by the set of documents in which they appear. Two terms will be similar to the extent that their document sets overlap. To construct a representation of a document, the vectors corresponding to the unique terms are multiplied by the log of their frequency within the document, and divided by their entropy across documents, and then added. Using the log of the term frequency ensures that words that occur more often in the document have higher weight, but that document vectors are not dominated by words that appear very frequently. Dividing by the entropy or inverse document frequency reduces the impact of high-frequency words that appear in many documents in a corpus. Similarities are measured as the cosines between the resultant vectors for two documents.
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
98
Latent semantic analysis (Landauer, McNamara, Dennis, & Kintsch, 2007): LSA starts with the same representation as the Vectorspace model—a term by document matrix with log entropy weighting.5 In order to reduce the contribution of noise to similarity ratings, however, the raw matrix is subjected to singular value decomposition (SVD). The SVD decomposes the original matrix into a term by factor matrix, a diagonal matrix of singular values, and a factor by document matrix. Typically, only a small number of factors (e.g., 300) are retained. To derive a vector representation of a novel document, term vectors are weighted, multiplied by the square root of the singular value vector and then added. As with the Vectorspace model, the cosine is used to determine similarity. The Topic Model (Topics; Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2002): The Topic Model is a Bayesian approach to document similarity that assumes a generative model in which a document is represented as a multinomial distribution of latent topics, and topics are represented as multinomial distributions of words. In both cases, Dirichlet priors are assumed. The parameters of these models can be inferred from a corpus using either Markov Chain Monte Carlo techniques (MCMC; Griffiths & Steyvers, 2002) or variational Expectation Maximization (Blei, Ng, & Jordan, 2003). We implemented the former. Ideally, document representations should then be calculated by running the MCMC sampler over a corpus augmented with information from the new document. To do this on a document-by-document basis is impractical. In the first instance, we choose to run the sampler over the corpus and then average the word distributions to calculate topic distributions for novel documents. Later in the paper, we investigate the impact of this decision by running the sampler over an augmented corpus containing all of the dataset paragraphs. To calculate the similarity of the topic distributions representing documents, we employed both the Dot Product (see Eq. 2) and Jensen-Shannon Divergence (JSD, see Eq. 3). While the Dot Product was employed for convenience, the JSD is a symmetric form of the Kullback-Leibler Divergence (D), which derives from information theory and provides a well-motivated way of comparing probability distributions. a!b¼
n X
ai bi
i¼1
1 1 JSDðPkQÞ ¼ DðPkMÞ þ DðQkMÞ 2 2 X 1 PðiÞ PðiÞ log where M ¼ ðP þ QÞ and DðPkQÞ ¼ 2 QðiÞ i
ð2Þ ð3Þ
Sparse nonnegative matrix factorization (Xu, Liu, & Gong, 2003): Nonnegative matrix factorization is a technique similar to LSA, which in this context creates a matrix factorization of the weighted term by document matrix. This factorization involves just two matrices—a term by factor matrix and a factor by term matrix—and is constrained to contain only nonnegative values. While nonnegative matrix factorization has been shown to create
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
99
meaningful word representations using small document sets, in order to make it possible to apply it to large collections we implemented the sparse tensor method proposed by Shashua and Hazan (2005). As in LSA, log entropy weight term vectors were added to generate novel document vectors and the cosine was used as a measure of similarity. The CSM (Kwantes, 2005): The final model considered was the CSM (Kwantes, 2005). CSM was developed as a global-matching model of semantic memory derived from the MINERVA 2 architecture of episodic memory. Therefore, CSM is unique in that it was created primarily as a cognitive model to explain the emergence of semantics from experience. To this end, CSM uses a retrieval operation on the contexts in which words occur to generate semantic representations. It operates by taking the term by document matrix (using just log weighting) and multiplying it by its transpose. Consequently, terms do not have to appear together in order to be similar as is the case in the Vectorspace model. Again terms are added to create novel document vectors and the cosine is used as a measure of similarity. 2.2. The datasets Two datasets of human ratings of paragraph similarity were used in this study. The first, which we will refer to as the WENN dataset, was composed of similarity ratings generated by subjects comparing celebrity gossip paragraphs taken from the World Entertainment News Network (WENN). The second dataset, which we will refer to as the Lee dataset, was archival data collected by Lee et al. (2005). 2.2.1. The WENN dataset Students who were recruited by advertising the experiment on a local university campus, along with employees of Defence Research and Development Canada—Toronto (DRDC), provided paragraph similarity ratings from 17 participants to form the WENN dataset. Participants were paid CA$16.69 for taking part in the study. Twenty-three6 single paragraphs were compared by participants that were selected from the archives of WENN made available through the Internet Movie Database7 (see Appendix A in the Supporting Information). Paragraphs were not chosen randomly. First, each paragraph was chosen to be approximately 100 words long. The median number of words contained in paragraphs in the WENN dataset was 126, with paragraph lengths ranging from 79 to 205 words. Paragraphs were also chosen in such a way to ensure that at least some of the paragraphs possessed topical overlap. For example, there was more than one paragraph about health issues, drug problems, stalkers, and divorce among those represented in the stimuli. Participants were shown pairs of paragraphs, side by side, on a personal computer monitor. Pairs were presented one at a time. For each pair, participants were asked to rate, on a scale of 0 to 100, how similar they felt the paragraphs were to each other. Participants were not given any instructions as to the strategy they should use to make their judgments. Once a similarity judgment had been made, the next pair was presented. Each participant rated the similarity of every possible pairing of different paragraphs for a total of 253 judgments. Pearson correlations were calculated between participants’ pairwise comparisons of the
100
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
paragraphs in the WENN dataset, the average of these correlation coefficients (0.466) indicates that there was only moderate inter-rater reliability for the WENN dataset. 2.2.2. The Lee dataset Lee et al. (2005) recorded observations of paragraph similarity made by 83 Adelaide University students to form the Lee dataset. The dataset consists of 10 independent ratings of the similarity of every pair of 50 paragraphs selected from the Australian Broadcasting Corporations news mail service (see Appendix B in the Supporting Information), which provides text e-mails of headline stories. The 50 paragraphs in the Lee dataset range in length from 56 to 126 words, with a median of 78.5 words. Pairs of paragraphs were presented to participants on a computerized display. The paragraphs in the Lee dataset focused on Australian and international ‘‘current affairs,’’ covering topics such as politics, business, and social issues. Human ratings were made on a 1 (least similar) to 5 (most similar) scale. As mentioned above, Lee et al. (2005) calculated an inter-rater reliability of 0.605. 2.3. Domain-chosen corpora: WENN (2000–2006) and Toronto Star (2005) Two corpora were chosen to act as knowledge bases for the semantic models to allow similarity estimates to be made on the paragraphs contained in the WENN and Lee datasets. The larger set of 12,787 documents collected from WENN between April 2000 and January 2006 was considered a relevant backgrounding corpus for the 23 paragraphs contained in the WENN dataset; this larger set of documents is henceforth called the WENN corpus. It was not possible to resource the original set of 364 headlines and pre´cis gathered by Lee et al. (2005) from the ABC online news mail service. Therefore, in an attempt to provide a news media-based corpus that was similar in style to the original corpus of ABC documents used by Lee and colleagues, articles from Canada’s Toronto Star newspaper were used. Moreover, the Toronto Star corpus comprised of 55,021 current affairs articles published during 2005. Initially, both corpora were preprocessed using standard methods: characters converted to lower case, numbers were zeroed (i.e., 31 Jan 2007 became 00 jan 0000), punctuation and words from a standard stop-list (see Appendix C in the Supporting Information) were removed, and words that appear only once in a corpus were also removed. Descriptive statistics for both the WENN corpus and the Toronto Star corpus are displayed in Appendix D (see the Supporting Information).
3. Study One: Comparison of models on domain-chosen corpora Comparisons made between all semantic models and human evaluations of paragraph similarity for both datasets are presented in the following two subsections of this paper. For the more complex models (LSA, Topics, and SpNMF) one must select a number of dimensions in which to calculate similarities. Performance is likely to be influenced by this choice; therefore, in each case comparisons were made using 50, 100, and 300 dimensional models.
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
101
3.1. WENN dataset and WENN corpus Using the WENN corpus, correlations between similarity ratings made by humans and the models on paragraphs in the WENN dataset were low (see Fig. 1) for all models except the simple word overlap (0.43). Of the other models, CSM (0.26) and LSA at 50 dimensions (0.21) performed best. Using the Jensen-Shannon metric improved the performance of the Topic Model in all cases when compared to the dot product measure of similarity. It could be argued that both Vectorspace (r ¼ .17, t(250) ¼ 1.61, n.s.)8 and LSA at 50 dimensions (r ¼ .21, t(250) ¼ 1.05, n.s.) performed as well as the CSM on this document set. For LSA, the Topic Model, and SpNMF, increasing the dimensionality or number of topics did not significantly increase or decrease model performance at this task (see Appendix Table E1 in the Supporting Information). 3.2. Lee dataset and Toronto Star corpus
1.0 0.2
0.4
0.6
0.8
Overlap LSA SpNMF Topics Topics−JS Vectorspace CSM
/A
0
0
/A
N
N
30
10
0
50
0
30
10
0
0
50
30
10
0
50
0 10
30
50
N
/A
0.0
Correlation (r) between model and human judgment of document similarity
Again, except for the word overlap (0.48), the correlations between similarity ratings made by human participants and the models on the paragraphs in the Lee dataset were very low (see Fig. 2). CSM and SpNMF (300 dimensions) were the next best performing models,
Model and dimensionality
Fig. 1. Correlations (r) between the similarity ratings made on paragraphs in the WENN dataset by human raters and the those made by word overlap, LSA, Topics, Topics-JS (with Jensen-Shannon), SpNMF, Vectorspace, and CSM. All models, except word overlap, used the WENN corpus. The effects of dimensionality reduction are displayed at 50, 100, and 300 dimensions for the more complex models that incorporate this reductive process. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.
0.2
0.4
0.6
0.8
Overlap LSA SpNMF Topics Topics−JS Vectorspace CSM
/A
0
/A N
N
0
30
50
10
0
0
30
10
0
50
0
30
10
0
0
50
30
50
10
N
/A
0.0
Correlation (r) between model and human judgment of document similarity
1.0
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
102
Model and dimensionality
Fig. 2. Correlations (r) between the similarity ratings made on paragraphs in the Lee dataset by human raters and those made by word overlap, LSA, Topics, Topics-JS (with Jensen-Shannon), SpNMF, Vectorspace, and CSM. All models, except word overlap, used the Toronto Star corpus. The effects of dimensionality reduction are displayed at 50, 100, and 300 dimensions for the more complex models that incorporate this reductive process. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.
correlating 0.15 and 0.14 with human judgments, respectively. In addition, Vectorspace had higher correlations than both LSA and the Topic Model using the dot product similarity measure. In 9 of the 12 possible comparisons, increased dimensionality produced significantly better estimates of paragraph similarity by models when compared to human ratings (see Appendix Table E2 in the Supporting Information). 3.3. Summary of Study One Overall, the simple word overlap model outperformed the more complex semantic models when paragraph similarities were compared to human judgments made on both WENN and Lee datasets. On the Lee dataset, semantic models generally performed better when semantic spaces were compiled with higher dimensionality. However, when model dimensionality was increased on the WENN dataset, a similar increase in performance was not found. The generally poor results for the more complex models could be the product of at least one of the following circumstances: 1. The models are unable to generate similarity calculations that are comparable with human judgments.
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
103
2. The preprocessing of corpora may have been inadequate, to the extent that noise remained in the corpora which prevented the semantic models from making reasonable estimates of paragraph similarity. 3. Or the corpora did not represent the knowledge required to make similarity estimates on the paragraph contained in WENN and Lee document sets. Other studies have reported more encouraging results when comparing estimates of paragraph similarity generated by semantic models and humans (Gabrilovich & Markovitch, 2007; Lee et al., 2005). Therefore, the first possible conclusion is likely to be inaccurate, indicating semantic models can make a reasonable estimate of the similarity of paragraphs when compared to human judgments. While this was not the case in this study, poor performance by the semantic models may have been driven by a suboptimal match between the background corpus and the paragraphs being tested. The likelihood of this scenario is supported by the generally low correlations with human results obtained by all of the models that required a background corpus. The following three studies explore the latter two possibilities. In Study Two, a more stringent corpus preprocessing method is used to improve on the results presented in Study One. In Study Three, Wikipedia is used to generate better backgrounding corpora, and this method again improves model estimates of paragraph similarity when compared to the human judgments. Then, in Study Four, paragraphs from the datasets are added to the models’ knowledge base to again improve model performance at this task.
4. Study Two: Corpus preprocessing Generally, corpus preprocessing identifies words that are likely to be informative to the semantic model. In the field of information retrieval there have been many types of sophisticated term selection functions employed by researchers (Sebastiani, 2002, p. 15). Other methods such as employing a stop-list are less complex, requiring no mathematical calculation, and simply remove words from the corpus which are deemed uninformative by the researcher. Stop-lists are usually applied to remove words such as articles, pronouns, and conjunctions (Moed, Gla¨nzel, & Schmoch, 2004). Bullinaria and Levy (2006) found that stop-lists reduced model performance when the textual unit under comparison is at a word-word level (such as the TOEFL task described above). However, working with paragraph comparisons, Pincombe (2004) states that ‘‘[u]se of a stop word list almost always improved performance’’ when comparing models estimates of similarity and human judgments (p. 1). A closer inspection of the stop-list (Appendix F in the Supporting Information) and preprocessing techniques (p. 14) used by Pincombe (2004) was conducted.9 This review revealed that single letters had been removed by the author and only alphabetical characters had been used in his corpora. The difference between the preprocessing used in Study One (allowing the inclusion of zeroed numbers and single characters) and that used in Pincombe’s research begs the question: Can the removal of single letters and numbers from the background corpus improve a semantic model’s ability to estimate paragraph similarity?
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
104
It is possible that the presence of these types of information (numbers and single letters) in a corpus can create noise for the models. For example, the American Declaration of Independence in 1776 has little to do with Elvis Presley’s birthday in 1935. Although using the preprocessing method of zeroing numbers, models comparing texts that describe these two occasions would erroneously find some similarity between them. Moreover, the zeroing of the aforementioned dates could also suggest commonality with a document describing the distance between two cities, obviously creating noise in the corpus even if this new document described a 1,000 mile drive between Philadelphia (Pennsylvania) and Tupelo (Mississippi). Similarly, the ‘‘Js’’ in ‘‘J F K’’ and ‘‘J K Rowling’’ should not indicate semantic similarity between documents that make reference to these well-known individuals. Therefore, the removal of these items may benefit a model’s ability to perform similarity ratings between paragraphs. 4.1. Removing numbers and single letters
1.0 0.2
0.4
0.6
0.8
ALL NN−NSL
Sp a ec V
CS M
ce
S cs −J pi To
cs pi To
M N Sp
LS A
O
ve
rla
p
F
0.0
Correlation (r) between model and human judgment of document similarity
All numbers and single letters were removed from both the WENN and Toronto Star corpora10 to test the hypothesis that removing these characters would improve the semantic models’ performance when similarity ratings were compared to human judgments. Figs. 3 and 4 display comparisons between the results generated in Study One (ALL) and the results
Fig. 3. Correlations between similarity estimates made by human and models on paragraphs in the WENN dataset. Models that employ a knowledge base used the WENN corpus. ‘‘ALL’’ depicts standard corpus preprocessing used in Study One; ‘‘NN-NSL’’ corpora have also had numbers and single letters removed. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
105
1.0 0.2
0.4
0.6
0.8
ALL NN−NSL
CS
ac Sp V
ec
M
e
S s− J To
pi c
cs To pi
Sp N M
A LS
O
ve r
la
p
F
0.0
Correlation (r) between model and human judgment of document similarity
for spaces compiled on corpora without number and single letters (NN-NSL, No numbersNo Single Letters). Only the results for models compiled at 300 dimensions (where dimensionality is a parameter of the model) are displayed in these figures. It should be noted that while the models compiled at 300 dimensions generally produced the best results,11 models compiled at both 50 and 100 dimensions displayed an identical trend (see Appendix Table G1 in the Supporting Information) of better performance when using the more stringent preprocessing method. Although it may seem counterintuitive to remove information from a knowledge base or corpus, the removal of numbers and single letters improved correlations between human judgments and similarity ratings produced from models in nearly all comparisons that were made for both the WENN (see Fig. 3) and Lee (see Fig. 4) datasets. The only model that did not improve in performance was CSM on the WENN dataset. This difference for CSM between ALL (0.26) and NN-NSL (0.16) corpora was significant (t(250) ¼ )2.48, p < .05). A more promising trend was displayed by the other models, especially on the WENN dataset with the LSA (0.48) and SpNMF (0.43) models performing best of the more complex semantic models. However, this trend was also displayed by the simple word overlap model which continued to clearly outperform the other models. When numbers and single letters were removed from the paragraphs used by the overlap model, correlations between this model and the human judgments improved to 0.62 on the WENN dataset and 0.53 on the Lee dataset. In 4 of the 12 comparisons on the WENN dataset, and 5 of the 12 comparisons
Fig. 4. Correlations between similarity estimates made by human and models on paragraphs in the Lee dataset. Models that employ a knowledge base used the Toronto Star corpus. ‘‘ALL’’ depicts standard corpus preprocessing used in Study One; ‘‘NN-NSL’’ corpora have also had numbers and single letters removed. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.
106
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
on the Lee dataset, increased dimensionality led to significant improvements to models’ performance (see Appendix Tables G1 and G2 in the Supporting Information). Notwithstanding this general improvement in the more complex semantic models’ performance, correlations with human judgments of similarity were still low using the Toronto Star (NN-NSL) corpus on the Lee dataset, with the highest being the Vectorspace model (0.2). This suggests that while corpus preprocessing was hindering the models’ ability to provide reasonable estimates of paragraph similarity, there are also other factors that are impeding the models’ performance. Clearly, the information and themes contained within corpora certainly constrain the performance of semantic models. However, suitable knowledge bases are not always easy to obtain. In an attempt to address this issue, the third study examines an alternative method of generating corpora that draws sets of knowledge-domain related documents (subcorpora) from the online encyclopedia Wikipedia.
5. Study Three: A better knowledge base? Smaller, more topic-focused subcorpora may provide context for polysemous words that may otherwise take on several meanings in a larger corpus. To this end, Wikipedia was utilized as a generic set of documents from which smaller targeted subcorpora could be sampled and compiled. Wikipedia is maintained by the general public, and it has become the largest and most frequently revised or updated encyclopedia in the world. Critics have questioned the accuracy of the articles contained in Wikipedia, but research conducted by Giles (2005) did not find significant differences in the accuracy of science-based articles contained in Wikipedia when they were compared to similar articles contained in the Encyclopedia Britannica. Furthermore, the entire collection of Wikipedia articles are available to the general public and can be freely downloaded.12 All Wikipedia entries current to March 2007 were downloaded for this research. In total there were 2.8 million Wikipedia entries collected; however, the total number of documents was reduced to 1.57 million after the removal of incomplete articles contained in the original corpus. Moreover, incomplete articles were identified and removed if they contained phrases like ‘‘help wikipedia expanding’’ or ‘‘incomplete stub.’’ The resulting Wikipedia corpus was further preprocessed in the same manner as the NN-NSL corpora in Study Two: removing stop-words, punctuation, words that only appeared once in the corpus, and finally removing all numbers and single letters. To enable the creation of subcorpora, Lucene13 (a high-performance text search engine) was used to index each document in the Wikipedia corpus. Lucene allows the user to retrieve documents based on customized queries. Like the search results provided by Google, the documents returned by Lucene are ordered by relevance to a query. Targeted queries were created for each paragraph rated by humans in the WENN dataset. This WENN-based query was constructed by removing stop-words and punctuation from the title14 that accompanied each paragraph, and then joining the remaining words with ‘‘OR’’ statements (see Appendix H in the Supporting Information). In contrast, the query devised for the paragraphs in the Lee dataset was more complex. For the Lee-based query, the researcher chose several descriptive keywords15 for each paragraph in the Lee dataset,
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
107
and used ‘‘AND’’ and ‘‘OR’’ operators to combine these keywords. Moreover, the Leebased query used Lucene’s ‘‘star’’ wild-card operator to return multiple results from word stems. For example, the stem and wild-card combination ‘‘research*’’ would match documents containing the words ‘‘research,’’ ‘‘researcher,’’ and ‘‘researchers’’ (see Appendix I in the Supporting Information). 5.1. Wikipedia subcorpora Four subcorpora were created using the Lucene queries (described above) on the Wikipedia document set. For each dataset (WENN & Lee), a 1,000-document and a 10,000document subcorpus was generated. The structure of the Wikipedia articles contained in these subcorpora was substantially different from the documents contained in either the WENN or Toronto Star corpora (see Table D1 in the Supporting Information). Wikipedia articles tend to be longer in format, with documents that approximate the length of a short essay (on average 1,813–2,698 words per document). In contrast, the documents contained in both the WENN and Toronto Star corpora are similar in length to a journal article’s abstract (on average 74 to 255 words per document). In addition to the Wikipedia documents being generally much longer than the WENN or Toronto Star documents, the Wikipedia documents also contain on average many more unique words. The greater size and complexity of the Wikipedia documents may produce noise for the semantic models. However, Lee and Corlett’s (2003) findings indicate that decisions about a document’s content can be made using only the beginning of a document’s text. In their study of Reuters’ documents, words found in the first 10% of a document’s text were judged to hold greater ‘‘mean absolute evidence’’ characterizing a document’s content. Lee and Corlett calculated the ‘‘evidence’’ value of a word given a particular topic. This calculation was made by comparing how often a word appeared in documents related to a topic, relative to the word’s frequency in documents that were not topic-related. Their finding may reflect a generally good writing style found in Reuters’ documents, where articles may begin with a pre´cis or summary of the information that is contained in the rest of the document. Documents in a Web-based medium such as Wikipedia may also conform to this generalization. Intuitively, it seems likely that important descriptive information displayed on a Web page will be positioned nearer the top of a page (probably within the first 300 words), so as not to be overlooked by the reader as the Web page scrolls or extends beneath screen.16 To explore the possible effect of document length (number of words) on semantic models, corpora were constructed that contained the first 100, 200, 300, and all words from the Wikipedia subcorpora. To illustrate this point, if the preceding paragraph was considered a document, in the first 100 word condition this document would be truncated at ‘‘…by comparing how often a word appeared in.’’ Furthermore, to test if corpus size influenced the similarity estimates generated by the semantic models, performance was compared on the 1,000 and 10,000 subcorpora for both datasets. Thus, making a 2 · 4 design (number of documents in a corpus BY number of words in each document) for each dataset. Each subcorpus was compiled using LSA at 300 dimensions. LSA was chosen for its quick compilation speeds and because of the generally good match that has been reported between
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
108
LSA and human performance on tasks comparing paragraph similarity (Landauer & Dumais, 1997; Lee et al., 2005). Moreover, in general LSA was one of the best performing models that incorporates a knowledge base in the previous studies presented in this paper.17 This choice of dimensionality is supported by the findings of the first two studies in this paper, where increased dimensionality improved performance.
1.0 0.2
0.4
0.6
0.8
Wikipedia 1000 documents Wikipedia 10000 documents
0.0
Correlation (r) between model and human judgment of document similarity
5.1.1. Document length In general, LSA’s performance was better as document length was shortened, with the best results produced by truncating documents’ length at 100 words. On both datasets, LSA produced the highest correlations with the human similarity judgments using the 1,000 document subcorpora truncated at 100 words (see Figs. 5 and 6). This configuration produced a result (0.51) that was significantly higher than all other document number and document length combinations for the Lee dataset. On the WENN dataset, the correlation for the 1,000 document corpora with documents truncated at 100 words was higher than all other cases; however, this result was not significantly higher in several cases. On both datasets, truncating documents at 100 words produced significantly higher correlations than the ALL word conditions (where document length was not truncated). These results show that
100
200
300
ALL
Words in Document
Fig. 5. Correlations between human judgments of paragraph similarity on the WENN dataset with estimates made using LSA (at 300 dimensions) using the WENN Wikipedia-based corpora containing 1,000 and 10,000 documents retrieved using Lucene with WENN-based query. Wikipedia documents have been truncated in four ways: first 100, 200, 300, and ALL words. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.
1.0
109
0.2
0.4
0.6
0.8
Wikipedia 1000 documents Wikipedia 10000 documents
0.0
Correlation (r) between model and human judgment of document similarity
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
100
200
300
ALL
Words in Document
Fig. 6. Correlations between human judgments of paragraph similarity on the Lee dataset with estimates made using LSA (at 300 dimensions) using Lee Wikipedia-based corpora containing 1,000 and 10,000 documents retrieved using Lucene with Lee-based query. Wikipedia documents have been truncated in four ways: first 100, 200, 300, and ALL words. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.
improvements to model performance can be achieved by truncating documents to 100 words, and this improvement supports the earlier findings of Lee and Corlett (2003). 5.1.2. Number of documents LSA’s performance on both datasets was best using the smaller 1,000 document subcorpora. On the Lee dataset, when documents are truncated at 100 words, the performance of LSA is better using the 1,000-document subcorpora than the 10,000-document subcorpora (t(1222) ¼ 4.44, p < .05). On the WENN dataset, when documents are truncated at 100 words, performance was also better for the 1,000-document subcopora, although this difference failed to reach significance (t(250) ¼ 1.63, n.s.). 5.2. All models compared on Wikipedia subcorpora The results presented in Study Two of this paper for models using the WENN (NN-NSL) and Toronto Star (NN-NSL) corpora have also been included in the findings presented in Figs. 7 and 8 as points of comparison to judge the effectiveness of creating the 1,000- and 10,000-document subcorpora from Wikipedia. When the results for both the WENN and Lee datasets are taken into consideration, again none of the more complex semantic models performed significantly better than the simple
M CS
e ac Sp ec V
cs −J S To
pi
cs pi To
F M N Sp
ve
rla p LS A
0.0
0.2
0.4
0.6
0.8
Overlap Wikipedia 1000 documents Wikipedia 10000 documents WENN
O
Correlation (r) between model and human judgment of document similarity
1.0
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
110
Semantic Models
Fig. 7. Correlations between human judgments of paragraph similarity on the WENN dataset with semantic model estimates made using Wikipedia corpora with 1,000- and 10,000-documents and the WENN Corpus (NN-NSL). Error bars are the 95% confidence limits of the correlation. These results are also presented in Appendix Table J1 in the Supporting Information. Correlations exclude Same–Same paragraph comparisons.
word overlap model. While the best-performing model on the Lee dataset was Vectorspace (0.56) using the Wikipedia 10,000-document corpus, this was not significantly different (t(1222) ¼ 1.31, n.s.) from the word overlap model’s correlation (0.53) with human judgments. As is displayed in Figs. 7 and 8, of the corpus-based models Vectorspace, LSA and SpNMF performed the best on both datasets. It is unclear whether using the Jensen-Shannon metric as opposed to dot product measure with the Topic Model produced better results. On the Lee dataset, Topic Model with dot product (0.48) using the 1,000-document Wikipedia corpus significantly outperformed Topics model with the Jensen-Shannon metric (0.42) using the 10,000-document Wikipedia corpus (t(1222) ¼ )2.08, p < .05). However, using the WENN (NN-NSL) corpus, there was not a significant difference between the two Topic Model similarity measures (t(250) ¼ 0.53, n.s.) on the WENN dataset. Latent semantic analysis performed well using both the WENN (NN-NSL) and Wikipedia-based Lee corpora. Given that LSA is built on Vectorspace, it is encouraging to see that in the case of the WENN dataset dimensionality reduction improved this LSA’s performance (0.48) when compared to Vectorspace (0.41). However, this improvement was not found consistently, as indicated by the higher correlation with human judgments on Lee dataset achieved by Vectorspace using either 1,000- and 10,000-document Wikipedia-based corpora (see Fig. 8). Using the WENN (NN-NSL) corpus as a knowledge base allowed the semantic models to produce better estimates of human similarity judgments than could be obtained using either
1.0
111
M CS
e ac Sp ec V
To
pi
cs
−J S
cs pi To
F M N Sp
ve
rla p LS A
0.0
0.2
0.4
0.6
0.8
Overlap Wikipedia 1000 documents Wikipedia 10000 documents Toronto Star
O
Correlation (r) between model and human judgment of document similarity
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
Semantic Models
Fig. 8. Correlations between human judgments of paragraph similarity on the Lee dataset with semantic model estimates made using Wikipedia corpora with 1,000- and 10,000-documents and the Toronto Star (NN-NSL). Error bars are the 95% confidence limits of the correlation. These results are also presented in Table J2 in the Supporting Information. Correlations exclude Same–Same paragraph comparisons.
1,000- or 10,000-document Wikipedia-based corpora on the WENN dataset. In contrast, corpora retrieved from Wikipedia allowed the models to perform much better when making estimates of paragraph similarity on the Lee document set (see Fig. 8). For corpus-based models, the 10,000-document Wikipedia corpus was found to produce the highest correlation with human ratings on the Lee document set (Vectorspace 0.56); however, in the majority of cases the 1,000-document Wikipedia corpora was associated with better model performance at this task. All results presented thus far have consistently shown that the Toronto Star has provided a poor knowledge base on which to assess the paragraphs contained in the Lee dataset. These results indicate that when domain-chosen corpora are not a good fit to the knowledge required to make accurate estimates of similarity on paragraphs, using corpora drawn from Wikipedia can improve model performance.
6. Study Four: Corpora that include the dataset paragraphs In the empirical studies we have reported, subjects were presented with document pairs to be rated. Documents were repeated in different pairs, so for the majority of ratings subjects had already been exposed to all of the test documents. In the previous studies, paragraphs contained in the WENN dataset were included in the WENN corpora, but not for the corpora used by models on the Lee dataset. Consequently, the models were at a
112
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
disadvantage relative to participants. This inclusion of dataset paragraphs is potentially important for models like the Topic Model where context can select for the appropriate meaning of a word. To evaluate the efficacy of including stimulus paragraphs into the semantic models’ knowledge base as a method of corpus improvement, the 50 Lee dataset paragraphs were added to the most effective corpora with the most effective preprocessing found in the previous studies for the Lee dataset. For this study, the 50 Lee paragraphs were prepended to both the 1,000- and 10,000document Wikipedia corpora. These revised corpora were preprocessed using the same techniques described in Study Three for the Wikipedia subcorpora. While the 50 Lee paragraphs were not truncated at 100 words, preprocessing was used to remove punctuation, stop-list words, words that only appear once on the document set, numbers, and single letters. After preprocessing, the smaller corpus contained 1,050 documents with 8,674 unique words and 100,107 tokens, and the larger corpus held 10,050 documents comprised of 37,989 unique words from a total of 942,696 tokens. Adding the 50 Lee paragraphs to the Wikipedia 1,000 corpora significantly improved correlations between model estimates and human judgments of similarity in nearly all cases (see Appendix Table K1 in the Supporting Information). While the Topics model improved one point, using the dot product measure, there was not a significant improvement using the Jensen-Shannon metric. The greatest improvement in model performance was displayed by Vectorspace which increased from 0.55 to 0.67, and LSA which rose from 0.51 to 0.60 (see Fig. 9). Both significant performance increases and decreases were produced for all models by prepending the 50 Lee paragraphs to the 10,000-document Wikipedia corpora (see Appendix Table K2 in the Supporting Information). While all differences were significant when compared to nonaugmented Wikipedia subcorpora, the actual differences in performance were small for most models. In general, these differences ranged between 0.001 and 0.02 with the exception of Topics model using the Jensen-Shannon metric, which went up from 0.42 to 0.49 when the 50 Lee paragraphs were added to the Wikipedia 10,000 corpus (see Fig. 10).
7. Overall summary In Study One, moderate correlations were found between the word overlap model (WENN: 0.43, Lee: 0.48) and human judgments of similarity on both datasets. However, weaker performance was displayed by all of the more complex models when similarity estimates were compared on both the WENN (highest CSM, 0.26) and Lee (highest CSM 0.15) datasets. It was postulated that the semantic models’ performance may have been constrained by factors such as corpus preprocessing and a poorly represented knowledge domain (in the case of the Toronto Star corpus and the Lee dataset). In Study Two, the importance of corpus preprocessing was highlighted; removing the numbers and single letters from corpora improved correlations with human judgment on both datasets for all models with the exception of CSM. After the removal of these characters, the best
1.0
113
CS
e V
ec
Sp
ac
S To
pi
cs −J
cs pi To
F M N Sp
p
A LS
rla ve O
M
0.2
0.4
0.6
0.8
Overlap Wikipedia 1000 documents Wikipedia 1050 (with Lee) documents
0.0
Correlation (r) between model and human judgment of document similarity
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
Semantic Models
Fig. 9. Correlations between human and model estimates of paragraph similarity on the Lee dataset using the standard Wikipedia 1,000 corpora (Wikipedia 1,000) and Wikipedia 1,000 corpora including the 50 Lee documents (Wikipedia 1,050). The overlap model has also been included in this bar graph to allow the reader another point of comparison. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.
performing of the more complex models were LSA (0.48) on the WENN dataset and Vectorspace (0.20) on the Lee dataset. However, the corpus-based models still failed to outperform the word overlap model, which also improved with the removal of numbers and single letters on both datasets (WENN: 0.62, Lee: 0.53). In some ways it is unsurprising that the models’ performance in this study was better on the WENN dataset than the Lee dataset, because the paragraphs used in similarity judgments were drawn from the greater set of documents contained in the WENN corpus. That is, in the case of the WENN set there was a better match between paragraphs that were compared (WENN dataset) and the models’ knowledge base (the WENN Corpus). Conversely, the Toronto Star articles did not provide a good approximation of the knowledge required to make reliable inferences regarding the similarity of paragraphs contained in the Lee dataset. While the Toronto Star corpus contains extracts of current affairs, these articles (published in 2005) must vary substantially from the pre´cis published in 2001 that are contained in the ABC news mail service that was used by Lee et al. (2005). In an attempt to obtain a better representation of the knowledge base required to make accurate paragraph similarity comparisons, in Study Three Wikipedia subcorpora were generated to use on each dataset. The Wikipedia documents were found to be much longer and more like short essays than the summary or abstract length documents found in the WENN and Toronto Star corpora. Guided by the research findings of Lee and Corlett (2003), it was
M CS
e ac Sp ec V
To
pi
cs −J S
cs pi To
F M N Sp
A LS
ve
rla
p
0.0
0.2
0.4
0.6
0.8
Overlap Wikipedia 10000 documents Wikipedia 10050 (with Lee) documents
O
Correlation (r) between model and human judgment of document similarity
1.0
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
114
Semantic Models
Fig. 10. Correlations between human and model estimates of paragraph similarity on the Lee dataset using the standard Wikipedia 10,000 corpora (Wikipedia 10,000) and Wikipedia 10,000 corpora including the 50 Lee documents (Wikipedia 10,050). The overlap model has also been included in this bar graph to allow the reader another point of comparison. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.
found that Wikipedia documents truncated at 100 words provided better corpora for LSA at 300 dimensions than when using all of the words contained in these documents. LSA’s performance was also better using the smaller 1,000-document Wikipedia subcorpora. The decision to use 300 dimensions was in part based on the results of Study One and Study Two, which indicated that increased dimensionality often led to significant performance gains when model estimates of paragraph similarity were compared to human ratings. Based on these findings, spaces were compiled for the models using Wikipedia corpora that contained documents truncated at 100 words. The semantic models’ performance on the WENN dataset did not improve using these Wikipedia subcorpora when compared to results achieved by models using the WENN corpus. However, there was a substantial improvement by nearly all models (except CSM) when similarity estimates were compared on the Lee dataset. Using the Wikipedia subcorpora, the best performing of the more complex models on the Lee dataset were Vectorspace using both 1,000 documents (0.55) and 10,000 documents (0.56) and SpNMF using 1,000 documents (0.53); all of which approach the inter-rater reliability (0.6) recorded for Lee and colleagues participants (Lee et al., 2005). The decrement in performance seen using the Wikipedia subcorpora, when compared to the WENN corpus on the WENN dataset, is again somewhat expected given that the documents in the WENN dataset were selected from the WENN corpus. When the results on both the WENN and Lee datasets are considered, Vectorspace, LSA, and SpNMF were the best
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
115
performing of the corpus-based models. That said, even using corpora that allowed models to perform on a comparable level with the inter-rater reliability found in the WENN dataset, and approaching that calculated for the Lee dataset, these models still could not significantly outperform the simple word overlap model when estimating the similarity of paragraphs in comparison to human performance at this task. The final study explored what effect including the dataset paragraphs into a corpus would have on models’ performance. This assessment was only undertaken for the Wikipedia corpora used on the Lee dataset, as the WENN documents were already included in the WENN corpora in previous studies. In particular, the Topic Model performance was expected to increase; however, this improvement in performance was only observed for the Topic Model using the Dot Product measure of similarity. Generally, performance increases associated with the inclusion of the 50 Lee paragraphs were greater on the smaller 1,050-document Wikipedia corpus when compared to those observed on the 10,050 Wikipedia corpus. It is possible that any benefit to a model’s performance produced by adding these 50 paragraphs is negated by the volume of terms contained in the larger corpus. Overall, the best performance was observed for Vectorspace (0.67) and LSA (0.60) using the 1,050-document Wikipedia corpus containing the 50 Lee paragraphs. It was interesting to note that LSA’s performance using the smaller Wikipedia corpus and 50 Lee paragraphs was almost exactly the same as the inter-rater reliability calculated for the Lee dataset. Furthermore, using this augmented 1,050-document Wikipedia corpus, both LSA (t(1222) ¼ 3.20, p < .05) and Vectorspace (t(1222) ¼ 7.81, p < .05) significantly outperformed the overlap model (0.53) when estimates of paragraph similarity were compared to the human judgments contained in the Lee dataset. Fig. 11 displays scatterplots from the two best-performing models on WENN and Lee datasets. It was surprising that on both datasets, the simple word overlap model was among the two best-performing models. As is illustrated by Fig. 11B and D, the word overlap model is generally capturing human responses that have been made on paragraphs which have low or no similarity. It is also interesting to note that on the WENN dataset, LSA using the WENN corpus (NN-NSL) has in all cases estimated some similarity between the paragraph pairs (see Fig. 11A). This may indicate that greater dimensionality is needed by LSA to adequately delineate the themes presented in the WENN corpus documents. At another level, because the WENN paragraphs all focus on ‘‘celebrity gossip,’’ to some extent they may all be considered related. Alternatively, on the Lee dataset, Vectorspace appears to have provided a relatively good match to the average human estimates of paragraph similarity (see Fig. 11C).
8. Discussion Quite surprisingly, the simplest models (Vectorspace and word overlap) were the bestperforming models examined in this research, both exceeding the inter-rater reliability calculated for human judgments. While exceeding the inter-rater reliability is an important milestone, it is possible for a model to perform better. The model is compared against the average rating of the subjects, which eliminates a significant amount of variance in the
116
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
Fig. 11. Scatterplots of the two best similarity estimates calculated for both the WENN and Lee datasets compared to the average similarity estimates made by humans for each pair of paragraphs. On the WENN dataset, (A) LSA using the WENN corpus (NN-NSL), and (B) the Overlap model. On the Lee dataset, (C) Vectorspace using the Wikipedia 1,050 (including Lee documents), and (D) the Overlap model. Note, on the Lee dataset, average human ratings have been normalized [0,1].
estimates of the paragraph similarities, whereas the inter-rater reliability is the average of the pairwise correlation of the subjects. On the WENN dataset, the overlap model (0.62) exceeded the inter-rater reliability (0.47). Similarly the Vectorspace model (0.67) using a corpus containing both truncated Wikipedia documents and the 50 Lee paragraphs also exceeded the inter-rater reliability found for the Lee dataset (0.605). The Vectorspace model’s performance on the Lee dataset using the smaller Wikipedia corpus that included the 50 Lee paragraphs was particularly encouraging. While the overlap
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
117
model’s good performance at these tasks can largely be accounted for by its ability to capture human ratings on paragraph pairs with low or no similarity, the Vectorspace model appeared to provide good estimates of both the similarity and dissimilarity of the Lee paragraphs when compared to human ratings. That said, the Vectorspace model did not perform as well on the WENN dataset, when compared to estimates produced by either the overlap model or LSA. However, the finding that no model performed as well as the overlap model on this dataset might indicate that even though the best results for the WENN dataset were found for most corpus-based models using the WENN corpus (NN-NSL), that this corpus still did not provide an adequate term representation for the models. Furthermore, it is possible that a better match to the background knowledge needed by models for the WENN paragraphs may have been accomplished had a more complex Lucene query been used to retrieve relevant Wikipedia documents. One possible explanation for the success of the overlap and Vectorspace models in these studies may be found in the framework of the experiments. In each experiment, participants made pairwise comparisons of paragraphs displayed on a computer monitor. The sideby-side positioning of these paragraph pairs may have encouraged keyword-matching (or discrimination) between the paragraphs by the participants. That is, it is possible that the participants were skimming paragraphs for keywords with which they could make similarity judgments. Another related strategy which could result in the similar outcome, would be to read one paragraph thoroughly and then to skim the comparison paragraph for salient words presented in the first text. Masson (1982) indicates that when skimming, readers miss important details in newswire texts, and that visually unique features of text such as place names may increase efficiency of skimming as a reading strategy. Given that names of people and places were certainly present in all paragraphs presented to participants in this research, commonalities between participants’ similarity estimates (and also those of the overlap and Vectorspace models) may also be influenced by these proper nouns. In future research, eyetracking technology could be employed to elucidate the possibility of skimming strategies in this type of experimental task. Alternatively, paragraphs could be presented in a serial sequence, rather than concurrently, and time spent reading each paragraph might act as an indicator of reading strategy. In the introduction, we categorized the materials used in this class of research as having four types of textual unit: words, sentences, single paragraphs, and chapters or whole documents. Past research has indicated that the Topic Model performs better at word association tasks than LSA. Moreover, researchers have shown that Topic Model’s ability to accommodate homographs is superior to other models at the single word textual unit level (Griffiths et al., 2007). While the ability to discriminate the intended meaning of ambiguous words is certainly desirable, it is possible that this attribute is not a prerequisite for successful model performance with larger textual units such as paragraphs. This may be because textual units such as sentences and paragraphs allow models access to a range of nonambiguous words whose informativeness may compensate a model’s inability to capture the meaning of more ambiguous words (Choueka & Lusignan, 1985; Landauer & Dumais, 1997). In the studies reported above, four models (word overlap, Vectorspace, LSA, and SpNMF) that do not capture this type of word ambiguity all
118
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
outperformed the Topic Model when compared to human ratings at the task of estimating similarity between paragraphs. Besides a model’s ability to make good approximations at human similarity judgments, another factor that must be considered when evaluating the usefulness of these semantic models is the ability to produce interpretable dimensions. For example, one of the criticisms of LSA is that the dimensions it creates are not always interpretable (Griffiths et al., 2007). Similarly, word overlap, Vectorspace, and CSM do not employ any dimensionality reduction, and thus provide word vectors that are difficult to interpret. In contrast, both SpNMF and Topic Model return interpretable dimensions. To illustrate this point, Table J3 in the Supporting Information file displays a sample of the dimensions created for the 10,000document Lee-based Wikipedia corpus. As is made clear in this table, it would be easy to meaningfully label any of these dimensions. Given its generally good approximations of human judgment and ability to provide interpretable dimensions, SpNMF could be regarded as one of the best models examined in this article. However, its slow compilation of spaces would certainly need to be addressed for it to be generally useful in either a research or an applied setting. In comparison to the Vectorspace model which takes 24 s to compile a space of the King James Bible using a 2.4 GHz CPU, the SpNMF model is very computationally expensive, taking just under 8 h. Future research may be able to utilize parallel programming techniques18 to sequence SpNMF calculations over multiple processing units to reduce the time needed to compile SpNMF spaces, and thus make SpNMF a more feasible model to use in tasks of this kind. In several ways, CSM was the worst-performing model employed in this research. All models performed better than CSM when using either the Wikipedia subcorpora or WENN corpus (NN-NSL). In addition, the matrices contained within CSM spaces can be over two orders of magnitude larger than those compiled by other models. For example, the space compiled by CSM for the 10,000-document Wikipedia-based corpus with documents truncated at 100 words for the Lee dataset was 12 Gb in size. In stark contrast, the same corpus compiled by Vectorspace used 84 Mb of disk space. While files of this size are not unusable, accessing the dense vectors contained in CSM spaces is slower than retrieving vectors for comparisons using the other models. One of the key strengths of the simple overlap model that performed so well in this research is that it is not reliant on a knowledge base, only extracting information from the surface structure of the textual stimuli. This paper has provided examples of the difficulties researchers face when attempting to create a suitable knowledge base for semantic models. This is not to mention the labor-intensive process undertaken to collect and format a large corpus. Furthermore, the simple overlap model is not without theoretical underpinning. Max Louwerse, in this issue of topiCS, has suggested that ‘‘support for language comprehension and language production is vested neither in the brain of the language user, its computational processes, nor embodied representations, but outside the brain, in language itself.’’ In arguing his claim, Louwerse provides examples of how first-order co-occurrences of terms can produce similar results to LSA on tasks of categorization. Similarly, it could certainly be argued that to some extent the good performance of the overlap model in the studies presented in this paper supports Louwerse’s argument.
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
119
Overall, dimensionality reduction did not appear to improve the models’ estimate of paragraph similarity when compared to results produced by Vectorspace and overlap models. However, LSA’s consistent performance, mimicking of human inter-rater reliability, and better performance on the WENN dataset when compared to Vectorspace all indicate that there is further research that must be done in this area. One aspect of this research that we intend to explore more fully is the possibility that subsets of participants’ judgment variance can be accommodated by different models. For example, it is possible that participants who are skimming the paragraphs may produce results more similar to either Vectorspace or the overlap model. In contrast, other participants who are reading carefully and not skimming over the text may produce results that are more similar to those calculated with LSA. While it is not possible to draw these conclusions with any certainty based on our current datasets, eye-tracking technology will be employed in future research to explore these possibilities. The findings presented in this paper indicate that corpus preprocessing, document length, and content are all important factors that determine a semantic model’s ability to estimate human similarity judgments on paragraphs. The online, community-driven Wikipedia encyclopedia also proved to be a valuable resource from which corpora could be derived when a more suitable domain-chosen corpus is not available. In many applications the hand construction of corpora for a particular domain is not feasible, and so the ability to show a good match between human similarity judgments and machine evaluations is a result of applied significance.
Notes 1. A standard stop-list was applied, and only words appearing 10 times or more were included in the final corpus. 2. http://www.ets.org/ 3. Dividing by the entropy reduces the impact of high-frequency words that appear in many documents in a corpus. 4. http://en.wikipedia.org/ 5. The reader is directed to Martin and Berry (2007) for an example of how to create a term by document matrix for both the Vectorspace model and LSA. 6. Participants actually compared 25 paragraphs; however, a technical fault made the human comparisons of two paragraphs to the rest of the paragraphs in the set unusable. 7. http://www.imdb.com 8. Two-tailed significance tests (a ¼ 0.05) between nonindependent correlations were performed with Williams’ formula (T2) that is recommend by Steiger (1980). 9. These techniques were also used in the Lee, Pincombe, and Welsh (2005) study. 10. Both corpora had already been preprocessed with standard methods: removing stopwords, punctuation, and words that appear in only one document were also removed.
120
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
11. With the exception of Topic Model using the Jensen-Shannon metric, all models that incorporate dimensionality reduction performed better at 300 dimensions. Topics-JS at 100 topics was 0.29 compared to 0.28 with 300 topics. 12. http://download.wikimedia.org/enwiki/latest/ 13. PyLucene is a Python extension that allows access to the Java version of Lucene: http://pylucene.osafoundation.org/ 14. These titles were not included with the WENN paragraphs when similarity comparisons were made by either humans or the semantic models. 15. On average, four keywords were chosen per paragraph to form the Lee-based query. 16. In Web usability research and broad-sheet newspaper media terms, this positioning is often referred to as being ‘‘above the fold.’’ 17. In Study Two, LSA similarity estimates correlated 0.48 with human judgments of similarity on WENN document set. 18. CUDA, the nVidia graphics processing unit technology, presents as an architecture on which these parallel processing gains might be achieved while efficiently using sparse matrices (Bell & Garland, 2008).
Acknowledgments The research reported in this article was supported by Defence Research & Development Canada (grant number W7711-067985). We would also like to thank Michael Lee and his colleagues for access to their paragraph similarity data. Finally, we wish to thank the reviewers, whose helpful comments have greatly improved this paper.
References Bell, N., & Garland, M. (2008). Efficient sparse matrix-vector multiplication on CUDA (NVIDIA Technical Report No. NVR-2008-004). NVIDIA Corporation. Available at: http://www.nvidia.com/object/nvidia_ research_pub_001.html. Accessed on April 10, 2009. Berry, M. W., & Browne, M. (2005). Email surveillance using non-negative matrix factorization. Computational & Mathematical Organization Theory, 11(3), 249–264. Blei, D., Ng, A. Y., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(4–5), 993–1022. Bullinaria, J. A., & Levy, J. P. (2006). Extracting semantic representations from word co-occurrence statistics: A computational study. Proceedings of the National Academy of Sciences, 39(3), 510–526. Choueka, Y., & Lusignan, S. (1985). Disambiguation by short contexts. Computers and the Humanities, 19(3), 147–157. Foltz, P. W. (2007). Discourse coherence and LSA. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 167–184). Mahwah, NJ: Lawrence Erlbaum Associates. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2). Available at: http://imej.wfu.edu/articles/1999/2/04/index.asp. Accessed on April 2, 2008.
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
121
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In M. M. Veloso (Ed.), Proceedings of the 20th international joint conference on artificial intelligence (pp. 1606–1611). Menlo Park, CA: AAAI Press. Giles, J. (2005). Internet encyclopaedias go head to head. Nature, 438(7070), 900–901. Griffiths, T. L., & Steyvers, M. (2002). A probabilistic approach to semantic representation. In W. D. Gray & C. D. Schunn (Eds.), Proceedings of the 24th annual conference of the Cognitive Society (pp. 381–386). Mahwah, NJ: Lawrence Erlbaum Associates. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1), 5228–5235. Griffiths, T. L., Tenenbaum, J. B., & Steyvers, M. (2007). Topics in semantic representation. Psychological Review, 114(2), 211–244. Kireyev, K. (2008). Beyond words: Semantic representation of text in distributional models of language. In M. Baroni, S. Evert, & A. Lenci (Eds.), Proceedings of the ESSLLI workshop on distributional lexical semantics: Bridging the gap between semantic theory and computational simulations (pp. 25–33). Hamburg, Germany: ESSLLI. Kwantes, P. J. (2005). Using context to build semantics. Psychonomic Bulletin and Review, 12(4), 703–710. Landauer, T. K., & Dumais, S. T. (1997). A solution to Platos problem: The Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (2007). Handbook of latent semantic analysis. Mahwah, NJ: Lawrence Erlbaum Associates. Lee, M. D., & Corlett, E. Y. (2003). Sequential sampling models of human text classification. Cognitive Science, 27(2), 159–193. Lee, M. D., Pincombe, B. M., & Welsh, M. B. (2005). An empirical evaluation on models of text document similarity. In B. G. Bara, L. W. Barsalou, & M. Bucciarelli (Eds.), Proceedings of the 27th annual conference of the Cognitive Society (pp. 1254–1259). Mahwah, NJ: Lawrence Erlbaum Associates. Martin, D. I., & Berry, M. W. (2007). Mathematical foundations behind Latent Semantic Analysis. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 35–55). Mahwah, NJ: Lawrence Erlbaum Associates. Martin, M. J., & Foltz, P. W. (2004). Automated team discourse annotation and performance prediction using LSA. In S. T. Dumais, D. Marcu, & S. Roukos (Eds.), HLT-NAACL 2004: Short papers (pp. 97–100). Boston, MA: Association for Computational Linguistics. Masson, M. E. J. (1982). Cognitive processes in skimming stories. Journal of Experimental Psychology: Learning, Memory, and Cognition, 8(5), 400–417. McNamara, D. S., Cai, Z., & Louwerse, M. M. (2007). Optimizing LSA measures of cohesion. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 379–400). Mahwah, NJ: Lawrence Erlbaum Associates. Moed, H. F., Gla¨nzel, W., & Schmoch, U. (2004). Handbook of quantitative science and technology research: The use of publication and patent statistics in studies of S&T systems. Secaucus, NJ: Springer–Verlag New York. Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998). The University of South Florida word association, rhyme, and word fragment norms. Available at: http://w3.usf.edu/FreeAssociation/ Accessed February 2, 2009. Pincombe, B. M. (2004). Comparison of human and LSA judgements of pairwise document similarities for a news corpus (Tech. Rep. No. DSTO-RR-0278). Adelaide, Australia: Australian Defense Science and Techology Organisation (DSTO), Intelligence, Surveillance and Reconnaissance Division. Available at: http://hdl.handle.net/1947/3334. Accessed on April 15, 2008. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
122
B. Stone, S. Dennis, P. J. Kwantes ⁄ Topics in Cognitive Science 3 (2011)
Shashua, A., & Hazan, T. (2005). Non-negative tensor factorization with applications to statistics and computer vision. In L. De Raedt & S. Wrobel (Eds.), Proceedings of the 22nd international conference on machine learning (pp. 792–799). New York: ACM Press. Steiger, J. H. (1980). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87(2), 245–251. Toffler, A. (1973). Future shock. London: Pan. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In J. Callan, D. Hawking, A. Smeaton, & C. Clarke (Eds.), Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’03) (pp. 267–273). New York: ACM Press. Zelikovitz, S., & Kogan, M. (2006). Using web searches on important words to create background sets for LSI classification. In G. Sutcliffe & R. Goebel (Eds.), Proceedings of the 19th international FLAIRS conference (pp. 598–603). Menlo Park, CA: AAAI Press.
Supporting Information Additional Supporting Information may be found in the online version of this article on Wiley Online Library: Comparing Methods for Single Paragraph Similarity Analysis Please note: Wiley-Blackwell is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.
Topics in Cognitive Science 3 (2011) 123–139 Copyright ! 2011 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01125.x
Identifying Optimum Performance Trade-Offs Using a Cognitively Bounded Rational Analysis Model of Discretionary Task Interleaving Christian P. Janssen,a Duncan P. Brumby,a John Dowell,a Nick Chater,b Andrew Howesc a
Interaction Centre, University College London Behavioural Science Group, Warwick Business School c Manchester Business School, University of Manchester b
Received 17 September 2010; received in revised form 15 October 2010; accepted 17 October 2010
Abstract We report the results of a dual-task study in which participants performed a tracking and typing task under various experimental conditions. An objective payoff function was used to provide explicit feedback on how participants should trade off performance between the tasks. Results show that participants’ dual-task interleaving strategy was sensitive to changes in the difficulty of the tracking task and resulted in differences in overall task performance. To test the hypothesis that people select strategies that maximize payoff, a Cognitively Bounded Rational Analysis model was developed. This analysis evaluated a variety of dual-task interleaving strategies to identify the optimal strategy for maximizing payoff in each condition. The model predicts that the region of optimum performance is different between experimental conditions. The correspondence between human data and the prediction of the optimal strategy is found to be remarkably high across a number of performance measures. This suggests that participants were honing their behavior to maximize payoff. Limitations are discussed. Keywords: Multitasking; Performance Operating Characteristic; Cognitively Bounded Rational Analysis; Cognitive modeling; Performance trade-offs
1. Introduction Multitasking often requires trade-offs to be made in terms of how well each task is performed (e.g., task time, number of errors made). Such performance trade-offs can be Correspondence should be sent to Christian P. Janssen, UCL Interaction Centre, University College London, Gower Street, London, WC1E 6BT, UK. E-mail:
[email protected]
124
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
described by plotting Performance Operating Characteristics, which show how the performance of separate tasks vary together systematically (Navon & Gopher, 1979; Norman & Bobrow, 1975). There is, however, a large range of strategies that might be deployed to manage task performance in a given multitasking situation (e.g., Brumby, Howes, & Salvucci, 2007). Why time is allocated differentially to each task, and why particular patterns of task interleaving are adopted, must reference the relative success of those different strategies for allocating attention between tasks (see also Payne, Duggan, & Neth, 2007). Such consideration of the strategic choices made by people in multitasking settings naturally supposes that an optimal performance trade-off might be defined. Given specific feedback about optimal performance, the question naturally becomes: Can people multitask optimally? In line with this adaptive view, previous research has shown that people can adapt their behavior to prioritize one task over another in dual-task settings, and in this way take up different points on the Performance Operating Characteristic (e.g., Brumby, Salvucci, & Howes, 2009; Gopher, 1993; Horrey, Wickens, & Consalus, 2006; Janssen & Brumby, 2010; Levy & Pashler, 2008). However, in these studies the verbal instructions given to participants to prioritize one task over another might have been open to differences in subjective interpretation. Moreover, there is no formal method for identifying the optimal point in the performance trade-off curve. For this to be done, a quantitative payoff function must be defined against which different strategies can be evaluated. Quantitative payoff functions have been used in experimental psychology to provide explicit instructions to participants on how the required tasks should be performed. For example, a payoff function might be used to inform participants how they should trade responding quickly to the appearance of stimuli against the risk of making a response error (e.g., Schumacher et al., 1999). Howes, Lewis, Vera, and colleagues (Howes, Lewis, & Vera, 2009; Howes, Vera, Lewis, & McCurdy, 2004; Lewis, Vera, & Howes, 2004; Vera, Howes, McCurdy, & Lewis, 2004) have taken the use of payoff functions one step further by putting forward the hypothesis that skilled human performance can be understood as a utility maximization problem that is constrained by cognitive architecture, knowledge, and experience. In other words, the idea is to assume that people are boundedly optimal. A payoff function can be used to identify this optimum solution, as was done successfully in the Psychological Refractory Period (PRP) paradigm (see Howes et al., 2009, for details). However, in some respects the PRP task is simple: Stimuli appear at their own pace and single responses need to be made. Slightly more complex are dynamic discretionary task interleaving scenarios, where participants need to decide themselves when to switch attention from one dynamic task to another. In these scenarios, the use of payoff functions has been limited. For example, payoff functions have been used to motivate participants to perform to a certain criterion (e.g., Hornof, Zhang, & Halverson, 2010) or to demonstrate that participants use payoff as an incentive to spend more time on one task over another (e.g., Wang, Proctor, & Pick, 2007). In this paper, we also use a payoff function in a dynamic discretionary task interleaving paradigm. However, we will follow the methodology of Howes et al. (2009) and use a
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
125
payoff function to investigate whether participants adopt the optimum strategy for maximizing payoff. This is in line with the original intention of work that inspired research on Performance Operating Characteristics, Signal Detection Theory, and Receiver Operating Characteristics to identify strategies that maximize utility (Swets, Tanner, & Birdsall, 1961). In our task environment, participants had to keep a randomly moving cursor inside a circular area and type a string of digits, but they could only see and control one task at a time. Participants’ performance was captured in a single payoff score, which reflected the payment the participant received at the end of the study. Tracking tasks have been used in several multitasking studies (e.g., Gopher, 1993; Hornof et al., 2010; Kieras, Meyer, Ballas, & Lauber, 2000; Lallement & John, 1998; Salvucci & Taatgen, 2008). The work presented here builds on and extends this work by showing how a payoff function enables us to bind normative cognitive models with experimental observations of multitasking behavior, and specifically, to show how strategy choice in dynamic discretionary task interleaving paradigms can be better understood by comparing observed performance to a prediction of optimal performance for maximizing payoff.
2. Experiment 2.1. Method 2.1.1. Participants Eight participants (four female) between 20 and 35 years of age (M = 23 years) from the subject pool at UCL participated for monetary compensation. Payment was based on performance (details are provided in the Materials section). The total payment achieved by participants ranged between £7.13 and £11.45 (M = £9.14). 2.1.2. Materials The dual-task setup required participants to perform a continuous tracking task and a discrete typing task, presented on a single 19-inch monitor with a resolution of 1280 · 1024 pixels. Fig. 1 shows the layout of the tasks on the display. The typing task was presented on the left side and the tracking task on the right. Each task was presented within a 450 · 450 pixels area, with a vertical separation of 127 pixels between the tasks. The tracking task required participants to keep a square cursor that drifted about the display in a random fashion inside a target circle (see Fig. 1). The cursor was 10 · 10 pixels and the target had a radius of either 80 (small target) or 120 pixels (large target). A random walk function was used to vary the position of the cursor in the display. The rate at which the cursor drifted about the display was varied between different experimental conditions. In a low noise condition, the random walk had a mean of zero and standard deviation of three pixels per update, while in a high noise condition the random walk had a mean of zero and standard deviation of five pixels per update. Updates occurred approximately once every 25 ms. To control the position of the cursor in the tracking display, participants used
126
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
Fig. 1. Position of the two tasks in the interface.
a Logitech Extreme 3D Pro joystick with their right hand. The drift function of the cursor was suspended whenever the joystick angle was greater than ±0.08 (the maximum angle was ±1). The speed at which the cursor could be moved was scaled by the angle, with a maximum of five pixels per 25 ms. The typing task required participants to enter a string of 20 digits using a numeric keypad with their left hand. The string was made up of the digits 1–3, where each digit occurred at least six times in a given sequence. Digits were presented in a random order with the constraint that no single digit was presented more than three times in a row in the sequence (e.g., ‘‘31213321231322231123’’ as in Fig. 1). When a digit was entered correctly it was removed from the to-be-entered sequence. In this way, the left-most digit on the display was always the next one to be entered. When an incorrect digit was typed, the string would not progress. No additional signal was given to indicate this error. The study used a forced interleaving paradigm, in which only one of the two tasks was visible and could be worked on at any moment in time. By default the typing task was visible and the tracking task was covered by a gray square. Holding down the trigger of the joystick made the tracking task visible and covered the typing task. Releasing the trigger covered the tracking task and made the typing task visible once more. Input was only for the visible task and any input for the covered task was ignored (recall that the tracking task only received input from the joystick while the typing task only received input from the keyboard). 2.1.3. Design The study manipulated aspects of the tracking task using a 2 (cursor noise: low vs. high) · 2 (target size: small vs. large) within-subjects design. The main dependent variables were the time required to complete the typing task, the maximum distance of the cursor from the center of the target circle, and the total time the cursor was outside of the target circle. Participants were remunerated based on their performance using an objective payoff function. The payoff function was designed to encourage fast completion of the typing task
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
127
while also encouraging participants to keep the cursor inside the target. The payoff (in pounds) received following a given trial was defined as: Payoff = Gain + Tracking Penalty + Digit Penalty
ð1Þ
The minimum payoff for a given trial was limited to )0.20 pounds (i.e., a loss). The gain component was based on the total time required to complete a dual-task trial (in seconds): #
Gain = 0:15# e$1 TotalTrialTimeInSec=20þ0:25
ð2Þ
This function was determined using pilot studies, to make sure participants mostly gained money. To encourage participants to keep the cursor inside the target circle of the tracking task, a tracking penalty was applied in trials where the cursor crossed the target boundary: #
Tracking Penalty ¼ $0:1# eSecOutside 1:1090$0:6931
ð3Þ
With this penalty, £0.10 was lost when the cursor was outside of the radius for 0.625 s, and £0.20 was lost when it was outside of the radius for 1.25 s. To encourage accurate typing, a digit penalty deduced £0.01 from the total payoff whenever an incorrect digit was entered. In the remainder of this paper, we will not look at the effect of digit penalty on payoff, as the total number of errors was relatively low, and in most trials no errors were made (see Results). We leave a further investigation of errors to future work and refer the interested reader to Smith, Lewis, Howes, Chu, and Green (2008) for a model that investigates the impact of errors on performance. 2.1.4. Procedure Participants were informed that they would be required to perform a series of dual-task trials and that they would be paid based on their performance. A participant’s payment was based on the cumulative payoff over the course of the study, in addition to a base payment of £3. Participants were told that they would gain more points by completing the typing task as quickly as possible, but that they would lose points if they made a typing error or if the cursor drifted outside of the target area in the tracking task. We chose not to give participants a formal description of the payoff function, but instead provided explicit feedback after every dual-task trial with the payoff score achieved. After explaining how to perform each of the tasks participants performed two single-task training trials for each task and two dual-task practice trials. Participants were instructed that for dual-task trials only one of the two tasks would be visible and controllable at any moment in time, and they were instructed how to switch between tasks using the trigger button on the joystick. Participants then completed four blocks of experimental trials (one for each experimental condition). In the first two blocks, participants experienced a single noise level, either low or high noise. The noise level was randomly assigned to participants and balanced across participants. On the first block a radius size (small or large) was also randomly assigned,
128
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
and on the second block the other radius level was assigned. For the third and fourth block this order was repeated, but with another level for noise. For each block, participants completed five single-task tracking trials, five single-task typing trials, and twenty dual-task trials. The dual-task trials were further grouped into sets of five trials, with a short pause between each set. The total procedure took about 1 h to complete. 2.2. Results Across all keystrokes in single-task typing trials, participants on average typed 2.5% (range 0.5–5.2%) of their keystrokes incorrectly (81 out of 3,281 keystrokes). At the trial level, 0, 1, 2, or more errors were made on, respectively, 61.9%, 29.4%, 5%, and 3.8% of the trials. In the dual-task trials, the number of typing trials was also low. Participants on average typed 3.6% (range 1.2–5.4%) of their keystrokes incorrect (481 out of 13,281 keystrokes). At the trial level, 0, 1, 2, or more errors were made on, respectively, 52.5%, 29.4%, 12.0%, and 6.1% of the trials. We regard the occurrence of errors interesting, but their occurrence is too low to draw any conclusions from (i.e., on most trials there are no errors). We therefore do not look at the effect that errors had on performance and leave this for future work. Also, we focus on performance during the last five dual-task trials of each experimental condition, as these reflect a period during which the participant had had time to adapt their behavior to the payoff function based on the feedback received. A 2 (cursor noise) · 2 (target size) analysis of variance (anova) was used for all statistical analysis with a significance level of .05. 2.2.1. Overall performance We first consider the effect of varying aspects of the tracking task on the time required to complete the typing task, the maximum distance of the cursor from the center of the target circle in the tracking task, and the mean time the cursor was outside the target area. It was found that trial time was significantly longer when there was greater noise in the tracking task (M = 11.17 s, SD = 4.32 s) than when there was a lower level of noise in the tracking task (M = 7.51 s, SD = 2.00 s), F(1, 7) = 15.07, p < .01. Trials were also longer when the target in the tracking task was smaller (M = 10.59 s, SD = 4.01 s) than when it was larger (M = 8.09 s, SD = 3.22 s), F(1, 7) = 11.84, p = .01. There was no significant interaction, F(1, 7) < 1. In the tracking task, we consider the maximum distance of the cursor from the center of the target over the course of a trial. It was found that the cursor drifted more when there was a higher level of noise (M = 95 pixels, SD = 15 pixels) than when there was a lower level of noise (M = 61 pixels, SD = 8 pixels), F(1, 7) = 33.42, p < .001. There was no effect of target size on the maximum distance that the cursor drifted over a trial, F(1, 7) = 1.19, p = .31, nor was the interaction effect significant, F(1, 7) < 1. Another measure of performance in the tracking task is the average time the cursor was outside of the target area per trial. Participants let the cursor remain outside of the target
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
129
area for longer when there was high noise (M = 0.36 s, SD = 0.45 s), compared to when there was low noise (M = 0.04 s, SD = 0.10 s), F(1, 7) = 7.28, p = .03. The cursor also spent more time outside of the target area when the target area was small (M = 0.34 s, SD = 0.05 s), compared to when it was large (M = 0.05 s, SD = 0.11 s), F(1, 7) = 13.26, p < .01. The interaction was not significant, but there was evidence of a trend, F(1, 7) = 4.58, p = .07. This trend reflects that in the low noise, large target condition the cursor never crossed the target area, whereas in the high noise, small target condition the cursor crossed the target area for over half a second. These differences in overall task performance between conditions are somewhat expected and unsurprising because they partly reflect differences in the difficulty of the tracking task. We were far more interested in how this performance was achieved. We next consider the dual-task interleaving strategy that was adopted in each condition. 2.2.2. Strategies Two aspects determine a strategy: (a) the number of digits typed during each visit to the typing window and (b) the amount of time spent in the tracking window per visit to this window. Fig. 2 shows these two basic strategy dimensions for each of the four conditions. For the number of digits typed we only considered correct digits. It can be seen that for each experimental condition there is a unique point in this strategy space—strategies differ between conditions. The number of digits entered per visit increased with an increase in target size, F(1, 7) = 17.4, p < .01, and it also increased with a decrease in cursor noise. That is, more digits were typed when it took longer for the cursor to cross the boundary, F(1, 7) = 15.18, p < .01. There was no significant interaction, F(1, 7) = 3.24, p = .12.
Fig. 2. Plot of the mean number of digits typed and time spent tracking, both per visit. Error bars depict standard errors.
130
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
It can also be seen in Fig. 2 that the time spent in the tracking window per visit increased with an increase in the noise associated with the cursor’s movement, F(1, 7) = 14.98, p = .01. An interaction effect was present as visit time was particularly short in the low noise, large radius condition, F(1, 7) = 11.55, p = .01. There was no significant effect of radius, F(1, 7) < 1.
3. A CBRA model of tracking-while-typing The results show that participants adapted their dual-task behavior to changes in the difficulty of the tracking task by varying the amount of time that was given to each task before switching to the other task. However, what these results do not show is whether participants were adopting a strategy that is optimal in terms of maximizing the expected payoff that could be achieved in each condition, both for the individual task (tracking and typing) and the combination of tasks. To answer this question we developed a Cognitively Bounded Rational Analysis model (Howes et al., 2009) of aggregate human performance. This framework is particularly useful for comparing the performance of alternative strategies, allowing strategies to be discriminated based on the payoff achieved. The model developed here is inspired by our previous work in developing models of a dialing-while-driving dual-task setup (e.g., Brumby, Salvucci, & Howes, 2007; Brumby et al., 2009; Janssen & Brumby, 2010). Both dual-task environments share core characteristics, but the current work differs in that it incorporates an explicit payoff function against which various dual-task interleaving strategies can be evaluated. In the next section, we describe the model that was used to determine whether people were selecting strategies that would maximize the financial payout that could be achieved in each condition. 3.1. Model development 3.1.1. Tracking model What did people do when visiting the tracking window? The crucial question for developing a model of the tracking task is to understand how people set the angle of the joystick based on the position of the cursor in the display. Fig. 3 shows the mean values for discrete bins of five pixels for the horizontal axes (vertical data are similar). We fitted a linear function (shown as a dotted line): Angle ¼ "0:01# current distance from target
ð4Þ
The joystick had a maximum angle of ())1. This shows that participants’ behavior in the tracking task can be captured by a simple linear function that sets the angle of the joystick based on the position of the target within the display. To implement this model, as in the experiment, the speed of the cursor is calculated by multiplying the angle of the joystick with a value of five pixels. Speed is calculated once every 250 ms of tracking, and the cursor
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
131
Fig. 3. Plot of the angle of the joystick as a function of distance from the target. The dashed line shows a fitted function.
position is updated every 25 ms based on this speed value. As in the experiment, the cursor can only be controlled when the tracking window is open. The total time spent tracking in dual-task is varied as part of the strategy (see below). 3.1.2. Typing model To model the typing task we fitted model performance to human single-task typing performance data. To get a measure of how long it took participants to enter a digit in the typing task, we take the mean single-task inter-keypress interval, which was 260 ms. This value was calculated by taking the mean value of participants’ total typing time and dividing this by the number of to-be-entered digits (20). In this way, errors were taken only indirectly into account. We use this time estimate to model the time it takes to enter a single digit in the typing task. 3.1.3. Dual-task model The dual-task model works as followed. The model starts off with typing a series of digits (the length of which is varied as a strategy). The time to type each digit is taken from our single-task model (260 ms). For switching between typing and tracking a switch cost of 250 ms is incurred, based on experimental data (time between last key press and pressing the trigger on the joystick: 247 ms). The model then tracks the cursor for a designated amount of time (varied between runs as a strategy aspect). When it switches back to typing, a switch cost of 180 ms is incurred (time between releasing the trigger and pressing the first key, corrected for the single task typing time: 185 ms). 3.1.4. Strategies We used the basic model described above to explore performance of a variety of different dual-task strategies. A strategy is determined by the number of digits that are typed in
132
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
sequence during a visit to the target window. We consider only a subset of 20 simple strategies that placed a consistent number of digits during each visit to the typing task (between 1 and 20 digits), with the exception of the last visit during which the remaining digits were placed (e.g., strategy 6-track-6-track-6-track-2). In addition, for each visit to the tracking task, more or less time can be spent on tracking. We systematically explored performance for models that spent between 250 and 3,000 ms on tracking during each visit to the tracking window, using steps of 250 ms (i.e., 12 alternatives). This gave a total of 229 (20 · 12–11) strategy alternatives (see also, Brumby, Salvucci, et al., 2007; Brumby et al., 2009; Janssen & Brumby, 2010, for a similar approach to modeling driver distraction). The objective function for rating performance is similar as in the experiment (see Eqs. 1-3) with the exception that the model does not make typing errors. For each strategy alternative 100 runs were performed. Mean performance is reported. 3.2. Model results The first question of interest was whether the model would fit the experimental data. To do this, we hardcoded a strategy for each condition that typed the same number of correct digits per visit and spent about the same amount of time tracking as participants did. We took these values within one standard error of the human means, as our model’s strategy alternatives were more discrete than the human data (e.g., the model’s tracking time was explored in discrete steps of 250 ms). With these values set, we asked whether the model’s performance fitted the total trial time, maximum deviation, and time outside the target area in each experimental condition observed in the human data. This is important so as to know that we have a reasonable calibration of the model’s performance relative to the human data. Model performance was within two standard errors of the human data for these variables. Given that we can be confident that the model is reasonably calibrated to the observed strategies, we can now use the model to evaluate the payoff achieved by different (unobserved) dual-task interleaving strategies. Fig. 4 shows a plot of the maximum number of digits typed per visit to the typing window versus payoff. In this figure (and Figs. 5 and 6), the performance predictions of the model for each strategy alternative are represented by colored circles. The color of the circle reflects the average payoff that the model gained over 100 simulations when this strategy was applied. The warmer the color, the higher the score (i.e., the higher on the vertical axis in Fig. 4). The maximum score is £0.16, and each change in color reflects a change in payoff of £0.02. The horizontal axis of Fig. 4 shows the maximum number of digits that the model typed per visit to the target window. Each of our 20 simple strategies takes a unique value on this axis (e.g., strategy 6-track-6-track-6-track-2 is plotted at value 6 on the horizontal axis). For each of these simple strategies, we explored multiple strategy alternatives based on how much time the model spent on tracking. This resulted in multiple points for each value on the horizontal axis. In three out of the four conditions, the human data (black point, with standard error bars) lie within the region of the highest payoff. That is, on average participants typed the optimum number of digits per visit to the target window so as to achieve the highest payoff. Note that in the small radius, large noise condition participants did not
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
133
Fig. 4. Plot of the mean number of digits typed per visit to the typing window versus predicted payoff per trial for the modeled strategies per condition. Color represents the average payoff achieved by the model using that strategy. Human results are shown as black points with standard error.
Fig. 5. POCs of trial time versus maximum deviation for the modeled strategy alternatives per condition. Color represents the average payoff achieved by the model using that strategy. Human results are shown as black points with standard error. The dashed line shows the target boundary.
achieve the highest score—they should have typed less digits per visit to the typing window. The analysis above suggests that participants selected appropriate strategies in each condition. To investigate whether this strategy also resulted in good overall performance,
134
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
Fig. 6. POCs of trial time versus time the cursor was outside of the target area for the modeled strategy alternatives per condition. Color represents the average payoff achieved by the model using that strategy. Human results are shown as black points with standard error.
we plotted Performance Operating Characteristics (POCs). Recall that POCs display performance on one task against performance on the other task (Navon & Gopher, 1979; Norman & Bobrow, 1975). We included two types of POCs. In Fig. 5, the POCs are plotted for the total trial time and the maximum deviation of the cursor from the center of the target. In Fig. 6, the POCs are plotted for the total trial time and the total time the cursor spent outside of the target area per trial. Again the color of the model data represents the average payoff achieved using this strategy. We highlight some general observations of these POCs: 1. For each condition the shape of the POC differs. 2. The scores that can be achieved differ between conditions, as indicated by different color ranges in each condition. 3. The best performing strategies (i.e., the regions with the warmest colors) tend to cluster on the outer edge (left side and bottom side) of the strategy space: the trade-off curve. That is, the best strategies make an optimal trade-off for performance on the combination of the two tasks. 4. In addition to point 3, for the current payoff function the optimum region is at different sections of the trade-off curve for some of the conditions. The biggest contrast is in Fig. 5 between the low noise, large target condition and the high noise, small target condition. In the former, the best score is achieved by letting the cursor drift completely (i.e., the best performance is at the top left), whereas in the latter condition the optimum is at the inflection point (i.e., the middle of the curve, on the outside, where
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
135
it crosses the dashed line). The model is essential for this assessment, as traditional POCs cannot predict optimal regions by themselves. Inherently, the exact location of the optimum region can also shift with a change in the payoff function. Leaning on this fourth point, our analysis helps to bracket optimal performance. For each of the measures of total trial time, maximum deviation of the cursor (see Fig. 5), and time spent outside of the target area (see Fig. 6), the model predicts that the optimal region lies in a different range of values. This is consistent with our finding in the human data for these measures that showed main effects of radius, or noise, or significant interactions of these factors. Note that this way of bracketing differs from bracketing methodologies that identify the fastest and slowest strategies for performance based on performance time (e.g., Kieras & Meyer, 2000). We can bracket performance for the best strategy alternatives (and others if necessary), based on the predicted payoff of those strategies. Similar to, for example, the work by Kieras and Meyer (2000), performance of these strategies can then be expressed in multiple dimensions of performance (e.g., in our case trial time, maximum deviation of the cursor, and time spent outside of the target area). Fig. 5 suggests that for the different performance measures, human performance is around or at the optimum. In all four conditions, human performance overlaps with the optimum range of values for total trial time and maximum deviation of the cursor from the center of the target (see Fig. 5). Fig. 6 suggests that for three conditions, human performance also overlaps with the optimum range of values for the total time the cursor was outside of the target area. It does not overlap in the high noise, small radius condition (see the bottom left graph in Fig. 6). To close, the correspondence between performance predictions of the best model strategy alternatives and human data can also be assessed using mean performance. Fig. 7 shows bar plots of mean human performance and mean model performance for three measures of performance. Model data are the mean performance for strategy alternatives that fall in the
Fig. 7. Bar plot of human and model performance in each condition. Model performance is the mean performance for strategy alternatives that fall in the highest scoring region (i.e., in the region with the warmest color in Figs. 4–6). Error bars show standard error. The three plots show (A) total trial time (in seconds), (B) maximum deviation of the cursor from the center of the target (in pixels), and (C) total time the cursor was outside of the target area (in seconds).
136
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
highest scoring region (i.e., in the region with the warmest color in Figs. 4–6). The correspondence between the human and model bars is surprisingly high—error bars between model and human data overlap in most instances. This also reflects in R2 and root mean squared error (RMSE) values. The fit values are as follows for trial time (R2 = 0.88, RMSE = 1.63 s), maximum deviation (R2 = 0.95, RMSE = 5.27 pixels), and time spent outside of the target area (R2 = 0.98, RMSE = 0.21 s). These values can be considered to be high, given that the model was not fitted to optimize R2 or RMSE values. Rather, the prediction of the model was based on the set of strategy alternatives that would achieve the highest payoff in each condition.
4. General discussion In this paper, we have presented an experiment and a model of a tracking-while-typing dual-task setup. A good feature of the task environment, in which participants need to track a cursor and type in digits, is that it translates performance on both tasks into a single performance score. This allowed us to move beyond observations that participants trade off performance in tasks, as done in classical dual-task research (Navon & Gopher, 1979; Norman & Bobrow, 1975) and for example in research on dual-task driving behavior (e.g., Janssen & Brumby, 2010). Following Howes et al. (2009), we were able to bracket optimum performance and to demonstrate situations where participants made performance trade-offs in an optimal manner, so as to maximize payoff. This is in line with the original objectives of work that inspired work on Performance Operating Characteristics (Navon & Gopher, 1979; Norman & Bobrow, 1975)—work on Signal Detection Theory and Receiver Operating Characteristics (Swets et al., 1961). The goal of this paper is not to argue that objective functions are the most prevalent aspect of performance in the real world. However, they make it possible to quantify how good performance is. This contrasts with previous work on discretionary task interleaving where verbal instructions on how to trade performance on each task is given (e.g., Brumby et al., 2009; Gopher, 1993; Horrey et al., 2006; Janssen & Brumby, 2010; Levy & Pashler, 2008), where payoff functions were used to motivate participants to perform to a certain criterion (e.g., Hornof et al., 2010), or where payoff functions were used to demonstrate that performance is sensitive to a change in payment (e.g., Wang et al., 2007). In contrast, we can define optimal performance on the combined tasks in terms of maximizing payoff. In our task environment, participants selected strategies that had the potential to achieve the maximum payoff in three out of four conditions (see Fig. 4). In most conditions, participants optimized each of the individual performance measures (total trial time, maximum deviation of the cursor from center, and time spent outside of the target area). Their performance overlaps with the bracketed optimum performance of the model (see Figs. 5 and 6), and the measures of fit between mean human and mean model performance of the best scoring strategy alternatives are high (see Fig. 7). In the low noise condition, participants also made the optimal trade-off on overall combined task performance, as their performance overlaps with the model region of
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
137
maximum payoff (see Fig. 5). In the high noise condition, human performance seems not to overlap with the region of best combined performance. This assessment is dependent on the cutoff point for the best region. Given our cutoff point, one explanation why performance might not lie in the optimum region in this condition is that it is hard for participants to assess how well they performed on the tracking task; this task is covered up part of the time. Participants can get a sense of the maximum deviation when they open the tracking window, but they cannot tell how long the cursor has been outside the target area. In addition, the model performance is based on average performance over 100 samples (or trials) per strategy, whereas human performance is based on performance over only five trials for each participant. As the position of the cursor in both the model and the experiment is manipulated by a noise function, it might be that the sample of noise values that the human participants experienced differs from the averaged sample that the model received (see also Fig. 7C—the variance in the human data is very high in the high noise, small target condition). This is of particular influence in the high noise condition, as the extremes are further apart (due to the higher noise value). One way of working around this problem is by having the model experience the same drift values that the participants experienced. Alternatively, the variance of performance could be taken into account in the assessment of the payoff of a strategy alternative. The model was developed with a minimal set of assumptions. This was already enough to demonstrate that people can adapt performance to an objective function in some situations. Further research can investigate how people adapt their behavior to different payoff functions, which, for instance, give greater weight to performance on one of the two tasks. Experimentally it might be investigated whether there are any strategy transfer effects between the different payoff conditions. We tried to minimize this in our analysis, by only looking at data after participants experienced a condition for 15 trials. However, approaches such as the soft constraints hypothesis predict that strategy transfer effects might still remain (Gray, Sims, Fu, & Schoelles, 2006). The model of the typing task might be refined for example to predict the effect of the different times needed to type repeating digits versus nonrepeating digits (cf. Janssen, Brumby, & Garnett, 2010). The model also does not yet give an account of typing errors, whereas typing errors do influence the payoff that is achieved by participants. More experimental data would be required to identify the nature of these errors, for example, whether they are related to speed-accuracy trade-offs (cf. Wobbrock, Cutrell, Harada, & MacKenzie, 2008). Our assumption not to include typing errors in the model is a similar assumption as was made in our dialing-while-driving models (e.g., Brumby, Salvucci, et al., 2007; Brumby et al., 2009; Janssen & Brumby, 2010), but it differs from other Cognitively Bounded Rational Analysis models (e.g., Smith et al., 2008). At a different level of analysis, the role of eye movements can be considered to explore a wider variety of strategies (cf. Hornof et al., 2010), such as strategies in which some visits to the typing task window are only spent on studying, and not typing digits. Finally, our model gives an account of participants’ ability to interleave the two tasks of tracking and typing in an optimal way. However, the model does not explain how
138
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
participants learn to make this optimal trade-off. In order to do this, theories of learning need to be incorporated (e.g., Erev & Gopher, 1999).
Acknowledgments This work was supported by EPSRC grant EP ⁄ G043507 ⁄ 1. The paper is an extension of a paper we presented at the 10th International Conference on Cognitive Modeling (Janssen, Brumby, Dowell, & Chater, 2010). We thank Julian Marewski and two anonymous reviewers for their valuable comments on that paper. A preliminary version of this work was also presented at the ICCM Doctoral Consortium. We thank attendees for their comments, in particular John Laird for his suggestion to let the model experience the exact noise sample that the participants experienced rather than values of a noise function. We thank Richard Young for the suggestion to use colored pictures to display payoff. Finally, we thank Wayne Gray for his review of this paper.
References Brumby, D. P., Howes, A., & Salvucci, D. D. (2007). A cognitive constraint model of dual-task trade-offs in a highly dynamic driving task. In B. Beqole, S. Payne, E. Churchill, R. St. Armant, D. Gilmore, & M. B. Rosson (Eds.), Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 233–242). New York: ACM Press. Brumby, D. P., Salvucci, D. D., & Howes, A. (2007). Dialing while driving? A bounded rational analysis of concurrent multi-task behavior. In R. L. Lewis, T. A. Polk, & J. E. Laird (Eds.), Proceedings of the 8th International Conference on Cognitive Modeling (pp. 121–126). Ann Arbor, MI: University of Michigan. Brumby, D. P., Salvucci, D. D., & Howes, A. (2009). Focus on driving: How cognitive constraints shape the adaptation of strategy when dialing while driving. In S. Greenberg, S. E. Hudson, K. Hinkley, M. R. Morris, & D. R. Olsen Jr (Eds.), Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 1629–1638). New York: ACM Press. Erev, I., & Gopher, D. (1999). A cognitive game-theoretic analysis of attention strategies, ability, and incentives. In D. Gopher & A. Koriat (Eds.), Attention and performance XVII: Cognitive regulation of performance: Interaction of theory and application (pp. 343–371). Cambridge, MA: MIT Press. Gopher, D. (1993). The skill of attention control: Acquisition and execution of attention strategies. In D. E. Meyer & S. Kornblum (Eds.), Attention and performance XIV: Synergies in experimental psychology, artificial intelligence, and cognitive neuroscience (pp. 299–322). Cambridge, MA: MIT Press. Gray, W. D., Sims, C. R., Fu, W. T., & Schoelles, M. J. (2006). The soft constraints hypothesis: A rational analysis approach to resource allocation for interactive behavior. Psychological Review, 113, 461–482. Hornof, A. J., Zhang, Y., & Halverson, T. (2010). Knowing where and when to look in a time-critical multimodel dual task. In S. E. Hudson, G. Fitzpatrick, W. K. Edwards, T. Rodden, & E. Mynatt (Eds.), Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 2103–2112). New York: ACM Press. Horrey, W. J., Wickens, C. D., & Consalus, K. P. (2006). Modeling drivers’ visual attention allocation while interacting with in-vehicle technologies. Journal of Experimental Psychology: Applied, 12, 67–78. Howes, A., Lewis, R. L., & Vera, A. (2009). Rational adaptation under task and processing constraints: Implications for testing theories of cognition and action. Psychological Review, 116, 717–751. Howes, A., Vera, A., Lewis, R. L., & McCurdy, M. (2004). Cognitive constraint modeling: A formal approach to supporting reasoning about behavior. In K. Forbus, D. Gentner, & T. Regier (Eds.), Proceedings of the
C. P. Janssen et al. ⁄ Topics in Cognitive Science 3 (2011)
139
26th Annual Meeting of the Cognitive Science Society (pp. 595–600). Mahwah, NJ: Lawrence Erlbaum Associates. Janssen, C. P., & Brumby, D. P. (2010). Strategic adaptation to performance objectives in a dual-task setting. Cognitive Science, 34, 1548–1560. Janssen, C. P., Brumby, D. P., Dowell, J., & Chater, N. (2010). A cognitively bounded rational analysis model of dual-task performance trade-offs. In D. D. Salvucci & G. Gunzelmann (Eds.), Proceedings of the 10th International Conference on Cognitive Modeling (pp. 103–108). Philadelphia, PA: Drexel University. Janssen, C. P., Brumby, D. P., & Garnett, R. (2010). Natural break points: Utilizing motor cues when multitasking. In Proceedings of the 54th Annual Meeting of the Human Factors and Ergonomics Society (pp. 482– 486). San Francisco, CA: Human Factors and Ergonomics Society. Kieras, D. E., & Meyer, D. E. (2000). The role of cognitive task analysis in the application of predictive models of human performance. In J. M. Schraagen, S. F. Chipman, & V. L. Shalin (Eds.), Cognitive task analysis (pp. 237–260). Mahwah, NJ: Erlbaum. Kieras, D. E., Meyer, D. E., Ballas, J. A., & Lauber, E. J. (2000). Modern computational perspectives on executive mental processes and cognitive control: Where to from here? In S. Monsell & J. Driver (Eds.), Attention and performance XVIII: Control of cognitive processes (pp. 681–712). Cambridge, MA: MIT Press. Lallement, Y., & John, B. (1998). Cognitive architecture and modeling idiom: An examination of three models of the wickens task. In M. A. Gernsbacher & S. J. Derry (Eds.), Proceedings of the Twentieth Annual Conference of the Cognitive Science Society (pp. 597–602). Mahwah, NJ: Lawrence Erlbaum Associates. Levy, J., & Pashler, H. (2008). Task prioritisation in multitasking during driving: Opportunity to abort a concurrent task does not insulate braking responses from dual-task slowing. Applied Cognitive Psychology, 22, 507–525. Lewis, R., Vera, A., & Howes, A. (2004). A constraint-based approach to understanding the composition of skill. In M. Lovett, C. Schunn, C. Lebiere, & P. Munro (Eds.), Proceedings of the Sixth International Conference on Cognitive Modeling (pp. 148–153). Mahwah, NJ: Lawrence Erlbaum. Navon, D., & Gopher, D. (1979). On the economy of the human-processing system. Psychological Review, 86, 214–255. Norman, D. A., & Bobrow, D. G. (1975). On data-limited and resource-limited processes. Cognitive Psychology, 7, 44–64. Payne, S. J., Duggan, G. B., & Neth, H. (2007). Discretionary task interleaving: Heuristics for time allocation in cognitive foraging. Journal of Experimental Psychology: General, 136, 370–388. Salvucci, D. D., & Taatgen, N. A. (2008). Threaded cognition: An integrated theory of concurrent multitasking. Psychological Review, 115, 101–130. Schumacher, E., Lauber, E., Glass, J., Zurbriggen, E., Gmeindl, L., Kieras, D. E., & Meyer, D. E (1999). Concurrent response-selection processes in dual-task performance: Evidence for adaptive executive control of task scheduling. Journal of Experimental Psychology: Human Perception and Performance, 25, 791–814. Smith, M. R., Lewis, R. L., Howes, A., Chu, A., & Green, C. (2008). More than 8,192 ways to skin a cat: Modeling behavior in multidimensional strategy spaces. In B. C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society (pp. 1441–1446). Austin, TX: Cognitive Science Society. Swets, J. A., Tanner, W., & Birdsall, T. (1961). Decision processes in perception. Psychological Review, 68, 301–340. Vera, A., Howes, A., McCurdy, M., & Lewis, R. (2004). A constraint satisfaction approach to predicting skilled interactive cognition. In E. Dykstra-Erickson & M. Tscheligi (Eds.), Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 121–128). New York: ACM Press. Wang, D. D., Proctor, R. W., & Pick, D. F. (2007). Acquisition and transfer of attention allocation strategies in a multiple-task work environment. Human Factors, 49, 995–1004. Wobbrock, J., Cutrell, E., Harada, S., & MacKenzie, I. (2008). An error model for pointing based on fitts’ law. In M. Burnett, M. F. Costabile, T. Catarci, B. De Ruyter, D. Tan, M. Czerwinski, & A. Lund (Eds.), Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 1613–1622). New York: ACM Press.
Topics in Cognitive Science 3 (2011) 140–153 Copyright ! 2011 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01127.x
A Neural Model of Rule Generation in Inductive Reasoning Daniel Rasmussen, Chris Eliasmith Centre for Theoretical Neuroscience, University of Waterloo Received 10 September 2010; received in revised form 26 October 2010; accepted 1 November 2010
Abstract Inductive reasoning is a fundamental and complex aspect of human intelligence. In particular, how do subjects, given a set of particular examples, generate general descriptions of the rules governing that set? We present a biologically plausible method for accomplishing this task and implement it in a spiking neuron model. We demonstrate the success of this model by applying it to the problem domain of Raven’s Progressive Matrices, a widely used tool in the field of intelligence testing. The model is able to generate the rules necessary to correctly solve Raven’s items, as well as recreate many of the experimental effects observed in human subjects. Keywords: Inductive reasoning; Neural Engineering Framework; Raven’s Progressive Matrices; Vector Symbolic Architectures; Cognitive modeling; Rule generation; Realistic neural modeling; Fluid intelligence
1. Introduction Inductive reasoning is the process of using a set of examples to infer a general rule that both describes the relationships shared by those examples and allows us to predict future items in the set. For example, if a person were watching objects in a river and saw a stick, a rowboat, and a fencepost float past, he or she might induce the rule that ‘‘wooden things float.’’ This rule both describes the relationship which linked those items (being wooden) and allows the person to predict future items which would also float (a wooden bookcase). Given even more examples—some non-wooden floating objects—he or she might infer the general rule that objects float when they displace a volume of water equal to their weight. Correspondence should be sent to Daniel Rasmussen, Centre for Theoretical Neuroscience, University of Waterloo, Waterloo, ON, Canada N2J 3G1. E-mail:
[email protected]
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
141
This type of reasoning is fundamental to our ability to make sense of the world, and it represents a key facet of human intelligence. It underlies our ability to be presented with a novel situation or problem and extract meaning from it. As such, it is a process that has been made central to many tests of general intelligence. One of the most widely used and well-respected tools in this field is the Raven’s Progressive Matrices (RPM) test (Raven, 1962). In the RPM, subjects are presented with a 3 · 3 matrix, in which each cell in the matrix contains various geometrical figures with the exception of the final cell, which is blank (Fig. 1). The subject’s task is to determine which one of eight possible answers belongs in the blank cell. They accomplish this by examining the other rows and columns and inducing rules that govern the features in those cells. They can then apply those rules to the last row/column to determine which answer belongs in the blank cell. Although there has been much experimental and theoretical effort put into understanding the mental processes involved in performing RPM-like tasks, to our knowledge there have been no cognitive models of the inductive process of rule generation. In this article, we present a method of rule generation and implement it in a neural model using simulated spiking neurons. This model can induce the rules necessary to solve Raven’s matrices and also displays many of the most interesting cognitive effects observed in humans: improved accuracy in rule generation over multiple trials, variable performance in repeated trials, and both quantitative and qualitative changes in individual performance.
2. Background 2.1. Raven’s Progressive Matrices There are several variations of the RPM; the Standard and Colored versions are generally used to test children or lower performing adults, whereas the Advanced is used to differentiate average/above-average subjects. In our work, we focus on the Advanced version.
Fig. 1. A simple Raven’s-style matrix.
142
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
Fig. 1 depicts an example of a simple Raven’s-style matrix.1 The matrix is shown at the top with one blank cell, and the eight possible answers for that blank cell are given below. In order to solve this matrix, the subject needs to generate three rules: (a) the number of triangles increases by one across the row, (b) the orientation of the triangles is constant across the row, (c) each cell in a row contains one background shape from the set {circle, square, diamond}. Subjects can then determine which element belongs in the blank cell by applying the rules to the third row (i.e., there should be 2 + 1 ¼ 3 triangles, they should be pointing towards the left, and the background shape should be a circle, since square and diamond are already taken). Once they have generated their hypothesis as to what the blank cell should look like, they can check for a match among the eight possible answers. Not all subjects will explicitly generate these exact rules, and their route to the answer may be more roundabout, but they do need to extract equivalent information if they are to correctly solve the problem. Despite the test’s broad use, there have been few computational models of this task. The model of Carpenter, Just, and Shell (1990) accurately recreates high-level human data (e.g., error rates), but it does not reflect the flexibility and variability of individual human performance nor take into account neurologic data. In addition, Carpenter et al.’s model has no ability to generate new rules; the rules are all specified beforehand by the modelers. This limitation of their model reflects a general lack of explanation in the literature as to how this inductive process is performed. More recently, models have been developed by Lovett, Forbus, and Usher (2010) and McGreggor, Kunda, and Goel (2010). The latter employs interesting new techniques based on image processing, but it is not intended to closely reflect human reasoning and is limited to RPM problems that can be solved using visual transformations. The Lovett et al. (2010) model takes an approach more similar to our own and has the advantage of more automated visual processing, but like the Carpenter et al. model it is targeted only at high-level human data and relies on applying rules defined by the modelers. Previous assumptions regarding the origin of subjects’ rules in the RPM are that people are either (a) born with, or (b) learn earlier in life, a library of rules. During the RPM, these preexisting rules are then applied to the current inductive problem. Hunt described this theory as early as 1973 and also pointed out the necessary conclusion of this explanation: If RPM performance is dependent on a library of known rules, then the RPM is testing our crystallized intelligence (our ability to acquire and use knowledge or experience) rather than fluid intelligence (our novel problem-solving ability). In other words, the RPM would be a similar task to acquiring a large vocabulary and using it to communicate well. However, this is in direct contradiction to the experimental evidence, which shows the RPM strongly and consistently correlating with other measures of fluid intelligence (Marshalek, Lohman, & Snow, 1983), and psychometric/neuroimaging practice, which uses the RPM as an index of subjects’ fluid reasoning ability (Gray, Chabris, & Braver, 2003; Perfetti et al., 2009; Prabhakaran, Smith, Desmond, Glover, & Gabrieli, 1997). A large amount of work has been informed by the assumption that the RPM measures fluid intelligence yet the problem raised by Hunt has been largely ignored. Consequently, there is a need for a better explanation of rule induction; by providing a technique to dynamically generate rules, we remove the dependence on a past library and thereby resolve the problem.
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
143
In contrast to the paucity of theoretical results, there has been an abundance of experimental work on the RPM. This has brought to light a number of important aspects of human performance on the test that need to be accounted for by any potential model. First, there are a number of learning effects: Subjects improve with practice if given the RPM multiple times (Bors, 2003) and also show learning within the span of a single test (Verguts & De Boeck, 2002). Second, there are both qualitative and quantitative differences in individuals’ ability; they exhibit the expected variability in ‘‘processing power’’ (variously attributed to working memory, attention, learning ability, or executive functions) and also consistent differences in high-level problem-solving strategy between low-scoring and high-scoring individuals (Vigneau, Caissie, & Bors, 2006). Third, a given subject’s performance is far from deterministic; given the same test multiple times, subjects will get previously correct answers wrong and vice versa (Bors, 2003). This is not an exhaustive list, but it represents some of the features that best define human performance. In the Results section, we demonstrate how each of these observations is accounted for by our model. 2.2. Vector encoding In order to represent a Raven’s matrix in neurons and work on it computationally, we need to translate the visual information into a symbolic form. Vector Symbolic Architectures (VSAs; Gayler, 2003) are one set of proposals for how to construct such representations. VSAs represent information as vectors and implement mathematical operations to combine those vectors in meaningful ways. To implement a VSA, it is necessary to define a binding operation (which ties two vectors together) and a superposition operation (which combines vectors into a set). We use circular convolution for binding and vector addition for superposition (Plate, 2003). Circular convolution is defined as C ¼ A " B; where cj ¼
n#1 X
ak bj#k mod n:
k¼0
ð1Þ
Along with this, we employ the idea of a transformation vector T between two vectors A and B, defined as A " T ¼B or T ¼ A0 " B;
ð2Þ
where A¢ denotes the approximate inverse of A. With these elements, we can create a vector representation of the information in any Raven’s matrix. The first step is to define a vocabulary, the elemental vectors that will be
144
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
used as building blocks. For example, we might use the vector [0.1,)0.35,0.17,…] as the representation for circle. These vectors are randomly generated, and the number of vectors that can be held in a vocabulary and still be distinguishable as unique ‘‘words’’ is determined by the dimensionality of those vectors (the more words in the vocabulary, the higher the dimension of the vectors needed to represent them). Once the vocabulary has been generated it is possible to encode the structural information in a cell. A simple method to do this is by using a set of attribute ! value pairs: shape ! circle + number ! three + color ! black + orientation ! horizontal + shading ! solid, and so on, allowing us to encode arbitrary amounts of information. As descriptions become more detailed it is necessary to use more complex encoding; however, ultimately it does not matter to the inductive system how the VSA descriptions are implemented, as long as they encode the necessary information. Thus, these descriptions can be made as simple or as complex as desired without impacting the overall model. VSAs have a number of other advantages: They require fewer neural resources to represent than explicit image data, they are easier to manipulate mathematically, and perhaps most importantly the logical operation of the inductive system is not dependent on the details of the visual system. All that our neural model requires is that the Raven’s matrices are represented in some structured vector form; the visual processing that accomplishes this, although a very difficult and interesting problem in itself (see Meo, Roberts, & Marucci, 2007 for an example of the complexities involved), is beyond the scope of the current model. This helps preserve the generality of the inductive system: The techniques presented here will apply to any problem that can be represented in VSAs, not only problems sharing the visual structure of the RPM. 2.3. Neural encoding Having described a method to represent the high-level problem in structured vectors, we now define how to represent those vectors and carry out the VSA operations in networks of simulated spiking neurons. There are several important reasons to consider a neural model. First, by tying the model to the biology, we are better able to relate the results of the model to the experimental human data, both at the low level (e.g., fMRI or PET) and at the high level (e.g., nondeterministic performance and individual differences). Second, our goal is to model human inductive processes, so it is essential to determine whether a proposed solution can be realized in a neural implementation. Neuroscience has provided us with an abundance of data from the neural level that we can use to provide constraints on the system. This ensures that the end result is indeed a model of the human inductive system, not a theoretical construct with infinite capacity or power. We use the techniques of the Neural Engineering Framework (Eliasmith & Anderson, 2003) to represent vectors and carry out the necessary mathematical operations in spiking neurons. Refer to Fig. 2 throughout this discussion for a visual depiction of the various operations. To encode a vector x into the spike train of neuron ai we define
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
145
Fig. 2. A demonstration of the elements of the Neural Engineering Framework. (a) to (b) encoding from an input signal to spiking activity in a neural population (Eq. 3). (b) to (c) transforming, in this case doubling, the represented value (Eq. 5). (c) to (d) decoding the spiking activity back into a value (Eq. 4). (Note: (b) and (c) are spike rasters; each row displays one neuron in the population, and each dot in the row represents a spike from that neuron.)
h i ~ x þ Jbias ai ðxÞ ¼ Gi ai / i i
ð3Þ
Gi as a function representing the nonlinear neuron characteristics. It takes a current as input (the value within the brackets) and uses a model of neuron behavior to output spikes. In our model we use Leaky Integrate and Fire neurons, but the advantage of this formulation is that any neuron model can be substituted for Gi without changing the overall framework. ai, ~ Jbias i , and /i are the parameters of neuron ai. ai is a gain on the input; it does not directly play a role in the encoding of information, but rather is used to provide variety in the firing is a constant current arising from characteristics of the neurons within a population. Jbias i intrinsic processes of the cell or background activity in the rest of the nervous system; it ~ represents the plays a similar role to ai, providing variability in firing characteristics. / i neuron’s preferred stimulus, that is, which inputs will make it fire more strongly. This is the most important factor in the neuron’s firing, as it is what truly differentiates how a neuron will respond to a given input. In summary, the activity of neuron ai is a result of its unique response (determined by its preferred stimulus) to the input x, passed through a nonlinear neuron model in order to generate spikes. We can then define the decoding from spike train to vector as X h % ai ðxÞ/i ; ð4Þ x^ ¼ i
where * denotes standard (not circular) convolution. This is modeling the current that will be induced in the postsynaptic cell by the spikes coming out of ai. ai(x) are the spikes generated in Eq. 3. h is a model of the postsynaptic current generated by each spike; by convolving that with ai(x), we get the total current generated by the spikes from ai. /i are the optimal linear decoders, which are calculated analytically so as to provide the best linear representation of the original input x; they are essentially a weight on the postsynaptic current generated by each neuron. We have defined how to transform a vector into neural activity and how to turn that neural activity back into a vector, but we also need to be able to carry out the VSA operations (binding and superposition) on those representations. One of the primary advantages of the
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
146
NEF is that we can calculate the synaptic weights for arbitrary transformations analytically, rather than learning them. If we want to calculate a transformation of the form z ¼ C1x + C2y (C1 and C2 are any matrix), and x and y are represented in the a and b neural populations (we can add or remove these terms as necessary to perform operations on different numbers of variables), respectively, then we describe the activity in the output population as " # X X ; ð5Þ xki ai ðxÞ þ xkj bj ðyÞ þ Jbias ck ðC1 x þ C2 yÞ ¼ Gk k i
j
where ck, ai, and bj describe the activity of the kth, ith, and jth neuron in their respective ~ C1 /x i , and xkj ¼ populations. The x are our synaptic weights: xki ¼ ak h/ k i m y ~ C2 / i . Referring back to our descriptions of the variables in Eqs. 3 and 4, this ak h / k j m means that the connection weight between neuron ai and ck is determined by the preferred stimulus of ck, multiplied by the desired transformation and the decoders for ai. To calculate different transformations, all we need to do is modify the C matrices in the weight calculations, allowing us to carry out all the linear computations necessary in this model. For a more detailed description of this process, and a demonstration of implementing the nonlinear circular convolution (Eq. 1), see Eliasmith (2005).
3. The model and results 3.1. Rule generation The key to our model is the idea of the transformation vector (Eq. 2). As we have our Raven’s matrix items encoded as vectors, we can represent rules as transformations on those vectors. For example, if A is the vector representation of one square, and B is the vector representation of two squares, then the transformation vector T ¼ A¢%B will be analogous to the rule ‘‘number of squares increases by one.’’ However, we do not just want to calculate individual transformations, we want general rules for the whole matrix. To accomplish this, we treat all adjacent pairs of cells as a set of A and B vectors and extract a general transformation from that set of examples. Neumann (2001) has shown that we can accomplish this by calculating T¼
n 1X A0 % Bi n i¼0 i
In order to perform this operation in neurons (where we cannot instantly sum over a set of examples), we translate it into the equivalent learning rule, where each pair of A and B vectors is presented sequentially: Tiþ1 ¼ Ti & wi ðTi & A0i % Bi Þ
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
1
2
3
4
5
6
7
8
147
Fig. 3. A simple Raven’s-style matrix.
In other words, we calculate an individual transformation for the given pair of cells, and then use the difference between that value and the overall transformation to update the overall transformation for the next pair of cells. We implement this by combining a neural integrator (to maintain the overall value of T) with a network that calculates the transformation for the current pair of examples. We present the examples in a top-down row-wise fashion, as that is the general scanning strategy employed by humans as revealed by eye-tracking studies (Carpenter et al., 1990; Vigneau et al., 2006). Let us take Fig. 3 as an example and examine how the model induces one of the rules necessary to solve the matrix: ‘‘number of objects increases by one.’’2 A0 is the vector representation of one circle, and B0 is the vector representation of two circles. The network calculates T1 ¼ A00 " B0 , which is something like the rule ‘‘number of circles increases by one,’’ and that value is stored in the neural integrator. In the next step A1 is two circles and B1 is three circles, and the transformation (A01 " B1 ) is again ‘‘number of circles increases by one.’’ However, in the next step, A2 is one square, B2 is two squares, and the transformation is ‘‘number of squares increases by one.’’ When this new transformation is added to the neural integrator, ‘‘number of objects increases by one’’ is reinforced (as it is present in all three rules), whereas the specific information (shape) is not. This process continues with the next two rows. Thus, we begin with a specific rule, but over time relations that are particular to individual A and B pairs are drowned out by the relation which all the pairs have in common: ‘‘number of objects increases by one.’’3 Once this process is complete we have the overall T vector, representing a general rule for the problem. Thus, we have accomplished our primary goal: to provide an explanation as to how subjects can inductively generate the rules governing a set of examples. We use these rules (T) by applying them to the second-last cell of the Raven’s matrix (A) giving us A " T ¼ B, where B is a vector representing what our rules tell us should be in the blank
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
148
Ai
Bi
Input Inverse (n=1500)
Input (n=1500)
Ti
Ai Circular Convolu!on Bi
Cleanup Memory
Integrator (n=6000)
(n=11000)
A i ⊗ Bi
Solu!on Generator
(n=10000)
Ti+1
Solu!on Checker
(n=11000)
T
(n=800)
RPM3,2 ⊗ T
Fig. 4. Schematic diagram of the rule generation section with cleanup memory, displaying the approximate number of neurons used in each submodule. The inputs (Ai and Bi) represent two adjacent cells in the matrix. The ‘‘Input Inverse’’ module calculates A0i , whereas ‘‘Input’’ simply leaves Bi unchanged. The ‘‘Circular Convolution’’ module calculates A0i ! Bi (the rule for that particular pair of cells). ‘‘Integrator’’ is storing the calculated rule so far (based on previous pairs of adjacent cells), which is combined with the current calculation. The output of ‘‘Integrator’’ is the overall rule, which is passed through a cleanup memory, potentially resulting in a less noisy version of that rule. Finally, ‘‘Solution Generator’’ generates a prediction of what should be in the blank cell by convolving the second-last cell with the calculated rule, and then ‘‘Solution Checker’’ calculates the similarity between that hypothesis and each of the eight possible answers given in the problem.
cell. We then compare this hypothesis to the eight possible answers and take the most similar (determined by the dot product between the two vectors) as our final answer (see Fig. 4). The key to induction is that the rules that are generated apply beyond the examples from which they were learned. When examining objects floating in a river, a rule that said ‘‘sticks, rowboats, and fenceposts float in rivers’’ would not be very interesting; an inductive rule (e.g., ‘‘wooden things float’’) is useful because it applies to new situations—it tells us that a log should float in a lake, although we did not see any logs or lakes when coming up with the rule. We can demonstrate the generality of the rules induced by this system in the same way, by applying them in novel circumstances. The process described above results in a transformation vector T (the rule). Instead of convolving this vector with the second-last cell of the matrix in order to generate a prediction for the blank cell, we can convolve the rule with different vectors and see which answer the system predicts as the next in the sequence. Fig. 5 shows the result of taking the rule generated from the matrix in Fig. 3 (which, when employed in the standard way, predicts the correct answer of three triangles) and applying it instead to the vector for four squares in order to generate a prediction for the blank cell. Note that the vectors for four squares and five squares were not in the examples from which the rule was learned, and yet it still predicts the correct answer. This demonstrates that the rule is not specific to shape (e.g., ‘‘number of triangles increases by one’’) or number (e.g., ‘‘two objects becomes three objects’’). The system has correctly extracted a general rule (‘‘number of objects increases by one’’) based only on the specific information contained in the matrix. 3.2. Cleanup memory In addition to being able to generate the rules to solve a matrix, the model should improve at this process given practice. We accomplish this by adding a cleanup memory, a system which stores certain values and, when given a noisy version of those values as input, outputs the clean version stored in memory. A cleanup memory can be implemented in neurons by
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
149
Fig. 5. Result of taking the rule generated from Fig. 3, convolving it with the vector for four squares, and comparing the resulting vector to the eight answers in the matrix. The most similar answer (i.e., the system’s prediction of which item comes next in the sequence after four squares) is number eight (five squares).
creating a network that contains neural populations tuned to respond only to certain inputs and output the clean version of those values (Stewart, Tang, & Eliasmith, 2009). We incorporate a cleanup memory in this model by storing the past rules the system has induced. The current rule generated by the network, which will be perturbed by neural noise and the details of the particular Raven’s matrix, is passed through this cleanup memory, and if the cleanup memory contains a similar rule, then that clean version of the rule is output (see Fig. 4). The cleanup memory is improved over time by two mechanisms. First, if the cleanup memory receives an input that it does not recognize, it adds that input to its memory so that it will be recognized in the future. Second, if the cleanup memory receives an input that it does recognize, it uses that input to refine the value stored in memory, so that the stored value becomes increasingly accurate. Thus, as the system encounters rules it has calculated before, it will be able to draw on its past efforts to provide more accurate output. See Fig. 6 for a demonstration of how this improvement in cleanup memory can lead to improved inductive performance. The cleanup memory is useful in that it improves the accuracy of the system and accounts for observed learning effects, but it also serves an important theoretical purpose: It bridges the gap between this model of dynamic rule generation and previous theories of a library of known rules. Rather than contradicting previous theories, we are improving on them by explaining where that past knowledge comes from. We now have an explanation as to why the use of that knowledge is a dynamic, fluid process rather than crystallized. The important aspect of a cleanup memory is that it depends upon its input. The cleanup memory can be used to improve the accuracy of rules, but it cannot generate them on demand; the subject needs to first generate an answer that is accurate enough to be recognized. Thus, subjects can still benefit from their past knowledge of rules, but the critical aspect of performance will
150
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
Fig. 6. An example of the model’s ability to learn over time. The model was presented with a series of matrices that appeared different but required the same underlying rules to solve; as we can see, the model is able to more quickly and definitively pick out the correct answer on later matrices (the eight lines in each graph represent the system’s confidence in the eight possible answers).
be their fluid rule generation ability. This resolves Hunt’s dilemma and means that the reliance of previous models on a library of known rules does not render their insights useless; instead, they can simply be reinterpreted from this new, biologically plausible perspective. 3.3. Higher level processes In addition to the inductive process of rule generation, there are high-level problem-solving effects (what we might call the subject’s ‘‘strategy’’) that will have a significant impact on performance. For example, how does the subject decide when and where to apply the rule generation system? When there are multiple rules to be found, how does the subject differentiate them, and how does the subject decide he or she has found all the rules? How does the subject decide whether his or her hypothesis is good enough to settle on as a final answer? In summary, what does the subject do with the rules once they have been generated? These are important questions, but they are dependent on the particular problem the subject is solving. We have implemented such a strategy system for the RPM (although not at the neural level) in order to collect aggregate test results and explore individual differences.4 Fig. 7 shows an example of these results, demonstrating the model’s ability to recreate differences caused by both low-level neural processing power and high-level strategy. The low-level variable is the dimensionality of the vectors, higher dimension vectors requiring more neurons to represent. The high-level variable is how willing the model is to decide it has found a correct rule: The lower line represents a subject who has less stringent standards and is willing to accept rules that may not be completely correct, whereas the top line represents a subject employing a more conservative strategy. These variables parallel the common types of explanations for individual differences in human subjects: on one hand, neurophysiologic differences such as gray matter density (Gray et al., 2003), and on the other hand, cognitive, strategic differences (Vigneau et al., 2006). These results demonstrate how the model can be used to investigate how these factors interact to give rise to the full spectrum of individual differences. Fig. 7 also reveals that although the overall performance trends are clear there is significant variability (average r ¼ 0.13) in any given trial. In other words, the same model run
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
151
Fig. 7. A demonstration of both low-level (vector dimension/neuron number) and high-level (strategy) influences on accuracy.
repeatedly will get different problems right or wrong, but on an average its performance will be stable. Unlike previous, deterministic models, this is an accurate reflection of the observed patterns of performance in human subjects (Bors, 2003). There are many such interesting avenues of exploration, but we will not go into the details of the strategy system here; the primary contribution of this research is the general rule-induction system described above, which is not dependent on the higher level framework within which it is used.
4. Conclusion We have presented a novel, neurally based model of inductive rule generation, and we have applied this system to the particular problem of Raven’s Progressive Matrices. The success of the system is demonstrated in its ability to correctly find general rules that enable it to solve these matrices, as well as in the model’s ability to recreate the interesting effects observed in human subjects, such as learning over time, nondeterministic performance, and both quantitative and qualitative variability in individual differences. These results demonstrate the potential for gaining a deeper understanding of human induction by adopting a neurally plausible approach to modeling cognitive systems.
Notes 1. For copyright reasons we have created a modified matrix to present here; the model works with the true Raven’s matrices.
152
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
2. Note that Fig. 3 differs primarily from Fig. 1 in that the rule involving items chosen from a set has been removed. The model can generate these kinds of rules, but we have not included that description in the current discussion for purposes of brevity. 3. This same process will help eliminate the noise added at the neural level. 4. The strategy system has three main responsibilities: automating the input to the neural modules, evaluating the success of the rules returned by the neural modules, and selecting an overall response when the neural modules find multiple rules (as in Fig. 1). In the current system these are simply programmed solutions, but the model presented in Stewart, Choo, and Eliasmith (2010) is an interesting demonstration of how such processes could be implemented in a realistic neural model.
Acknowledgments This work was supported by the Natural Sciences and Engineering Research Council of Canada, CFI/OIT, Canada Research Chairs, and the Ontario Ministry of Training, Colleges, and Universities. References Bors, D. (2003). The effect of practice on Raven’s Advanced Progressive Matrices. Learning and Individual Differences, 13(4), 291–312. Carpenter, P., Just, M., & Shell, P. (1990). What one intelligence test measures: A theoretical account of the processing in the Raven Progressive Matrices Test. Psychological Review, 97(3), 404–431. Eliasmith, C. (2005). A unified approach to building and controlling spiking attractor networks. Neural Computation, 17(6), 1276–1314. Eliasmith, C., & Anderson, C. (2003). Neural engineering: Computation, representation, and dynamics in neurobiological systems. Cambridge, MA: MIT Press. Gayler, R. (2003). Vector Symbolic Architectures answer Jackendoff’s challenges for cognitive neuroscience. In P. Slezak (Ed.), ICCS/ASCS international conference on Cognitive Science (pp. 133–138). Sydney: University of New South Wales. Gray, J. R., Chabris, C. F., & Braver, T. S. (2003). Neural mechanisms of general fluid intelligence. Nature Neuroscience, 6(3), 316–322. Hunt, E. (1973). Quote the Raven? Nevermore! In L. Gregg (Ed.), Knowledge and cognition (pp. 129–157). Potomac, NJ: Lawrence Erlbaum Associates. Lovett, A., Forbus, K., & Usher, J. (2010). A structure-mapping model of Raven’s Progressive Matrices. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd annual conference of the Cognitive Science Society (pp. 2761–2766). Austin, TX: Cognitive Science Society. Marshalek, B., Lohman, D., & Snow, R. (1983). The complexity continuum in the radex and hierarchical models of intelligence. Intelligence, 7(2), 107–127. McGreggor, K., Kunda, M., & Goel, A. (2010). A fractal analogy approach to the Raven’s test of intelligence. In AAAI workshops at the 24th AAAI conference on Artificial Intelligence (pp. 69–75). Atlanta: Association for the Advancement of Artificial Intelligence. Meo, M., Roberts, M., & Marucci, F. (2007). Element salience as a predictor of item difficulty for Raven’s Progressive Matrices. Intelligence, 35(4), 359–368.
D. Rasmussen, C. Eliasmith ⁄ Topics in Cognitive Science 3 (2011)
153
Neumann, J. (2001). Holistic processing of hierarchical structures in connectionist networks, Unpublished doctoral thesis, University of Edinburgh. Perfetti, B., Saggino, A., Ferretti, A., Caulo, M., Romani, G. L., & Onofrj, M. (2009). Differential patterns of cortical activation as a function of fluid reasoning complexity. Human Brain Mapping, 30(2), 497–510. Plate, T. (2003). Holographic reduced representations. Stanford, CA: CLSI Publications. Prabhakaran, V., Smith, J., Desmond, J., Glover, G., & Gabrieli, J. D. E. (1997). Neural substrates of fluid reasoning: An fMRI study of neocortical activation during performance of the Raven’s Progressive Matrices Test. Cognitive Psychology, 33, 43–63. Raven, J. (1962). Advanced progressive matrices (Sets I and II). London: Lewis. Stewart, T. C., Tang, Y., & Eliasmith, C. (2009). A biologically realistic cleanup memory: Autoassociation in spiking neurons. In A. Howes, D. Peebles, & R. Cooper (Eds.), 9th international conference on cognitive modelling (pp. 128–133). Manchester, England: ICCM2009. Stewart, T. C., Choo, X., & Eliasmith, C. (2010). Dynamic behaviour of a spiking model of action selection in the basal ganglia. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd annual conference of the Cognitive Science Society (pp. 235–240). Austin: Cognitive Science Society. Verguts, T., & De Boeck, P. (2002). The induction of solution rules in Ravens Progressive Matrices Test. European Journal of Cognitive Psychology, 14, 521–547. Vigneau, F., Caissie, A., & Bors, D. (2006). Eye-movement analysis demonstrates strategic influences on intelligence. Intelligence, 34(3), 261–272.
Topics in Cognitive Science 3 (2011) 154–165 Copyright ! 2011 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01128.x
The Evolution of a Goal-Directed Exploration Model: Effects of Information Scent and GoBack Utility on Successful Exploration Leonghwee Teo, Bonnie E. John Human-Computer Interaction Institute, Carnegie Mellon University Received 4 October 2010; received in revised form 26 October 2010; accepted 1 November 2010
Abstract We explore the match of a computational information foraging model to participant data on multipage web search tasks and find its correlation on several important metrics to be too low to be used with confidence in the evaluation of user-interface designs. We examine the points of mismatch to inspire changes to the model in how it calculates information scent scores and how it assesses the utility of backing up from a lower-level page to a higher-level page. The outcome is a new model that qualitatively matches participant behavior better than the original model, has utility equations more appealing to ‘‘common sense’’ than the original equations, and significantly improves the correlation between model and participant data on our metrics. Keywords: ACT-R; CogTool-Explorer; Computational model; Human–computer interaction; Information foraging
1. Introduction Predicting human performance to aid in the design of interactive systems is an important practical use of computational cognitive modeling. Models like SNIF-ACT 2.0 (Fu & Pirolli, 2007) and AutoCWW (Blackmon, Kitajima, & Polson, 2005) focus on predicting user exploration of websites. These models use the common concepts of label-following and information scent (infoscent). That is, they posit that the user’s choice is partly determined by the semantic similarity between the user’s goal and the options presented in the user-interface (UI). Budiu and Pirolli (2007) and Teo and John (2008) began to consider the 2D spatial layout of the UI when predicting exploration behavior. Budiu and Pirolli (2007) Correspondence should be sent to Leonghwee Teo, Human-Computer Interaction Institute, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213. E-mail:
[email protected]
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
155
reported a correlation between data and model of R2 = .56 for the number of clicks to success and R2 = .59 for search times in a degree-of-interest (DOI) tree. Teo and John (2008) did not report correlations, but their model successfully predicted the effect of target position in 22 search tasks in a two-column format. This paper furthers this work by considering a multipage layout of links in a website where previous information is hidden as exploration progresses. We first describe our metrics and why they are important. We then present the tasks and the operation of a baseline model. After presenting the quantitative performance of the baseline model, we delve into some details of the model’s performance to find inspiration as to how to improve the model. Finally, we present the best model found to date and discuss directions for future work.
2. The metrics Ultimately, a UI designer would want a model to predict the range of human behavior that would be observed in the real world when using the interactive system, on metrics such as number of errors and where they occur, performance time, learning time and what was learned, effects of fatigue, environmental factors, or emotion on performance, and even levels of satisfaction or joy when using the system. No computational model is up to that task at this writing, and more modest metrics are used in current work. For SNIF-ACT 2.0, Fu and Pirolli (2007) reported the correlation between model and participants on number of clicks on each link (R2 = .69 and .91 for two different websites), the correlation for number of go-back actions for all tasks (R2 = .73 and .80), and a table of percent of model runs that succeeded on each task juxtaposed with the percent of participants who succeeded on each task (R2 = .98 and .94, calculated from Fu & Pirolli, 2007, figure 13). The first two metrics were for models run under the model-tracing paradigm; that is, at each step the model was allowed to choose its action but was reset to the participant’s action if it did not choose what the participant chose; the last metric was for free-running models. For their free-running model, DOI-ACT, Budiu and Pirolli (2007) did not report percent success because their experiment participants completed all tasks (and the model could run to success on all but 2 of the 16 tasks), but instead reported the correlation between the model and participants for number of clicks to accomplish each task (R2 = .56) and total time for each task (R2 = .59). We will report similar metrics that are both indicative of model goodness-of-fit and important to UI designers. 1. Correlation between model and participants on the percent of trials succeeding on each task (R2%Success). Percent success is common in user testing to inform UI designers about how successful their users will be with their design, so a high correlation between model and data will allow modeling to provide similar information. 2. Correlation between model and participants on the number of clicks on links to accomplish each task (R2ClicksToSuccess). We eliminated unsuccessful trials because some
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
156
participants would click two or three links and then do nothing until time ran out, whereas others continued to click (as did the model). AutoCWW (Blackmon et al., 2005) also uses this metric. 3. Correlation between model and participants on the percent of trials succeeding without error on each trial (R2%ErrorFreeSuccess). This measure indicates the model’s power to predict which tasks need no improvement and therefore no further design effort.
3. The tasks To test and improve our model, we chose a multipage layout used in AutoCWW experiments (Toldy, 2009, experiment 1), shown in Fig. 1; Dr. Marilyn Blackmon generously provided the participant log files from 36 exploration tasks performed on this layout. The participants were given a search goal (at the top of each page) and had 130 s to complete each task. There were 44–46 valid participant trials recorded for each task.
Top-level
2nd-level
3rd-level
Fig. 1. Multipage layout from Toldy (2009). Participants start in the top-level page (leftmost) and on selecting a link, transit to 2nd-level pages. Participants may go back to the top-level page, or they may select a link to go to its 3rd-level page. In a 3rd-level page, participants can check if they have succeeded in the task, and, if not, go back to the 2nd-level page and continue exploration.
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
157
4. CogTool-Explorer: Mechanisms and parameters We start our exploration with CogTool-Explorer, a model of goal-directed user exploration implemented in the ACT-R cognitive architecture (Anderson et al., 2004) first developed to account for the effects of two-column layout on link choice in web search tasks (Teo & John, 2008). CogTool-Explorer added ACT-R’s simulated eyes and hands to SNIFACT 2.0 and interacts with a spatially accurate ACT-R device model generated by CogTool (John, Prevas, Salvucci, & Koedinger, 2004). Fig. 2 shows the structure of CogTool-Explorer and its relationship to CogTool. Using CogTool, an interactive system designer creates a storyboard of a graphic user interface (GUI) either by hand or automatically from web pages (bottom left of Fig. 2), represented as frames with interactive widgets like links, buttons, or menus, and transitions between those frames that represent user actions like clicking on a link. CogTool translates this storyboard into an ACT-R device model (bottom center of Fig. 2). CogTool-Explorer’s model of the user (center of Fig. 2) interacts with this device model to predict novice exploration behavior. In more detail, CogTool-Explorer uses ACT-R’s ‘‘eye’’ as described in Anderson et al. (2004) with Salvucci’s EMMA model of visual preparation, execution, and encoding (Salvucci, 2001), a long-standing implementation within CogTool. A visual search strategy adapted from the Minimal Model of Visual Search (Halverson & Hornof, 2007) guides
Fig. 2. The structure of CogTool-Explorer.
158
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
where to move the eye. The strategy starts in the upper-left corner and proceeds to look at the link closest to the model’s current point of visual attention, moderated by its noise function. This strategy will not look at a link more than once on each visit to the web page. Other noise parameters and strategies are possible (e.g., see Budiu & Pirolli, 2007), but as the strategy and noise setting from Halverson and Hornof (2007) produced good results in the two-column tasks in Teo and John (2008), the models in this paper will not vary any aspects of visual processing. Likewise, CogTool-Explorer uses ACT-R’s standard ‘‘hand,’’ used in many CogTool models, and will retain that mechanism through this paper’s exploration. CogTool-Explorer’s estimation of information scent has used latent semantic analysis (LSA; Landauer, McNamara, Dennis, & Kintsch, 2007) to calculate the semantic relatedness of the search goal to links on the screen. We will continue using LSA throughout this paper, although other estimation procedures are possible (e.g., Fu and Pirolli [2007] and Budiu and Pirolli [2007] used pointwise mutual information). A noise function moderated the infoscent values to reflect the variability a person might display when assessing relatedness (baseline noise = ACT-R default = 1), and a scaling factor of 50 (set by Teo & John, 2008) transforms the infoscent values provided by LSA to the range of values expected by SNIF-ACT 2.0. CogTool-Explorer uses the same equations as SNIF-ACT 2.0 to decide which action to take based on what has been seen and evaluated so far, equations which also achieved good results in Teo and John (2008). These equations include two parameters, k, a ‘‘readiness to satisfice’’ factor, and the GoBackCost. Both of these were set to 5 in Fu and Pirolli (2007), but Teo and John’s tasks required a k-value of 600 to fit the data well, which we will continue to use here. The baseline GoBackCost parameter is set to Fu and Pirolli’s value of 5. Finally, when SNIF-ACT 2.0 went back to a page already seen, the link associated with the page backed-up from was marked as having been selected, and SNIF-ACT 2.0 would not select it again (not reported in Fu & Pirolli, 2007, but extracted from the SNIF-ACT 2.0 code). Presumably, as Fu and Pirolli’s data come from naturalistic tasks, the link color changed when a link had been selected and thus this ‘‘perfect memory’’ was ‘‘in the world.’’ This mechanism is also in CogTool-Explorer’s baseline model.
5. Performance of the baseline CogTool-Explorer model We ran the baseline CogTool-Explorer model until the model runs converged. That is, we ran a set of 44–46 runs of each of the 36 tasks (equal to the number of valid participant trials on each task, for a total of 1,649 runs in each set) and calculated the %Success for each task. We then ran an additional set, combined it with the previous set to form a new combined set and compared its values of %Success per task to the previous set’s values. If all values were within 1% of each other, we considered the model converged and stopped. If any of the tasks had a %Success value greater than 1% from its counterpart in the previous set, we ran an additional set, combined it with the previous combined set to form a new combined set and compared its values of %Success per task to the previous combined set’s values. The baseline model converged after 12 sets (!20,000 runs), with the following calculated values for our metrics and their 95% confidence intervals:
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
159
R2%Success = .28 (0.21, 0.35) R2ClicksToSuccess = .36 (0.29, 0.43) R2%ErrorFreeSuccess = .44 (0.37, 0.51) These values are disappointing for UI design because design practice requires far higher confidence in a model’s predictions to be a useful alternative to user testing. These values are also substantially lower than the comparable values reported by other SNIF-ACT derivatives; SNIF-ACT 2.0’s R2%Success was .98 and .94 for the two websites modeled (Fu & Pirolli, 2007) and DOI-ACT’s R2ClicksToSuccess was .56 (Budiu & Pirolli, 2007). As the baseline CogTool-Explorer model used the same utility equations and most of the same parameters as SNIF-ACT 2.0, it is necessary to understand why the R2%Success results are so different. Our first hypothesis is that different data collection processes are to blame. Fu and Pirolli’s (2007) data were from participants doing eight tasks on each of two websites, at their leisure, on their own computers. Their participants could abandon the task at will, whereas Toldy’s tasks were collected in the lab and participants had 130 s to complete each task (Toldy, 2009). Allowing the participants to abandon tasks probably eliminated the most difficult tasks with their higher variability. Not compelled to continue until success, not a single participant in Fu and Pirolli’s data succeeded on 4 of their 16 tasks, in contrast to the range seen in Toldy’s tasks (average %Success = 71%, min = 13%, max = 100%). As SNIF-ACT 2.0 also failed on these tasks, these four points provided a strong anchor at the origin for their R2%Success value. Another major difference that might have led to better performance is that SNIF-ACT 2.0 used infoscent scores calculated with reference to only the website in the task (E. Chi, personal communication, June 18, 2010), whereas our infoscent scores were calculated with reference to the college-level TASA corpus (from Touchstone Applied Science Associates, Inc.). A corpus comprised of the task website might have produced infoscent scores with less noise (from word sense ambiguity, etc.) than the more general college-level corpus. Finally, simply switching tasks can illuminate deficiencies in any model, which will be the focus of the rest of this paper.
6. Improving the model Two glaring deficiencies in the behavior of the baseline model, relative to that of participants, inspired changes in the model. The first is that participants reselect links that they had clicked before (13% of their actions) and the model never does. This means that the mechanism in SNIF-ACT 2.0 that perfectly remembers which links have been clicked and never reselects them must be changed to allow the possibility of matching the behavior in these data. We cannot tell from the data whether a reselection is a deliberate decision to click on the link a second time or that the participant forgot that link had been clicked (the links in this experiment did not change color when clicked); we chose to model the latter with the following mechanism in our baseline model. Each link is represented as a visual object that has a ‘‘status’’ attribute whose value is set to ‘‘chosen’’ when the link is clicked on by the
160
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
model and then stored in declarative memory. ACT-R’s decay mechanism governs whether the fact that the link had been chosen will be retrieved when it is next seen and evaluated by this model. We set ACT-R’s base level learning activation parameter, :bll, to 0.5 as recommended in the ACT-R 6.0 tutorial, n.d. (section 4.3), the retrieval activation threshold to )0.5 as shown in section 4.2, and both the permanent noise, :pas, and the instantaneous noise, :ans, to nil (section 4.5). The second deficiency in the baseline model is that 22% of the participants’ actions involve going back from a page and only 7% of the models’ actions do. This behavior is comparable to Fu and Pirolli’s 5% go-back actions, which, we believe matched their data because they allowed their participants to abandon tasks instead of going to completion. This calls into question the SNIF-ACT 2.0 mechanisms that govern go-back behavior, that is, both the GoBack utility equation and the GoBackCost parameter. We will lower the GoBackCost from 5 to 1 to get the exploration started and examine the GoBack utility equation with a more detailed examination of the model behavior. After making the two fundamental changes motivated by global behavior of the baseline model (call this model baseline++), we guided our investigation by examining tasks where participants were least likely to be exploring in a random fashion, that is, on tasks where participants were most successful. We sorted the 36 tasks by highest %ErrorFreeSuccess and then focused on the top four tasks. The third task in this list, to search for information about pigeons (correct top-level link = ‘‘Life Sciences,’’ correct 2nd-level link = ‘‘Birds’’), had infoscent scores that were all very low and not widely distributed for the top-level headings. Budiu and Pirolli (2007) discuss this problem as well; misleading and ⁄ or nondiscriminating infoscent scores will plague any model and we did not consider this task further for inspiration about what to change. However, the other three tasks inspired three ways to change the baseline++ model. 6.1. Refinement of infoscent values for top-level links The topmost task was to search for information about ferns, and its correct top-level link was ‘‘Life Sciences.’’ The 46 participants selected other top-level links only 8% of the time but went back from those 2nd-level pages to select ‘‘Life Science’’ and then ‘‘Plants’’ (in all but two cases) to complete the task. In contrast, the baseline++ model selected other toplevel links about 70% of the time before selecting ‘‘Life Sciences,’’ and on some model runs it never selected ‘‘Life Sciences’’ and failed the task. One possible explanation for the model behavior was that it did not look at ‘‘Life Science’’ before deciding to select a link on the top-level page. When we examined the details of the model runs, this was not the case, as the model runs did see ‘‘Life Science’’ before selecting a link in over 95% of first-visits to the top-level page. A second possible explanation was that the model looked at too many links and saw other higher infoscent links before selecting a link on the top-level page. This also was not the case because, in all model runs up to the point where it finished looking at ‘‘Life Science,’’ if we forced the model to choose the best link so far, it would have selected ‘‘Life Science’’ in over 60% of the runs. A third possible explanation lies in the infoscent values used by the model.
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
161
Given a particular goal, the baseline models followed AutoCWW (Blackmon et al., 2005) by using LSA to compute an infoscent value for each link, based on the cosine value between two vectors, one representing the words in the goal description and the other the words in the link text. To approximate how a reader elaborates and comprehends the link text in relation to his or her background knowledge, AutoCWW adds all the terms from the LSA corpus that have a minimum cosine of 0.5 with the raw text and a minimum word frequency of 50 to the raw link text before using LSA. Kitajima, Blackmon, and Polson (2005) explained that ‘‘elaborated link labels generally produce more accurate estimates of semantic similarity (LSA cosine values).’’ Our baseline model used the same method; thus, for the link ‘‘Life Science,’’ the words ‘‘science sciences biology scientific geology physics life biologist physicists’’ were added and then submitted to LSA to compute the infoscent value. AutoCWW uses a further elaboration method motivated by UI layouts with links grouped into regions labeled with a heading. Kitajima et al. (2005) explained that ‘‘readers scan headings and subheadings to grasp the top-level organization or general structure of the text.’’ To represent a region, AutoCWW first elaborates the heading text as described in the previous paragraph, and then adds all the text and their elaborations from links in the same region. The baseline model did not use this elaboration method for top-level links because their subordinate links appeared on 2nd-level pages, different from Kitajima et al.’s assumption. However, participants did practice trials on the same multipage layout as the actual trials and perform all 36 test trials on the same layout. Therefore, we would expect that this experience would influence how participants assessed infoscent of the top-level link. This reasoning motivated our first refinement to the baseline++ model to better represent these participants: For the infoscent of a top-level link, we elaborate the top-level link and then add the text from all links in the corresponding 2nd-level page. While this refinement is similar to AutoCWW’s procedure, the justifications are different. This refinement is also in line with Budiu and Pirolli’s (2007) use of category-based scent, but it approximates their human-generated categories with an automated process. 6.2. Refinement of mean infoscent of previous page The second task on our list was to search for information about the Niagara River. The baseline++ model selected the correct link ‘‘Geography’’ on the top-level page, but it went back from the 2nd-level ‘‘Geography’’ page over 60% of the time, while participants never did. To investigate, we looked at how the model decided to go back. Recall that, like SNIF-ACT 2.0, after looking at and assessing the infoscent of a link, the baseline CogTool-Explorer models choose between reading another link, selecting the best link seen so far, or going back to the previous page using utility functions. The utility functions of reading another link and selecting the best link so far have both strong theoretical support (Fu & Pirolli, 2007) and empirical support from several studies that did not use or emphasize go-back behavior (Fu & Pirolli, 2007; Teo & John, 2008). However, the utility function for going back has less support and was therefore a focus of our attention. From SNIF-ACT 2.0, the baseline CogTool-Explorer models used the following GoBack utility equation.
162
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
UtilityGoBack ¼ MISðlinks assessed on previous pageÞ
$MISðlinks assessed on current pageÞ $GoBackCost
ð1Þ
where MIS is Mean Information Scent. The infoscent values for the nine top-level links are sensible: The correct link, ‘‘Geography,’’ has the highest LSA value by an order of magnitude. After selecting the top-level link with the highest infoscent and visiting the corresponding 2nd-level page, Eq. 1 includes ‘‘Geography’s’’ high scent in its first operand, which attracted the model back to the toplevel page. This behavior violates common sense; as the model had just selected the best top-level link to visit its 2nd-level page, it should not be pulled back to the previous page by the infoscent of the selected link. This reasoning inspired another refinement to the baseline++ model, changing Eq. 1 to Eq. 2: UtilityGoBack ¼ MISðlinks assessed on previous page excluding the selected linkÞ $MISðlinks assessed on current pageÞ $GoBackCost
ð2Þ
where MIS is Mean Information Scent. 6.3. Refinement of mean infoscent of current page The last task on our list of four was to find information about the Hubble Space Telescope. While both participants and model in this task selected the correct link ‘‘Physical Science & Technology’’ on the top-level page, the model went back from the corresponding 2nd-level page 50% of the time, but participants never did. Inspection of the model runs in the Hubble task revealed a different problem from that in the Niagara River task, however. After selecting the link with the highest infoscent and visiting the corresponding 2nd-level page, if the first link the model saw on that page had very low infoscent, the GoBack utility would be high because the value of the second operand would be low. This behavior also violates common sense; as the model had just selected the best link on the top-level page because it looked promising, the model should carry that confidence into the next page and should not immediately go back just because the first link it saw on the 2nd-level page did not relate to the task goal. This reasoning inspired our last refinement to the baseline++ model, changing Eq. 2 to Eq. 3: UtilityGoBack ¼ MISðlinksassessedon previouspage excludingthe selectedlinkÞ $MISðlinksassessedon currentpage includingthe selectedlinkÞ $GoBackCost
where MIS is Mean Information Scent.
ð3Þ
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
163
This change has a nice symmetry with the previous change, carrying along the ‘‘confidence’’ inspired by the high infoscent top-level link. If the selected link’s infoscent score is very high compared to the other top-level links, those other top-level links alone will not exert much pull to go back. If the selected link’s infoscent score is high relative to the first few links it sees on the 2nd-level page, the model will not go back until it ‘‘loses confidence’’ by seeing several low infoscent links, thereby diluting the effect of the high infoscent link that led the model to this page. We ran one set of many preliminary models to get a feel for the contributions of these changes. The combination of all changes described here seemed to be the best model.
7. Performance of the best model so far With all the changes described above combined, we ran the model to convergence (10 sets, a total of 16,490 runs) and attained the following calculated values for our metrics and their 95% confidence intervals (Table 1): R2%Success = .72 (0.66, 0.76) R2ClicksToSuccess = .66 (0.60, 0.71) R2%ErrorFreeSuccess = .82 (0.79, 0.85)
8. Discussion and future work The improved model presented above made large and significant improvements on all our metrics over the baseline model coming into this investigation. R2%Success more than doubled and the other two metrics increased by more than 50%. Although there is room for improvement, these values are in the range where UI designers could use them to identify the tasks at the extremes. That is, this analysis identifies which tasks are sufficiently supported by the interface that effort can be diverted to other areas and which tasks are in most need of attention. Future work will take several paths. One path involves systematically exploring the benefits of the model mechanisms and parameters described in this paper. We have presented only the conjunction of these elements, with a single set of parameters, but we will examine the mechanisms’ individual and pairwise effects on model performance and explore the parameter space before moving on to other UI layouts and tasks. Second, we should reconsider the metrics and how to use them. Although we believe the metrics presented here are both meaningful for goodness of fit and useful for UI design, other metrics should be considered. For example, Fu and Pirolli (2007) reported the correlation between the number of go-back actions by the model and participants; how might this help inform model improvements or design? As a second example, consider root mean square error (RMS error), a standard metric for quantifying the difference between the values estimated by a model and what is observed in empirical trials. UI designers often
164
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
Table 1 Summary of results Mechanism, Parameter, or Metric Visual processes
Baseline Model
Best Model So Far No change
Manual processes
ACT-R + Salvucci (2001) + Halverson and Hornof (2007)a ACT-Ra
Information scent process Heading-level input
Link labels
Link labels + lower link labels No change
Link-level input Decision process Click best link utility equation k (readiness to satisfice) Read next link utility equation GoBack utility equation GoBackCost Memory of selected links
Metrics R2%Success (95% confidence interval) R2ClicksToSuccess (95% confidence interval) R2%ErrorFreeSuccess (95% confidence interval)
Link labels
No change
SNIF-ACT2.0b 600a SNIF-ACT2.0b SNIF-ACT2.0: Eq. 1b 5b Perfectb
No change No change No change Improved here Eq. 3 1 Imperfect :bll = 0.5 :rt = )0.5 :ans = nil :pas = nil
.28 (0.21, 0.35) .36 (0.29, 0.43) .44 (0.37, 0.51)
.72 (0.66, 0.76) .66 (0.60, 0.71) .82 (0.79, 0.85)
Note. Gray shading indicates mechanism and parameters that did not change. a From Teo and John (2008). b From Fu and Pirolli (2007).
need to know absolute quantities when making decisions about design and development effort and cost trade-offs. Thus, a low RMS error would be as valuable as a high correlation (the RMS error did reduce for each metric with our improved model, but are not yet <20% which is desirable for UI design practice). In addition, we need to understand how to combine or trade off metrics against one another, as it is unlikely that model exploration will produce the most desirable levels of all metrics at once. Third, we must validate the model by extending to other UI layouts and tasks. Although this paper reports improvements to several measures of fit, these improvements were made with reference to a single set of tasks on a single UI layout. It is possible that the changes we made to the parameters and to the GoBack utility equation, sensible as they sound, may simply be tuning the values and parameters to this data set. We plan to explore both different tasks with the same multipage layout and the same tasks on different layouts. In the meantime, AutoCWW has shown it could be used to improve the design of website links with only 54% of the variance explained for ClicksToSuccess (Blackmon et al., 2005)
L. Teo, B. E. John ⁄ Topics in Cognitive Science 3 (2011)
165
and this improved version of CogTool-Explorer exceeds that level. If these results can be shown to extend beyond simple web search tasks, to other layouts, types of interfaces, and tasks, CogTool-Explorer will be well on its way to being a useful tool for design.
Acknowledgments The authors thank the anonymous reviewers whose probing questions improved the science reported in this paper and Dr. Marilyn Blackmon for sharing the experiment data. This research was supported in part by funds from IBM, NASA, Boeing, NEC, PARC, DSO, and ONR, N00014-03-1-0086. The views and conclusions in this paper are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of IBM, NASA, Boeing, NEC, PARC, DSO, ONR, or the U.S. Government.
References ACT-R 6.0 tutorial. (n.d.). Available at: http://act-r.psy.cmu.edu/actr6/units.zip. Accessed on June 13, 2010. Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004). An integrated theory of the mind. Psychological Review, 111(4), 1036–1060. Blackmon, M. H., Kitajima, M., & Polson, P. G. (2005). Tool for accurately predicting website navigation problems, non-problems, problem severity, and effectiveness of repairs. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 31–40). New York: ACM. Budiu, R., & Pirolli, P. (2007). Modeling navigation in degree-of-interest trees. In N. A. Taatgen & H. van Rijn (Eds.), Proceedings of the 31th Annual Conference of the Cognitive Science Society (pp. 845–850). Austin, TX: Cognitive Science Society. Fu, W.-T., & Pirolli, P. (2007). SNIF-ACT: A cognitive model of user navigation on the World Wide Web. Human-Computer Interaction, 22, 355–412. Halverson, T., & Hornof, A. J. (2007). A minimal model for predicting visual search in human-computer interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 431–434). New York: ACM. John, B. E., Prevas, K., Salvucci, D. D., & Koedinger, K. (2004). Predictive human performance modeling made easy. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 455–462). New York: ACM. Kitajima, M., Blackmon, M. H., & Polson, P. G. (2005). Cognitive architecture for website design and usability evaluation: Comprehension and information scent in performing by exploration. In Proceedings of HCI International 2005 (343–373). Mahwah, NJ: Erlbaum. Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook of latent semantic analysis. Mahwah, NJ: Lawrence Erlbaum Associates. Salvucci, D. D. (2001). An integrated model of eye movements and visual encoding. Cognitive Systems Research, 1(4), 201–220. Teo, L., & John, B. E. (2008). Towards a tool for predicting goal-directed exploratory behavior. In Proceedings of the Human Factors and Ergonomics Society 52nd Annual Meeting (pp. 950–954). Santa Monica, CA: Human Factors and Ergonomics Society. Toldy, M. E. (2009). The impact of working memory limitations and distributed cognition on solving search problems on complex informational websites. Unpublished Doctoral Dissertation, University of Colorado – Boulder, Department of Psychology.
Topics in Cognitive Science 3 (2011) 166–186 Copyright ! 2011 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01126.x
Risk Attitude in Decision Making: In Search of Trait-Like Constructs Eldad Yechiam,a Eyal Ertb b
a Behavioral Sciences, Technion – Israel Institute of Technology Department of Agricultural Economics and Management, The Hebrew University
Received 12 September 2010; received in revised form 3 November 2010; accepted 4 November 2010
Abstract We evaluate the consistency of different constructs affecting risk attitude in individuals’ decisions across different levels of risk. Specifically, we contrast views suggesting that risk attitude is a single primitive construct with those suggesting it consists of multiple latent components. Additionally, we evaluate such constructs as sensitivity to losses, diminishing sensitivity to increases in payoff, sensitivity to variance, and risk acceptance (the willingness to accept probable outcomes over certainty). In search of trait-like constructs, the paper reviews experimental results focusing on the consistency of these constructs in different tasks as well as their temporal consistency. Overall, the findings show that the most consistent factor is risk acceptance, and they also demonstrate its potential boundaries. These results are modeled with a simple quantitative index of subjective risk. A survey of decisions under risk further reveals that participants exhibit almost no consistency across different tasks in this setting, highlighting the advantage of experiential tasks for studying individual differences. Keywords: Risk taking; Individual differences; Cognitive style; Experience
1. Introduction An important debate in psychology concerns the issue of whether people’s risk attitude (or their sensitivity to risk) is consistent in different settings or whether it is heavily influenced by environmental cues (see review in Schoemaker, 1993). Within those contending that risk attitude is consistent there are major differences in the conceptualization of risk, which have seldom been contrasted in the context of individual risk preferences and their consistency within the individual. The current paper contrasts three major views Correspondence should be sent to Eldad Yechiam, Faculty of Industrial Engineering and Management, Technion, Haifa 32000, Israel. E-mail:
[email protected]
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
167
concerning the nature of the consistent psychological constructs underlying people’s risk attitude. Perhaps the earliest view on this issue is the classical economic approach that addresses risk attitude as sensitivity to differences in payoff variances, referring to the average distance between each of the payoffs and the mean (e.g., Markowitz, 1952; Pratt, 1964). We will refer to this view as the ‘‘risk as variance’’ approach. The extension of this approach to individual differences (Preuschoff, Bossaerts, & Quartz, 2006) suggests that the major risk attitude dimension in which people are consistent is the sensitivity to variance. For example, consider the choice between zero and a gamble offering an equal chance to win or lose $100 with equal likelihood (e.g., by a flip of a coin). Under the risk as variance approach, some people would consistently choose the low variance option (zero) and some people would consistently avoid it. A second more recent view is ‘‘risk acceptance,’’ the idea that the tendency of people to prefer (or avoid) risk over certainty is the consistent construct in people’s risk attitude (e.g., Brachinger & Weber, 1997). There are different formulations of the risk acceptance approach, which constitutes a revision of the classic economic approach. For parsimony, we chose to focus on a simplified interpretation, referring to risk acceptance as the individual’s sensitivity to certain versus probable outcomes. Thus, the risk acceptance approach implies that differences in variance between choice alternatives constitute a necessary but insufficient condition for the sensitivity to risk. The other necessary condition for risk sensitivity is the existence of clear differences in the level of (un)certainty, such as when choosing between fixed and probabilistic payoffs. Consistent risk attitude is only exhibited under this condition.1 As an example distinguishing the risk acceptance and the risk as variance approaches, consider the choice between two gambles: one offering an equal chance to win or lose $10 with equal likelihood and another offering an equal chance to win or lose $110 with equal likelihood. Under the risk as variance approach people would make this choice according to their sensitivity to variance (which constitutes their risk attitude), while under the risk acceptance approach this situation is not relevant to the risk attitude construct altogether because it does not contrast certain and uncertain outcomes. In other words, the risk acceptance approach implies that the consistent risk attitude construct only reflects the preference of certain (i.e., fixed) outcomes (by some people) or uncertain outcomes (by others). This suggests that differences in variance do not lead to consistent behavior in the absence of certainty. A different and highly dominant view of risk attitude suggests that it is in fact made up of different latent components that are not directly related to the sensitivity to variance. This view is represented by Prospect theory (Kahneman & Tversky, 1979), which explains people’s risk attitude by two main regularities of subjective values: (a) loss aversion—the idea that the perceived magnitude of losses is larger than the perceived magnitude of equivalent gains, and (b) diminishing sensitivity—an implication of Stevens (1957) law asserting that the subjective impact of a change in the absolute payoff decreases with the distance from zero. This idea is captured by prospect theory with a ‘‘value function’’ that describes how objective quantities are translated into subjective values. The reference point to which val-
168
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
ues are compared is at zero, and the function is concave for gains and convex for losses. Thus, diminishing sensitivity implies that large amounts (either gains or losses) are discounted as a function of the distance from zero.2 Recent cognitive models of individual choice behavior have adopted this view by implementing these factors as constructs that are thought to be consistent at the individual level: (a) loss sensitivity—the assumption that individuals weigh gains and losses in a consistent fashion (e.g., Busemeyer & Stout, 2002; Worthy, Maddox, & Markman, 2007), and (b) diminishing sensitivity—the assertion that people are consistent in discounting (large) outcomes as a function of their distance from zero (e.g., Ahn, Busemeyer, Wagenmakers, & Stout, 2008). We will refer to this view as the ‘‘prospect theory constructs’’ approach because it uses the concepts of prospect theory but further suggests that they are consistent within the individual. Note that the two constructs of loss sensitivity and diminishing sensitivity can be easily mapped to a person’s sensitivity to variance. Say, for example, there is a choice between $30 for sure and a gamble producing $10 or $50 with equal probability (e.g., by a flip of a coin). High diminishing sensitivity implies discounting of the larger amount (of $50), thereby making the higher variance option less attractive. In a gamble with same-sized gains and losses, loss sensitivity has a similar role. For example, if all outcomes in the example above are deducted by 30, this gives a sure outcome of 0 and a gamble producing )$20 or $20 with the same probability. A person sensitive to losses (compared to gains) would overweight the loss over the gain outcome and would therefore avoid the higher-variance gamble and opt for the safer sure amount. Nevertheless, as demonstrated below, the prospect theory constructs lead to predictions that differ from those of the risk as variance approach. Section 2 of the current paper highlights the conflicting predictions implied by the approaches outlined above. Section 3 reviews experiments focusing on people’s consistency across different tasks and different experimental sessions, which allow the examination of these predictions. Given the nature of the predictions, all of the reviewed experiments use a within-subject design. Section 4 proposes a simple quantitative index for the emergence of consistency in risk-taking behavior.
2. The contrasting predictions A trait is defined as a habitual pattern of behavior, thought, or emotion (Kassin, 2003). Hence, if a construct is trait-like, then it is predicted to affect behavior across a range of situations, thereby leading to consistency in the individual’s behavior when the same task is performed in different conditions. In this section, some specific conditions that enable one to differentiate the predictions of the three aforementioned approaches are outlined. The first such condition involves the consistency between risk-taking propensities in choice between gambles that include only nonnegative payoffs (the gain domain) and choice between gambles that involve only nonpositive payoffs (the loss domain). Under the prospect theory constructs approach, supposing that indeed diminishing sensitivity is consistent, then a negative association is expected between risk taking in the gain and loss domains. For example, say you have two choice problems:
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
169
Problem 1 (Gain domain): Choose between getting $50 for sure and a prospect providing 50% to win $100 and 50% to get 0. Problem 2 (Loss domain): Choose between losing $50 for sure and a prospect providing 50% to lose $100 and 50% to get 0. In each problem, there are 100 repeated choices between the safe option and the riskier prospect. Let us imagine two individuals (consistently) differing in their diminishing sensitivity. One individual has no diminishing sensitivity. This means that the subjective value of $100 is perceived as about twice as valuable as the value of $50 in both problems. This individual is therefore expected to be ‘‘risk neutral’’ in both problems and so equally likely to select the safe and risky options, as their expected values are identical. Now, our second individual has high diminishing sensitivity, meaning that he discounts large amounts. Discounting the large amount in the gain domain (e.g., in Problem 1) results in risk aversion because the large amount is the better part of the risky prospect and is not found very attractive. Discounting the large amount in the loss domain (e.g., in Problem 2) results in risk seeking because the large amount is the worst part of the risky prospect and it is not found very negative. Hence, for the individual with high diminishing sensitivity we would expect risk aversion in the gain domain and risk seeking in the loss domain. Kahneman and Tversky (1979) found that, on average, people behave like our second individual—showing risk aversion in the gain domain and risk seeking in the loss domain, and termed this observation the reflection effect. The prospect theory constructs approach further assumes that these constructs are consistent within different individuals. This implies that those that are risk averse in the gain domain would be risk seeking in the loss domain. Hence, this approach predicts a negative correlation between risk attitudes in the gain and loss domains at the individual level. In contrast, models based on the sensitivity to variance, as well as models of risk acceptance, would predict a positive correlation between risky choices in the gain and loss domains, as individuals would either seek or avoid certainty or variance in both domains. However, the risk acceptance approach will have this prediction only when the choice alternatives contrast certain and uncertain outcomes (as in Problems 1 and 2 above). The second prediction involves the consistency of the weighting of gains and losses. Under the prospect theory construct of loss sensitivity, a positive correlation should appear between choice problems differing in the magnitudes of gains and losses regardless of factors like variance or certainty. Again, let us use an example to clarify. This example involves problems that include gambles with both gains and losses, hence referred to as ‘‘mixed gambles’’ (we sometimes refer to this as the mixed domain): Problem 3 (Mixed-low outcomes): Choose between getting 0 for sure and a prospect providing 50% to win $5 and 50% to lose $5. Problem 4 (Mixed-high outcomes): Choose between getting 0 for sure and a prospect providing 50% to win $50 and 50% to lose $50. Under the prospect theory construct of loss sensitivity, if a person gives more weight to losses than gains, he or she should be risk averse in both of these problems. In contrast, if a
170
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
person gives more weight to gains than to losses, he or she should be risk seeking in both problems. Therefore, given some individual differences in loss sensitivity, a positive correlation is expected across the two problems (as some people are more risk averse in both problems and some are more risk seeking in both). In addition, the idea of loss sensitivity implies that people will not be consistent in their risk attitude across Problem 1 (Gain) and either Problem 3 (Mixed-low) or Problem 4 (Mixed-high) simply because there are no losses involved in Problem 1.3 In contrast, the sensitivity to variance model predicts that the largest consistencies would appear between problems where the alternatives have similar differences between levels of variance. Therefore, in the examples above it predicts positive correlation between choices across Problems 1 and 4, even though Problem 1 involves only gains, and Problem 4 involves losses as well. The reason is that these problems have the same differences in variance between their prospects. Moreover, this approach also predicts that people would exhibit much lower consistency in their choices across Problems 3 and 4, even though both problems involve both gains and losses, because the difference between the alternatives’ variance in Problem 4 is much higher than in Problem 3. The risk acceptance approach predicts choice consistency mostly when there are discernible differences in levels of certainty (e.g., in the choice between fixed and probabilistic outcomes). Accordingly, it also predicts positive consistency between Problems 3 and 4, which contrast a certain outcome with uncertain outcomes. However, it does not predict consistency in conditions that do not involve this contrast. To illustrate, let us present another choice problem: Problem 5 (Mixed-unavoidable uncertainty): Choose between a prospect providing 50% to win $25 and 50% to lose $25 and a prospect providing 50% to win $75 and 50% to lose $75. In this problem, neither of the alternatives provides certain outcomes. The hypothesized correlations between risk attitudes in Problem 4 and Problem 5 can distinguish between the risk acceptance approach and the sensitivity to variance approach. The latter predicts high consistency in risky choices between the two problems since both problems involve the same difference in the alternatives’ level of variance. However, according to the risk acceptance approach people will not exhibit consistent risk attitude across these problems since Problem 4 involves choices between certain and uncertain outcomes, whereas Problem 5 involves choices only between uncertain outcomes.
3. Experimental evidence: Consistency across tasks and time in risk-taking behavior In a recent paper, we (Ert & Yechiam, 2010) examined the contrasting predictions of the aforementioned approaches to the consistency of people’s behavior across different experiential risk-taking tasks. In such decisions, individuals do not get explicit information about the payoff distributions associated with the alternatives they face (e.g., the probabilities and
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
171
payoff sizes) and learn the relevant distributions from their experience (Hertwig, Barron, Weber, & Erev, 2004). We start the current analysis with a brief summary of these findings and continue with additional experimental evidence in two directions: First, we report new data about consistency across decision problems in description-based tasks, where individuals get written information concerning the characteristics of the decision problem. Second, we review recent studies (e.g., Levin, Hart, Weller, & Harshman, 2007; Yechiam, 2010) addressing similar predictions using a longitudinal design and evaluating the consistency of choices over time. 3.1. Consistency across the gain and loss domains: Diminishing sensitivity or risk acceptance? An important implication of the diminishing sensitivity construct to the consistency in people’s behavior involves the prediction of a negative consistency across the gain and loss domains. As indicated above, examining the consistency across these domains enables contrasting the ‘‘diminishing sensitivity’’ assertion (which is a part of the prospect theory constructs approach) with the ‘‘sensitivity to variance’’ and the ‘‘risk acceptance’’ assertions. In Experiment 1 of Ert and Yechiam (2010), each participant was presented with four repeated choice tasks, as described in Table 1. Each task included two alternatives, one (referred to as ‘‘L’’) being always associated with lower variance payoffs than the other (‘‘H’’). The main within-subject manipulation pertained to the presentation of the outcomes as gains or losses. In the Gain condition choice, alternatives yielded positive outcomes, whereas in the Loss condition outcomes were negative. In order to differentiate between predictions of the sensitivity to variance and risk acceptance approaches, the tasks were further distinguished with respect to the difference in the levels of uncertainty. In two of the tasks, selecting the safer option eliminated probabilistic outcomes. We refer to these tasks as the ‘‘Avoidable Uncertainty’’ condition. In the other two tasks, uncertainty could not be avoided since both alternatives included probable outcomes. These tasks are referred to as the ‘‘Unavoidable Uncertainty’’ condition. The diminishing sensitivity assertion predicts the emergence of a negative association between risk taking in the gain and loss domains in both the Avoidable and Unavoidable uncertainty conditions because high diminishing sensitivity leads to risk seeking in the loss domain and risk aversion in the gain domain. This assertion also predicts positive correlations between the two gain problems, and between the two loss problems. In contrast, the risk acceptance assertion predicts the emergence of a positive association between the gain and loss domains in the two Avoidable Uncertainty conditions, and no association between the two Unavoidable Uncertainty conditions. In the avoidable uncertainty problems, there are clearer differences in uncertainty level, which supposedly trigger risk acceptance tendencies. Finally, the sensitivity to variance model predicts positive associations across all four choice problems due to one option being higher in variance than the other, even in the Unavoidable Uncertainty condition. The participants were informed that they would be playing different games in which they would operate ‘‘computerized money machines’’ with two unmarked buttons, and that
172
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
Table 1 The payoff schemes of the four conditions of Experiment 1 of Ert and Yechiam (2010) and the average proportion of selections Domain
Condition
Gain
Avoidable Uncertainty
Gain
Unavoidable Uncertainty
Loss
Avoidable Uncertainty
Loss
Unavoidable Uncertainty
Alternative: Payoff
P(H)
L: Win 600 H: 50% to win 1200, 50% to win 0 L: 50% to win 500, 50% to win 400 H: 50% to win 890, 50% to win 10 L: Lose 600 H: 50% to lose 1200, 50% to lose 0 L: 50% to lose 500, 50% to lose 400 H: 50% to lose 890, 50% to lose 10
0.26 0.31 0.45 0.49
Note. L = Low variance option; H = High variance option; P(H) = The average proportion of H choices across individuals.
their final payoffs would be sampled from one of the ‘‘machines.’’ They received no prior information about the payoff distributions or the number of trials. Their task was to select one of the machine’s two unmarked buttons in each trial. The payoffs in each task were contingent upon the button chosen and were randomly drawn from the relevant distributions described in Table 1. Two types of feedback immediately followed each choice: (a) the basic payoff for the choice, which appeared on the selected button for 2 s, and (b) an accumulating payoff counter, which was displayed constantly. Final take-home amounts were determined according to the accumulating score in one choice problem that was randomly selected at the end of the experiment (the performance score was converted into cash money at a rate of 0.01 NIS per 100 points). The measure used in each task was simply the proportion of choices of H across trials. There are therefore four variables in this study (and subsequent ones) conforming to the rate of H choices in each of the four choice problems. The choice proportions under the different conditions are summarized in the rightmost column of Table 1.4 The correlations across tasks appear in Fig. 1 (all correlations are between proportions of H selections). The results showed that in the Avoidable Uncertainty condition there was a positive association between the gain and loss domains (r = .45, p < .01), which stands in contrast to the diminishing sensitivity hypothesis and supports the risk acceptance assertion. In the Unavoidable Uncertainty condition, there was no association between the loss and gain domains (r = .03, NS), which further supports the risk acceptance assertion, since in this condition the probabilistic outcome could not be avoided (or accepted). In addition, participants were consistent between the two Gain problems (r = .63, p < .0001) and between the two Loss problems (r = .32, p < .02), suggesting that individuals exhibit reliable diminishing sensitivity to a certain degree. To summarize, participants exhibited a consistent preference between a constant outcome and a probable outcome in the gain and loss domains, rather than consistent diminishing sensitivity. This suggests that risk acceptance, rather than diminishing sensitivity, modulates the consistency across the gain and loss domains. Additionally, the argument that the consistent sensitivity to risk is due to mere variance differences cannot account for the null correlations between gain and loss domain problems in the Unavoidable Uncertainty condition.5
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
173
Fig. 1. The results of Experiment 1 of Ert and Yechiam (2010): scatter plots, linear regression lines, and Spearman correlations. Each dot in the scatter plot shows the proportion of choices from the High variance option of an individual decision maker. The column header denotes the abscissa, and the row header denotes the ordinate (AU = Avoidable Uncertainty; UU = Unavoidable Uncertainty; Gain = Gain domain; Loss = Loss domain).
3.2. Consistency with losses and gains—Is it the product of a weighting parameter? The second line of predictions that clashes the different approaches involves the question of whether consistent weighting of gains and losses (loss sensitivity) can modulate risk taking in problems involving gains and losses, or whether its effects are due to risk acceptance (or sensitivity to variance) as well. In Ert and Yechiam (2010), this was examined by comparing two conditions: a condition where there is a choice between zero and a gamble involving gains and losses, and a condition where there are two uncertain alternatives (i.e., a choice between two gambles differing in the magnitude of gains and losses).
174
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
Under the prospect theory constructs approach the loss-sensitivity construct predicts that individuals would consistently avoid the uncertain alternative with the largest losses. Accordingly, consistency is predicted to be maintained even in the choice between two uncertain options. Similarly, under the sensitivity to variance approach a positive correlation is expected to be maintained for alternatives having the same difference in variance. However, under the risk acceptance approach consistency is only expected to emerge in the condition where there are substantial differences in the level of uncertainty (i.e., between zero and a gamble). The method replicated Experiment 1 only with new choice problems (see Table 2). In two of the tasks, referred to as the ‘‘Avoidable Uncertainty’’ condition, selecting the safer option eliminated the probability of losing. In the other two tasks (‘‘Unavoidable Uncertainty’’ condition), uncertainty differences between alternatives were smaller and both alternatives included possible losses occurring with the same frequency (but differing in magnitude). A second within-subject manipulation pertained to the payoff size. In condition ‘‘High Payoff,’’ the size of all payoffs was doubled by five, compared to the ‘‘Low Payoff’’ condition. Consequently in the Low-Payoff condition alternative H was associated with a standard deviation smaller by five than in the High-Payoff condition (SD = 100, 500, respectively). The choice proportions under the different conditions are summarized in the rightmost column of Table 2 and the correlations across tasks appear in Fig. 2. At the aggregate level in both conditions participants did not tend to avoid the riskier alternative and therefore did not exhibit loss aversion (consistent with previous findings in experience-based tasks; e.g., Erev, Ert, & Yechiam, 2008). At the individual level the results reveal that despite showing no loss aversion on average, participants were highly consistent between the Avoidable Uncertainty problems, in which risks could be avoided (r = .54, p < .01) yet not in the Unavoidable Uncertainty problems, where risks could not be avoided (r = .13, NS). Also, the participants did not show consistency across the two High-Payoff and Low-Payoff tasks, inconsistently with implication of the risk as variance. The correlations within each of the Table 2 The payoff schemes of the four conditions in Experiment 2 of Ert and Yechiam (2010) and the average proportion of selections Condition
Payoff Magnitude
Avoidable Uncertainty
Low payoff
Avoidable Uncertainty
High payoff
Unavoidable Uncertainty
Low payoff
Unavoidable Uncertainty
High payoff
Alternative: Payoff
P(H)
L: Win 0 H: 50% to win 100, 50% to lose 100 L: Win 0 H: 50% to win 500, 50% to lose 500 L: 50% to win 50, 50% to lose 50 H: 50% to win 150, 50% to lose 150 L: 50% to win 250, 50% to lose 250 H: 50% to win 750, 50% to lose 750
0.64 0.61 0.52 0.51
Note. L = Low variance option; H = High variance option; P(H) = The average proportion of H choices across individuals.
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
175
Fig. 2. The results of Experiment 2 of Ert and Yechiam (2010): scatter plots, linear regression lines, and Spearman correlations. Each dot in the scatter plot shows the proportion of choices from the High variance option of an individual decision maker. The column header denotes the abscissa, and the row header denotes the ordinate (AU = Avoidable Uncertainty; UU = Unavoidable Uncertainty).
two pairs of High and Low payoff tasks were small and insignificant, even though the two problems in of these pairs have the same level of variance. This pattern suggests that the consistency in risk taking with losses is not driven by the mere sensitivity to losses or by the sensitivity to variance. As opposed to the prediction of both approaches, the participants were only consistent when choosing between a risky alternative involving uncertain losses and gains and a safe alternative producing a fixed outcome. This indicates that the consistent construct in the mixed domain involves risk acceptance, as the consistency in risk taking only emerges where the available alternatives are clearly distinguished in their level of uncertainty.
176
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
3.3. A single construct of risk acceptance? In the previous sections, we have reviewed findings showing that the construct of risk acceptance is useful for predicting individual-level consistencies, yet it may not be a single primitive construct. In the third experiment described in Ert and Yechiam (2010), we examined whether risk acceptance is a single psychological construct or whether it implicates a second construct when the outcomes involve frequently appearing gains and losses. This was evaluated by comparing the consistency of risk taking across Gain and Mixed domain conditions (as shown in Table 3). A second within-subject manipulation pertained to the level of risk. In condition ‘‘High Payoff,’’ all payoffs were twice as high as in the ‘‘Low Payoff’’ condition. The choice proportions under the different conditions are summarized in the rightmost column of Table 3 and correlations across tasks appear in Fig. 3. The results show that on average, people took more risk in the Mixed condition than in the Gain condition, even though in the Mixed condition risk taking led to losses—Low Payoff: t(49) = 4.71, p < .01; High Payoff: t(49) = 2.93, p < .05. Thus, the participants in these tasks do not exhibit loss aversion, as previously shown in other experience-based tasks (e.g., Erev et al., 2008; Koritzky & Yechiam, 2010). Participants were highly consistent between the two Mixed problems (r = .57, p < .01) and between the two Gain problems (r = .55, p < .01). However, participants were not consistent across the Gain and Mixed problems: The correlations across these problems were small (average r = .11) and insignificant. These results suggest two separate constructs, one for gains and losses of similar magnitudes, and another for gains only. Given the large positive association found between risk acceptance in the gain and loss domains, this suggests that the latter construct is relevant to the loss domain as well. Another interpretation rests on the special case of a constant outcome of zero. It might be that risk attitude in the Mixed problems was independent from that exhibited in the Gain problems because participants have a special psychological tendency to respond to the absolute zero. Table 3 The payoff schemes of the four conditions of Experiment 3 of Ert and Yechiam (2010) and the average proportion of selections Condition
Payoff Magnitude
Mixed
Low payoff
Mixed
High payoff
Gain
Low payoff
Gain
High payoff
Alternative: Payoff
P(H)
L: Win 0 H: 50% to win 1000, 50% to lose 1000 L: Win 0 H: 50% to win 2000, 50% to lose 2000 L: Win 1000 H: 50% to win 2000, 50% to win 0 L: Win 2000 H: 50% to win 4000, 50% to win 0
0.55 0.56 0.28 0.30
Note. L = Low variance option; H = High variance option; P(H) = The average proportion of H choices across individuals.
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
177
Fig. 3. The results of Experiment 3 of Ert and Yechiam (2010): scatter plots, linear regression lines, and Spearman correlations. Each dot in the scatter plot shows the proportion of choices from the High variance option of an individual decision maker. The column header denotes the abscissa, and the row header denotes the ordinate (Mixed = Mixed domain with both gains and losses; Gain = Gain domain).
3.4. Experience- versus description-based studies The Ert and Yechiam (2010) study focused on experience-based decisions. A single previous study by Schoemaker (1990) examined the consistency across the gain and loss domains using description-based decisions, where the participants get a verbal account of the probabilities and outcomes. The results showed a positive correlation similar to that obtained in Ert and Yechiam (2010), but it was smaller and not statistically significant. Accordingly, it is not quite clear whether the findings described above are robust to description-based decisions, which are the most commonly used format in current decision science.
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
178
For examining this issue, we conducted a survey where 139 students, participating in an introductory psychology course, voluntarily completed a questionnaire in which they were asked to choose among gambles conforming to the eight decision problems reported in Tables 2 and 3 above. We focused on these problems because they cover all the payoff domains and additionally allow the comparison between avoidable risks and unavoidable risks. The prospects outcomes and probabilities were described on a paper sheet (in the same format used by Kahneman & Tversky, 1979) and respondents were asked to mark their preferred choice in each of the hypothetical decision problems. Nine versions of this survey were administered, and in each of them the problems were randomly ordered. The results of this survey are presented in Table 4. For conciseness, we have averaged across the high and low payoff problems. The aggregate proportions of risk taking show two main observations: (1) risk aversion in the gain domain and risk seeking in the loss domain, consistent with the reflection effect; (2) no loss aversion in any of the mixed gambles (in line with recent studies of low stake decisions under risk, e.g., Birnbaum & Bahra, 2007; Ert & Erev, 2008, 2010; Koritzky & Yechiam, 2010). Surprisingly, participants showed only moderate degrees of consistency between the different tasks. The only consistencies were observed between the two gain domain problems (r = .22, p < .05) and the two loss domain problems (r = .31, p < .01), suggesting some support to the idea of consistent diminishing sensitivity within the gain and loss domains. Nevertheless, consistent with the results of Ert and Yechiam (2010: Experiment 1) and Schoemaker (1990), no negative association was observed across domains, in contrast to the diminishing sensitivity assertion. Additionally, in most cases the tendency to select the risky option in mixed gambles was independent from that exhibited in the gain and loss domains. However, in one case this was breached, as in the Unavoidable Uncertainty condition risk taking in the gain domain was slightly but significantly associated with risk taking in the mixed domain (r = .21, p < .05). The results show no relationship between the mixed problems and no relationship between the avoidable risk problems. Thus, no support was found for either the loss sensitivity or the risk acceptance construct in decisions from description. Table 4 Average proportions of selections and Spearman correlations between risk taking in the different items of the description-based survey
AU Mixed UU Mixed AU Gain UU Gain AU Loss UU Loss
P(H)
AU Mixed
UU Mixed
AU Gain
UU Gain
AU Loss
UU Loss
0.57 0.53 0.38 0.38 0.46 0.59
1.00 .14 ).11 ).14 ).08 ).13
1.00 ).02 .21* .05 ).08
1.00 .22* .08 ).08
1.00 .03 .04
1.00 .31*
1.00
Note. AU = Avoidable Uncertainty; UU = Unavoidable Uncertainty; P(H) = The average proportion of choices from the High variance option across individuals. *p < .05.
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
179
3.5. Consistency across time in experiential decisions Consistency across tasks is an indicator that is often used for assessing whether a construct is trait-like (e.g., Ashton & Vernon, 1995). Another indicator for traits is the consistency across time. Its advantages are that it diffuses the effect of any situation-specific variable that leads to consistency in a given experimental session (Deinzer et al., 1995). The current section addresses the temporal consistency of the constructs modulating crossdomain risk taking. Perhaps the most surprising regularity in the studies reviewed above is the positive consistency across the gain and loss domains. It suggests that the trait-like construct implicated in experiential behavior involves risk acceptance rather than diminishing sensitivity (as the latter construct predicts negative consistency across domains). If risk acceptance for different domains is indeed a trait, then it is expected to also be consistent across different experimental sessions. In an impressive study, Levin et al. (2007) studied the consistency of risk-taking behavior in an experiential task, known as the cups task, across a 3-year period. This was assessed for a sample of adolescents as well as for their parents. The cups task was administered in two versions: a gain and a loss domain. Using the results as indices of risk sensitivity, risk acceptance and diminishing sensitivity can be approximated by aggregating risk-taking tendencies across domains. Under the risk acceptance construct the consistent construct is said to be the risktaking level in the loss domain plus the risk-taking level in the gain domain (denoting the fact that both are manifestations of the same construct). Under the diminishing sensitivity argument the consistent construct is the risk-taking level in the loss domain minus the risk-taking level in the gain domain. This reflects the posited reflection effect at the individual level. The cups task is somewhat more complex than the previously reviewed tasks as it contains both descriptive and experiential information concerning the payoffs. Participants are shown the outcomes hidden behind a cup and choose between obtaining sure outcomes or guessing the location of the ‘‘outcome cup’’ among several identical cups. Following each choice, the participants receive feedback. The Levin et al. (2007) study used a 3 · 2 · 2 design with this task. First, the value of the risky option was advantageous, disadvantageous, or neutral compared to the safe option. Second, the probability of winning (or losing) in the risky option was .5 or .2. Finally, the outcomes were framed as either gains or losses. For conciseness, we pool the results of the first two conditions and focus only on the comparison of the gain and loss domains. The temporal consistency results of Levin et al. (2007) appear in Table 5. As can be seen, the consistency across sessions was higher for the ‘‘risk acceptance’’ factor (Loss + Gain) than for the ‘‘reflection’’ factor (Loss ) Gain). This was observed for both parents and children. These results suggest that the more prominent latent construct modulating consistent risk taking across the gain and loss domains is risk acceptance.6 Similar results were obtained in a study by Yechiam (2010). In this study, the participants performed variants of the Avoidable Uncertainty Gain and Loss problems described in Table 1. The tasks were performed in two sessions that were conducted about 3 months apart. The correlation across sessions for the ‘‘risk acceptance’’ factor was .21 (p < .05), while the correlation for the ‘‘reflection’’ factor was .14 and not statistically significant.
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
180
Table 5 Temporal consistency in the Levin et al. (2007) study: Correlations between the aggregated risk-taking in the Loss and Gain domain tasks (‘‘risk acceptance’’) and the difference between risk levels in these domains (‘‘reflection’’) Loss + Gain (Risk Acceptance) Parents Children
.29* .38*
Loss ) Gain (‘‘Reflection’’) .20 .30*
Note. The results are shown for a sample of parents and their children (n = 62). *p < .05.
Taken together, these results point to the fact that the ‘‘risk acceptance’’ factor possesses superior temporal consistency, suggesting that it may be reasonable to treat it as a behavioral trait.
4. Quantitative summary 4.1. A quantitative index of subjective risk The results of the reviewed studies support the ‘‘risk acceptance’’ approach though suggesting that the psychological construct of risk acceptance could be different in a domain with both gains and losses. Perhaps a more challenging goal is to use these findings in an attempt to develop a quantitative index for what makes people respond consistently to risk. Individual differences studies indicate that a trait should be measured in a situation where it is relevant (Tett & Guterman, 2000), which therefore involves a decision between a nontrivial amount of risk and a very low amount of risk. Therefore, the subjective difference in the risk of the alternatives is expected to lead to increased behavioral consistency in risktaking levels. We evaluated two quantitative indices for the emergence of consistency based on such subjective differences. A simple index was based on the idea that variance differences lead to consistency. According to this idea, the larger the differences in variance, the better a person differentiates between alternatives, thus leading to more consistency in his or her risk-taking behavior. An alternative account involves the assumption that differences in subjective risk level (and therefore individual consistency) increase as a function of differences in variance but also decrease as a function of the distance of outcomes from a certain outcome with the gamble’s expected value. This leads to having the largest subjective distance between certain and uncertain outcomes (consistent with the risk acceptance assertion). This account can be formalized by the following index for subjective risk differences: S ¼ Sdiff =R½jpi # ðxi % x!Þj'
ð1Þ
Where S is the Risk-Difference Signal (RDS), Sdiff is the difference between the standard deviations of the two distributions, pi is the probability for each outcome i and xi is its size,
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
181
and x! is the expected value of the gamble.7 Note that this index is different from the coefficient of variation (Hendricks & Robey, 1936; Weber, Shafir, & Blais, 2004) which has only the expected value in the denominator. First of all, the coefficient of variation does not imply that the certainty of outcomes affects perceived differences in risk level. Second, using the coefficient of variation is not applicable for gambles with expected value that is near zero or negative. Under both accounts the risk differences in a problem pair are assumed to aggregate as follows: C ¼ S1 " S2
ð2Þ
This yields a parameter-free index C (of predicted consistency). The problems reviewed in Section 3 (from Ert & Yechiam, 2010) were rearranged into 18 pairs (representing all possible pairs within each study), and the risk difference in each pair was determined according to the two alternative indices. Then, the predictive ability of the two indices was determined by calculating the correlations between the predicted consistency of each pair and its actual consistency in risk level. The variance based index produced a correlation of .21, while the RDS index produced a correlation of .37 when predicting the consistency across all 18 comparisons. A post-hoc version of the RDS, which differentiates nonmixed (gain or loss domain) from mixed (gain and loss) problems and is otherwise identical to the original index, was also examined. It yields an average correlation (between predicted and actual consistency) of .80 for 14 relevant pairs: r = .68 for nonmixed problems (n = 7) and .91 for mixed problems (n = 7). For the variance-based index the correlations are only .47 and .63, respectively. Thus, the results of the task consistency experiments cannot be interpreted by a parsimonious model resting just on variance differences. Rather, two additional assumptions must be made: (a) subjective risk differences decrease with the distance from the certain outcome having the same expected value as the gamble, and (b) two constructs of risk acceptance should be assumed: one for gain or loss domain problems and a unique construct for mixed gambles. 4.2. A similarity-based index An alternative model for the results of the current experiments involves the postulated effect of the similarity of payoffs on behavioral consistency (Altman, Bercovici-Boden, & Tennenholtz, 2006; Michalski, Carbonell, & Mitchell, 1986). For example, the gambles in Table 1 are highly similar across the gain and loss domains: They use the same payoff magnitudes and are differentiated only by their payoff sign. One could argue that this might have led to the behavioral consistency across domains in Ert and Yechiam (2010). To examine whether these experimental results are driven simply by the similarity of the payoffs in each condition, we examined the predictions of a model assuming that people are consistent whenever outcomes are similar (following the approach of Michalski et al., 1986).
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
182
The quantitative predictions of two versions of a similarity-based model were tested. Under one variant of the model it is assumed that the participants can identify the safe and risky alternative, and mere similarity in payoffs between the two alternatives drives the consistency results (this will be referred to as the pure-similarity model); formally: C ¼ "RðXi " Xj Þ
ð3Þ
Where C is the similarity-based predicted consistency prediction, and X denotes the outcome vectors for choice problems i and j (in each problem, the outcomes were ordered from the safe to the risky alternative and from the highest to the lowest payoff, and the similarity of each pair of outcome was calculated). A second variant of the model was examined under which there is insensitivity to the payoff sign. Under this model: C ¼ "RðjXi j " jXj jÞ
ð4Þ
Namely, whenever the same outcome is presented in terms of gain or loss it is considered identical (this will be referred to as the absolute-similarity model). As in the models described above, the problems presented in Tables 1–3 were rearranged into 18 pairs, and the fit of the two similarity-based models was determined by calculating the correlations between the similarity of each pair and its consistency in risk level. For the pure-similarity model, a correlation of only .17 was found. For the absolute-similarity model, assuming that rewards and penalties of equal magnitudes are treated alike, the results were even worse, indicating a correlation of .02 between similarity and consistency. These correlations were not significantly improved by dividing the problems into pure and mixed gains and losses. Thus, the results suggest that what drives the consistency findings is not mere similarity of payoffs. Rather, as noted above, the differences in uncertainty and the distance from certainty provide a better index for individual consistency.
5. Discussion In his 1993 study, Schoemaker made the observation that ‘‘Behavioral decision theory research on risk-taking has largely abandoned the trait approach in favor of situational and information processing models (e.g., Goldstein & Einhorn, 1987; Payne, 1973; Tversky & Kahneman, 1974; Tversky, Sattath, & Slovic, 1988). Risk-taking is considered to be mostly a function of the task, people’s decision frames, and their information processing strategies, rather than a function of individual predispositions.’’ In contrast, though, in other areas of Psychology ‘‘the pendulum appears to be swinging back’’ (Schoemaker, 1993) and the trait approach has again become popular with the introduction of frameworks such as the Five Factor Model (Goldberg, 1993; McCrae & Costa, 1987) and Three Factor Model (Eysenck & Eysenck, 1985), which combine individual traits into composites or aggregates and exhibit correlations up to .7 with composite measures of behavior across situations (see e.g., Kenrick & Funder, 1988).
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
183
The current review and analysis suggests two reasons for the failure of the trait approach in decision making. The first reason is the use of only a single approach for examining consistency across situations. As we have seen, the predictions of prospect theory (Kahneman & Tversky, 1979) alone did not produce constructs that are consistent (e.g., diminishing sensitivity and the sensitivity to losses vs. gains). The predictions of the risk acceptance model (Brachinger & Weber, 1997) alone were also not sufficient to explain the entire set of data. The results showed that only an integrated approach, using two risk acceptance constructs for pure gain ⁄ loss domains or mixed domains was adequate for explaining the emergence of consistency in risk-taking behavior. The results of our survey of decisions under risk suggest another reason for the failure of previous studies in decision making to identify reliable behavioral traits, related to the way that information concerning choice outcomes is conveyed. The most prominent paradigm in decision-making science is the description-based paradigm (Weber et al., 2004). Yet, in our survey, as in Schoemaker (1990), individuals exhibited very little consistency between description-based problems presented in different domains. In the reviewed studies, consistency was only exhibited for the experience-based paradigm. These results suggest that previous attempts to find consistency in risk taking at the individual level using behavioral paradigms (Schoemaker, 1990) or models (Keller, 1985; Schneider & Lopes, 1986) might have failed because of their choice of the description-based decision paradigm. The reasons for the effect of task type on consistency are yet to be clarified. One possible explanation for these differences is that when risk is more explicit (such as while choosing between described gambles) people exhibit more social desirability, making individual data harder to evaluate (Koritzky & Yechiam, 2010). Another explanation is that experience-based decisions trigger ‘‘hot’’ decision-making processes that more closely approximate decision makers’ behavior in everyday situations, and their behavioral traits as well (Figner, Mackinlay, Wilkening, & Weber, 2009; Weller, Levin, Shiv, & Bechara, 2007). The most important finding reviewed here in experience-based decisions is that people exhibit positive consistency between the gain and loss domains when making a choice between a constant outcome and a probabilistic one. We view this finding as an example of a more general factor modulating individual consistencies in risk attitude, involving the sensitivity to differences in certainty, with the case of certainty versus uncertainty being an extreme contrast along this axis. The suggestion that risk acceptance across the gain and loss domains is consistent was also useful for predicting the emergence of consistency across different sessions (Levin et al., 2007; Yechiam, 2010). The theoretical importance of the positive consistency is that its existence contradicts the prediction based on diminishing sensitivity, which implies a negative correlation across domains (due to the reflection effect, as explained above). Future studies should elaborate the conditions for perceiving differences in risk level that lead to consistent behavior. The current modeling results show that the emergence of such differences cannot be summarized under a simple model based on differences in the variance of the relevant gambles. The (Avoidable and Unavoidable) Loss conditions in Experiment 2 had the same difference in variance, but the consistency was substantially different when one outcome was zero and another was not. To account for this finding, it must be
184
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
assumed that the consistency decreases as a function of the distance from certainty. This finding has implications for the definition of risk and variance, and it suggests, as previously argued (Weber et al., 2004) that variance alone does not drive the subjective feeling of risk. The assumption that risk level is also affected by the distance of the payoffs from certainty was embedded in the Risk Difference Signal (RDS) index and enabled it to predict the emergence of consistency across different pairs of choice problem. Future tests of this model can examine the consistency of risk taking in situations involving unavoidable risk with larger differences in risk level. The RDS index predicts the emergence of consistency in this situation as well. The current review thus suggests that the search for trait-like constructs in behavioral decision making may be productive, though leading to different theoretical conclusions from those based on the study of decision making at the population level. For example, we have replicated the reflection effect at the aggregate level, but we have shown that at the individual-consistency level it does not exist. We believe that for research into the traits of decision making to continue productively, such dualities must be accepted, rather than seen as counter-examples. It appears that there is a substantial gap between the variables affecting decision making at the population level and the consistent traits of decision making.
Notes 1. Accordingly, in the problem presented above (choice between zero and a gamble offering an equal chance to win or lose $100) the risk acceptance approach predicts that people would behave consistently because it involves a dilemma between certain and uncertain outcomes. This particular choice problem therefore does not distinguish the predictions of the risk as variance and risk acceptance approaches. 2. Of course, additional components proposed in prospect theory reflect biases in the perception of probabilities. In the current paper, we keep the objective probability constant across the studies and focus on the perception of values as a starting point for assessing the psychological constructs implicated in risk sensitivity. 3. Diminishing sensitivity also does not imply consistency across problems because in mixed gambles, which involve both gains and losses, the outcomes are of the same size and thus are equally discounted irrespectively of whether there is high or low discounting of large payoffs. 4. The findings at the aggregate level show that people took more risk in the loss domain than in the gain domain—t(39) = 3.98, p < .001. Thus, interestingly, at the population level the reflection effect was substantiated. There were no significant differences in risk taking between the Avoidable Uncertainty and the Unavoidable Uncertainty conditions—t(39) = 1.41, NS. 5. The experiment described in Section 3.2 replicated this finding with problems having the same exact difference in variance. 6. Still, for the children at least there was significant consistency also in the diminishing sensitivity construct. Another relevant finding is that a reanalysis of the correlation
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
185
between the Gain and Loss domain conditions of this study reveals a positive correlation as in Ert and Yechiam (2010). 7. In our simple examples, the probabilities are identical for the different outcomes and therefore the expected value equals the average value of the different outcomes produced by a given gamble.
References Ahn, W. Y., Busemeyer, J. R., Wagenmakers, E. J., & Stout, J. C. (2008). Comparison of decision learning models using the generalization criterion method. Cognitive Science, 32, 1376–1402. Altman, A., Bercovici-Boden, A., & Tennenholtz, M. (2006). Learning in one shot strategic form games. In J. Fu¨rnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Machine learning: EMCL 2006 (pp. 6–17). Berlin: Springer-Verlag. Ashton, M. C., & Vernon, P. A. (1995). Verbal and spatial abilities are uncorrelated when g is controlled. Personality and Individual Differences, 19, 399–401. Birnbaum, M. H., & Bahra, J. P. (2007). Gain-loss separability and coalescing in risky decision making. Management Science, 53, 1016–1028. Brachinger, H. W., & Weber, M. (1997). Risk as primitive: A survey of measures of perceived risk. OR Spectrum, 19, 235–250. Busemeyer, J. R., & Stout, J. C. (2002). A contribution of cognitive decision models to clinical assessment: Decomposing performance on the Bechara gambling task. Psychological Assessment, 14, 253–262. Deinzer, R., Steyer, R., Eid, M., Notz, P., Schwenkmezger, P., Ostendorf, F., & Neubauer, A. (1995). Situational effects in trait assessment: The FPI, NEOFFI, and EPI questionnaires. European Journal of Personality, 9, 1–23. Erev, I., Ert, E., & Yechiam, E. (2008). Loss aversion, diminishing sensitivity, and the effect of experience on repeated decisions. Journal of Behavioral Decision Making, 21, 575–597. Ert, E., & Erev, I. (2008). The rejection of attractive gambles, loss aversion, and the lemon avoidance heuristic. Journal of Economic Psychology, 29, 715–723. Ert, E., & Erev, I. (2010). On the descriptive value of loss aversion in decisions under risk. Harvard Business School working paper 10-056. Ert, E., & Yechiam, E. (2010). Consistent constructs in individuals’ risk taking in decisions from experience. Acta Psychologica, 134, 225–232. Eysenck, H. J., & Eysenck, M. W. (1985). Personality and individual differences: A natural science approach. New York: Plenum. Figner, B., Mackinlay, R. J., Wilkening, F., & Weber, E. U. (2009). Hot and cold cognition in risky decision making: accounting for age and gender differences in risk taking. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35, 709–730. Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist, 48, 26–34. Goldstein, W. M., & Einhorn, H. (1987). A theory of preference reversals. Psychological Review, 94, 236–254. Hendricks, W. A., & Robey, K. W. (1936). The sampling distribution of the coefficient of variation. The Annals of Mathematical Statistics, 7, 129–132. Hertwig, R., Barron, G., Weber, E. U., & Erev, I. (2004). Decisions from experience and the effect of rare events in risky choice. Psychological Science, 15, 534–539. Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47, 263–291. Kassin, S. (2003). Psychology. Englewood Cliffs, NJ: Prentice-Hall, Inc. Keller, R. L. (1985). An Empirical Investigation of relative risk aversion. IEEE Transaction on Systems, Man, and Cybernetics, 15, 475–482.
186
E. Yechiam, E. Ert ⁄ Topics in Cognitive Science 3 (2011)
Kenrick, T., & Funder, D. C. (1988). Profiting from controversy: Lessons from the person-situation debate. American Psychologist, 43, 23–34. Koritzky, G., & Yechiam, E. (2010). On the robustness of decision tasks to response distortion. Journal of Behavioral Decision Making, 23, 83–99. Levin, I. P., Hart, S. S., Weller, J. A., & Harshman, L. A. (2007). Stability of choices in a risky decision making task: A 3-year longitudinal study. Journal of Behavioral Decision Making, 20, 241–252. Markowitz, H. M. (1952). Portfolio selection. Journal of Finance, 7, 77–91. McCrae, R. R., & Costa, P. T. Jr (1987). Validation of the five-factor model of personality across instruments and observers. Journal of Personality and Social Psychology, 52, 81–90. Michalski, R., Carbonell, J., & Mitchell, T. (1986). Machine learning, Vol II. Los Altos, CA: Morgan Kauffman. Payne, J. W. (1973). Alternative approaches to decision making under risk: Moments vs. risk dimensions. Psychological Bulletin, 80, 493–553. Pratt, J. W. (1964). Risk aversion in the small and in the large. Econometrica, 32, 122–136. Preuschoff, K., Bossaerts, P., & Quartz, S. R. (2006). Neural differentiation of expected reward and risk in human subcortical structures. Neuron, 51, 381–390. Schneider, S. L., & Lopes, L. L. (1986). Reflection in preferences under risk: Who and when may suggest why. Journal of Experimental Psychology: Human Perception and Performance, 12, 535–548. Schoemaker, P. J. H. (1990). Are risk-attitudes related across domains and response modes? Management Science, 36, 1451–1463. Schoemaker, P. J. H. (1993). Determinants of risk-taking: Behavioral and economic views. Journal of Risk and Uncertainty, 6, 49–73. Stevens, S. S. (1957). On the psychophysical law. Psychological Review, 64, 153–181. Tett, R. P., & Guterman, H. A. (2000). Situation trait relevance, trait expression, and cross-situational consistency: Testing a principle of trait activation. Journal of Research in Personality, 34, 397–423. Tversky, A., & Kahneman, D. (1974). Judgments under uncertainty: Heuristics and biases. Science, 185, 1124– 1131. Tversky, A., Sattath, S., & Slovic, E. (1988). Contingent weighting in judgment and choice. Psychological Review, 95, 371–384. Weber, E. U., Shafir, S., & Blais, A.-R. (2004). Predicting risk-sensitivity in humans and lower animals: Risk as variance or coefficient of variation. Psychological Review, 111, 430–445. Weller, J. A., Levin, I. P., Shiv, B., & Bechara, A. (2007). Neural correlates of adaptive decision making for risky gains and losses. Psychological Science, 18, 958–964. Worthy, D. A., Maddox, W. T., & Markman, A. B. (2007). Regulatory fit effects in a choice task. Psychonomic Bulletin and Review, 14, 1125–1132. Yechiam, E. (2010). Losses induce consistency in risk taking even without loss aversion. Technion working paper.
Topics in Cognitive Science 3 (2011) 187–196 Copyright ! 2011 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01123.x
Homo heuristicus Outnumbered: Comment on Gigerenzer and Brighton (2009) Benjamin E. Hilbig,a Tobias Richterb a
University of Mannheim and Max-Planck Institute for Research on Collective Goods b Department of Psychology, University of Kassel
Received 8 January 2010; received in revised form 19 April 2010; accepted 16 May 2010
Abstract Gigerenzer and Brighton (2009) have argued for a ‘‘Homo heuristicus’’ view of judgment and decision making, claiming that there is evidence for a majority of individuals using fast and frugal heuristics. In this vein, they criticize previous studies that tested the descriptive adequacy of some of these heuristics. In addition, they provide a reanalysis of experimental data on the recognition heuristic that allegedly supports Gigerenzer and Brighton’s view of pervasive reliance on heuristics. However, their arguments and reanalyses are both conceptually and methodologically problematic. We provide counterarguments and a reanalysis of the data considered by Gigerenzer and Brighton. Results clearly replicate previous findings, which are at odds with the claim that simple heuristics provide a general description of inferences for a majority of decision makers. Keywords: Fast and frugal heuristics; Adaptive toolbox; Recognition heuristic; Formal modeling; Multinomial processing tree model
1. Introduction In their review of work on the adaptive toolbox of fast and frugal heuristics, Gigerenzer and Brighton (2009) provided a critical discussion of empirical evidence and the methodology that has been used to investigate the assumed noncompensatory nature of these heuristics. One cornerstone of their discussion is a reanalysis of data from an experiment by Richter and Spa¨th (2006, Experiment 3) on the use of recognition and further knowledge in comparative judgments. In this experiment, in which German students were presented with pairs of names of U.S.-American cities with the task to choose the more populous city, Correspondence should be sent to Benjamin E. Hilbig, Psychology III, University of Mannheim, Schloss Ehrenhof Ost, 68131 Mannheim, Germany. E-mail:
[email protected]
188
B. E. Hilbig, T. Richter ⁄ Topics in Cognitive Science 3 (2011)
recognition and task-relevant knowledge were varied. In line with the predictions of the recognition heuristic (Goldstein & Gigerenzer, 2002), participants mostly chose recognized cities over unrecognized ones. However, these recognition effects were partly compensated by task-relevant knowledge that conflicts with the claim of noncompensatory reliance on recognition. Whereas Richter and Spa¨th concluded from their results that recognition information is not generally used in a noncompensatory fashion but integrated with further knowledge (for similar conclusions, see Bro¨der & Eichler, 2006; Newell & Fernandez, 2006; Pohl, 2006), Gigerenzer and Brighton arrive at a contrary interpretation of the data. They argue that, when analyzed appropriately at the individual level, the data show ‘‘that a majority of participants consistently followed the recognition heuristic in the presence of conflicting cues’’ (p. 134). We believe this interpretation to be conceptually and methodologically flawed. Given the centrality of the recognition heuristic for the adaptive toolbox approach and the attention that the claim of its noncompensatory nature has attracted in the field, we feel that Gigerenzer and Brighton’s claims should not be left undisputed. In this comment, we will focus on a conceptual ambiguity and, more importantly, a methodological flaw. With respect to the latter, we will then present a reanalysis of the data from Richter and Spa¨th that is based on a formal measurement model specifically developed to estimate the degree to which decision makers rely on recognition in a noncompensatory fashion.
2. Conceptual problem: How can use of the recognition heuristic depend on the recognition validity? Across heterogeneous judgment domains (population of animal species, safety of air carriers, population of American cities) the three experiments reported by Richter and Spa¨th (2006) consistently suggested that other cues beyond recognition are considered. Yet Gigerenzer and Brighton (2009) dismissed the evidence from two out of the three experiments as being irrelevant for their theory by arguing that in these experiments, ‘‘the recognition validity was unknown or low’’ (p. 133). However, the original theory of the recognition heuristic (Goldstein & Gigerenzer, 2002) does not entail the explicit assumption that the recognition heuristic is (only) used if the recognition validity in a given domain is high. Of course, Goldstein and Gigerenzer (2002) provide the normative observation that the recognition heuristic is useful only if this is the case. However, such a normative fact does not necessarily imply the descriptive claim that the recognition heuristic is applied if and only if the recognition validity is high. This ambiguity notwithstanding, it has been shown empirically that decision makers will indeed refrain from relying on the recognition cue when it is invalid (Hilbig, Erdfelder, & Pohl, 2010; Pohl, 2006). At the same time, it is unclear how such an adaptive reliance on recognition is actually achieved by decision makers (an instance of the ‘‘strategy selection problem,’’ cf. Glo¨ckner & Betsch, 2010). At least, it appears that more complex processes beyond the simple search, stopping, and decision rules of the recognition heuristic would be necessary. Stated differently, the much-acclaimed simplicity and precision of the recogni-
B. E. Hilbig, T. Richter ⁄ Topics in Cognitive Science 3 (2011)
189
tion heuristic do wane so long as there is no specification of how exactly decision makers are expected to know the recognition validity (of any possible domain) and, thereby, when to rely on the recognition heuristic. Given these open questions, it seems somewhat harsh to entirely dismiss experiments that entail low recognition validity.
3. Methodological problem: Choosing the recognized alternative is not equivalent to using the recognition heuristic In reanalyzing the data from Richter and Spa¨th (2006), Gigerenzer and Brighton (2009) focused on critical trials in which one alternative was recognized and the other one was not. They presented the number of trials in which the recognized alternative was chosen by each participant (a measure sometimes called adherence rate) broken down by whether or not further knowledge argued against the recognized alternative. Based on these data, they concluded that ‘‘even in the critical condition […], the majority of participants consistently followed the recognition heuristic’’ (p. 134). To the extent that this is meant to imply ‘‘a majority of participants consistently used the recognition heuristic,’’ this conclusion represents a logical fallacy (petitio principii) because it presupposes the very assumption that is in dispute, viz. that the recognition heuristic is in fact operative whenever participants choose the recognized over the unrecognized alternative. In general, it is easy to empirically demonstrate high adherence rates to any heuristic which relies on a cue comprising above-chance-level validity (Hilbig, in press). This is due to the confound between the cue in question and other pieces of information pointing in the same direction: In the experiment by Richter and Spa¨th (2006), participants recognizing a city (e.g., Boston) were also likely to know cues that argue for the size of this city (e.g., Boston has well-known universities, Boston is located in the relatively more populated Northeast of the United States, etc.), thus potentially resulting in choice of the recognized city. As a consequence, the adherence rate is not a valid indicator of the extent to which people use the recognition heuristic (RH-use) because it severely overestimates use of noncompensatory heuristics in general (Hilbig, 2008b). For this reason, our own reanalysis of the data relies on the estimate of RH-use provided by the r-model (Hilbig et al., 2010). This model and the results of our reanalysis are described in the remainder of this comment.
4. A model-based reanalysis of the data from Richter and Spa¨th (2006) A measurement model of the recognition heuristic, named r-model, was recently developed by Hilbig et al. (2010) and is depicted in Fig. 1. Based on observed categorical data (i.e., choices), this multinomial processing tree model (Erdfelder et al., 2009) provides estimates of one parameter representing single-cue reliance on recognition alone (probability r) as proposed by the recognition heuristic. In addition, other parameters, which stand for the recognition validity a, the knowledge validity b and the probability of valid guesses g, are estimated. The basic idea of the r-model—and its main advantage over measures such as
B. E. Hilbig, T. Richter ⁄ Topics in Cognitive Science 3 (2011)
190
!
A: Both objects recognized
"#$ !
%
B: Neither object recognized
"#$ %
' "#$#'
RH used
& C: One object recognized
"#$ &
valid knowledge
' "#$#'
invalid knowledge
' "#$#'
! knowledge considered
"#$ !
valid knowledge
Correct choice
invalid knowledge
False choice
valid guess
Correct choice
invalid guess
False choice
rec. object larger
Correct choice (of rec. Object)
rec. object smaller
False choice (of rec. object)
rec. object larger
Correct choice (of rec. Object)
rec. object smaller
Correct choice (of unrec. Object)
rec. object larger
False choice (of unrec. object)
rec. object smaller
False choice (of rec. object)
Fig. 1. The r-model in the form of processing trees. Three cases are distinguished: (A) both objects are recognized, (B) neither is recognized, or (C) exactly one is recognized. The parameter a represents the recognition validity (probability of the recognized object representing the correct choice), b stands for the knowledge validity (probability of valid knowledge), g is the probability of a correct guess and, most importantly, r denotes the probability of using the recognition heuristic (following the recognition cue while ignoring any knowledge beyond recognition).
adherence rates—is to disentangle the use of recognition and additional knowledge in comparative judgments concerning pairs where one object is recognized and the other one is not (Fig. 1, case C). To this end, the knowledge validity b (i.e., the probability of valid knowledge) is also estimated from judgments concerning pairs where both objects are recognized (Fig. 1, case A) and, therefore, the recognition heuristic cannot be applied. The logic of the r-model is simple. Consider a participant who has to make a comparative judgment between two alternatives of which she recognizes only one (Fig. 1, case C). In this situation, she can either use the recognition heuristic, which will occur with probability r, or she can consider additional knowledge or information, which will happen with probability 1 ) r. If the participant uses the recognition heuristic and thus chooses the recognized object, her judgment will be correct with probability a, that is, the recognition validity. If she considers additional knowledge, her judgment will be correct with probability b. In that
B. E. Hilbig, T. Richter ⁄ Topics in Cognitive Science 3 (2011)
191
case, valid knowledge will lead to a correct choice, which can, in fact, either mean choosing the recognized or the unrecognized of the two objects—depending on which represents a correct judgment in the current pair. Within the r-model, the recognition heuristic can be implemented as a submodel by fixing the r parameter to 1. The r-model has been shown to fit empirical data well and the psychological meaning of the central model parameter r has been validated experimentally (Hilbig et al., 2010). Moreover, simulations revealed that the r-model provides the best estimate of RH-use currently available (Hilbig, 2010). Consequently, we used it for a reanalysis of the data from Richter and Spa¨th (2006) both on the aggregate as well as the individual level.
5. Method In Richter and Spa¨th’s data, we determined for each pair of cities presented to participants which of the two cities a participant had reported to recognize. Each pair could thus be sorted into one of the three trees in the r-model (cases A, B, or C, respectively). Next, it was determined for each pair which of the two cities represented the factually correct option with respect to the judgment criterion, that is, city population. Thereby, each choice could be classified as correct or false. Finally, in cases in which only one city was reported to be recognized, we determined whether the recognized (or the unrecognized) of the two cities had been chosen, that is, judged to be more populous. With these three steps, every choice in Richter and Spa¨th’s data was sorted into one of the eight possible outcome categories shown in the terminal branches of Fig. 1. As is most typically the case in multinomial modeling, parameter estimates were sought by minimizing the asymptotically v2-distributed log-likelihood ratio statistic G2 through the EM-algorithm (Hu & Batchelder, 1994). In a nutshell, this maximum-likelihood procedure searches through the parameter-space to determine the set of parameters which minimizes the distance between observed and expected category frequencies (in the current case, the eight choice categories described above). Parameter estimates and model fit statistics for the r-model were obtained using the multiTree software tool (Moshagen, 2010). Model fits were tested by means of the goodness-of-fit statistic G2 and differences between nested models with the corresponding v2-difference test for changes in model fit (DG2). Nonnested models were compared using the Bayesian Information Criterion (BIC; e.g., Myung, 2000) from which the Bayesian posterior probability of model superiority (given the data) can be estimated, assuming equal priors (Wagenmakers, 2007).
6. Results and discussion 6.1. Aggregate analyses On the aggregate level (across all choices and participants), the r-model accounted for the data well, as shown by a satisfactory fit of G2(1) = 1.6, p = .20. The obtained parameter
192
B. E. Hilbig, T. Richter ⁄ Topics in Cognitive Science 3 (2011)
estimates were a = .88 (SE = 0.01), b = .59 (SE = 0.01), g = .49 (SE = 0.02) and, most importantly, r = .80 (SE = 0.01). As such, the estimated probability of true RH-use was substantial, though significantly smaller than implied by the adherence rate (M = 0.91), DG2(1) = 109.5, p < .001. Once again, this finding confirms that RH-use is overestimated by adherence rates. As the previous findings imply, a strict and deterministic version of the recognition heuristic (fixing r = 1) also failed to account for the data and produced severe misfit (p < .001). However, such a deterministic understanding of the recognition heuristic may be seen as unfair, as strategy execution errors must be expected (e.g., Bro¨der & Schiffer, 2003). As a consequence, we next implemented the recognition heuristic in a probabilistic rather than a deterministic way (cf. Hilbig et al., 2010): First, we added an error parameter f to each terminal branch of the original r-model, thus implementing a naı¨ve error theory (Rieskamp, 2008). This extended r-model was then compared to a submodel with r fixed at 1, which represents a probabilistic version of the recognition heuristic. Comparing these models revealed that the probabilistic recognition heuristic submodel needed an average error of f = 0.09 (SE = 0.01) to account for the data. Nevertheless, it still fit the data worse than the extended r-model, DG2(1) = 10.1, p = .001. As such, even a probabilistic version of the recognition heuristic could not account for the data as well as a model implying that the recognition cue is only sometimes considered in isolation (r < 1). Finally, we compared the original r-model (without any error parameter) to the probabilistic recognition heuristic submodel (as before, fixing r = 1 and adding an error parameter f). As these are nonnested models, we based the model comparison on the BIC that was 7119 for the original r-model and 7129 for the probabilistic recognition heuristic submodel. As such, the r-model was superior, while both comprise the exact same number of free parameters; the Bayesian posterior probability (given the data, assuming equal priors) of the r-model as compared to the probabilistic implementation of the recognition heuristic was .99, which can be understood as very strong evidence against the latter (Wagenmakers, 2007). In sum, single-cue reliance on recognition did occur in a substantial proportion of cases, which is plausible given the extremely high recognition validity. However, the model-based aggregate analyses revealed that the data could not be adequately explained by the recognition heuristic, not even when implemented probabilistically. A model in which the recognition cue is only sometimes considered in isolation generally fit the data better. This model is in line with compensatory theories which propose that the recognition cue is integrated with others (if available). The findings thus mirror previous investigations with other data sets (Hilbig et al., 2010) and confirm Richter and Spa¨th’s (2006) original conclusions that even though recognition is indubitably a very prominent cue it is ‘‘used as one cue among others.’’ 6.2. Individual analyses In accordance with the arguments put forward by Gigerenzer and Brighton (2009), we next analyzed the choice data of each individual separately, again using the r-model. The
B. E. Hilbig, T. Richter ⁄ Topics in Cognitive Science 3 (2011)
193
results are displayed in Fig. 2 in which the gray bar indicates the corresponding individual estimate of r as compared to the individual adherence rate (white bar) for each participant. As can be seen, RH-use was less likely than implied by the adherence rate for practically every participant. Stated differently, only a small number of participants consistently relied on the recognition cue in isolation. Most participants, by contrast, refrained from doing so in a nontrivial proportion of cases, which was clearly lower than implied by their individual adherences rates—the measure on which Gigerenzer and Brighton (2009) based their conclusions. To test, on the individual level, how many participants might be classified as users of the recognition heuristic, we first used the procedure described in Hilbig et al. (2010): Taking the average parameter estimates as the alternative hypothesis (H1), we performed a power analysis (Erdfelder, Faul, & Buchner, 2005) to determine the value of r0, which implied a power of 1 ) b = .95 for testing the original r-model against the recognition heuristic submodel with r0 fixed accordingly. The resulting parameter value of r0 was .96. So, for each participant, we compared the fit of the original r-model (with no constraint on r) to a submodel fixing r = .96, which represents the recognition heuristic. Comparing the models showed that for 20 out of the 28 participants (71%, see Fig. 1), the submodel of the recognition heuristic fit significantly (p < .05) worse than the r-model. For these participants, the 1.0
RH-use
.75
.50
.25
0
RH-users
RH-non-users Individual participants
Fig. 2. Individual probability of RH-use as estimated by the r-parameter (gray bar including one standard error of the parameter estimate, cf. Moshagen, 2010) and by the individual adherence rate to the predictions of the recognition heuristic (white bar).
194
B. E. Hilbig, T. Richter ⁄ Topics in Cognitive Science 3 (2011)
probability of considering recognition only was significantly smaller than the critical value of .96. Stated differently, this majority of decision makers too rarely used the recognition heuristic to be classified as consistent RH-users. However, one may once more argue that r = .96 (without any strategy execution error) is too strict an implementation of the recognition heuristic. Therefore, mirroring the aggregate analyses reported above, we compared the original r-model (no constraint on r, no error parameter) to the probabilistic recognition heuristic submodel (fixing r = 1 and adding an error parameter f) on the individual level. Again, these are nonnested models that were consequently compared using the BIC. We found that for nine participants (32%) the probabilistic recognition heuristic was the superior model (yielding the smaller BIC value), for seven (25%) neither model performed better, and for 12 (43%) the r-model was to be preferred. So, even when implementing a probabilistic version of the recognition heuristic and comparing models at the individual level, more participants were classified as RH-nonusers than as RH-users. Finally, additional evidence was obtained from a model-free analysis using the individual discrimination index (DI; Hilbig & Pohl, 2008), which is defined as the difference in adherence rates depending on whether using recognition implies a correct versus a false inference. Any participant reliably discriminating such cases cannot have relied on recognition alone. Analyses revealed that 11 participants (39%) had a DI score within the 95% confidence interval of zero, thus qualifying as potential users of the recognition heuristic. Vice versa, the remaining 17 participants consistently discriminated whether recognition led to a correct inference, which is incompatible with the assumption of one-reason decision making as implied by the recognition heuristic (Hilbig, Pohl, & Bro¨der, 2009).
7. Conclusions In our comment, we focused on a conceptual and a methodological drawback inherent in Gigerenzer and Brighton’s (2009) critique of Richter and Spa¨th (2006). We argued that the recognition heuristic cannot simply be taken to depend on the recognition validity without further specification, and that adherence to recognition is not a valid measure of RH-use. In order to obtain a valid and unbiased estimate of RH-use, we applied the r-model (Hilbig et al., 2010) for a reanalysis of Richter and Spa¨th’s (Experiment 3) data. Both aggregateand individual-level results showed that the recognition heuristic cannot adequately account for choice data—which held for the majority of participants. These findings are in line with previous experiments all of which cast doubts on the recognition heuristic and other heuristics as general accounts of judgment and decision making (for an overview see Hilbig, in press). This is especially noteworthy given that Gigerenzer and Brighton (2009) consider these data ‘‘the perfect test for the heuristic’’ (p. 133). Indeed, with a recognition validity of .88 in the current data set, it is hard to imagine how any further cues should override the recognition cue particularly often. Given the large recognition validity and (by comparison) low knowledge validity of .59, most alternative models (e.g., Glo¨ckner & Betsch, 2008a; Newell & Lee, in press) must necessarily predict choices that frequently resemble RH-use
B. E. Hilbig, T. Richter ⁄ Topics in Cognitive Science 3 (2011)
195
(Glo¨ckner, 2009). Even so, the assumption of consistent isolated reliance on the recognition cue was rejected for a majority of participants. Importantly, such a consistency (i.e., an estimate of the r parameter close to 1) is a necessary precondition for the much-acclaimed less-is-more effect (Hilbig et al., 2010) the occurrence of which, in turn, is a cornerstone of Gigerenzer and Brighton’s (2009) general argument. Of course, the results also show that RH-use does occur in a substantial proportion of cases. Likewise, there are individuals who seem more prone to apply this strategy (cf. Hilbig, 2008a). As a consequence, further research is needed to uncover its situational and individual determinants. Such a research agenda has been fruitfully pursued in the past for other heuristics (Bro¨der, in press; Bro¨der & Newell, 2008). To conclude, advertising pervasive use of fast and frugal heuristics is simply not warranted given the empirical data (Hilbig, in press). Reanalyses of selected experimental conditions from single studies using biased measures are unlikely to change this fact. As alternatives to the adaptive toolbox approach, several process models have been suggested (for an overview, see Glo¨ckner & Witteman, 2010) and successfully tested (e.g., Glo¨ckner & Betsch, 2008b; Hilbig & Pohl, 2009; Newell & Lee, in press). Thus, we are wary of accepting a ‘‘homo heuristicus’’ view of human decision making, given that fast and frugal heuristics are only used consistently by a minority of decision makers.
References Bro¨der, A. (in press). The quest for Take the Best: Insights and outlooks from experimental research. In P. Todd, G. Gigerenzer, & the ABC Research Group (Eds.), Ecological rationality: Intelligence in the world. New York: Oxford University Press. Bro¨der, A., & Eichler, A. (2006). The use of recognition information and additional cues in inferences from memory. Acta Psychologica, 121, 275–284. Bro¨der, A., & Newell, B. R. (2008). Challenging some common beliefs: Empirical work within the adaptive toolbox metaphor. Judgment and Decision Making, 3, 205–214. Bro¨der, A., & Schiffer, S. (2003). Bayesian strategy assessment in multi-attribute decision making. Journal of Behavioral Decision Making, 16, 193–213. Erdfelder, E., Auer, T.-S., Hilbig, B. E., Aßfalg, A., Moshagen, M., & Nadarevic, L. (2009). Multinomial processing tree models: A review of the literature. Zeitschrift fu¨r Psychologie – Journal of Psychology, 217, 108–124. Erdfelder, E., Faul, F., & Buchner, A. (2005). Power analysis for categorical methods. In B. S. Everitt & D. C. Howell (Eds.), Encyclopedia of statistics in behavioral science (Vol. 3, pp. 1565–1570). Chichester, UK: Wiley. Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences. Topics in Cognitive Science, 1, 107–143. Glo¨ckner, A. (2009). Investigating intuitive and deliberate processes statistically: The multiple-measure maximum likelihood strategy classification method. Judgment and Decision Making, 4, 186–199. Glo¨ckner, A., & Betsch, T. (2008a). Modeling option and strategy choices with connectionist networks: Towards an integrative model of automatic and deliberate decision making. Judgment and Decision Making, 3, 215– 228. Glo¨ckner, A., & Betsch, T. (2008b). Multiple-reason decision making based on automatic processing. Journal of Experimental Psychology: Learning, Memory, & Cognition, 34, 1055–1075.
196
B. E. Hilbig, T. Richter ⁄ Topics in Cognitive Science 3 (2011)
Glo¨ckner, A., & Betsch, T. (2010). Accounting for critical evidence while being precise and avoiding the strategy selection problem in a parallel constraint satisfaction approach – a reply to Marewski. Journal of Behavioral Decision Making, 23, 468–472. Glo¨ckner, A., & Witteman, C. (2010). Beyond dual-process models: A categorization of processes underlying intuitive judgment and decision making. Thinking & Reasoning, 16, 1–25. Goldstein, D. G., & Gigerenzer, G. (2002). Models of ecological rationality: The recognition heuristic. Psychological Review, 109, 75–90. Hilbig, B. E. (2008a). Individual differences in fast-and-frugal decision making: Neuroticism and the recognition heuristic. Journal of Research in Personality, 42, 1641–1645. Hilbig, B. E. (2008b). One-reason decision making in risky choice? A closer look at the priority heuristic. Judgment and Decision Making, 3, 457–462. Hilbig, B. E. (2010). Precise models deserve precise measures: A methodological dissection. Judgment and Decision Making, 5, 272–284. Hilbig, B. E. (in press). Reconsidering ‘evidence’ for fast and frugal heuristics. Psychonomic Bulletin & Review. Hilbig, B. E., Erdfelder, E., & Pohl, R. F. (2010). One-reason decision-making unveiled: A measurement model of the recognition heuristic. Journal of Experimental Psychology: Learning, Memory, & Cognition, 36, 123– 134. Hilbig, B. E., & Pohl, R. F. (2008). Recognizing users of the recognition heuristic. Experimental Psychology, 55, 394–401. Hilbig, B. E., & Pohl, R. F. (2009). Ignorance- versus evidence-based decision making: A decision time analysis of the recognition heuristic. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35, 1296–1305. Hilbig, B. E., Pohl, R. F., & Bro¨der, A. (2009). Criterion knowledge: A moderator of using the recognition heuristic? Journal of Behavioral Decision Making, 22, 510–522. Hu, X., & Batchelder, W. H. (1994). The statistical analysis of engineering processing tree models with the EM algorithm. Psychometrika, 59, 21–47. Moshagen, M. (2010). multiTree: A computer program for the analysis of multinomial processing tree models. Behavior Research Methods, 42, 42–54. Myung, I. J. (2000). The importance of complexity in model selection. Journal of Mathematical Psychology, 44, 190–204. Newell, B. R., & Fernandez, D. (2006). On the binary quality of recognition and the inconsequentially of further knowledge: Two critical tests of the recognition heuristic. Journal of Behavioral Decision Making, 19, 333– 346. Newell, B. R., & Lee, M. D. (in press). The right tool for the job? Comparing an evidence accumulation and a naive atrategy selection model of decision making. Journal of Behavioral Decision Making. Pohl, R. F. (2006). Empirical tests of the recognition heuristic. Journal of Behavioral Decision Making, 19, 251– 271. Richter, T., & Spa¨th, P. (2006). Recognition is used as one cue among others in judgment and decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32, 150–162. Rieskamp, J. (2008). The probabilistic nature of preferential choice. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34, 1446–1465. Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804.
Topics in Cognitive Science 3 (2011) 197–205 Copyright ! 2011 Cognitive Science Society, Inc. All rights reserved. ISSN: 1756-8757 print / 1756-8765 online DOI: 10.1111/j.1756-8765.2010.01124.x
Towards Competitive Instead of Biased Testing of Heuristics: A Reply to Hilbig and Richter (2011) Henry Brighton, Gerd Gigerenzer Center for Adaptive Behavior and Cognition, Max Planck Institute for Human Development Received 8 September 2010; received in revised form 4 November 2010; accepted 4 November 2010
Abstract Our programmatic article on Homo heuristicus (Gigerenzer & Brighton, 2009) included a methodological section specifying three minimum criteria for testing heuristics: competitive tests, individual-level tests, and tests of adaptive selection of heuristics. Using Richter and Spa¨th’s (2006) study on the recognition heuristic, we illustrated how violations of these criteria can lead to unsupported conclusions. In their comment, Hilbig and Richter conduct a reanalysis, but again without competitive testing. They neither test nor specify the compensatory model of inference they argue for. Instead, they test whether participants use the recognition heuristic in an unrealistic 100% (or 96%) of cases, report that only some people exhibit this level of consistency, and conclude that most people would follow a compensatory strategy. We know of no model of judgment that predicts 96% correctly. The curious methodological practice of adopting an unrealistic measure of success to argue against a competing model, and to interpret such a finding as a triumph for a preferred but unspecified model, can only hinder progress. Marewski, Gaissmaier, Schooler, Goldstein, and Gigerenzer (2010), in contrast, specified five compensatory models, compared them with the recognition heuristic, and found that the recognition heuristic predicted inferences most accurately. Keywords: Simple heuristics; Recognition heuristic; Homo heuristicus; Biased testing
1. Introduction Cognition rests on an ability to make accurate inferences from limited observations of an uncertain and potentially changing environment. Developing theories capable of explaining how the cognitive system functions so effectively despite this uncertainty is a key step Correspondence should be sent to Henry Brighton, Center for Adaptive Behavior and Cognition, Max Planck Institute for Human Development, Lentzeallee 94, 14195 Berlin, Germany. E-mail: hbrighton@ mpib-berlin.mpg.de
198
H. Brighton, G. Gigerenzer ⁄ Topics in Cognitive Science 3 (2011)
toward understanding cognition. The abilities of machines, for example, pale in comparison. These issues drive our research, and the notion of Homo heuristicus characterizes a particular relationship between cognition and the structure of the environment, one that hypothesizes how an organism can make accurate inferences about an uncertain world (Gigerenzer & Brighton, 2009). Rather than attempting to minimize, maximize, or optimize during the process of problem solving, Homo heuristicus relies on heuristic, resource-frugal, and robust solutions that ignore information. This does not mean that heuristics are functionally inferior to processes that integrate all information or optimize. Optimization is not always possible or desirable in the complex and uncertain environments in which we find ourselves. In fact, a mind that relies on simple heuristics can make both faster and more accurate inferences than one that relies on, for example, multiple regression (Czerlinski, Gigerenzer, & Goldstein, 1999) or neural networks models (Brighton, 2006; Chater, Oaksford, Nakisa, & Redington, 2003). Our article examined less-is-more effects and used the statistical problem of the bias-variance dilemma to further understand how, when, and why heuristics make accurate inferences. More specifically, the study of Homo heuristicus proceeds by (a) proposing heuristics, expressed as precise and testable computational models, their building blocks, and the core cognitive capacities they exploit; (b) analyzing the functional–ecological implications of these heuristics, which means understanding why and when they work; (c) examining how heuristics are selected from what we refer to as an adaptive toolbox, a metaphor used to conceptualize the stock of heuristics available to the organism. Consequently, empirical tests of heuristic use should be guided by knowledge of their functional–ecological implications, because the hypothesis is that people will attempt to select specific heuristic when it is adaptive to do so. By examining questions like these, we aim to lay firm foundations for understanding the broader issue of strategy selection. The problem of strategy selection is not specific to the study of heuristics. It should be a concern for anyone who accepts that cognition, and therefore decision making, relies on more than one form of processing (e.g., Einhorn, 1970, 1971; Ginossar & Trope, 1987; Payne, 1976; Payne, Bettman, & Johnson, 1988; Payne, Bettman, & Johnson, 1993; Rapoport & Wallsten, 1972; Rieskamp & Otto, 2006; Svenson, 1979). Hilbig and Richter remark that the simplicity of heuristics wanes once the task of strategy selection has been taken into account. This criticism assumes that the problem of strategy selection demands complex processes, although no supporting evidence for this assumption was given. It also implies that the strategy selection problem represents the Achilles heel of research into heuristics, while other approaches can, somehow, safely disregard the problem. This strikes us as short-sighted. Hillbig and Richter’s viewpoint requires a commitment to believing that a single strategy, or a single pattern of information processing, underlies the problem of inductive inference. This line of reasoning places the onus firmly on those adopting such a view to explain how a single process could adequately respond to the diversity of statistical patterns found in the environment (e.g., Geman, Bienenstock, & Doursat, 1992; Schaffer, 1993). Moreover, for those who assume that the mind has only one tool in its adaptive toolbox, such as a weighted-linear strategy or a neural network, the strategy selection problem
H. Brighton, G. Gigerenzer ⁄ Topics in Cognitive Science 3 (2011)
199
translates into the question of how the mind selects a new and adequate parameter set for every new class of problem.
2. Three methodological principles for testing strategies Hilbig and Richter are largely mute on these theoretical issues. Instead, they respond to our critique of an experiment by Richter and Spa¨th (2006) on the recognition heuristic. In Section 5, on Methodology and Empirical Evidence, we put forward three methodological principles for testing heuristics (Gigerenzer & Brighton, 2009, p. 132). We then used the Richter and Spa¨th study to illustrate how violating these principles can lead to unwarranted conclusions. The three principles are: (i) competitive tests, (ii) individual-level tests, and (iii) tests of adaptive selection of heuristics. Richter and Spa¨th’s study violated all three principles. First, the authors tested only the recognition heuristic but concluded that a compensatory strategy they had neither tested nor specified would explain participants’ inferences more accurately. Second, this conclusion was based on means only; no individual data were analyzed. Note that in the presence of systematic individual differences, one should not draw conclusions about individual processes from group means. In the extreme, the mean will not represent a single individual. Third, the authors tested if subjects used the recognition heuristic without first establishing if the statistical properties of the task—such as the presence of substantial recognition validity—made it functional to do so. In Experiment 1, the recognition validity was not reported and likely at chance. In Experiment 2, an adjusted small correlation was reported instead of the recognition validity. Only in Experiment 3 was the recognition validity substantial. The adaptive selection of heuristics would predict that accordance rates are high in Experiment 3, but low or at chance level in the others. In contrast, Richter and Spa¨th appear to have missed the point of the adaptive selection of heuristics, and the study of ecological rationality (Gigerenzer et al., 1999), which we spent nine pages discussing in our article (Gigerenzer & Brighton, 2009, p. 116–125). Richter and Spa¨th (2006, p. 160) went as far to suggest that the recognition heuristic is ‘‘universally applied’’ or that people ‘‘rely on recognition blindly.’’ Hilbig and Richter (p. 4) perpetuate this misunderstanding of heuristics as general-purpose rules, and even attribute it to Goldstein and Gigerenzer (2002), despite these authors explicitly stating that the recognition heuristic is not a general purpose strategy. The ecological rationality of the recognition heuristic is defined by two characteristics: Some objects must be unrecognized and the recognition validity must be substantial (pp. 76–78, 87). We fully accept that the details of how the cognitive system shifts strategies adaptively in response to recognition validity and other factors has not been fully set out. We do not accept that one should refrain from using knowledge of the functional–ecological
200
H. Brighton, G. Gigerenzer ⁄ Topics in Cognitive Science 3 (2011)
implications of a model to inform the process of experimental design and analysis. While Hilbig and Richter consider it ‘‘somewhat harsh’’ (p. 5) to doubt the insights of a study for which the functional–ecological match between the task and the recognition heuristic is weak or unknown, we consider it absolutely central to conducting solid empirical work. 2.1. Richter and Spa¨th’s (2006) study: No competitive testing, no individual analyses To avoid any further misunderstanding, let us first define the adaptive use of the recognition heuristic. Relying on the heuristic is ecologically rational in an environment R where the recognition of objects a; b 2 R is strong and positively correlated with their criterion values. For two objects, the heuristic is: If one of two objects is recognized and the other is not, then infer that the recognized object has the higher value with respect to the criterion. This heuristic is noncompensatory in the sense that the recall of further cues about the recognized object cannot compensate for (i.e., overturn) recognition information. In the original work (Gigerenzer & Goldstein, 1996, pp. 651–652; Goldstein & Gigerenzer, 2002, pp. 76–78), the recognition heuristic was assumed to model human inferences when three conditions hold: (i) the recognition validity is substantial, (ii) inferences are made from memory, rather than from tables of information, and (iii) recognition stems from natural environments, as opposed to artificial manipulation. In Experiment 3 of Richter and Spa¨th (2006), these conditions were fulfilled. The authors then asked whether the recognition heuristic predicts people’s inferences in the presence of a strong, contradicting cue. German participants were taught whether certain recognized American cities have international airports or not. The airport cue was chosen as being the most valid (mean subjective validity = .82) among six cues tested in a pilot study. Moreover, the biserial rank correlation between population rank and airport was larger than that between population rank and recognition, .71 versus ).56. There were three memory states for recognized cities: positive cue (with international airport), no cue (unknown), and negative cue (no international airport). Richter and Spa¨th reported that in these three states, 98%, 95%, and 82% of all inferences were in accordance with the recognition heuristic, respectively. Their conclusion, though, was remarkable: ‘‘no evidence was found in favor of a noncompensatory use of recognition’’ (p. 159). We presented an analysis of Richter and Spa¨th’s data at the individual level, which showed that even in the presence of a strong contradicting cue, the majority of participants (17 out of 28) chose the recognized objects all the time (32 out of 32 judgments) or nearly all the time (31 out of 32), while the others appeared to guess or follow some other strategy (Gigerenzer & Brighton, 2009, figure 7). This pattern showed a
H. Brighton, G. Gigerenzer ⁄ Topics in Cognitive Science 3 (2011)
201
degree of intraindividual consistency rarely obtained in judgment and decision-making research. Richter and Spa¨th (2006, p. 159), in contrast, had concluded that there would be ‘‘clear evidence’’ for compensatory strategies they favor, without having formulated and tested such a model.
3. Hilbig and Richter’s reanalysis: Still no competitive testing Before we comment on Hilbig and Richter’s reanalysis, we would like to correct three errors repeated throughout their article. 1. Not all heuristics are noncompensatory processes. Hilbig and Richter begin their comment by stating, ‘‘Gigerenzer and Brighton (2009) provided a critical discussion of empirical evidence and the methodology that has been used to investigate the assumed noncompensatory nature of these heuristics’’ (p. 3). The claim that all heuristics are noncompensatory processes is incorrect. In fact, the first example of a heuristic we discussed at length was tallying, a compensatory heuristic. This oversight allows the authors to make a second erroneous claim. 2. If a person does not rely on the recognition heuristic, it does not follow that he or she relies on a nonheuristic compensatory strategy. Hilbig and Richter interpret the finding that not all participants follow the recognition heuristic as evidence that they follow a ‘‘nonheuristic’’ compensatory strategy. This conclusion is invalid, for two reasons. First, as mentioned before, participants may rely on a compensatory heuristic such as tallying. Second, participants may rely on a different noncompensatory heuristic, such as a one-reason heuristic that only considers information about international airports. The point is, an argument for an alternative explanation that has not been formalized as a model and tested will necessarily be an argument based on speculation. 3. To generalize from one heuristic to all heuristics is logically unfounded. Hilbig and Richter claim in their abstract and conclusions that ‘‘fast and frugal heuristics are only used consistently by a minority of decision makers’’ (p. 14). It should be clear, though, that such statements about heuristics in general cannot be justified after analyzing one heuristic, as is the case in their comment. In their reanalysis of Richter and Spa¨th’s data, Hilbig and Richter use an individual-level analysis, but no competitive testing. Moreover, lack of competitive testing is combined with biased testing of the recognition heuristic, which is our next point. 3.1. The 100% (96%) threshold Hilbig and Richter reanalyzed the data of Richter and Spa¨th (2006) using the r-model. This model attempts to estimate the proportion of people that ‘‘use’’ the recognition heuristic, an estimate which they propose as being more accurate than the adherence rate. The adherence rate, the r-model, or any other criterion one chooses to
202
H. Brighton, G. Gigerenzer ⁄ Topics in Cognitive Science 3 (2011)
use, can only ever provide an uncertain indicator of the ‘‘use’’ of a cognitive model. Given this, we are nonplussed by Hilbig and Richter’s accusation that we make a logical error when using adherence rate to test the predictive accuracy of the recognition heuristic. The issue is not, and never will be, one of logic. Science, fortunately, is clear on how to resolve the matter: Competing explanations should be judged on their ability to explain the data. The problem is, the r-model is a model of behavior and in no way specifies, beyond the recognition heuristic, how information is processed when making decisions. It fails to offer a valid competing explanation in what should be competition played on a level playing field between process models, models that attempt to describe the data-generating machinery. Rather than delve into the specifics of their analysis and quibble over the ability of a behavioral model to offer solid insights into cognitive processing, we will instead focus on the criterion used by Hilbig and Richter to classify a person as a user of the recognition heuristic. We use the term ‘‘biased’’ testing if someone evaluates two or more process models in competitive testing but uses different evaluation criteria. In the absence of competitive testing, biased testing is the practice of assessing the model one disfavors against an unrealistic standard. The first test Hilbig and Richter conduct with the r-model uses a 100% threshold (r = 1); that is, they test the hypothesis that every person always relies on the recognition heuristic. Then they relax 100% to 96%, and classify only a minority (9 out of 28 participants in Richter and Spa¨th’s Experiment 3) as ‘‘users’’ of the recognition heuristic. (These are likely to be the same nine participants who followed the predictions of the recognition heuristic in 32 out of 32 times, according to our reanalysis in Gigerenzer & Brighton, 2009, figure 7, lower panel.) It should be evident that by choosing any number, say, 100%, 90%, or 75% of subject’s responses as a threshold, or augmenting the r-model with an error-theory, one can obtain more or less favorable results for the recognition heuristic. The particular threshold values Hilbig and Richter chose are not met by any model of cognitive processes we are aware of in the entire field of judgment and decision making. Prospect theory, a Nobel Prize-model, for instance, typically predicts 75%–80% of judgments in two–alternative choice tasks. Other models do not much better, and often worse (Brandsta¨tter, Gigerenzer, & Hertwig, 2006). In other work, where Hilbig, Scholl, and Pohl (2010) estimate the r-value rather than fix it at an unrealistic level, they conclude that the majority of judgments, ranging from 63% to 77%, resulted from ‘‘use’’ of the recognition heuristic. To summarize, the r-model analysis of Hilbig and Richter does not specify an alternative process model or test it competitively with the recognition heuristic. They attempt to estimate the ‘‘true’’ proportion of recognition heuristic users. The resulting estimated proportion depends on an arbitrary threshold, which the authors set unrealistically high as 100% or 96%. Without carrying out the additional work of specifying a competing model, it is impossible to know how another process model would fare when measured against this same criterion. Competitive model testing renders arbitrary decision thresholds such as these irrelevant and provides clear answers to questions concerning the relative ability of models to explain behavior.
H. Brighton, G. Gigerenzer ⁄ Topics in Cognitive Science 3 (2011)
203
4. How to resolve the debate: Competitive testing Hilbig and Richter clearly harbor intuitions about a superior explanation for human decision making. By abstaining from the challenge of formulating and putting their intuitions to the test, they are free to enjoy the luxury of speculation. We welcome competing proposals and see them as essential to progress, but unless these intuitions are formalized to an extent that they can be compared with existing models and judged on the same footing, then these intuitions will remain intuitions. They should not be mistaken for a serious competing explanation, and they should certainly not be used as a means to arrive at evidence against existing models that have been formally specified and can undergo empirical testing. How can this debate be resolved? The answer is simple. Specify an alternative model, and then assess the relative ability of competing models to explain the observations. This is why we have stressed the competitive testing of process models in the title of this response, and in the abstract of the original article. Such formal models exist. Already in the first article on the recognition heuristic, Gigerenzer and Goldstein (1996) formulated and analyzed the predictive accuracy of several compensatory models, including variations on tallying that use recognition information as ‘‘just another cue,’’ as Hilbig and Richter argue. Of particular relevant to this debate is the first experimental study that competitively tests various cognitive process models that assume compensatory processing of recognition information. Marewski et al. (2010) formulated five alternative models that integrate recognition with further cues for the recognized object. All five competing models can be thought of as weighted linear additive models with two classes of predictors, recognition and cues, or recognition and retrieval time, respectively. These models have free parameters that allow them to mimic the recognition heuristic and predict the opposite pattern, depending on the parameter values. That is, they included the recognition heuristic as a special case. None of the five compensatory models could predict judgments with greater accuracy than the recognition heuristic, which performed best overall. The study shows that although the recognition heuristic cannot predict with 100% accuracy, particularly in the presence of contradicting cues, this finding by itself does not imply that compensatory models predict with greater accuracy.
5. Conclusion We introduced the notion of Homo heuristicus to characterize how the cognitive system might make accurate inferences in uncertain, complex, and potentially changing environments. Our hypothesis is that by ignoring information, such as cues and dependencies between cues (a form of bias), organisms can simultaneously achieve robust, functional, and tractable responses to environmental uncertainty. Heuristics are process models used to formalize and test this hypothesis. In contexts where all events, actions, and probabilities are known, the process of optimization is all well and good, so long as it is computationally tractable to implement. In contexts where optimization is not tractable, or when full knowledge of the problem is unavailable, heuristics can offer superior responses to uncertainty.
204
H. Brighton, G. Gigerenzer ⁄ Topics in Cognitive Science 3 (2011)
In their commentary, Hilbig and Richter have little to say on these functional–ecological issues, issues that play a critical role in constraining cognitive theorizing. Instead, they focus on a specific issue, the proportion of participants who relied on the recognition heuristic in a study by Richter and Spa¨th (2006). Whereas Richter and Spa¨th had concluded that ‘‘no evidence was found’’ (p. 159) in favor of the recognition heuristic, the r-model reanalysis by Hilbig and Richter now classifies 29% of participants as users of the recognition heuristic. This classification is, however, based on an arbitrary and absolutely unrealistic 96% threshold for classifying participants as users of the recognition heuristic. Based on this threshold, they argue, inaccurately, that some compensatory and therefore ‘‘nonheuristic’’ process provides a superior explanation of human behavior, but they fail to specify this alternative and test it. The basic practice of competitive model testing renders this suspect methodological practice unnecessary and allows competing explanations to vie on a level playing field. To peddle putative evidence against one process model, based on an arbitrary decision criterion, as evidence for an unspecified alternative strikes us as a particularly weak form of argument. To then extend this pattern of reasoning in an attempt to dismiss an entire class of models, such as heuristics in general, strikes us as wholly unconvincing.
References Brandsta¨tter, E., Gigerenzer, G., & Hertwig, R. (2006). The priority heuristic: Making choices without tradeoffs. Psychological Review, 113, 409–432. Brighton, H. (2006). Robust inference with simple cognitive models. In C. Lebiere & R. Wray (Eds.), AAAI Spring Symposium: Cognitive Science Principles Meet AI-Hard Problems (pp. 17–22). Park, CA: American Association for Artificial Intelligence. Chater, N., Oaksford, M., Nakisa, R., & Redington, M. (2003). Fast, frugal and rational: How rational norms explain behavior. Organizational Behavior and Human Decision Processes, 90, 63–86. Czerlinski, J., Gigerenzer, G., & Goldstein, D. G. (1999). How good are simple heuristics? In G. Gigerenzer, P. M. Todd, & the ABC Research Group, Simple heuristics that make us smart (pp. 97–118). New York: Oxford University Press. Einhorn, H. J. (1970). The use of nonlinear, noncompensatory models in decision making. Psychological Bulletin, 73, 221–230. Einhorn, H. J. (1971). Use of nonlinear, noncompensatory models as a function of task and amount of information. Organizational Behavior & Human Performance, 6, 1–27. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias ⁄ variance dilemma. Neural Computation, 4, 1–58. Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences. Topics in Cognitive Science, 1, 155–175. Gigerenzer, G., & Goldstein, D. G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650–669. Gigerenzer, G., Todd, P. M., & the ABC Research Group. (1999). Simple heuristics that make us smart. New York: Oxford University Press. Ginossar, Z., & Trope, Y. (1987). Problem solving in judgment under uncertainty. Journal of Personality & Social Psychology, 52, 464–474. Goldstein, D. G., & Gigerenzer, G. (2002). Models of ecological rationality: The recognition heuristic. Psychological Review, 109, 75–90.
H. Brighton, G. Gigerenzer ⁄ Topics in Cognitive Science 3 (2011)
205
Hilbig, B. E., Scholl, S. G., & Pohl, R. F. (2010). Think or blink — is the recognition heuristic an ‘‘intuitive’’ strategy? Judgment and Decision Making, 5, 300–309. Marewski, J. N., Gaissmaier, W., Schooler, L. J., Goldstein, D. G., & Gigerenzer, G. (2010). From recognition to decisions: Extending and testing recognition-based models for multi-alternative inference. Psychonomic Bulletin & Review, 17, 287–309. Payne, J. W. (1976). Task complexity and contingent processing in decision processing: An information search and protocol analysis. Organizational Behavior & Human Decision Processes, 16, 366–387. Payne, J. W., Bettman, J. R., & Johnson, E. J. (1988). Adaptive strategy selection in decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14, 534–552. Payne, J. W., Bettman, J. R., & Johnson, E. J. (1993). The adaptive decision maker. Cambridge, England: Cambridge University Press. Rapoport, A., & Wallsten, T. S. (1972). Individual decision behavior. Annual Review of Psychology, 23, 131– 176. Richter, T., & Spa¨th, P. (2006). Recognition is used as one cue among others in judgment and decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32, 150–162. Rieskamp, J., & Otto, P. E. (2006). SSL: A theory of how people learn to select strategies. Journal of Experimental Psychology: General, 135, 207–236. Schaffer, C. (1993). Overfitting avoidance as bias. Machine Learning, 10, 153–178. Svenson, O. (1979). Process descriptions of decision making. Organizational Behavior & Human Performance, 23, 86–112.