Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5427
Konstantin Avrachenkov Debora Donato Nelly Litvak (Eds.)
Algorithms and Models for the Web-Graph 6th International Workshop, WAW 2009 Barcelona, Spain, February 12-13, 2009 Proceedings
13
Volume Editors Konstantin Avrachenkov INRIA Sophia Antipolis 2004 Route des Lucioles, 06902 Sophia Antipolis, France E-mail:
[email protected] Debora Donato Yahoo! Research, Barcelona Ocata 1, 1st floor, 08003 Barcelona, Spain E-mail:
[email protected] Nelly Litvak University of Twente Faculty of Electrical Engineering, Mathematics and Computer Science P.O. Box 217, 7500 AE Enschede, The Netherlands E-mail:
[email protected]
Library of Congress Control Number: 2008943853 CR Subject Classification (1998): F.2, G.2, H.4, H.3, C.2, H.2.8, E.1 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-95994-7 Springer Berlin Heidelberg New York 978-3-540-95994-6 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12611911 06/3180 543210
Preface
This volume constitutes the refereed proceedings of the 6th Workshop on Algorithms and Models for the Web Graph, WAW 2009, held in Barcelona in February 2009. The World Wide Web has become part of our everyday life, and information retrieval and data mining on the Web are now of enormous practical interest. The algorithms supporting these activities combine the view of the Web as a text repository and as a graph, induced in various ways by links among pages, links among hosts, or other similar networks. We also witness an increasing role of the second-generation Web-based applications Web 2.0, such as social networking sites and wiki sites. The workshop program consisted of 14 regular papers and two invited talks. The invited talks were given by Ravi Kumar (Yahoo! Research, USA) and Jos´e Fernando Mendes (University of Aveiro, Portugal). The regular papers went through a thorough review process. The workshop papers were naturally clustered in three sections: “Graph Models for Complex Networks,” “PageRank and Web Graph” and “Social Networks and Search.” The first section lays a foundation for theoretical and empirical analysis of the Web graph and Web 2.0 graphs. The second section analyzes random walks on the Web and Web 2.0 graphs and their applications. It is interesting to observe that the PageRank algorithm finds new exciting applications beyond now classical Web page ranking. Nowadays social networks are among the most popular applications on the Web. The third section of the workshop program is devoted to the design and performance evaluation of the algorithms for social networks. We hope that the papers presented at WAW 2009 will stimulate further development of Web 2.0 applications and deepen our understanding of the World Wide Web evolution. We would like to thank the General Chair, Andrei Broder, and the Program Committee members for their time and effort which resulted in a very high quality program. We would like to thank the organizers of WAW 2007, Fan Chung and Anthony Bonato, for their valuable advice. We also would like to thank the Springer LNCS editorial team, particularly Alfred Hofmann, Ursula Barth and Anna Kramer, for their advice and prompt help. We thank Yana Volkovich very much for the design and continuing support of the WAW 2009 website. We would like to thank cordially our industrial sponsors Yahoo! Inc. and Google Inc. and our technical sponsors INRIA, University of Twente, Pompeu Fabra University, and Barcelona Media - Innovation Centre. Last but not least, we would like to extend our thanks to all the authors for their high-quality scientific contribution. February 2009
Konstantin Avrachenkov Debora Donato Nelly Litvak
Organization
Executive Committee Conference Chair Program Committee Co-chair Program Committee Co-chair Program Committee Co-chair
Andrei Broder (Yahoo! Research, USA) Konstantin Avrachenkov (INRIA, France) Debora Donato (Yahoo! Research, Spain) Nelly Litvak (University of Twente, The Netherlands)
Organizing Committee Ricardo Baeza-Yates Debora Donato Mari Carmen Marcos Yana Volkovich
Yahoo! Research, Spain Yahoo! Research, Spain Pompeu Fabra University, Spain University of Twente, The Netherlands
Program Committee Paolo Boldi Anthony Bonato Guido Caldarelli Fan Chung Graham Vladimir Dobrynin Jeannette Janssen Ravi Kumar Amy N. Langville Stefano Leonardi David Liben-Nowell Mark Manasse Kevin McCurley Igor Nekrestyanov Remco van der Hofstad Laurent Viennot Sebastiano Vigna Dorothea Wagner Walter Willinger Alexander Zelikovsky
University of Milan, Italy Ryerson University, Canada Centre Statistical Mechanics and Complexity CNR-INFM, Italy University of California, San Diego, USA St. Petersburg State University, Russia Dalhousie University, Canada Yahoo! Research, USA College of Charleston, USA Sapienza University of Rome, Italy Carleton College, USA Microsoft Research, USA Google Inc., USA St. Petersburg State University, Russia TU Eindhoven, The Netherlands INRIA, France University of Milan, Italy Karlsruhe University, Germany AT&T Research, USA Georgia State University, USA
VIII
Organization
Sponsoring Institutions Pompeu Fabra University Google Inc. Yahoo! Research
Table of Contents
Graph Models for Complex Networks Information Theoretic Comparison of Stochastic Graph Models: Some Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kevin J. Lang
1
Approximating the Number of Network Motifs . . . . . . . . . . . . . . . . . . . . . . . Mira Gonen and Yuval Shavitt
13
Finding Dense Subgraphs with Size Bounds . . . . . . . . . . . . . . . . . . . . . . . . . Reid Andersen and Kumar Chellapilla
25
The Giant Component in a Random Subgraph of a Given Graph . . . . . . Fan Chung, Paul Horn, and Linyuan Lu
38
Quantifying the Impact of Information Aggregation on Complex Networks: A Temporal Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fernando Mour˜ ao, Leonardo Rocha, Lucas Miranda, Virg´ılio Almeida, and Wagner Meira Jr.
50
PageRank and Web Graph A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank . . . . Fan Chung
62
Choose the Damping, Choose the Ranking? . . . . . . . . . . . . . . . . . . . . . . . . . Marco Bressan and Enoch Peserico
76
Characterization of Tail Dependence for In-Degree and PageRank . . . . . . Nelly Litvak, Werner Scheinhardt, Yana Volkovich, and Bert Zwart
90
Web Page Rank Prediction with PCA and EM Clustering . . . . . . . . . . . . . Polyxeni Zacharouli, Michalis Titsias, and Michalis Vazirgiannis
104
Permuting Web Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Boldi, Massimo Santini, and Sebastiano Vigna
116
Social Networks and Search A Dynamic Model for On-Line Social Networks . . . . . . . . . . . . . . . . . . . . . . Anthony Bonato, Noor Hadi, Paul Horn, Pawel Pralat, and Changping Wang
127
X
Table of Contents
TC-SocialRank: Ranking the Social Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Gulli, Stefano Cataudella, and Luca Foschini Exploiting Positive and Negative Graded Relevance Assessments for Content Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maarten Clements, Arjen P. de Vries, and Marcel J.T. Reinders
143
155
Cluster Based Personalized Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyun Chul Lee and Allan Borodin
167
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
185
Information Theoretic Comparison of Stochastic Graph Models: Some Experiments Kevin J. Lang Yahoo Research, 2821 Mission College Blvd, Santa Clara, CA 95054
[email protected]
Abstract. The Modularity-Q measure of community structure is known to falsely ascribe community structure to random graphs, at least when it is naively applied. Although Q is motivated by a simple kind of comparison of stochastic graph models, it has been suggested that a more careful comparison in an information-theoretic framework might avoid problems like this one. Most earlier papers exploring this idea have ignored the issue of skewed degree distributions and have only done experiments on a few small graphs. By means of a large-scale experiment on over 100 large complex networks, we have found that modeling the degree distribution is essential. Once this is done, the resulting information-theoretic clustering measure does indeed avoid Q’s bad property of seeing cluster structure in random graphs.
1
Introduction and Discussion
This is an experimental paper that addresses several topics of current interest by using a large ensemble of graphs to evaluate several objective functions based on stochastic graph models incorporating degree sequences and partitionings of the nodes. The setup for these experiments requires many ingredients some of which we will now mention. Stochastic Graph Models: We use Gp to denote the classic Erdos-Renyi Gnp model that generates n-node graphs in which each of the n(n − 1)/2 possible edges occurs with probability p. Gm will denote the related model in which all graphs with n nodes and m edges are equiprobable. Gp and Gm are essentially the same model but with soft and hard constraints respectively on the number of edges. Gw is a generalization of Gp in which the expected degrees of the nodes are specified by the vector w. Gd is a generalization of Gm in which the degrees of the nodes are exactly specified by the vector d. These are all standard graph models which have been thoroughly analyzed by graph theorists, and are discussed at length in the book [1]. Maximum Likelihood Comparison of Models: The recent paper [2] makes a convincing case that a useful quantitative method for evaluating stochastic graph models is to take their generative nature seriously and ask which model is most likely to have generated the given graph. That paper compared four well-known models, including Gp and Gd , as explanations for several instances K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 1–12, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
K.J. Lang
of one graph, namely autonomous systems; Gd was second best, and Gp was worst. Our results include comparisons between Gm and Gd for more than 100 graphs, with results tabulated in Figure 3 and 4. We note that negative log likelihoods can be can be expressed in bits (with smaller bit-counts corresponding to higher likelihoods) and there is known to be a one-to-one correspondence between stochastic models and optimal compression schemes. Although these two viewpoints are equivalent, our discussion will mostly use the compression viewpoint, drawing special attention to the MDL interpretation of the compression schemes. Minimum Description Length: The MDL framework [3] emphasizes the tradeoff between the bit-count in the header of a message specifying the object, and the bit-count in the body of the message, with the former bit count interpreted as a cost that is balanced agains reductions in the latter bit count. The message header usually contains structural information that constrains the object, thus reducing the remaining entropy and shortening the message body whose job is eliminate all remaining uncertainty so that the decoder can output a specific object. We note that the comparison in [2] between Gp and Gd can be interpreted in the MDL framework (the question then is whether including the degree sequence in the message header sufficiently reduces the number of possible arrangements of edges so that the overall message is shortened). Similarly the papers [4] and [5] essentially use the MDL framework to evaluate partitioned Gp and Gm models respectively, with the partitionings being specified in the message header. The MDL approach has long been popular for model selection because of its intrinsic resistance to over-fitting due to the header vs body tradeoff that forces us to pay for detailed scructural descriptions that reduce entropy. In particular, the MDL approach has often been recommended in the context of data clustering as a method for choosing the right value of k, the number of clusters. Our opinion is that while the idea of using an MDL scheme to choose k makes sense in principle, its practical utility has been oversold in many papers, since altering the details of the compression scheme can change the answer in the same way that choosing a different prior can change the answer in Bayesian statistics. Moreover, it is typically a computationally intractable problem to optimize the relevant objective function, so arbitrary factors including luck can determine which particular value of k ends up with the best score. Because of those caveats, and despite the fact that our main data tables do contain other values of k (in the spirit of “data exploration”) we will mostly be focusing on the slightly more limited question of whether a given k-choosing method outputs k = 1 or k > 1. [Strictly speaking, a k > 1 conclusion is rigorous but a k = 1 conclusion is not, in both cases regardless of what partitioning algorithm was used, because the former only requires a sufficiently good partitioning to be exhibited, while the latter requires a proof of the non-existence of a sufficiently good partitioning, which is very hard to obtain given the intractability of graph partitioning].
Information Theoretic Comparison of Stochastic Graph Models
3
Clusters in Real and Random Graphs: While real graphs might or might not contain cluster structure, most people agree that random graphs (generated by non-partitioned models like our four base models) should not. Therefore any kchoosing method that outputs k > 1 for a random graph has a flaw that needs to be investigated and fixed. One famous example is the modularity-Q metric [6]. Calculations in [7] showed that, simply due to random fluctuations, random graphs typically can be partitioned in a way that results in positive modularity, which might be interpreted as implying that k > 1. Our experiments on many graphs show that this is indeed a pervasive problem. The paper [7] goes on to suggest that Q scores should not be interpreted alone but rather in comparison with Q scores of randomized graphs. Given the MDL method’s alleged resistance to overfitting, one might suppose that an MDL-style comparison between partitioned and non-partitioned stochastic graph models (like those done in [4,8,5]) would successfully output k > 1 for many real graphs but k = 1 for random graphs. However, our experiments show that this approach does not work when one uses a base model like Gm that does not model the degree distribution. In that case we get the bad answer of k > 1 for many random graphs. This seems to especially happen when the graph has a highly skewed degree distribution. We also note that some of the partitionings which can cause the partitioned model to win are not at all cluster-like, such as the partitioning obtained by segregating the nodes by degree. However, this problem empirically seems to go away when we instead do an MDL-style comparison between partitioned and non-partitioned versions of the Gd model, which explicitly models the degree distribution. When that method is used, we still have k > 1 for many real graphs, but for all of the random graphs that we have tested, the conclusion is k = 1. While k = 1 is the non-rigorous answer (because it is hard to rule out the existence of any partitioning that would cause the partitioned model to have a better score) the consistency of this result over the entire ensemble of graphs is fairly impressive. In conclusion, while we are not necessarily recommending the exact objective functions in this paper, we suggest that the way forward in information-theoretic objective functions for graph clustering is more likely to involve the comparison of models like Gd which model the degree sequence, than models like Gm which do not.
2
MDL-Style Encoding Schemes for Graphs
Encoding the Base Models: We are encoding an undirected graph g, with N nodes and M edges, whose edges are specified by indicator variables xij that are 1 if edge i ∼ j is present, and 0 otherwise. All logarithms are base-2. tri(n) = n(n − 1)/2. sb() and vb() are the encoding costs of hypothetical efficient blackbox encoders for scalars and vectors. The body part of the message that actually encodes the edges is marked with big curly braces.
4
K.J. Lang
⎫ ⎬ −xij log p − (1 − xij ) log(1 − p) bits(g|Gp ) = sb(N ) + sb(p) + ⎭ ⎩ i<j
tri(N ) bits(g|Gm ) = sb(N ) + sb(M ) + log M ⎫ ⎧ N ⎬ ⎨ −xij log pij − (1 − xij ) log(1 − pij ) bits(g|Gw ) = sb(N ) + vb(w) + ⎭ ⎩ ⎧ N ⎨
i<j
where pij ∝ (wi wj ). For the Gd model we have adapted the scheme that was used by [2] to encode directed graphs. Our first step is therefore to orient the input graph’s edges; for specificity we orient them to point away from higher-degree nodes, flipping a coin for ties. The newly oriented edges define indegree and outdegree vectors id and od that sum elementwise to the original degree vector d. The mysterious-looking parenthesized expression in the following is explained in [2].
bits(g|Gd ) = sb{N, M }+vb(id)+vb(od)+ log(M !)−
N i
N log(idi !)− log(odi !) i
Partitioned Models: For each of the above base models X it is possible to define a partitioned model that views the graph g as K smaller graphs, each generated by model X, that have been somehow pasted together. In this paper we use two different “pasting methods”, which can be more precisely described as methods for converting a partitioning of the graph’s nodes into a block decomposition of the graph’s adjacency matrix. Let Y be a partitioning of the graph’s nodes into K disjoint subsets Yj . Let nj = |Yj | be the number of nodes in piece j. For 0 ≤ j < K, let mj be the number of “internal” edges linking pairs of nodes both in subset Yj . For 0 ≤ i < j < K, let mij be the number of edges linking nodes in pieces i and j. Next consider the graph’s adjacency matrix after the rows and columns have been conceptually rearranged to bring together the nodes in each piece Yj . The node partitioning induces K + tri(K) matrix blocks above the diagonal (which is all we care about since the graph is undirected so the matrix is symmetric). There are K triangular blocks touching the diagonal, each containing tri(nj ) matrix elements, which are all zero except for mj ones. There are tri(K) rectangular blocks off of the diagonal, each containing (ni nj ) matrix elements, which are all zero except for mij ones. Now, in the “full” matrix decomposition F we retain and separately encode all of these K + tri(K) blocks, but in the “combined” matrix decomposition C we combine the tri(K) off-diagonal blocks into one undifferentiated superblock ˘ containing tri(N )− K j tri(nj ) matrix elements which are all zero except for M = K M − j mj ones. Clearly the four base models can be combined with these two matrix decompositions to yield eight different models of partitioned graphs. We have implemented
Information Theoretic Comparison of Stochastic Graph Models
5
most of those combinations, but in this paper we will only discuss three, namely Gm with F , and Gd with C, for which we present experimental log likelihood measurements, and Gw with C, because that is the model in which the modularity-Q score was developed [6]. Encoding the Partitioned Models: First we show how to encode a partitioned Gm model using matrix decomposition F . Let y be a length-N indicator vector for the K-way partitioning Y . Let m be a length-(K + tri(K)) vector collecting together the within-piece edge counts mi and the between-piece edge counts mij . Then, noting that the ni values can be computed from y, we have bits(g|Gm , F , K, Y ) = sb{N, K} + vb(y) + vb(m) ⎧ ⎫
K K ⎨ tri(ni ) ni nj ⎬ + + log log ⎩ mi mij ⎭ i i<j
Next we describe a method for encoding a partitioned Gd model using matrix decomposition C. As with the base Gd model, we start by orienting the edges, ˆ and od ˆ for then we consult the partitioning to create four new degree vectors, id ˘ ˘ in- and out-degree resulting from within-piece edges, and id and od for in- and ˘ , and the mj out-degree resulting from between-piece edges. Then, noting that M can be computed from the four degree distributions: ˆ + vb(od) ˆ + vb(id) ˘ + vb(od) ˘ bits(g|Gd , C, K, Y ) = sb{N, K} + vb(y) + vb(id) ⎫ ⎧ N N K ⎬ ⎨ ˆ i !) ˆ i !) − log(od log(id log(mj !) − + ⎭ ⎩ i i j
N N ˘ i !) ˘ i !) − ˘ !) − log(od log(id + log(M i
3
i
Discussion of Scalar and Vector Encoders
Because we are using the sb and vb encoders in the header section rather than the body section of these MDL-style schemes, their implementations function much like priors in Bayesian statistics; different choices can lead to different results, including different estimates of the best value of k. Actually, the choice of scalar encoder sb has a very small effect on the answer except when the graph is tiny, so we arbitrarily use the Elias gamma coder. The design of the vector encoder vb can have a much bigger influence on the answer. Because our main interest is in the question of whether we should be modeling partitions or degree sequences at all, not on what is precisely the best model for those degrees or partitions, we have attempted to engineer an encoder that achieves good compression while making fairly weak assumption about the input distribution. Our solution was a composite encoder that uses the best of the following three schemes: Because it is traditional in the complex networks field to consider power-law distributions, our first encoder P is a power-law encoder just like the one in [2],
6
K.J. Lang
except that we remember to charge for encoding the power so we don’t use overly specific powers for small vectors. In cases where the power-law distribution is a bad fit, we can instead use an “agnostic” encoder A that doesn’t directly assume a specific distribution for the values, but instead makes the weaker assumption that they sum to a given total. A similar scheme is used in [5]. Among other things, Figure 1 shows for every graph whether P or A is a better model for the graph’s degree distribution. We find that A does win for some graphs (those graphs where “A” precedes “P” in the three-letter string in the “model degs” column). This can be considered a small addition to the extensive literature on whether or not powerlaw distributions are good for modeling actual degree distributions. [However, our use of the P model for encoding requires the entire distribution to be well modeled by a power law, whereas the usual power-law hypothesis excludes one or both ends of the distribution.] Our third encoder H encodes using the exact empirical histogram, which must itself be paid for (i.e. encoded) and the scheme for doing that can be thought of as imposing a hyper-prior that can be overcome with a sufficiently large amount of data. In Figure 1 we see (because “H” is the first letter in the three-letter string in the “model degs” column) that scheme H is nearly always the best of our three sub-schemes for encoding the degree sequences of mediumto-large graphs. Now for some actual details. Our overall encoder vb uses the best of three subschemes H, P, and A, charging log 3 bits for the choice between them. In the power-law scheme P we find a good power β = ±b/(2c) by binary search, then n max v encode the input vector using i −log(viβ /Z) bits, where Z = j=1 i j β . The overhead of scheme P is roughly 1+sb{b, c, n, max vi } bits. In the agnostic scheme A we note that the vector’s length is n, its entries vi are non-negative, and they sum to a particular total t. There are B = t+n−1 different vectors matching n−1 that description; the choice between them can be encoded using log B bits. The overhead of this method is sb(t) + sb(n). Finally we sketch out the recursive, histogram-based scheme H. Given the input vector’s symbol count histogram we optimally encode the input vector using log( s c(s))! − s (log c(s))! bits (this is the logarithm of the number of different vectors with that same histogram). The recursion in this scheme arises from the need to also encode the histogram, which is represented by a vector of symbol values and a parallel vector of symbol counts. Each of these two vectors is encoded by a recursive call to the overall encoder vb. We also perform a few simple optimizations in scheme H such as sorting the valuecount pairs and then differencing the values before making the two recursive calls to the encoder.
4
Modularity-Q
The Q measure [6] can be understood by considering a K-way partitioning of the graph’s nodes that allows the adjacency matrix to be re-arranged so that there are K blocks on the diagonal containing all within-piece edges. The leftover betweenpiece edges are all contained in a single off-diagonal superblock. The intuition which motivates Q is that a graph with good cluster structure should have a preponderance of within-piece edges. In fact for each piece j the value mj should be
Information Theoretic Comparison of Stochastic Graph Models Graph Name monksLk3 karate dolphins lesmis pol-books football netscience mani-facesDeg10 mani-facesDeg3 mani-facesK10 mani-facesK3 mani-swissDeg10 mani-swissDeg3 mani-swissK10 mani-swissK3 powergrid-watts road-USA road-ca road-pa road-tx delicious epinions flickr LinkedIn LiveJournal01 LiveJournal-larsWcc messengerSocial4 airports answers answers-2 answers-4 answers-6 as-caidaRel071112 as-newman as-oregon-all AuthToPap-astro-ph AuthToPap-cond-mat AuthToPap-dblp AuthToPap-gr-qc AuthToPap-hep-ph AuthToPap-hep-th
N M model nodes edges degs G 18 41 APH m 34 78 APH m 62 159 PAH m 77 254 APH m 105 441 AHP m 115 613 HPA m 379 914 AHP m 696 6979 HAP m 551 1981 HAP m 698 6935 AHP m 698 2091 AHP m 20000 200000 HPA m 19042 57747 HAP m 20000 199955 AHP m 20000 59997 AHP m 4941 6594 HAP m 126146 161950 HAP m 1957027 2760388 HAP m 1087562 1541514 HAP m 1351137 1879201 HAP m 147567 301921 HPA D 75877 405739 PHA D 404733 2110078 HPA D 6946668 30507070 HPA D 3766521 30629297 HPA D 4843953 42845684 HPA D 1878736 4079161 HPA D 500 2980 PHA D 488484 1240189 HPA D 25431 65551 HPA D 93971 266199 HPA D 290351 613237 HPA D 26389 52861 HPA D 22963 48436 HPA D 13579 37448 HPA D 54498 131123 HAP D 57552 104179 HAP D 615678 944456 HAP D 14832 22266 HAP D 47832 86434 HPA D 39986 64154 HPA D
7
Graph N M model Name nodes edges degs G gnutella-25 22663 54693 HPA D gnutella-30 36646 88303 HPA D gnutella-31 62561 147878 HPA D random-deg4 100000 200000 AHP m random-deg7 100000 350000 AHP m amazon 2003 06 01 asin01 403364 2443311 HAP D amazon 2003 all 473315 3505519 HAP D amazon-simProd-All 524371 1491793 HAP D bio-proteinsVespignani 4626 14801 HAP D bio-proteinYeastBarabasi 1458 1948 PHA D bio-yeastZivP0001 353 1517 PAH D bio-yeastZivP001 1266 8511 PHA D Blog-nat05-6m 29150 182212 HPA D Blog-nat06all 32384 315713 HAP D Cit-hep-ph 34401 420784 HAP D Cit-hep-th 27400 352021 HAP D clickstream-UsrToUrl 199308 951649 HPA D CoAuth-astro-ph 17903 196972 HAP D CoAuth-cond-mat 21363 91286 HAP D CoAuth-gr-qc 4158 13422 HAP D CoAuth-hep-ph 11204 117619 HPA D CoAuth-hep-th 8638 24806 HAP D dblp-larsWcc 317080 1049866 HAP D email-all 234352 383111 HPA D email-all-inOut 37803 114199 HPA D email-enron 33696 180811 HPA D email-ijs 72216 91393 HPA D imdbDec07-ActToAct 821810 27394903 HAP D imdbDec07-ActToMov 1966620 5771671 HPA D imdb-actress-movie 601481 1320616 HPA D imdb-a-m-30countries 198430 566756 HPA D imdb-a-m-USA 241360 530494 HPA D Patents 3764105 16511682 HAP D pol-blogs 1222 16714 PAH D Post-nat05-6m 238305 297338 HPA D Post-nat06all 437305 565072 HPA D protein dip 4626 14801 HAP D web-BerkStan 319717 1542940 HPA D web-google 855802 4291352 HAP D web-notredame 325729 1090108 HPA D web-wt10g-trec 1458316 6225033 HPA D
Fig. 1. Basic information about many of our 100 test graphs. The rightmost column tells whether the gm or gd base model assigns a higher probability to the graph.
larger than the number of edges that would expected for that piece if the overall graph had actually been generated by a single-piece base model, specifically Gw . The Q score is then the sum over the K pieces of the excess within-piece edges, suitably normalized. The definition of Q can be stated in the following simple and K cheaply computable form: j (mj /M )−(vol(Yj )/2M )2 , where vol(Yj ) is the sum of the degrees of the nodes in piece j. Q’s bad property of imputing cluster structure to random graphs [7] is easy to explain by the fact even in a random graph, some cuts with a given balance contain fewer edges than others, simply due to random fluctuations. If the graph is partitioned using the smallest of these sufficiently balanced cuts, the resulting deficit in between-piece edges causes an excess of within-piece edges that is detected by Q.
5
Dataset, Methodology, and Results
Graph Collection: During his studies of the properties of large complex networks, Jure Leskovec has assembled a very useful collection of more than 100
8
K.J. Lang
medium-to-large real-world undirected graphs, with sizes ranging up to millions of nodes. These graphs come from a wide range of real-world sources, and have consequently diverse properties. Some, but not all, of these graphs have highly skewed degree distributions. We obtained this collection from Jure, but he plans to make it publically available soon. We used all of the graphs, but to save space, only a subset of them are listed in Figure 1 Highlights: Road-* and mani-* are essentially low-dimensional meshes. gnutella-* are expanderlike peer-to-peer networks, and random-deg* are actual expanders. delicious, flickr, LinkedIn and LiveJournal* are large social networks. The AuthToPap*, CoAuthor*, Cit* graphs encode collaboration and citation patterns in scientific papers. The Imdb-* graphs encode which actors appeared in which movies. The “model degs” column in Figure 1 shows the ranking from highest to lowest likelihood of the H, P, and A models for the graph’s degree sequence (see Section 3). The “model G” column of Figure 1 tells whether the Gm or Gd base model assigns a higher probability to the graph. We see that Gd is a better model for most of the large complex networks, but Gm wins for the meshlike graphs, and ironically for the tiny networks like karate and dolphins which many previous complex networks papers have used to test ideas about community structure, including information-theoretic measures like those studied in this paper. Procedure: Step 1) We randomly permuted the node ids of every graph to erase any node ordering information that we might otherwise accidentally exploit. If our goal was to actually compress a graph, we would be happy to use any kind of structure that would help for that graph. A well-known example is sorted-URL node ordering for web graphs [9]. However our goal is to understand the interaction of node degrees and cluster structure, so we are intentionally excluding other kinds of structure from our models. Step 2) For each graph, we generated a random graph with exactly the same degree sequence. Step 3) For each real graph g and random graph g, and for each K ∈ {2, 4, 8, . . . , 1024}, we used Graclus [10] to generate a candidate partitioning Y (g, K). Graclus is a very fast Metis-based program that attempts to produce pieces bounded by cuts with low normalized cut score, while also obeying some balance constraints. Step 4) For each Y (g, K) we calculated Q(g, K, Y (g, K)), bits(g|Gm , F , K, Y (g, K)), and bits(g|Gd , C, K, Y (g, K)). Note on Generating Random Graphs: Generating a random graph with a given degree sequence is a notoriously tricky problem [11]. There are two main approaches. We avoided the edge swapping approach because it is unclear how much swapping is enough. Instead we used the generate from scratch approach which is also known to be tricky. The literature does not currently contain any algorithm that 1) is fast and simple 2) avoids creating self loops and parallel edges 3) doesn’t get stuck 4) and provably avoids introducing subtle 2nd order biases such as connected high degree nodes to low degree nodes too often or not often enough. We use the following procedure that has properties 1-3 but not necessarily property 4: Process the nodes in decreasing order of requested degree. For each node, supply its remaining unsupplied edges by connecting the node to randomly chosen nodes, with the probability of choosing a given neighbor proportional to its
Information Theoretic Comparison of Stochastic Graph Models Partition Source
Modu-Q bits(Gm )/M orig rand orig rand None (Base Model) 18.51 18.51 None (K=1) 0.000 0.000 18.51 18.51 Graclus(K=4) 0.711 0.408 17.70 18.87 Graclus(K=16) 0.848 0.462 16.94 19.08 Graclus(K=64) 0.824 0.412 16.49 19.25 Graclus(K=256) 0.735 0.392 16.45 19.27 Graclus(K=1024) 0.654 0.373 16.66 19.54 Segregate Degrees 0.000 17.88
9
bits(Gd )/M orig rand 17.11 17.53 17.11 17.53 16.32 18.12 15.65 18.22 15.50 18.56 15.89 18.77 16.41 19.01 19.00
Fig. 2. These results for the imdb-actress-movie graph are explained in Section 5
residual requested degree, all the while avoiding the creation of parallel edges and self edges. We note that processing the nodes in the opposite order (lowest requested degree first) would cause the algorithm to get stuck on some of our graphs. Note on Graclus: It might seem puzzling that we only used one graph partitioning algorithm for these experiments, given the numerous candidate algorithms which appear in the graph partitioning and community finding literatures. As mentioned in the earlier discussion section, we are mainly interested in the k = 1 versus k > 1 question, and no matter what currently available algorithm is used, the conclusion k > 1 is rigorous and the conclusion k = 1 is not. Nevertheless, we have found that graclus is both fast and reasonably effective at finding partitions that score well according to these information-theoretic objective functions. In fact it is often works much better than hill-climbing algorithms that have been recommended for this purpose in other papers. However, we are not claiming to have ever found the best scoring partition for a given value of k. Results: First we describe the results for one graph in some detail. The graph, “imdb-actress-movie“ is a bipartite graph derived from IMDB data, encoding which actresses have appeared in which movies. If the adjacency matrix is reordered based on a recursive spectral decomposition of the graph, noisy but clearly visible block structure appears, which turns out to roughly correspond to countries. The results of the current experiment are summarized in Figure 2. The other columns tabulate the scores that were in measured in step 4 above for both the actress-movie graph (orig) and a random graph with the same degree sequence (rand). The Modularity Q scores for the various Graclus clusterings peak at about 16 pieces for both the original and random graphs, suggesting that both of them contain real cluster structure. Recalling that likelihoods are expressed as negative log probabilities, in units of bits per edge, with smaller number being better, the likelihoods in the columns for the partitioned Gm model can be interpreted as showing that the original graph contains clusters but the random graph does not, which makes more sense than the results for Q. However, consider the numbers in the bottom row, which are for a partitioning of the nodes into sets based on their degree. This method of splitting up the
10
K.J. Lang
nodes does not correspond to intuitively good “communities” or “modules”, but technically it is a valid partitioning. This degree-based fake clustering causes the partitioned Gm model to beat the unpartitioned Gm base model on the random graph. This kind of spurious outcome explains why the Q measure is based on the Gw model which takes degrees into account. It also provided our motivation to define an information-theoretic clustering measure based on the closely related Gd model. The rightmost two columns in Figure 2 show that the resulting log likelihood scores for the graclus clusterings are consistent with the imdb graph having cluster structure, while the random graph does not. Moreover, the measure is not fooled by the degree-based fake clustering. Due to space limitations, our results for the dozens of other graphs are presented in a more compact form in Figure 3 and Figure 4. For each original graph and corresponding random graph, and for each of the three objective functions (Q, and log likelihoods for partitioned Gm and partitioned Gd ), we state which K value resulted in the graclus clustering that was best according to that measure.
graph name monksLk3 karate dolphins lesmis pol-books football netscience mani-facesDeg10 mani-facesDeg3 mani-facesK10 mani-facesK3 mani-swissDeg10 mani-swissDeg3 mani-swissK10 mani-swissK3 powergrid-watts road-USA road-ca road-pa road-tx delicious epinions flickr LinkedIn LiveJournal01 LiveJournal-larsWcc messengerSocial4 airports answers answers-2 answers-4 answers-6 as-caidaRel071112 as-newman as-oregon-all AuthToPap-astro-ph AuthToPap-cond-mat AuthToPap-dblp AuthToPap-gr-qc AuthToPap-hep-ph AuthToPap-hep-th
Modularity-Q original random Q @ K Q @ K 0.404 4 0.255 4 0.398 4 0.304 4 0.487 4 0.368 8 0.522 8 0.252 8 0.493 8 0.277 8 0.589 8 0.250 8 0.841 16 0.425 8 0.641 8 0.180 8 0.807 16 0.344 8 0.649 8 0.200 8 0.757 16 0.389 8 0.926 32 0.206 8 0.974 128 0.405 8 0.878 32 0.203 8 0.926 32 0.404 8 0.930 32 0.720 32 0.977 128 0.777 512 0.991 256 0.699 32 0.989 256 0.696 32 0.991 256 0.707 32 0.707 128 0.466 16 0.409 8 0.203 8 0.545 128 0.167 8 0.621 8 0.232 8 0.650 128 0.186 8 0.702 64 0.172 8 0.596 64 0.469 16 0.337 4 0.185 8 0.723 16 0.390 8 0.511 16 0.385 8 0.468 32 0.358 8 0.574 16 0.430 8 0.629 16 0.452 512 0.621 8 0.432 512 0.608 8 0.345 16 0.715 128 0.439 16 0.822 128 0.556 16 0.840 1024 0.599 8 0.890 64 0.638 16 0.803 16 0.544 16 0.840 128 0.601 16
− log prob(g|Gm ) / M original random K=1 b/e @ K b/e @ K 3.70 3.68 2 3.70 1 4.53 4.53 1 4.53 1 5.13 4.88 2 5.13 1 5.02 4.03 8 5.02 1 5.08 4.47 8 5.08 1 4.85 3.53 16 4.85 1 7.77 6.02 32 7.77 1 6.54 3.13 64 6.54 1 7.71 4.90 64 7.71 1 6.56 4.12 64 6.56 1 8.32 6.74 32 8.32 1 11.41 3.84 1024 11.41 1 13.06 7.27 1024 13.06 1 11.41 5.42 1024 11.41 1 13.15 8.80 256 13.15 1 12.30 11.45 32 12.30 1 17.03 15.44 512 17.03 1 20.85 18.11 1024 20.85 1 19.99 17.29 1024 19.99 1 20.33 17.72 1024 20.33 1 16.58 15.07 1024 16.58 1 14.24 12.63 512 13.70 32 16.69 13.75 1024 16.47 16 21.04 18.66 1024 20.85 16 19.26 15.45 1024 18.80 128 19.51 15.40 1024 18.87 128 20.17 19.84 64 20.17 1 6.83 6.02 64 6.83 1 18.00 16.64 64 18.00 1 13.71 13.52 8 13.71 1 15.46 15.07 256 15.46 1 17.51 17.26 32 17.51 1 14.13 13.86 32 14.13 1 13.85 13.51 32 13.85 1 12.71 11.55 128 12.71 1 14.91 12.58 1024 14.91 1 15.40 13.48 1024 15.40 1 19.06 17.54 1024 19.06 1 13.72 12.52 256 13.72 1 15.14 13.46 512 15.14 1 15.05 13.65 512 15.05 1
Fig. 3. See the caption of Figure 4
− log prob(g|Gd ) / M original random K=1 b/e @ K b/e @ K 6.29 6.29 1 6.29 1 5.63 5.63 1 5.57 1 6.25 6.13 2 6.25 1 5.58 5.46 8 5.52 1 5.61 5.19 2 5.62 1 5.96 5.48 8 5.89 1 8.18 6.85 16 8.16 1 6.57 4.98 16 6.26 1 8.15 6.39 16 7.80 1 7.03 5.70 8 6.93 1 8.80 7.68 16 8.81 1 12.18 6.60 256 12.03 1 13.90 8.73 512 13.73 1 11.88 7.38 128 11.74 1 13.58 9.80 256 13.60 1 12.81 12.04 32 12.81 1 17.72 16.32 512 17.74 1 21.68 19.01 1024 21.66 1 20.83 18.22 1024 20.80 1 21.17 18.63 1024 21.14 1 14.87 13.91 1024 15.03 1 10.08 9.97 16 10.09 1 10.96 10.12 128 11.04 1 17.18 15.74 1024 17.36 1 17.44 14.30 1024 17.39 1 17.19 14.03 1024 17.16 1 19.45 19.35 64 19.49 1 4.75 4.61 2 4.79 1 15.30 14.29 16 15.41 1 11.59 11.59 1 11.66 1 12.96 12.91 4 13.01 1 15.14 15.14 1 15.31 1 9.43 9.38 16 10.03 1 9.18 9.07 8 9.75 1 8.36 7.92 16 8.78 1 13.70 12.09 512 14.23 1 14.70 13.09 512 15.09 1 18.01 16.63 1024 18.44 1 12.69 11.67 128 13.25 1 13.41 12.15 512 14.19 1 13.57 12.47 256 14.32 1
Information Theoretic Comparison of Stochastic Graph Models
graph name gnutella-25 gnutella-30 gnutella-31 random-deg4 random-deg7 amazon 2003 06 01 asin01 amazon 2003 all amazon-simProd-All bio-proteinsVespignani bio-proteinYeastBarabasi bio-yeastZivP0001 bio-yeastZivP001 Blog-nat05-6m Blog-nat06all Cit-hep-ph Cit-hep-th clickstream-UsrToUrl CoAuth-astro-ph CoAuth-cond-mat CoAuth-gr-qc CoAuth-hep-ph CoAuth-hep-th dblp-larsWcc email-all email-all-inOut email-enron email-ijs imdbDec07-ActToAct imdbDec07-ActToMov imdb-actress-movie imdb-a-m-30countries imdb-a-m-USA Patents pol-blogs Post-nat05-6m Post-nat06all protein dip web-BerkStan web-google web-notredame web-wt10g-trec
Modularity-Q original random Q @ K Q @ K 0.450 16 0.435 16 0.453 8 0.432 16 0.471 16 0.434 16 0.527 16 0.529 16 0.377 8 0.380 8 0.845 16 0.259 8 0.806 16 0.227 8 0.849 32 0.402 8 0.484 32 0.353 8 0.790 16 0.662 16 0.702 8 0.269 8 0.685 8 0.208 8 0.490 4 0.202 8 0.554 16 0.167 8 0.713 16 0.163 8 0.618 16 0.156 8 0.298 32 0.180 16 0.578 16 0.166 8 0.675 16 0.304 8 0.806 32 0.357 16 0.600 64 0.149 8 0.719 32 0.390 16 0.735 32 0.349 8 0.653 32 0.514 1024 0.456 32 0.295 128 0.574 8 0.229 8 0.856 16 0.664 16 0.715 16 0.084 8 0.821 16 0.355 8 0.848 16 0.462 16 0.850 32 0.395 8 0.721 16 0.460 16 0.812 16 0.296 8 0.425 2 0.130 8 0.919 64 0.679 64 0.915 64 0.658 64 0.487 16 0.351 16 0.861 64 0.221 8 0.960 128 0.264 8 0.900 128 0.274 8 0.955 128 0.285 8
K=1 13.64 14.34 15.13 16.05 15.25 16.47 16.41 17.93 10.94 10.55 6.81 8.00 12.63 12.14 11.90 11.50 15.79 11.11 12.73 10.78 10.50 12.00 16.99 17.57 14.05 13.06 16.24 15.03 19.80 18.51 16.53 17.19 20.15 6.91 17.99 18.81 10.94 16.46 17.82 17.01 18.82
− log prob(g|Gm ) / M original random b/e @ K b/e @ K 13.64 1 13.64 1 14.34 1 14.34 1 15.13 1 15.13 1 16.05 1 16.05 1 15.25 1 15.25 1 11.33 1024 16.47 1 11.37 1024 16.41 1 13.72 1024 17.93 1 10.69 32 10.94 1 10.55 1 10.55 1 4.96 16 6.80 2 5.66 64 8.00 1 11.57 128 12.26 64 9.69 512 11.60 128 7.92 1024 11.75 256 7.50 1024 11.20 256 15.52 16 15.51 1024 6.96 1024 10.83 128 9.32 1024 12.73 1 7.28 256 10.78 1 4.79 1024 9.86 128 10.06 256 12.00 1 13.39 1024 16.99 1 17.56 64 17.54 2 13.72 64 14.01 2 10.22 1024 12.64 64 15.40 512 16.24 1 10.29 1024 13.45 1024 16.75 1024 19.80 1 16.44 128 18.51 1 13.89 128 16.53 1 16.11 64 17.19 1 16.29 1024 20.15 1 5.82 32 6.74 64 17.36 512 17.99 1 17.80 1024 18.81 1 10.67 32 10.94 1 10.11 1024 16.20 16 11.56 1024 17.76 128 10.61 1024 17.01 1 11.78 1024 18.73 32
11
− log prob(g|Gd ) / M original random b/e @ K b/e @ K 12.91 1 13.33 1 13.48 1 13.89 1 14.26 1 14.65 1 16.89 1 16.89 1 16.16 1 16.16 1 11.59 1024 16.44 1 11.45 1024 16.12 1 13.93 1024 18.05 1 9.97 1 10.11 1 10.10 1 10.39 1 5.23 8 6.36 1 5.21 8 6.78 1 8.64 4 9.27 1 8.86 64 10.07 1 8.79 64 11.20 1 8.31 128 10.43 1 10.40 1 10.66 1 7.84 256 10.02 1 9.97 512 12.31 1 8.06 128 10.22 1 5.76 256 7.99 1 10.43 128 11.79 1 13.09 1024 16.35 1 9.34 1 10.55 1 8.47 1 9.21 1 8.85 32 10.07 1 6.29 16 7.40 1 9.11 1024 12.26 1 14.78 128 17.68 1 15.50 64 17.53 1 13.66 32 16.06 1 14.81 8 15.98 1 16.49 1024 19.78 1 4.98 2 5.61 1 15.25 256 16.48 1 16.07 512 17.51 1 9.97 1 10.11 1 7.59 256 12.12 1 10.12 1024 16.38 1 9.38 512 13.78 1 8.96 1024 15.93 1
K=1 12.91 13.48 14.26 16.89 16.16 16.33 16.01 17.90 9.97 10.10 6.30 6.57 8.99 10.03 11.18 10.40 10.40 10.09 12.37 10.33 8.07 11.85 16.37 9.34 8.47 9.95 6.55 12.34 16.94 17.11 15.77 15.47 19.77 5.47 15.68 16.85 9.97 11.25 15.97 13.42 15.22
Fig. 4. This table shows the K value of the best Graclus clustering according to three different objective functions, for numerous actual graphs and random graphs with the same degree sequences. The Q scores suggest that all of the random graphs contain clusters, while the partitioned Gd scores suggest that none of them do, which is arguably the correct answer.
The overall conclusion from these tables1 is that all three measures indicate that the actual graphs contain cluster structure. However, for the random graphs with the same degree sequences, which intuitively should not have any cluster structure, Q always thinks that it sees cluster structure, partitioned Gm sometimes does, and partitioned Gd never does. Acknowledgements. We would like to thank Reid Andersen, Anirban Dasgupta, Michael Mahoney for useful discussions, and Jure Leskovec for assembling this collection of graphs. 1
Following a reviewer’s suggestion, for nearly all of these graphs we have redone this experiment, obtaining new Q-scores and bit-counts by averaging 10 runs of the 4step randomized procedure described in Section 5. Lower order digits of many scores changed, but the new tables (not included in this manuscript) are qualitatively similar and support the same high-level conclusions as these single-run tables.
12
K.J. Lang
References 1. Chung, F., Lu, L.: Complex Graphs and Networks (Cbms Regional Conference Series in Mathematics). American Mathematical Society (August 2006) 2. Bez´ akov´ a, I., Kalai, A., Santhanam, R.: Graph model selection using maximum likelihood. In: ICML 2006: Proceedings of the 23rd international conference on Machine learning, pp. 105–112. ACM, New York (2006) 3. Rissanen, J.: Modelling by the shortest data description. Automatica 14, 465–471 (1978) 4. Chakrabarti, D., Papadimitriou, S., Modha, D.S., Faloutsos, C.: Fully automatic cross-associations. In: KDD 2004: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 79–88. ACM, New York (2004) 5. Rosvall, M., Bergstrom, C.T.: An information-theoretic framework for resolving community structure in complex networks. Proc. Natl. Acad. Sci. U S A 104(18), 7327–7331 (2007) 6. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004) 7. Guimera, R., Sales-Pardo, M., Amaral, L.A.N.: Modularity from fluctuations in random graphs and complex networks. Physical Review E 70, 025101 (2004) 8. Hofman, J.M., Wiggins, C.H.: A bayesian approach to network modularity. Physical Review Letters 100, 258701 (2008) 9. Boldi, P., Vigna, S.: The webgraph framework i: compression techniques. In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 595–602. ACM, New York (2004) 10. Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29(11) (2007) 11. Blitzstein, J., Diaconis, P.: A sequential importance sampling algorithm for generating random graphs with prescribed degrees. Technical report, Stanford (2005)
Approximating the Number of Network Motifs Mira Gonen and Yuval Shavitt Tel-Aviv University, Ramat Aviv, Israel
[email protected],
[email protected]
Abstract. World Wide Web, the Internet, coupled biological and chemical systems, neural networks, and social interacting species, are only a few examples of systems composed by a large number of highly interconnected dynamical units. These networks contain characteristic patterns, termed network motifs, which occur far more often than in randomized networks with the same degree sequence. Several algorithms have been suggested for counting or detecting the number of induced or non-induced occurrences of network motifs in the form of trees and bounded treewidth subgraphs of size O(log n), and of size at most 7 for some motifs. In addition, counting the number of motifs a node is part of was recently suggested as a method to classify nodes in the network. The promise is that the distribution of motifs a node participate in is an indication of its function in the network. Therefore, counting the number of network motifs a node is part of provides a major challenge. However, no such practical algorithm exists. We present several algorithms with time complexity O e2k k · n · |E|· 2 1 log δ /ǫ that, for the first time, approximate for every vertex the number of non-induced occurrences of the motif the vertex is part of, for k-length cycles, k-length cycles with a chord, and (k − 1)-length paths, where k = O(log n), and for all motifs of size of at most four. In addition, we show algorithms that approximate the total number of non-induced occurrences of these network motifs, when no efficient algorithm exists. Some of our algorithms use the color coding technique.
1 1.1
Introduction Background and Motivation
World Wide Web, the Internet, coupled biological and chemical systems, neural networks, and social interacting species, are only a few examples of systems composed by a large number of highly interconnected dynamical units. The first approach to capture the global properties of such systems is to model them as graphs whose nodes represent the dynamical units, and whose links stand for the interactions between them. Such networks have been extensively studied by exploring their global topological features such as power-law degree distribution, the existence of dense-core and small diameter. (For references see the full version of the paper [7]). However, two networks which have similar global features K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 13–24, 2009. c Springer-Verlag Berlin Heidelberg 2009
14
M. Gonen and Y. Shavitt
can have significant differences in structure, which can be captured by examining local structures they include: e.g., one of them may include a specific subgraph many more times than the other. Therefore these small subgraphs, termed network motifs, were suggested to be elementary building blocks that carry out key functions in the network. Milo et al. [12] found motifs in networks from biochemistry, neurobiology, ecology, and engineering. Specifically, they found motifs in the World Wide Web. Moreover, Hales and Arteconi [9] presented results from a motif analysis of networks produced by peer-to-peer protocols. They showed that the motif profiles of such networks closely match protein structure networks. Thus efficiently detecting and counting the number of network motifs provide a major challenge. As a result novel computational tools have been developed for counting subgraphs in a network and discovering network motifs. Most of existing work deal with induced motifs, while there is also work that focuses on non-induced motifs.1 The motivation for considering non-induced subgraphs is that the process of obtaining large networks (such as the AS graph) are far from complete and error free; they lack existing edges. Thus, an occurrence of a specific network motif in one network may include additional edges in its occurrence in another network and vice versa. Several existing algorithms for counting and detection non-induced motifs [6,4,2,1,16] used the color coding technique of Alon et al. [3]. Color coding is an innovative combinatorial approach that was introduced by Alon et al. [3] to detect simple paths, trees and bounded treewidth subgraphs in unlabeled graphs. Color coding is based on assigning random colors to the vertices of an input graph. It considers only those subgraphs where each vertex has a unique color. Such colorful subgraphs can then be detected through efficient use of dynamic programming, in time polynomial with n, the size of the input graph. If the above procedure is repeated sufficiently many times (polynomial with n, provided that the subgraph we are looking for is of size O(log n), it is guaranteed that a specific occurrence of the query subgraph will be detected with high probability. The color coding technique is a building block in some of the algorithms presented in this paper. Przulj et al. [14] described how to count all induced subgraphs with up to 5 vertices in a PPI (Protein-Protein Interaction) network. Faster techniques that count induced subgraphs of size up to 6 were developed by Hormozdiari et al. [10], and for size up to 7 were shown by Grochow and Kellis [8]. The running time of these techniques all increase exponentially with the size of the motif. Kashtan et al. [11] showed an algorithm for detecting induced network motifs that sample the network. This algorithm detect induced occurrences of small motifs (motifs with k ≤ 7 vertices). Wernicke et al. [17] claims that Kashtan et al.’s algorithm suffers from a sampling bias and scales poorly with increasing subgraph size. Thus, Wernicke [17] presented an improved algorithm for network 1
G0 is an induced subgraph of a graph G if and only if for each pair of vertices v0 and w0 in G0 and their corresponding vertices v and w in G there is an edge between v0 and w0 in G0 if and only if there is an edge between v and w in G.
Approximating the Number of Network Motifs
15
motif detection which overcomes these drawbacks. Scott et al. [15] focused on the subgraph detection problem. Dost et al. [6] showed how to solve the subgraph detection problem for subgraphs of size O(log n), provided that the query subgraph is either a simple path, a tree, or a bounded treewidth subgraph. Arvind and Raman [4] counted the number of subgraphs in a given graph G which are isomorphic to a bounded treewidth graph H. They gave a randomized approximate counting algorithm with a running time of k O(k) · nb+O(1) , where n and k are the number of vertices in G and H, respectively, and b is the treewidth of H. Alon and Gutner [2] combined the color coding technique with a construction of Balanced Families of Perfect Hash Functions to obtain a deterministic algorithm to count the number of simple paths or cycles of size k in an input graph G. Alon et al. [1] improved the algorithm of Alon and Gutner. They presented a polynomial time algorithm for approximating the number of non-induced occurrences of trees and bounded treewidth subgraphs with k = O(log n) vertices with a running time of 2O(k log log k) · nO(1) . A new systematic measure of a network’s local topology was recently suggested by Przulj [13]. They term this measure ”graphlet distribution” of a vertex. Namely, they count for each vertex the number of all motifs of size at most five that adjacent to the vertex. The promise is that the distribution of motifs a node participate in is an indication of its function in the network, thus nodes can be divided into functional classes. In addition, Becchetti et al [5] have recently shown that the distribution of the local number of triangles and the related clustering coefficient can be used to detect the presence of spamming activity in large scale Web graphs, as well as to provide useful features for the analysis of biochemical networks or the assessment of content quality in social networks. Therefore, counting the number of network motifs a node is part of also provides a major challenge. However, no practical algorithm for counting the number of network motifs a node is part of exists. 1.2
Our Contributions
We present several algorithms with time complexity O e2k k · n · |E| · log 1δ /ǫ2 that, for the first time, approximate for every vertex the number of non-induced occurrences of the motif the vertex is part of, for k-length cycles, k-length cycles with a chord, and (k − 1)-length paths, where k = O(log n). We also provide algorithms with time complexity O n · |E| · log δ1 /ǫ2 + |E|2 + |E| · n log n that, for the first time, approximate for every vertex the number of non-induced occurrences of the motif the motifs of size of at most four. In vertex is part of for all addition, we show an O ek k · n · |E| · log δ1 /ǫ2 algorithm that, for the first time, approximates the total number of non-induced occurrences of O(log n)-length cycles with a chord. Moreover, we improve the time complexity of approximating the total number of non-induced occurrences of “tailed” triangles and 4-cliques upon existing algorithms. Some of our algorithms use the color coding technique of Alon et al. [3]. Organization: In Section 2 we give notations and definitions. In Section 3 we introduce motifs counting approximation algorithms for O(log n)-size motifs. In
16
M. Gonen and Y. Shavitt
Section 4 we present motifs counting algorithms for all four-size motifs. We summarize our conclusions in Section 5.
2
Preliminaries
Let G = (V, E) be an undirected graph with n vertices. We assume that G is represented by an adjacency lists. For a vertex v let N (v) denote the set of neighbors of v and let deg(v) denote the degree of v. A motif H is said to be isomorphic to a subgraph H ′ in G if there is a bijection between the vertices of H and the vertices of H ′ such that for every edge between two vertices v and u of H there is an edge between the vertices v ′ and u′ in H ′ that correspond to v and u respectively. Such a subgraph H ′ is considered to be a non-induced occurrence of H in G. For a vertex v we say that v is adjacent to H if v is a vertex of H. Denote by [k] the set {1, . . . , k}. Denote by col(v) the color of vertex v. Let H be a motif with k vertices, and let G = (V, E) be a graph where |V | = n. Assign a color to each vertex of V from the color set [k]. The colors are assigned to each vertex independently and uniformly at random. A copy of H in G is said to be colorful if each vertex on it is colored by a distinct color. For a problem f , let #f denote the number of distinct solutions of f . Definition 1. ((ǫ, δ)-approximation) An algorithm A for a counting problem f is a (ǫ, δ)-approximation if it takes an input instance and two real values ǫ, δ and produces an output y such that Pr[(1 − ǫ) · #f ≤ y ≤ (1 + ǫ) · #f ] ≥ 1 − 2δ.
3
Algorithms for Counting Motifs of Size O(log n)
Given a graph G = (V, E) and a vertex v, we describe how to approximately count for every vertex v the number of non-induced occurrences of (k − 1)-length paths, k-length cycles, and k-length cycles with a chord that are adjacent to v, for k = O(log n). In addition, for each such motif H we present an algorithm for approximating the number of non-induced subgraphs of G that are isomorphic to H when no efficient algorithm exists. Most of our approximation algorithms apply the color coding technique of Alon et al. [3]. Note that we allow overlaps between the motifs we count, i.e. two occurrences of H, namely H ′ and H ′′ may share vertices; in fact the vertex sets of H ′ and H ′′ may be identical. We consider H ′ and H ′′ distinct occurrences of H provided that the edge sets of H ′ and H ′′ are not identical. 3.1
Counting Paths
In this section assume that H is a simple path of length k−1. We present an algorithm to approximately count for every vertex v, the number of subgraphs of G which are isomorphic to H and adjacent to v v. (Note that Alon et al [1] only count the total number of paths in the graph).
Approximating the Number of Network Motifs
17
k
Let t = log(1/δ), and let s = ǫ4k 2 k! . Assume that we have a k-coloring of G, i.e., each vertex is randomly and independently colored with a color in [k]. For each vertex v and each subset S of the color set [k], let Pi (v, S) be the number of colorful paths adjacent to v using colors in S at the ith coloring, and let Ci (v, S) be the number of colorful paths for which one of their endpoint is v using colors in S at the ith coloring. Consider the following algorithm. The algorithm takes as input: a graph G = (V, E), a vertex v ∈ V , the requested path length k − 1, fault-tolerance ǫ, and an error probability δ. Algorithm 1. (A (ǫ, δ)-approximation algorithm for counting simple paths of length k − 1 adjacent to a vertex v) 1. For j = 1 to t (a) For i = 1 to s i. Color each vertex of G independently and uniformly at random with one of the k colors. ii. For all u ∈ V Ci (u, φ) = 1. 1 if col(v) = ℓ; iii. For all ℓ ∈ [k] Ci (v, {ℓ}) = 0 otherwise. iv. For all S ⊆ [k] s.t |S| > 1 Ci (v, S) = u∈N (v) Ci (u, S \ {col(v)}). k v. Pi (v, [k]) = ℓ=1 u∈N (v) (S1 ,S2 )∈Aℓ,v Ci (v, S1 ) · Ci (u, S2 ), where Aℓ,v = {(Si , Sj )|Si ⊆ [k], Sj ⊆ [k], Si ∩ Sj = φ, |Si | = ℓ, |Sj | = k − ℓ}. Pi (v, [k]). vi. Let Xiv = s Xiv . (b) Let Yjv = i=1 s 2. Let Z v be the median of Y1v , ..., Ytv . 3. Return Z v · k k /k!.
Our main theorem is the following: Theorem 1. Let G = (V, E) be an undirected graph, and let H be a simple path of length k − 1. Then for all v ∈ V Algorithm 1 is a (ǫ, δ)-approximation for the number of copies of H in G that are adjacent to v, with time complexity O ek |E| log(1/δ)/ǫ2 . For proving Theorem 1 we first prove the following lemma:
Lemma 1. For all v ∈ V Pi (v, [k]) can be computed in O(2k |E|) time. Proof: According computing Ci (v, S) et al [1] the time to Alon complexity for for all v and S is v∈V u∈N (v) O(2k ) + v∈V O(k) = O( v∈V deg(v) · 2k ) = O(2k (|E|)). A vertex v is adjacent to a colorful path of length k − 1 if and only if it is an endpoint of a colorful path of length ℓ − 1, one of its neighbors is an endpoint of a colorful path of length v u k − 1 − ℓ, and the subsets of colors of both Thus, for each vertex v, the paths are disjoint. ℓ−1 k−1−ℓ number of colorful paths of length k − 1 that are adjacent to v is P (v, [k]) =
k
ℓ=1 u∈N (v) (S1 ,S2 )∈Aℓ,v
C(v, S1 ) · C(u, S2 ),
18
M. Gonen and Y. Shavitt
where Aℓ,v = {(Si , Sj )|Si ⊆ [k], Sj ⊆ [k], Si ∩ Sj = φ, |Si | = ℓ, |Sj | = k − ℓ}. (We define C(u, φ) = 1 for every vertex u). Therefore the running time for computing P (v, [k]) for all v, assuming that C(u, S) is known for any vertex u and any color set S, is: k k deg(v) = 2k−1 · 2 · |E|. ℓ v∈V ℓ=1
Thus the total running time for computing P (v, [k]) for all v is O(2k |E|).
⊓ ⊔
The proof of Theorem 1 is based on Lemma 1 and the approximation technique of Alon et al. [1]. The details of the proof of Theorem 1 appear in the full version of the paper [7]. 3.2
Counting Cycles
In this section assume that H is a simple cycle of length k. We present an algorithm to approximately count for every vertex v the number v of subgraphs of G which are isomorphic to H and adjacent to v. k Let t = log(1/δ), and let s = ǫ4k 2 k! . Assume that we have a k-coloring of G, i.e., each vertex is randomly and independently colored with a color in [k]. For each pair of vertices v, x and each color set S of the color set [k] let Ci (v, x, S) be the number of colorful paths between v and x using colors in S at the ith coloring, and let CYi (v, S) be the number of colorful cycles adjacent to v using colors in S at the ith coloring. Consider the following algorithm. The algorithm takes as input: a graph G = (V, E), a vertex v ∈ V , the requested cycle length k, fault-tolerance ǫ, and an error probability δ. The algorithm uses a procedure to compute the number of colorful paths between v and any other vertex. Algorithm 2. (A (ǫ, δ)-approximation algorithm for counting simple cycles of length k adjacent to a vertex v) 1. For j = 1 to t (a) For i = 1 to s i. Color each vertex of G independently and uniformly at random with one of the k colors. x, [k]) = count-path(v, x, k). ii. For all x ∈ V Ci (v, iii. Let CYi (v, [k]) = 12 u∈N (v) Ci (v, u, [k]). CYvi (v, [k]). iv. Let Xiv = s v i=1 Xi . (b) Let Yj = s 2. Let Z v be the median of Y1v , ..., Ytv . 3. Return Z v · k k /k!.
Algorithm 3. count-path(v,x,k)(counting simple paths of length k − 1 between v and x) 1. For all S ⊆ [k] s.t S = {ℓ} Ci (v, x, S) =
1 if coli (v) = coli (x) = ℓ; 0 otherwise.
Approximating the Number of Network Motifs
19
2. For q = 2 to k, for all S ⊆ [k] s.t |S| = q Ci (u, x, S \ {coli (v)}). Ci (v, x, S) = u∈N (v)
Our main theorem is the following: Theorem 2. Let G = (V, E) be an undirected graph, and let H be a simple cycle of length k. Then for every vertex v Algorithm 2 is a (ǫ, δ)-approximation for the number of copies of H in G that are adjacent to v, with time complexity O ek · k · n · |E| log(1/δ)/ǫ2 . For proving Theorem 2 we first prove the following lemma:
Lemma 2. For all v ∈ V CYi (v, [k]) can be computed in O(2k · k · n · |E|) time. Proof: A vertex v is adjacent to a colorful cycle of length k if and only if it is an endpoint of a colorful path of length k − 1, which has one of v’s neighbor v as an endpoint. u Therefore, we first compute for every two vertices u, v the number of colorful paths of length k − 1 between u and v. If u is a neighbor of v, then we get a cycle for computing of lengthk. The running time
CY (v, [k]) deg(v) = O(2k · k · for all v is then 2k 1 + (k − 2) 1 + v,x∈V v,x∈V v∈V
n · |E|).
⊓ ⊔
Proof of Theorem 2. The correctness of the approximation returned by Algorithm 2 is proved in the same manner as in the proof of Theorem 1. Lemma 2 implies the correctness i (v, [k]). The time complexity of the computation of CY of Algorithm 2 is O ek · k · n · |E| log(1/δ)/ǫ2 by Lemma k 2, and by 2showing that the number of colorings used by the algorithm is O e log(1/δ)/ǫ . (This is proved in the same manner as in the proof of Theorem 1). This completes the proof. ⊓ ⊔ 3.3
Counting k-Length Cycles with a Chord
In this section assume that H is a simple cycle of length k with a chord, such that the distance between the endpoint of the chord on the cycle is min{ℓ, k − ℓ}, k−ℓ
for some given 2 ≤ ℓ ≤ k − 2. ℓ We present an algorithm to approximately compute the number of subgraphs of G which are isomorphic to H, and, for every vertex v, the number of subgraphs of G which are isomorphic to H and adjacent to v. The approximation of the number of colorful subgraphs of G which are isomorphic to H appears in the full version of the paper [7]. We now approximate for every v ∈ V the number of colorful subgraphs of G which are isomorphic to H and are adjacent to v. Let t = log(1/δ), let s = 4·kk ǫ2 k! . Assume that we have a k-coloring of G, i.e., each vertex is randomly and independently colored with a color in [k]. Let Pi (v, u, w, S) be the number of
20
M. Gonen and Y. Shavitt
colorful paths from u to w that are adjacent to v in the ith coloring, using the colors in S. Recall that Ci (v, u, S) is the number of colorful paths from v to u in the ith coloring, using the colors in S. Let Az,b V ′ (S) = {(S1 , S2 )|S1 , S2 ⊆ [k], |S1 | = z +1, |S2 | = b−z +1, S1 ∪S2 = S, S1 \{col(u)|u ∈ V ′ }∩S2 \{col(u)|u ∈ V ′ } = φ}. Consider the following algorithm. The algorithm takes as input: a graph G = (V, E), a vertex v, fault-tolerance ǫ, and an error probability δ. Algorithm 4. (A (ǫ, δ)-approximation algorithm for counting simple cycles of length k with a chord that are adjacent to v) 1. For j = 1 to t (a) For i = 1 to s i. Color each vertex of G independently and uniformly at random with one of the k colors. ii. Xiv = 0. iii. For every edge (u, w) ∈ E : iv. For all S ⊆ [k] s.t |S| = ℓ + 1 ℓ−1 Pi (v, u, w, S) = z=1 (S1 ,S2 )∈Az,ℓ Ci (v, w, S1 ) · Ci (v, u, S2 ). v (S) P (v, u, w, S3 ) · Ci (u, w, S4 ) + v. Let Xiv = Xiv + (S3 ,S4 )∈Aℓ,k i uw ([k]) P (v, u, w, S3 ) · Ci (u, w, S4 ) + k−ℓ,k (S3 ,S4 )∈Auw ([k]) i ℓ,k C (v, u, S3 ) · Ci (v, u, S4 ). i (S3 ,S4 )∈Auv ([k])
s
Xv
i (b) Let Yjv = i=1 . s 2. Let Z v be the median of Y1v , ..., Ytv . 3. Return Z v · k k /k!.
Our main theorem is the following: Theorem 3. Let G = (V, E) be an undirected graph, and let H be a simple cycle of length k with a chord. Then, for every v ∈ V , Algorithm 4 is a (ǫ, δ)approximation for the number of copies of H in G that are adjacent to v, with time complexity O |E| · n · e2k · k log(1/δ)/ǫ2 . For proving Theorem 3 we first prove the following lemma:
Lemma 3. Xiv can be computed with time complexity O(|E| · n · 22k · k). Proof: Let (u, w) be the chord, and let ℓ be the distance on the cycle between u and w. The number of copies of H that are adjacent to v depends on the position of v. There are three cases: one for which v is adjacent to a path of length ℓ bew u tween ukand w ℓ v , one for which v is adjacent to a path of length k−ℓ between u v −ℓ w v u u , and one for which v is an endpoint of the chord . In the first case and w we first count all the colorful paths of length z between v and w and multiply it by the number of colorful paths of length ℓ−z between v and u, where 1 ≤ z ≤ ℓ−1.
Approximating the Number of Network Motifs
21
v z
z ℓ
−
u
w
(W.l.o.g assume that ℓ = k − ℓ). k − ℓ Thus for all S ⊆ [k] s.t |S| = ℓ + 1 ℓ−1 Pi (v, u, w, S) = Ci (v, w, S1 ) · Ci (v, u, S2 ). Therefore the z=1 (S1 ,S2 )∈Az,ℓ v (S) total number of copies of H that are adjacent to v in the first case is the number of ℓ-length colorful paths between u and w that are adjacent to v, multiplied by the number of k − ℓ-length colorful paths between u and w, with disjoint set of colors Pi (v, u, w, S3 ) · Ci (u, w, S4 ). (except for the colors of u and w): (S3 ,S4 )∈Aℓ,k uw ([k]) The second case is computed in the same manner. The third case is computed as follows. We count the number of ℓ-length colorful paths between u and v and multiply it by the number of k − ℓ-length colorful paths between u and v, using disjoint set of colors besides the colors of u and v. Computing the running time: According to the proof of Lemma 2, the time complexity for computing Ci (v, w, S) for every color-set S and every pair of vertices v, w is O(2k · k · n · |E|). The running time of computing u, w, S) for all vertices v, u, w, and every Pi (v, ℓ−1 ℓ k ℓ k color-set S of size ℓ+1 is O z=1 z · ℓ+1 · |E| · n = O(2 · ℓ ·|E|·n). There k fore the time complexity of computing the first case is O( v∈V (u,w)∈E ℓ+1 + k ℓ k k ℓ k k 2 · ℓ ·|E|·n)+O(2 ·k·n·|E|) = O( ℓ ·|E|·n+2 · ℓ ·|E|·n)+O(2 ·k·n·|E|) = O(2k · k · n · |E| · 2ℓ ). In the same manner the time complexity of case two is O(2k · k · n · |E| · 2 k−ℓ ). The time complexity of the third case ( besides computing k
Ci (v, w, S) is O = O(|E| · kℓ ). Thus the total time complexity (u,v)∈E ℓ+1 is O(|E| · n · 22k · k).
⊓ ⊔
Proof of Theorem 3. The correctness of the approximation returned by Algorithm 4 is proved in the same manner as in the proof of Theorem 1. Lemma 3 v implies the correctness of the computation of Xi . The time complexity of Algo2k 2 rithm 4 is O |E| · n · k · e log(1/δ)/ǫ by Lemma 3 and by showing that the number of colorings used by the algorithm is O ek log(1/δ)/ǫ2 . (This is proved in the same manner as in the proof of Theorem 1). This completes the proof. ⊓ ⊔
4
Algorithms for Counting All Four-Size Motifs
Given a graph G = (V, E) and a vertex v, we describe how to approximately count for every vertex v the number of non-induced occurrences of each possible motif H that are adjacent to v. In addition, for each such motif H we present an algorithm for approximating the number of non-induced subgraphs of G that are isomorphic to H when no efficient algorithm exists. Note that we allow overlaps between the motifs, as in the previous section. 4.1
Counting “Tailed Triangles”
We In this section assume that H is a triangle with a “tail” of length one. present an algorithm that approximates the number of subgraphs of G which are
22
M. Gonen and Y. Shavitt
isomorphic to H, and, for every vertex v, approximates the number of subgraphs of G which are isomorphic to H and adjacent to v. We first approximate for every v ∈ V the number of subgraphs of G which are isomorphic to H and are adjacent to v. Let T RG (v) be the approximation of the total number of triangles in G that are adjacent to v, according to Algorithm 2. Let Gv = (Vv , Ev ), where Vv = V \ {v}, and Ev is the induced set of edges received by removing all edges adjacent to v. Consider the following algorithm. The algorithm takes as input: a graph G = (V, E), a vertex v, fault-tolerance ǫ, and an error probability δ. Algorithm 5. (A (ǫ, δ)-approximation algorithm for counting simple ”tailed triangles” adjacent to v) 1. 2. 3. 4. 5. 6. 7. 8.
9.
T LG(v) = 0 T RG (v) = result of Algorithm 2 (G,v, k = 3, ǫ, δ). T LG(v) = T LG(v) + T RG(v) · (|N (v)| − 2). Compute Gv by going over the whole adjacency list and removing v any time is appears in the list. For all u ∈ N (v) T RG v (u)= result of Algorithm 2 (Gv ,u, k = 3, ǫ, δ). T LG(v) = T LG(v) + u∈N (v) T RGv (u). av = 0. For all u ∈ N (v) go over v’s adjacency list, and for each vertex w in u’s adjacency list check if w ∈ N (v) by going over v’s adjacency list. If w ∈ N (v) then av = av + deg w − 2 + deg u − 2. Return T LG(v) + av .
Theorem 4. Let G = (V, E) be an undirected graph, and let H be a triangle with a ”tail” of length one. Then, for every vertex v, the number of copies of G that are isomorphic to H and adjacent to v can be (ǫ, δ)-approximated, with time complexity O |E|2 + n · |E| log(1/δ)/ǫ2 .
The proof of Theorem 4 and counting the number of subgraphs of G, which are isomorphic to H, appear in the full version of the paper [7]. 4.2
Counting 4-Cliques
In this section assume that H is a clique of size four. We present an algorithm that exactly computes the number of subgraphs of G which are isomorphic to H, and, for every vertex v, the number of subgraphs of G which are isomorphic to H and adjacent to v. We first compute, for every vertex v, the number of subgraphs of G which are v . We run the following algorithm: Let isomorphic to H and adjacent to v. Cl(v) be the number of four-cliques in the graph that are adjacent to v. The algorithm takes as input: a graph G = (V, E), a vertex v. Algorithm 6. (Algorithm for counting 4-cliques that are adjacent to v) 1. Cl(v) = 0. 2. For every vertex u ∈ N (v):
Approximating the Number of Network Motifs
23
(a) Compute N (v) ∩ N (u): i. Go over all the vertices in the adjacency list of v and the adjacency list of u, and add each vertex to a list. (Thus a vertex can appear several times in the list). ii. Sort the vertices in the list (which is a multiset) according the names of the vertices. iii. For each vertex in the list count the number of times it appears in the list. If it appears twice then add the vertex to a list ℓ(u, v). iv. Sort the list ℓ(u, v) according to the names of the vertices. (b) For all w ∈ ℓ(u, v) go over the adjacency list of w and for each vertex t = v, u in this adjacency list check if t ∈ ℓ(u, v). If t ∈ ℓ(u, v) then Cl(v) := Cl(v) + 1. 3. Return Cl(v)/6. Theorem 5. Let G = (V, E) be an undirected graph, and let H be a clique of size four. Then for all v ∈ V Algorithm 6 counts the number of copies of H in G that are adjacent to v, with time complexity O(|E| · n log n + |E|2 ). The proof of Theorem 5 and counting the number of subgraphs of G which are isomorphic to H appear in the full version of the paper [7]. Counting Small Trees. Let H be a tree of size four that is consisted of a Computing the number of subgraphs of G vertex and three of its neighbors. which are isomorphic to H, and, for every vertex v, the number of subgraphs of G which are isomorphic to H and adjacent to v appear in the full version of the paper [7].
5
Conclusions
We presented algorithms with time complexity O e2k k · n · |E| · log 1δ /ǫ2 that, for the first time, approximate the number of non-induced occurrences of the motif a vertex is part of, for k-length cycles, k-length cycles with a chord, and (k − 1)-length paths, where k = O(log n), and for all motifs of size of at most four. In addition, we showed algorithms that approximate the total number of non-induced occurrences of these network motifs, when no efficient algorithm exists. Approximating the number of non-induced occurrences of the motif a vertex is part of, for other motifs of size O(log n) is left for future work. Acknowledgment. We thank Dana Ron for many hours of fruitful discussions.
References 1. Alon, N., Dao, P., Hajirasouliha, I., Hormozdiari, F., Sahinalp, S.C.: Biomolecular network motif counting and discovery by color coding. Bioinformatics 1, 1–9 (2008) 2. Alon, N., Gutner, S.: Balanced families of perfect hash functions and their applications. In: Arge, L., Cachin, C., Jurdzi´ nski, T., Tarlecki, A. (eds.) ICALP 2007. LNCS, vol. 4596, pp. 435–446. Springer, Heidelberg (2007)
24
M. Gonen and Y. Shavitt
3. Alon, N., Yuster, R., Zwick, U.: Color-coding. Journal of the ACM 42(4), 844 (1995) 4. Arvind, V., Raman, V.: Approximation algorithms for some parameterized counting problems. In: Bose, P., Morin, P. (eds.) ISAAC 2002. LNCS, vol. 2518, pp. 453–464. Springer, Heidelberg (2002) 5. Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 16–24 (2008) 6. Dost, B., Shlomi, T., Gupta, N., Ruppin, E., Bafna, V., Sharan, R.: QNet: A tool for querying protein interaction networks. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS (LNBI), vol. 4453, pp. 1–15. Springer, Heidelberg (2007) 7. Gonen, M., Shavitt, Y.: Approximating the number of network motifs. Technical report, School of Electrical Enjeneering, Tel Aviv University (2008) 8. Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS (LNBI), vol. 4453, pp. 92–106. Springer, Heidelberg (2007) 9. Hales, D., Arteconi, S.: Motifs in evolving cooperative networks look like protein structure networks. Special Issue of ECCS 2007 in The Journal of Networks and Heterogeneous Media 3(2), 239–249 (2008) 10. Hormozdiari, F., Berenbrink, P., Przulj, N., Sahinalp, S.C.: Not all scale-free networks are born equal: The role of the seed graph in ppi network evolution. PLoS: Computational Biology 3(7), e118 (2007) 11. Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746 (2004) 12. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: Simple building blocks of complex networks. Science 298, 824–827 (2002) 13. Przulj, N.: Biological network comparison using graphlet degree distribution. Bioinformatics 23(2), e177–e183 (2007) 14. Przulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: Scale- free or geometric? Bioinformatics 150, 216–231 (2005) 15. Scott, J., Ideker, T., Karp, R.M., Sharan, R.: Efficient algorithms for detecting signaling pathways in protein interaction networks. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS (LNBI), vol. 3500, pp. 1–13. Springer, Heidelberg (2005) 16. Shlomi, T., Segal, D., Ruppin, E.: QPath: a method for querying pathways in a protein-protein interaction network. Bioinformatics 7, 199 (2006) 17. Wernicke, S.: Efficient detection of network motifs. IEEE/ACM Transactions on Computational Biology and Bioinformatics 3(4), 347–359 (2006)
Finding Dense Subgraphs with Size Bounds Reid Andersen and Kumar Chellapilla Microsoft Live Labs, Redmond WA 98052, USA {reidan,kumarc}@microsoft.com
Abstract. We consider the problem of finding dense subgraphs with specified upper or lower bounds on the number of vertices. We introduce two optimization problems: the densest at-least-k-subgraph problem (dalks), which is to find an induced subgraph of highest average degree among all subgraphs with at least k vertices, and the densest at-most-k-subgraph problem (damks), which is defined similarly. These problems are relaxed versions of the well-known densest k-subgraph problem (dks), which is to find the densest subgraph with exactly k vertices. Our main result is that dalks can be approximated efficiently, even for web-scale graphs. We give a (1/3)-approximation algorithm for dalks that is based on the core decomposition of a graph, and that runs in time O(m + n), where n is the number of nodes and m is the number of edges. In contrast, we show that damks is nearly as hard to approximate as the densest k-subgraph problem, for which no good approximation algorithm is known. In particular, we show that if there exists a polynomial time approximation algorithm for damks with approximation ratio γ, then there is a polynomial time approximation algorithm for dks with approximation ratio γ 2 /8. In the experimental section, we test the algorithm for dalks on large publicly available web graphs. We observe that, in addition to producing near-optimal solutions for dalks, the algorithm also produces near-optimal solutions for dks for nearly all values of k.
1
Introduction
The density of an induced subgraph is the number of edges contained in the subgraph, divided by the number of vertices. Identifying subgraphs with high density is a useful primitive, which has been applied to find web communities, produce compressed representations of graphs, and identify link spam [9,14,20,8]. Effective heuristics have been developed to identify various kinds of dense subgraphs. Kumar et al. gave an algorithm for finding bipartite cliques [20]. Dourisboure et al. gave a scalable heuristic for finding small dense communities in web graphs [9]. The algorithm of Gibson et al. [14] finds dense communities using two-level min-hashing, with the goal of identifying link spam. Generally speaking, these algorithms are designed to find collections of small dense subgraphs that are isolated from each other, which are often viewed as the dense centers of communities in the graph. This is quite different from the task of finding a single large dense subgraph that contains a significant fraction of the graph, which is often used as a method of preprocessing or subsampling the graph. K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 25–37, 2009. c Springer-Verlag Berlin Heidelberg 2009
26
R. Andersen and K. Chellapilla
The complexity of identifying dense subgraphs can vary greatly when additional constraints on the size of the subgraph are introduced. Finding the densest subgraph with an arbitrary number of vertices is known as the densest subgraph problem ds, and can be solved exactly in polynomial time by solving a sequence of maximum flow problems [15,13]. The algorithm of Kortsarz and Peleg [18] produces a (1/2)-approximation of the densest subgraph in linear time, which is useful for graphs where the time required to compute maximum flows is prohibitively large. In contrast, no efficient algorithm is known for the problem of finding the densest subgraph with exactly k vertices, where k is specified as part of the input. This is the densest k-subgraph problem, or dks. The dks problem is N P-complete, and the best polynomial time algorithm known for dks (due to Feige, Peleg, and Kortsarz) has the approximation ratio n−(1/3)+δ , for a small constant δ [11]. We want to control the size of the dense subgraphs we find, but we need to avoid the difficult task of finding dense subgraphs of a specified size. In this paper, we address this problem by introducing two variations of the densest subgraph problem: finding the densest subgraph with at least k vertices, and finding the densest subgraph with at most k vertices. We refer to these as the densest atleast-k-subgraph problem (dalks), and the densest at-most-k-subgraph problem (damks). These two relaxed versions of the densest k-subgraph problem roughly correspond to the two types of applications for dense subgraphs described earlier; for finding communities one would want an algorithm to solve damks, and for preprocessing a graph one would want an algorithm for dalks. Our main result is to show that dalks can be solved efficiently, while damks is nearly as hard to approximate as dks. In Section 3, we introduce a (1/3)-approximation algorithm for dalks that runs in time O(m + n) in an unweighted graph. In Section 4, we prove a reduction that shows any polynomial time γ-approximation algorithm for damks can be used to design a polynomial time (γ 2 /8)-approximation algorithm for dks. Our algorithm for dalks is based on the core decomposition, and it can be viewed as a generalization of Kortsarz and Peleg’s (1/2)-approximation algorithm for densest subgraph problem [18]. The core decomposition was first introduced as a tool for social network analysis [19]. It has been used in several applications, including graph drawing [2] and the analysis of biological networks [21]. In Section 6, we present experimental results for dalks on publicly available webgraphs. We demonstrate that the algorithm finds subgraphs with nearly optimal density while providing considerable control over the subgraph size. Surprisingly, we observe that on typical web graphs, the algorithm also produces a good approximation of the densest subgraph on exactly k vertices, for nearly all values of k. In Section 5, we describe theoretical results that help to explain this observation. We introduce two graph parameters based on the density of the graph’s cores. Given these parameters, we prove bounds on the range of k for which our algorithm produces a good approximation for the densest k-subgraph problem.
Finding Dense Subgraphs with Size Bounds
1.1
27
Related Work
Here we survey results on the complexity of the densest k-subgraph problem. The best approximation algorithm known for the general problem (when k is specified as part of the input) is the algorithm of Feige, Peleg, and Kortsarz [11], which has approximation ratio O(n−(1/3)+δ ), for a small constant δ > 0. For any particular value of k, the greedy algorithm of Asahiro et al. [6] gives the ratio O(k/n). Algorithms based on linear programming and semidefinite programming have produced approximation ratios better than O(k/n) for certain values of k, but do not improve the approximation ratio for the general case [12,10]. Feige and Seltser [12] showed that dks is N P-complete when restricted to bipartite graphs of maximum degree 3, by a reduction from max-clique. This reduction does not produce a hardness of approximation result for dks. In fact, they showed that if a graph contains a k-clique, then a subgraph with k vertices and (1 − ǫ) k2 edges can be found in subexponential time. Khot [17] proved there can be no PTAS (polynomial time approximation scheme) for the densest k-subgraph problem, under a reasonable complexity assumption. Arora, Karger, and Karpinski [4] gave a PTAS for the special case k = Ω(n) and m = Ω(n2 ). Asahiro, Hassin, and Iwama [5] showed that the problem is still N P-complete for very sparse graphs. Kannan and Vinay [16] introduced a different objective function for density, which is defined for a pair of vertex subsets S and T rather than a single subgraph. They gave an O(log n)-approximation for this objective function using spectral techniques. Charikar [7] later gave a linear time (1/2)-approximation algorithm for this objective function, based on the core decomposition, and showed that the problem can be solved exactly by linear programming. A local algorithm for finding small subgraphs with high density according to the Kannan-Vinay objective function was described in [3].
2
Definitions
Let G = (V, E) be an undirected graph with a weight function w : E → R+ that assigns a positive weight to each edge. The weighted degree w(v, G) is the sum of the weights of the edges in G incident with v. The total weight W (G) is the sum of the weights of the edges in G. Definition 1. For any induced subgraph H of G, the density d(H) of H is d(H) :=
W (H) . |H|
Definition 2. For an undirected graph G, we define the following quantities. Dal (G, k) := the maximum density of any induced subgraph of G with at least kvertices. Dam (G, k) := the maximum density of any induced subgraph of G with at most kvertices.
28
R. Andersen and K. Chellapilla
Deq (G, k) := the maximum density of any induced subgraph of G with exactly kvertices. Dmax (G) := the maximum density of any induced subgraph of G. The densest at-least-k-subgraph problem (dalks) is to find an induced subgraph with at least k vertices achieving density Dal (G, k). Similarly, the densest atmost-k-subgraph problem (damks) is to find an induced subgraph with at most k vertices achieving density Dam (G, k). The densest k-subgraph problem (dks) is to find an induced subgraph with exactly k vertices achieving Deq (G, k), and the densest subgraph problem (ds) is to find an induced subgraph of any size achieving Dmax (G). We now define what it means to be an approximation algorithm for dalks. Approximation algorithms for damks, dks, and ds are defined similarly. Definition 3. We say an algorithm A(G, k) is a γ-approximation algorithm for the densest at-least-k-subgraph problem if, for any graph G and integer k, it returns an induced subgraph H ⊆ G with at least k vertices and density d(H) ≥ γDal (G, k).
3
Finding Dense Subgraphs with at Least k Vertices
In this section, we describe an algorithm FindLargeDenseSubgraph that is a (1/3)-approximation algorithm for the densest at-least-k-subgraph problem and that runs in time O(m + n) in an unweighted graph. The algorithm is described in Table 1. The main step of the algorithm computes the core decomposition of the graph using a well-known greedy procedure (see [18,7,2]). This produces an ordering (v1 , . . . , vn ) of the vertices of the graph, after which the algorithm outputs a subgraphs of the form {v1 , . . . , vj }. Kortsarz and Peleg [18] used the core decomposition to give a (1/2)-approximation algorithm for ds. Theorem 1 extends their result to show that the core decomposition can be used to approximate dalks. Theorem 1. FindLargeDenseSubgraph(G, k) is a (1/3)-approximation algorithm for the densest at-least-k-subgraph problem. The proof of Theorem 1 is in Section 3.1. The core decomposition procedure, which dominates the running time of FindLargeDenseSubgraph, can be implemented to run in time O(m + n) in an unweighted graph and O(m + n log n) in a weighted graph. For a proof, we refer the reader to [18]. This implies the following proposition. Proposition 1. The running time of FindLargeDenseSubgraph(G, k) is O(m+ n) in an unweighted graph, and O(m + n log n) in a weighted graph.
Finding Dense Subgraphs with Size Bounds
29
FindLargeDenseSubgraph(G, k) : Input: a graph G with n vertices, and an integer k. Output: an induced subgraph of G with at least k vertices. 1. Compute the core decomposition of G: Let Hn = G and repeat the following for i = n, . . . , 1, (a) Let ri be the minimum weighted degree of any vertex in Hi . (b) Let vi be a vertex of minimum weighted degree, where w(vi , Hi ) = ri . (c) Remove vi from Hi to form the induced subgraph Hi−1 . (d) Update the values of W (Hi ) and d(Hi ) as follows, W (Hi−1 ) = W (Hi ) − 2ri , d(Hi−1 ) = W (Hi−1 )/(i − 1). Note that part 1 produces an ordering of the vertices v1 , . . . , vn , where v1 is the last vertex removed and vn is the first. The set Hi consists of the vertices {v1 , . . . , vi }. 2. Output the subgraph Hi with the largest density d(Hi ) over all i ≥ k. Fig. 1. Description of FindLargeDenseSubgraph
3.1
Analysis of the Algorithm
To analyze FindLargeDenseSubgraph, we consider the relationship between induced subgraphs of G with high average degree (dense subgraphs) and induced subgraphs of G with high minimum degree (w-cores). Definition 4. Given a graph G and a weight w ∈ R, the w-core Cw (G) is the unique largest induced subgraph of G with minimum weighted degree at least w. Here is an outline of how we will proceed. We first show that the FindLargeDenseSubgraph algorithm computes all the w-cores of G (Lemma 1). We then show that for any induced subgraph H of G with density d, the (2d/3)core of G has total weight at least W (H)/3 (Lemma 2). We prove Theorem 1 using these two lemmas. Lemma 1. Let {H1 , . . . , Hn }, and {r1 , . . . , rn } be the induced subgraphs and weighted degrees determined by the algorithm FindLargeDenseSubgraph on the input graph G. For any w ∈ R, let I(w) be the largest index such that rI(w) ≥ w. Then, HI(w) = Cw (G). In other words, every w-core of G is equal to one of the subgraphs Hi . Proof. Fix a value of w. It easy to see (by induction) that none of the vertices vn . . . vI(w)+1 that were removed before vI(w) can be contained in an induced subgraph with minimum degree at least w. That implies Cw (G) ⊆ HI(w) . On the other hand, the minimum degree of HI(w) is at least w, so HI(w) ⊆ Cw (G). Therefore, HI(w) = Cw (G). ⊓ ⊔
30
R. Andersen and K. Chellapilla
Lemma 2. For any graph G with n nodes, total weight W , and density d = W/n, the d-core of G is nonempty. Furthermore, for any α ∈ [0, 1], the total weight of the (αd)-core of G is strictly greater than (1 − α)W . Proof. Let {H1 , . . . , Hn } be the induced subgraphs determined by FindLargeDenseSubgraph on the input graph G. Fix a value of w, and let I(w) be the largest index such that rI(w) ≥ w. Recall that HI(w) = Cw (G) by Lemma 1. Since each edge in G is removed once during the course of the algorithm, W =
n
ri
i=1
I(w)
=
i=1
ri +
n
ri
i=I(w)+1
< W (HI(w) ) + w · (n − I(w)) ≤ W (Cw (G)) + w · n. Therefore, W (Cw (G)) > W − w · n. Taking w = d = W/n in the equation above, we learn that W (Cd (G)) > 0. Taking w = αd = αW/n, we learn that W (Cαd (G)) > (1 − α)W . ⊓ ⊔ Proof (Theorem 1). Let {H1 , . . . , Hn } be the induced subgraphs computed by FindLargeDenseSubgraph on the input graph G. It suffices to show that for any k, there is an integer I ∈ [k, n] satisfying d(HI ) ≥ Dal (G, k)/3. Let H∗ be an induced subgraph of G with at least k vertices and with density d∗ = W (H∗ )/|H∗ | = Dal (G, k). We apply Lemma 2 to H∗ with α = 2/3 to show that C(2d∗ /3) (H∗ ) has total weight at least W (H∗ )/3. This implies that C(2d∗ /3) (G) has total weight at least W (H∗ )/3. The core C(2d∗ /3) (G) has minimum degree at least 2d∗ /3, so its density is at least d∗ /3. Lemma 1 shows C(2d∗ /3) (G) = HI , for I = |C(2d∗ /3) (G)|. If I ≥ k, then HI satisfies the requirements of the theorem. If I < k, then C(2d∗ /3) (G) = HI is contained in Hk , and the following calculation shows that Hk satisfies the requirements of the theorem. d(Hk ) =
W (C(2d∗ /3) (G)) W (Hk ) W (H∗ )/3 ≥ ≥ = d∗ /3. k k k
⊓ ⊔
We remark that our analysis of FindLargeDenseSubgraph is a generalization of the result of Kortsarz-Peleg [18]. Their result shows that FindLargeDenseSubgraph(G, 1) is a (1/2)-approximation algorithm for ds. This follows from the fact that if w = Dmax (G), then the w-core of G is nonempty, which is a special case of Lemma 2.
Finding Dense Subgraphs with Size Bounds
4
31
Finding Dense Subgraphs with at Most k Vertices
The densest at-most-k-subgraph problem is N P-complete by a reduction to the max-clique problem, since a subgraph of size at most k has density at least (k − 1)/2 if and only if it is a k-clique. Feige and Seltser [12] proved that the densest k-subgraph problem is N P-complete even when restricted to graphs with maximum degree 3, and their proof implies that the densest at-most-k-subgraph problem is N P-complete when restricted to the same class of graphs. In this section, we show that damks is nearly as hard to approximate as dks. We show that if there exists a polynomial time pseudo-approximation algorithm for damks, which outputs a set of at most βk vertices with density at least γ times the density of the densest subgraph with at most k vertices, then there exists a polynomial time approximation algorithm for dks with ratio γ min(γ, β −1 )/8. As an immediate consequence, a polynomial time γ-approximation algorithm for damks would imply a polynomial time (γ 2 /8)-approximation algorithm for dks. Definition 5. An algorithm A(G, k) is a (β, γ)-algorithm for the densest atmost-k-subgraph problem if for any input graph G and integer k, it returns an induced subgraph of G with at most βk vertices and density at least γDam (G, k). Theorem 2. If there is a polynomial time (β, γ)-algorithm for the densest atmost-k-subgraph problem (where β ≥ 1 and γ ≤ 1), then there is a polynomial time (γ min(γ, β −1 )/8)-approximation algorithm for the densest k-subgraph problem. Proof. Assume there exists a polynomial time algorithm A(G, k) that is (β, γ)algorithm for damks. We will now describe a polynomial time approximation algorithm for dks. Given as input a graph G and integer k, let G1 = G, let i = 1, and repeat the following procedure. Let Hi = A(Gi , k) be an induced subgraph of Gi with at most βk vertices and with density at least γDam (Gi , k). Remove all the edges in Hi from Gi to form a new graph Gi+1 on the same vertex set as G. Repeat this procedure until all edges have been removed from G. Let ni be the number of vertices in Hi , let Wi = W (Hi ), and let di = d(Hi ) = Wi /ni . Let H∗ be an induced subgraph of G with exactly k vertices and density d∗ = Deq (G, k). Notice that if (W1 + · · · + Wt−1 ) ≤ W (H∗ )/2, then dt ≥ γd∗ /2. This is true because dt is at least γ times the density of the induced subgraph of Gt on the vertex set of H∗ , which is at least W (H∗ ) d∗ W (H∗ ) − (W1 + · · · + Wt−1 ) ≥ = . k 2k 2 Let T be the smallest integer such that (W1 + · · · + WT ) ≥ W (H∗ )/2, and let UT be the induced subgraph on the union of the vertex sets of H1 , . . . , HT . The total weight W (UT ) is at least W (H∗ )/2. The density of UT is d(UT ) =
Wt d∗ W1 + · · · + WT W (UT ) ≥ ≥ min ≥γ . 1≤t≤T nt |UT | n1 + · · · + nT 2
32
R. Andersen and K. Chellapilla
To bound the number of vertices in UT , notice that (n1 + · · · + nT −1 ) ≤ γ −1 k, because T −1 T −1 T −1 W (H∗ ) d∗ d∗ k ni di ≥ γ Wi = ni . = ≥ 2 2 2 i=1 i=1 i=1 Since nT is at most βk, we have |UT | ≤ (n1 + · · · + nT ) ≤ (γ −1 + β)k. There are now two cases to consider. If |UT | ≤ k, then we pad UT with arbitrary vertices to form a set UT′ of size exactly k. The set UT′ is still sufficiently dense: d∗ W (H∗ )/2 = . d(UT′ ) ≥ k 2 If |UT | > k, then we employ a simple greedy procedure to reduce the number of vertices. We begin with the induced subgraph UT , greedily remove the vertex with smallest degree to obtain a smaller subgraph, and repeat until exactly k vertices remain. The resulting subgraph UT′′ has density at least d(UT )(k/2|UT |) by the method of conditional expectations (this technique was also used in [11]). The set UT′′ is sufficiently dense: d(UT′′ ) ≥ d(UT )
5
γ k = d∗ −1 −1 2(γ + β)k 4(γ + β) γ γ min(γ, β −1 ) = d∗ . ≥ d∗ −1 8 max(γ , β) 8
d∗ k ≥γ 2|UT | 2
⊓ ⊔
Finding Dense Subgraphs of Specified Size
The previous section shows that the densest at-least-k subgraph problem is easy to approximate within a constant factor for any graph and any value of k. The densest k-subgraph problem seems hard to approximate well in the worst case, but we may still be able to find near-optimal solutions for specific instances. In this section we describe a method for identifying a range of k-values for which we can obtain a good approximation of the densest k-subgraph. Here is an outline of our approach. We first define a graph parameter k∗ (G) ∈ [1, n]. We then prove that for all k ≥ k∗ , the algorithm FindLargeDenseSubgraph can be used to find a (1/3)-approximation of the densest subgraph with exactly k vertices. In Section 6, we observe empirically that for several example web graphs, the value of k∗ is only a small fraction of n. Definition 6. For a given graph G, let w∗ be the smallest value such that the average degree of the core C(w∗ ) is less than 2w∗ . Let k∗ (G) = |C(w∗ )| be the number of vertices in that core. Roughly speaking, k∗ describes how small a core of the graph must be before it can be nearly degree-regular. The following theorem shows that for every k ≥ k∗ , the set Hk produced by FindLargeDenseSubgraph has density at least 1/3 of the densest k-subgraph.
Finding Dense Subgraphs with Size Bounds
33
Theorem 3. Let {v1 , . . . , vn } be the ordering of the vertices produced by FindLargeDenseSubgraph, and let Hk = {v1 , . . . , vk }. Then, for any k ≥ k∗ we have d(Hk ) ≥ (1/3)Deq (G, k). Proof. We will first show the following: d(Hk+1 ) ≤ d(Hk ) for all k ≥ k∗ .
(1)
Once we show this, then for any k ≥ k∗ we have d(Hk ) = max Hj ≥ j≥k
1 1 Dal (k) ≥ Deq (k). 3 3
The middle step follows from the approximation guarantee proved in Theorem 1. To prove (1), it suffices to take an arbitrary value of w for which |Cw | > k∗ , and show that d(Hj−1 ) ≥ d(Hj ) for all j in the interval (|Cw+1 |, |Cw |]. We prove this by induction, first assuming d(Hj ) ≥ d(Cw ) and then proving d(Hj−1 ) ≥ d(Cw ). Recall that r(j) is the degree of vj in Hj . Then, d(Hj−1 ) =
j · d(Hj ) − d(Hj ) j · d(Hj ) − 2r(j) ≥ = d(Hj ). j−1 j−1
Here we used the fact that 2r(j) ≤ d(Hj ). This is true because r(j) ≤ w, our assumption that |Cw | ≥ k∗ implies w ≤ d(Cw )/2, and our induction assumption ⊓ ⊔ implies d(Cw )/2 ≤ d(Hj )/2. When k < k∗ the previous theorem doesn’t apply, but we can still bound the ratio between d(Hk ) and the optimal density Deq (k). The following bound holds for any k ∈ [1, n], and can be computed easily by observing the densities of the sets H1 , . . . , Hn . Lemma 3. Let Rk = maxj≥k
d(Hj ) d(Hk ) .
For any value of k, we have d(Hk ) ≥
Rk 3 Deq (G, k).
Proof. For any value of k, we have d(Hk ) = Rk max d(Hj ) ≥ j≥k
Rk Rk Dal (k) ≥ Deq (k). 3 3
The middle step follows from the approximation guarantee proved in Theorem 1. ⊓ ⊔ We remark that to prove Theorem 3, we showed that Rk = 1 for all k ≥ k∗ . In the next section we will compute the values of k∗ and Rk for several example graphs.
6
Experiments
In this section we present experimental results on four example graphs. The graphs and their sizes are listed in Table 1. Three of these graphs are publicly
34
R. Andersen and K. Chellapilla Table 1. Graph size, running time, and the observed value of k∗ graph num nodes (n) total degree (2m) running time (sec) k∗ domain-2006 55,554,153 1,067,392,106 263.81 9,445 webbase-2001 118,142,156 1,985,689,782 204.573 48,190 uk-2005 39,459,926 1,842,690,156 92.271 368,741 cnr-2000 325,558 6,257,420 0.359 13,237
Table 2. Attributes of the densest core, highest core, and w∗ core in the four test graphs graph
core number nodes in core density worst Rk w |Cw | d(Cw ) maxk≥|Cw | Rk
domain-2006 w∗ core densest core highest core webbase-2001 w∗ core densest core highest core uk-2005 w∗ core densest core highest core cnr-2000 w∗ core densest core highest core
1099 1203 1298
9445 4737 2502
2196.32 2275.96 2072.42
1 .9694 .9104
548 2281 2281
48190 1219 1219
1089.42 2436 2436
1 .8547 .8547
258 1002 1002
368741 587 587
515.851 1171.98 1171.98
1 .8871 .8871
38 116 116
13237 82 82
75.1145 161.976 161.976
1 .9138 .9138
available webgraphs from the Laboratory for Web Algorithmics1 at the Univerita Degli Studi Di Milano. The graph webbase-2001 was obtained from the 2001 crawl performed by the WebBase crawler. The graph uk-2005 was obtained from a 2005 crawl of the .uk domain, performed by UbiCrawler. The graph cnr-2000 was obtained from a small crawl of the Italian CNR domain. These graphs were chosen because they are fairly large, easy to obtain, and have been used in previous research papers. The remaining graph domain-2006 is a snapshot of the domain graph in September 2006, from Microsoft. These were originally directed graphs, but we have treated them as undirected graphs in the following way. We consider a directed link from a vertex u to a vertex v as an undirected link between u and v. We remark that there will be a link with multiplicity 2 between u and v in this undirected graph if both (u, v) and (v, u) appeared in the original directed graph. For this reason, the average degree of a subgraph on k vertices may be as large as 2(k − 1). In addition, we 1
http://law.dsi.unimi.it/
Finding Dense Subgraphs with Size Bounds
4
webbase−2001
cnr−2000
3
10
35
10
3
10
2
10 2
10
1
10
1
10
0
10 0 10
4
Core number Average Degree x (1/2) 5
10 Number of vertices in core
0
10
10
domaingraph−2006
10 0 10
Core number Average Degree x (1/2) 2
6
10
uk−2005
3
10
4
10 10 Number of vertices in core
10
3
10
2
10 2
10
1
10
1
10
0
10 2 10
Core number Average Degree x (1/2) 4
6
10 10 Number of vertices in core
0
8
10
10 2 10
Core number Average Degree x (1/2) 4
6
10 10 Number of vertices in core
8
10
Fig. 2. Plots of core size versus core number and core density in four webgraphs
removed all self-loops from the graphs. The total degrees reported in Table 1 were computed after these modifications were made. In Table 1, we report running times for our implementation of FindLarge DenseSubgraph. We implemented the algorithm in C++, and ran our experiments on a single machine with 64GB of RAM and a 3.0Ghz quad-core Intel Xeon processor. Only one of the processor cores was used by the algorithm. The time we report is the time required to compute the core decomposition, which produces an ordering of all vertices in the graph. The running time does not include the time required to load the graph from disk into memory. We also report in Table 1 the values of k∗ for each of these graphs. We observe that k∗ is small compared to the number of vertices in the graph, which is good because our algorithm produces a good approximation of the densest k-subgraph for all k larger than k∗ . In Table 2 we report statistics for three special cores in each of the example graphs. We report the w∗ -core (see Definition 6), which is
36
R. Andersen and K. Chellapilla
the core that determines the value of k∗ . We report the core that has the highest density (densest core), and we also report the highest value of w for which the w-core is nonempty (highest core). Note that these last two are not the same in general, but they end up being the same for three of our example graphs. For each of these special cores we report the w-value of the core, the number of vertices in the core, and the density of the core. For each of the cores in Table 2 we also report a statistic regarding the quantity Rk described in Lemma 3. For each core Cw we report “worst Rk ”, which we define to be the smallest value of Rk over all values of k ≥ |Cw |. The table indicates that “worst Rk ” is close to 1 for the highest core, which means we can approximate dks well for all values of k above the size of the highest core. For example, in the graph domain-2006, the highest core contains 2502 nodes and has a value of .9104 for worst Rk . That means for all values of k ≥ 2502, the set Hk produced by FindLargeDenseSubgraph on domain-2006 is within a factor of .9104 ∗ 1/3 of the densest subgraph on exactly k vertices, by Lemma 3. Figure 2 contains a plot for each of the four webgraphs that shows the size, core number, and density of all of the graph’s cores. Each of the plots in the figure has two curves. Each point on the curve represents a w-core. One curve shows the core number, the other shows the density of the core, and both are plotted against the number of vertices in the core. The value of k∗ can be seen from these plots; it is the x-coordinate of the first point (from right to left) at which these two curves intersect.
References 1. Abello, J., Resende, M.G.C., Sudarsky, R.: Massive quasi-clique detection. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 598–612. Springer, Heidelberg (2002) 2. Alvarez-Hamelin, J.I., Dall’Asta, L., Barrat, A., Vespignani, A.: Large scale networks fingerprinting and visualization using the k-core decomposition. Advances in Neural Information Processing Systems 18, 41–50 (2006) 3. Andersen, R.: A local algorithm for finding dense subgraphs. In: Proc. 19th ACMSIAM Symposium on Discrete Algorithms (SODA 2008), pp. 1003–1009 (2008) 4. Arora, S., Karger, D., Karpinski, M.: Polynomial time approximation schemes for dense instances of NP-hard problems. In: Proc. 27th ACM Symposium on Theory of Computing (STOC 1995), pp. 284–293 (1995) 5. Asahiro, Y., Hassin, R., Iwama, K.: Complexity of finding dense subgraphs. Discrete Appl. Math. 121(1-3), 15–26 (2002) 6. Asahiro, Y., Iwama, K., Tamaki, H., Tokuyama, T.: Greedily finding a dense subgraph. J. Algorithms 34(2), 203–221 (2000) 7. Charikar, M.: Greedy approximation algorithms for finding dense components in a graph. In: Jansen, K., Khuller, S. (eds.) APPROX 2000. LNCS, vol. 1913, pp. 84–95. Springer, Heidelberg (2000) 8. Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: WSDM 2008: Proceedings of the international conference on web search and web data mining, pp. 95–106 (2008) 9. Dourisboure, Y., Geraci, F., Pellegrini, M.: Extraction and classification of dense communities in the web. In: WWW 2007: Proceedings of the 16th international conference on World Wide Web, pp. 461–470 (2007)
Finding Dense Subgraphs with Size Bounds
37
10. Feige, U., Langberg, M.: Approximation algorithms for maximization problems arising in graph partitioning. J. Algorithms 41(2), 174–211 (2001) 11. Feige, U., Peleg, D., Kortsarz, G.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001) 12. Feige, U., Seltser, M.: On the densest k-subgraph problem, Technical report, Department of Applied Mathematics and Computer Science, The Weizmann Institute, Rehobot (1997) 13. Gallo, G., Grigoriadis, M., Tarjan, R.: A fast parametric maximum flow algorithm and applications. SIAM J. Comput. 18(1), 30–55 (1989) 14. Gibson, D., Kumar, R., Tomkins, A.: Discovering large dense subgraphs in massive graphs. In: Proc. 31st VLDB Conference (2005) 15. Goldberg, A.: Finding a maximum density subgraph, Technical Report UCB/CSB 84/171, Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA (1984) 16. Kannan, R., Vinay, V.: Analyzing the structure of large graphs (manuscript) (1999) 17. Khot, S.: Ruling out PTAS for graph min-bisection, dense k-subgraph, and bipartite clique. SIAM Journal on Computing 36(4), 1025–1071 (2006) 18. Kortsarz, G., Peleg, D.: Generating sparse 2-spanners. J. Algorithms 17(2), 222– 236 (1994) 19. Seidman, S.B.: Network structure and minimum degree. Social Networks 5, 269– 287 (1983) 20. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for emerging cyber-communities. In: Proc. 8th WWW Conference (WWW 1999) (1999) 21. Wuchty, S., Almaas, E.: Peeling the yeast protein network. Proteomics 5, 444 (2005)
The Giant Component in a Random Subgraph of a Given Graph Fan Chung1,⋆ , Paul Horn1 , and Linyuan Lu2,⋆⋆ 1
University of California, San Diego 2 University of South Carolina
Abstract. We consider a random subgraph Gp of a host graph G formed by retaining each edge of G with probability p. We address the question of determining the critical value p (as a function of G) for which a giant component emerges. Suppose G satisfies some (mild) conditions depending on its spectral gap and higher moments of its degree sequence. We define the second order average degree d˜ to be d˜ = v d2v /( v dv ) where dv denotes the degree of v. We prove that for any ǫ > 0, if p > (1 + ǫ)/d˜ then asymptotically almost surely the percolated subgraph Gp has a giant component. In the other direction, if p < (1−ǫ)/d˜ then almost surely the percolated subgraph Gp contains no giant component.
1
Introduction
Almost all information networks that we observe are subgraphs of some host graphs that often have sizes prohibitively large or with incomplete information. A natural question is to deduce the properties that a random subgraph of a given graph must have. We are interested in random subgraphs of Gp of a graph G, obtained as follows: for each edge in Gp we independently decide to retain the edge with probability p, and discard the edge with probability 1 − p. A natural special case of this process is the Erd˝os-R´enyi graph model G(n, p) which is the special case where the host graph is Kn . Other examples are the percolation problems that have long been studied [13,14] in theoretical physics, mainly with the host graph being the lattice graph Z k . In this paper, we consider a general host graph, an example of which being a contact graph, consisting of edges formed by pairs of people with possible contact, which is of special interest in the study of the spread of infectious diseases or the identification of community in various social networks. A fundamental question is to ask for the critical value of p such that Gp has a giant connected component, that is a component whose volume is a positive fraction of the total volume of the graph. For the spread of disease on contact networks, the answer to this question corresponds to the problem of finding the epidemic threshold for the disease under consideration, for instance. ⋆
⋆⋆
This author was supported in part by NSF grant ITR 0426858 and ONR MURI 2008-2013. This author was supported in part by NSF grant DMS 0701111.
K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 38–49, 2009. c Springer-Verlag Berlin Heidelberg 2009
The Giant Component in a Random Subgraph of a Given Graph
39
For the case of Kn , Erd˝ os and R´enyi answered this in their seminal paper [11]: if p = nc for c < 1, then almost surely G contains no giant connected component and all components are of size at most O(log n), and if c > 1 then, indeed, there is a giant component of size ǫn. For general host graphs, the answer has been more elusive. Results have been obtained either for very dense graphs or bounded degree graphs. Bollobas, Borgs, Chayes and Riordan [4] showed that for dense graphs (where the degrees are of order Θ(n)), the giant component threshold is 1/ρ where ρ is the largest eigenvalue of the adjacency matrix. Frieze, Krivelevich and Martin [12] consider the case where the host graph is d-regular with adjacency eigenvalue λ and they show that the critical probability is close to 1/d, strengthening earlier results on hypercubes [2,3] and Cayley graphs [15]. For expander graphs with degrees bounded by d, Alon, Benjamini and Stacey [1] proved that the percolation threshold is greater than or equal to 1/(2d). There are several recent papers, mainly in studying percolation on special classes of graphs, which have gone further. Their results nail down the precise critical window during which component sizes grow from log(n) vertices to a positive proportion of the graph. In [5,6], Borgs, et. al. find the order of this critical window for transitive graphs, and cubes. Nachmias [16] looks at a similar situation to that of Frieze, Krivelevich and Martin [12] and uses random walk techniques to study percolation within the critical window for quasi-random transitive graphs. Percolation within the critical window on random regular graphs is also studied by Nachmias and Peres in [17]. Our results differ from these in that we study percolation on graphs with a much more general degree sequence. The greater preciseness of these results, however, is quite desirable. It is an interesting open question to describe the precise scaling window for percolation for the more general graphs studied here. Here, we are interested in percolation on graphs which are not necessarily regular, and can be relatively sparse (i.e., o(n2 ) edges.) Compared with earlier results, the main advantage of our results is the ability to handle general degree sequences. To state our results, we give a few definitions here. For a subset S of vertices, the volume of S, denoted by vol(S) is the sum of degrees of vertices in S. The kth order volume of S is the kth moment of the degree sequence, i.e. volk (S) = v∈S dkv . We write vol1 (S) = vol(S) and volk (G) = volk (V (G)), where V (G) is the vertex set of G. We denote by d˜ = vol2 (G)/vol(G) the second order degree of G, and by σ the spectral gap of the normalized Laplacian, which we fully define in Section 2. Further, recall that f (n) is O(g(n)) if lim supn→∞ |f (n)|/|g(n)| < ∞, and f (n) is o(g(n)) if limn→∞ |f (n)|/|g(n)| = 0. We will prove the following ˜
Theorem 1. Suppose G has the maximum degree Δ satisfying Δ = o( σd ). For , a.a.s. every connected component in Gp has volume at most p ≤ 1−c d˜ O( vol2 (G)g(n)), where g(n) is any slowly growing function as n → ∞.
Here, an event occurring a.a.s. indicates that it occurs with probability tending to one as n tends to infinity. In order to prove the emergence of giant component
40
F. Chung, P. Horn, and L. Lu
˜ we need to consider some additional conditions. Suppose where p ≥ (1 + c)/d, there is a set U satisfying (i) vol2 (U ) ≥ (1 − ǫ)vol2 (G). (ii) vol3 (U ) ≤ M dvol2 (G) where ǫ and M are constant independent of n. In this case, we say G is (ǫ, M )admissible and U is an (ǫ, M )-admissible set. We note that the admissibility measures of skewness of the degree sequence. For example, all regular graphs are (ǫ, 1)-admissible for any ǫ, but a graph needs not be regular to be admissible. We also note that in the case that vol3 (G) ≤ M dvol2 (G), G is (ǫ, M )-admissible for any ǫ. Theorem 2. Suppose p ≥ √ d n o( log n)
1+c d˜
for some c ≤
1 20 .
˜
Suppose G satisfies Δ = o( σd ),
and σ = o(n−κ ) for some κ > 0, and G is ( cκ Δ= 10 , M )-admissible. Then a.a.s. there is a unique giant connected component in Gp with volume Θ(vol(G)), and no other component has volume more than max(2d log n, ω(σ vol(G))).
Here, recall that f (n) = Θ(g(n)) if f (n) = O(g(n)) and g(n) = O(f (n)). In this case, we say that f and g are of the same order. Also, f (n) = ω(g(n)) if g(n) = o(f (n)). We note that under the assumption that the maximum degree Δ of G satis˜ fying Δ = o( σd ), it can be shown that the spectral norm of the adjacency matrix ˜ Under the assumption in Theorem 2, we observe satisfies A = ρ = (1 + o(1))d. that the percolation threshold of G is d1˜. To examine when the conditions of Theorems 1 and 2 are satisfied, we note that admissibility implies that d˜ = Θ(d), which essentially says that while there can be some vertices with degree much higher than d, there cannot be too many. Chung, Lu and Vu [8] show that for random graphs with a given expected degree sequence σ = O( √1d ), and hence for graphs with average degree nǫ the spectral condition of Theorem 2 easily holds for random graphs. The results here can be viewed as a generalization of the result of Frieze, Krivelevich and Martin [12] with general degree sequences and is also a strengthening of the original results of Erd˝ os and Reyni to general host graphs. The paper is organized as follows: In Section 2 we introduce the notation and some basic facts. In Section 3, we examine several spectral lemmas which allow us to control the expansion. In Section 4, we prove Theorem 1, and in Section 5, we complete the proof of Theorem 2.
2
Preliminaries
Suppose G is a connected graph on vertex set V . Throughout the paper, Gp denotes a random subgraph of G obtained by retaining each edge of G independently with probability p.
The Giant Component in a Random Subgraph of a Given Graph
41
Let A = (auv ) denote the adjacency matrix of G, defined by 1 if {u, v} is an edge; auv = 0 otherwise. We let dv = u auv denote the degree of vertex v. Let Δ = maxv dv denote the maximum degree of G and δ = minv dv denote the minimum degree. For each vertex set S and a positive integer k, we define the k-th volume of G to be volk (S) = dkv . v∈S
The volume vol(G) is simply the sum of all degrees, i.e. vol(G) = vol1 (G). We vol1 (G) define the average degree d = n1 vol(G) = vol and the second order average 0 (G) vol2 (G) ˜ degree d = . vol1 (G)
Let D = diag(dv1 , dv2 , . . . , dvn ) denote the diagonal degree matrix. Let 1 denote the column vector with all entries 1 and d = D1 be column vector of degrees. The normalized Laplacian of G is defined as 1
1
L = I − D− 2 AD− 2 . The spectrum of the Laplacian is the eigenvalues of L sorted in increasing order. 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 . Many properties of λi ’s can be found in [7]. For example, the least eigenvalue λ0 is always equal to 0. We have λ1 > 0 if G is connected and λn−1 ≤ 2 with equality holding only if G has a bipartite component. Let σ = max{1 − λ1 , λn−1 − 1}. Then σ < 1 if G is connected and non-bipartite. For random graphs with a given expected degree sequence [8], σ = O( √1d ), and in general for regular graphs it is easy to write σ in terms of the second largest eigenvalue of the adjacency matrix. Furthermore, σ is closely related to the mixing rate of random walks on G, see e.g. [7]. The following lemma measures the difference of adjacency eigenvalue and d˜ using σ. Lemma 1. The largest eigenvalue of the adjacency matrix of G, ρ, satisfies ˜ ≤ σΔ. |ρ − d| 1 D1/2 1 vol(G)
Proof: Recall that ϕ = √ to eigenvalue 0. We have Then,
is the unit eigenvector of L corresponding
I − L − ϕϕ∗ ≤ σ.
dd∗ dd∗ ˜ |ρ − d| = A − ≤ A − vol(G) vol(G)
= D1/2 (I − L − ϕϕ∗ )D1/2
≤ D1/2 · I − L − ϕϕ∗ · D1/2 ≤ σΔ.
42
F. Chung, P. Horn, and L. Lu
For any subset of the vertices, S, we let let S¯ denote the complement set of S. The vertex boundary of S in G, denoted by Γ G (S) is defined as follows: Γ G (S) = {u ∈ S | ∃v ∈ S such that {u, v} ∈ E(G)}. When S consists of one vertex v, we simply write Γ G (v) for Γ G ({v}). We also write Γ (S) = Γ G (S) if there is no confusion. Similarly, we define Γ Gp (S) to be the set of neighbors of S in our percolated subgraph Gp .
3
Several Spectral Lemmas
We begin by proving two lemmas, first relating expansion in G to the spectrum of G, then giving a probabilistic bound on the expansion in Gp Lemma 2. For two disjoint sets S and T , we have vol(S)vol2 (T ) dv |Γ (v) ∩ S| − ≤ σ vol(S)vol3 (T ). vol(G) v∈T vol(S)3 vol5 (T ) vol(S)2 vol3 (T ) 2 2 2 ≤ σ vol(S) max{dv }+2σ . dv |Γ (v) ∩ S| − 2 v∈T vol(G) vol(G) v∈T
Proof: Let 1S (or 1T ) be the indicative column vector of the set S (or T ) respectively. Note dv |Γ (v) ∩ S| = 1∗S AD1T . v∈T
vol(S) = 1∗S d. vol2 (T ) = d∗ D1T . Here 1∗S denotes the transpose of 1S as a row vector. We have vol(S)vol2 (T ) d |Γ (v) ∩ S| − v vol(G) v∈T
1 1∗ dd∗ D1T | vol(G) S 1 1 1 1 3 1 1 D 2 11∗ D 2 )D 2 1T | = |1∗S D 2 (D− 2 AD− 2 − vol(G)
= |1∗S AD1T −
1 D1/2 1 vol(G) ∗
Let ϕ = √
denote the eigenvector of I − L for the eigenvalue 1.
The matrix I − L − ϕϕ , which is the projection of I − L to the hyperspace ϕ⊥ , has L2 -norm σ.
The Giant Component in a Random Subgraph of a Given Graph
43
We have 1 3 vol(S)vol2 (T ) ∗ ∗ 2 2 d |Γ (v) ∩ S| − v = |1S D (I − L − ϕϕ )D 1T | vol(G) v∈T
1
3
≤ σD 2 1S · D 2 1T ≤ σ vol(S)vol3 (T ). Let ev be the column vector with v-th coordinate 1 and 0 else where. Then |Γ G (v) ∩ S| = 1∗S Aev . We have v∈T
dv |Γ G (v) ∩ S|2 =
dv 1∗S Aev e∗v A1S = 1∗S ADT A1S .
v∈T
Here DT = v∈T dv ev e∗v is the diagonal matrix with degree entry at vertex in T and 0 else where. We have vol(S)2 vol3 (T ) G 2 dv |Γ (v) ∩ S| − vol(G)2
v∈T
1 1∗ dd∗ DT dd∗ 1S | vol(G)2 S 1 ≤ |1∗S ADT A1S − 1∗ dd∗ DT A1S | vol(G) S 1 1 +| 1∗S dd∗ DT A1S − 1∗ dd∗ DT dd∗ 1S | vol(G) vol(G)2 S = |1∗S ADT A1S −
1
1
1
1
= |1S D 2 (I − L − ϕϕ∗ )D 2 DT A1S | 1 1 1 1∗ dd∗ DT D 2 (I − L − ϕϕ∗ )D 2 1S | +| vol(G) S 1
1
≤ |1S D 2 (I − L − ϕ∗ ϕ)D 2 DT D 2 (I − L − ϕϕ∗ )D 2 1S | 1 1 1 1∗ dd∗ DT D 2 (I − L − ϕϕ∗ )D 2 1S | +2| vol(G) S vol(S)3 vol5 (T ) 2 2 . ≤ σ vol(S) max{dv } + 2σ v∈T vol(G)
Lemma 3. Suppose that two disjoint sets S and T satisfy 5p 2 σ max{d2v }vol(G) v∈T 2δ 2δvol2 (T )vol(G) 25σ 2 vol3 (T )vol(G)2 ≤ vol(S) ≤ δ 2 vol2 (T )2 5pvol3 (T ) 2 δ vol2 (T )2 vol(S) ≤ . 25p2 σ 2 vol5 (T ) vol2 (T ) ≥
(1) (2) (3)
44
F. Chung, P. Horn, and L. Lu
Then we have that vol2 (T ) vol(S). vol(G) 2 (T )vol(S) with probability at least 1 − exp − δ(1−δ)pvol . 10Δvol(G) vol(Γ Gp (S) ∩ T ) > (1 − δ)p
Proof: For any v ∈ T , let Xv be the indicative random variable for v ∈ Γ Gp (S). We have G P (Xv = 1) = 1 − (1 − p)|Γ (v)∩S| . Let X = |Γ Gp (S) ∩ T |. Then X is the sum of independent random variables Xv . X=
dv Xv .
v∈T
Note that E (X) =
dv E (Xv )
v∈T
=
dv (1 − (1 − p)|Γ
dv (p|Γ G (v) ∩ S| −
v∈T
≥
G
v∈T
≥ p(
(v)∩S|
)
p2 G |Γ (v) ∩ S|2 ) 2
vol(S)vol2 (T ) − σ vol(S)vol3 (T )) vol(G)
p2 vol(S)2 vol3 (T ) + σ 2 vol(S) max{d2v } + 2σ − ( v∈T 2 vol(G)2 4 vol2 (T ) > (1 − δ)p vol(S) 5 vol(G)
vol(S)3 vol5 (T ) ) vol(G)
by using Lemma 2 and the assumptions on S and T . We apply the following Chernoff inequality, see e.g. [10] P (X ≤ E (X) − a) ≤ e
−2
da2v2E[X2v ]
a2
≤ e− 2∆E(X) .
We set a = αE (X), with α chosen so that (1 − α)(1 − 54 δ) = (1 − δ). Then P (X ≤ (1 − δ)p
vol2 (T ) vol(S)) < P (X ≤ (1 − α)E (X)) vol(G)
α2 E (X) α(1 − δ)pvol2 (T )vol(S) < exp − ≤ exp − 2Δ 2Δvol(G)
To complete the proof, note α > δ/5.
.
The Giant Component in a Random Subgraph of a Given Graph
4
45
The Range of p with No Giant Component
In this section, we will prove Theorem 1. Proof of Theorem 1: It suffices to prove the following claim. Claim A: If pρ < 1, where ρ is the largest eigenvalue of the adjacency matrix, 1 with probability at least 1 − C 2 (1−pρ) , all components have volume at most C vol2 (G).
Proof of Claim A: Let x be the probability that there is a component of Gp having volume greater than C vol2 (G). Now we choose two random vertices with the probability of being chosen proportional to their degrees in G. Under vol2 (G), the condition that there is a component with volume greater than C √ C
vol2 (G)
the probability of each vertex in this component is at least vol(G) . Therefore, the probability that the random pair of vertices are in the same component is at least
2 C vol2 (G) C 2 xd˜ . (4) = x vol(G) vol(G) On the other hand, for any fixed pair of vertices u and v and any fixed path P of length k in G, the probability that u and v is connected by this path in Gp is exactly pk . The number of k-paths from u to v is at most 1∗u Ak 1v . Since dv du and vol(G) respectively, the the probabilities of u and v being selected are vol(G) probability that the random pair of vertices are in the same connected component is at most u,v
n
n
k=0
k=0
dv k ∗ k du 1 pk d∗ Ak d. p 1u A 1v = vol(G) vol(G) vol(G)2
We have n
k=0
∞ pk ρk vol2 (G) 1 d˜ k ∗ k p d A d ≤ ≤ . vol(G)2 vol(G)2 (1 − pρ)vol(G) k=0
2
˜
˜
d C xd Combining with (4), we have vol(G) ≤ (1−pρ)vol(G) . which implies x ≤ Claim A is proved, and the theorem follows taking C to be g(n).
5
1 C 2 (1−pρ) .
The Emergence of the Giant Component
Lemma 4. Suppose G contains an (ǫ, M )-admissible set U . Then we have 1. d˜ ≤ M 2 d. (1−ǫ)
2. For any U ′ ⊂ U with vol2 (U ′ ) > ηvol2 (U ), we have vol(U ′ ) ≥
η 2 (1 − ǫ)d˜ vol(G). Md
46
F. Chung, P. Horn, and L. Lu
Proof: Since G is (ǫ, M )-admissible, we have a set U satisfying (i) vol2 (U ) ≥ (1 − ǫ)vol2 (G) (ii) vol3 (U ) ≤ M dvol2 (G). We have vol2 (G) 1 vol2 (U ) 1 vol3 (U ) M vol2 (G) d. ≤ ≤ ≤ ≤ d˜ = vol(G) vol(U ) 1 − ǫ vol(U ) 1 − ǫ vol2 (U ) (1 − ǫ)2 For any U ′ ⊂ U with vol2 (U ′ ) > ηvol2 (U ), we have vol(U ′ ) ≥
η 2 vol2 (U )2 η 2 (1 − ǫ)vol2 (G) η 2 (1 − ǫ)d˜ vol2 (U ′ )2 ≥ ≥ ≥ vol(G). ′ vol3 (U ) vol3 (U ) Md Md 1+c d˜
1 20 .
Proof of Theorem 2: It suffices to assume p = for some c < cκ Let ǫ = 10 be a small constant, and U be a (ǫ, M )-admissible set in G. √ Define U ′ to be the subset of U containing all vertices with degree at least ǫd. We have d2v ≥ (1 − ǫ)vol2 (G) − ǫnd2 ≥ (1 − 2ǫ)vol2 (G). vol2 (U ′ ) ≥ vol2 (U ) − √ dv < ǫd
Hence, U ′ is a (2ǫ, M ) admissible set. We will concentrate on the neighborhood expansion within U ′ . 25M ⊂ U ′ with Let δ = 2c and C = δ2 (1−4ǫ) 2 . Take an initial set S0 max(Cσ 2 vol(G), Δ ln n) ≤ vol(S0 ) ≤ max(Cσ 2 vol(G), Δ ln n) + Δ. Let T0 = U ′ \ S0 . For i ≥ 1, we will recursively define Si = Γ Gp (Si−1 )∩U ′ and 2 (Ti )vol(G) . Ti = U ′ \ ∪ij=0 Sj until vol2 (Ti ) ≤ (1 − 3ǫ)vol2 (G) or vol(Si ) ≥ 2δvol 5pvol3 (Ti ) Condition 1 in Lemma 3 is always satisfied. 5p 2 5(1 + c) 2 2 σΔ 2 5(1 + c) σ max d2v vol(G) ≤ vol2 (G) ) σ Δ vol(G) = ( ˜ 2δ v∈Ti 2δ 2dδ d˜ = o(vol2 (G)) ≤ vol2 (Ti ). Condition 3 in Lemma 3 is also trivial because δ 2 vol2 (Ti )2 δ 2 vol2 (Ti )2 δ 2 (1 − 3ǫ)vol2 (G) ≥ ≥ 2 2 2 2 2 25p σ vol5 (Ti ) 25p σ Δ vol3 (Ti ) 25p2 σ 2 Δ2 M d d˜ 2 δ 2 (1 − 3ǫ) ) vol(G) = ω(vol(G)). ≥( σΔ 25(1 + c)2 M Now we verify condition 2. We have vol(S0 ) > Cσ 2 vol(G) =
25σ 2 vol3 (T0 )vol(G)2 25M 2 σ vol(G) ≥ . δ 2 (1 − 4ǫ)2 δ 2 vol2 (T0 )2
The Giant Component in a Random Subgraph of a Given Graph
47
The conditions of Lemma 3 are all satisfied. Then we have that vol2 (T0 ) vol(S0 ). vol(G) 2 (T0 )vol(S0 ) with probability at least 1 − exp − δ(1−δ)pvol . 10Δvol(G) vol(Γ Gp (S0 ) ∩ T0 ) > (1 − δ)p
2 (Ti ) Since (1 − δ)p vol vol(G) ≥ (1 − δ)(1 − 3ǫ)(1 + c) = β > 1 by our assumption that c is small (noting that ǫ and δ are functions of c), the neighborhood of Si grows exponentially, allowing condition 2 of Lemma 3 to continue to hold and us to continue the process. We stop when one of the following two events happens, 2 (Ti )vol(G) . – vol(Si ) ≥ 2δvol 5pvol3 (Ti ) – vol2 (Ti ) ≤ (1 − 3ǫ)vol2 (G).
Let us denote the time that this happens by t. If the first, but not the second, case occurs we have vol(St ) ≥
2δ(1 − 3ǫ) 2δvol2 (Tt )vol(G) ≥ vol(G). 5pvol3 (Tt ) 5M (1 + c)
In the second case, we have vol2 (
t
j=0
Sj ) = vol2 (U ′ ) − vol2 (Tt ) ≥ ǫvol2 (G) ≥ ǫvol(U ′ ).
2 t d˜ By Lemma 4 with η = ǫ, we have vol( j=0 Sj ) ≥ ǫ (1−2ǫ) vol(G). On the other Md i−t hand, note that since vol(S t i ) ≥ βvol(S t i−1 ),−jwe have that vol(Si ) ≤ β vol(St ), and hence we have vol( j=0 Sj ) ≤ j=0 β vol(St ) so
vol(St ) ≥
˜ − 1) ǫ2 (1 − 2ǫ)d(β vol(G). M dβ
In either case we have vol(St ) = Θ(vol(G)). For the moment, we restrict ourselves to the case where Cσ 2 n > Δ ln n. Each vertex in St is in the same component as some vertex in S0 , which has ) √ 0 ≤ C ′ σ 2 n. We now combine the k1 largest components to size at most vol(S ǫd form a set W (1) with vol(W (1) ) > Cσ 2 vol(G), such that k2 is minimal. If k1 ≥ 2, vol(W (1) ) ≤ 2Cσ 2 vol(G). Note that since the average size of a component is vol(G) vol(St ) ′ 4 |S0 | ≥ C1 σ2 n , k1 ≤ C1 σ n. (1)
(1)
(1)
We grow as before: Let W0 = W (1) , Q0 = Tt−1 \ W0 . Note that the (1) (1) conditions for Lemma 3 are satisfied by W0 and T0 . We run the process (1) (1) (1) (1) (1) (1) as before, setting Wt = Γ (Wt ) ∩ Qt−1 and Qt = Qt−1 \ Wt stopping (1)
(1)
when either vol(Qt ) < (1 − 4ǫ)vol2 (G) or vol(Wt ) > 2δ(1−4ǫ) 5M(1+c) vol(G).
(1)
(1)
2δvol2 (Qt )vol(G) (1)
5pvol3 (Qt )
≥
As before, in either case vol(Wt ) = Θ(vol(G)). Note that if (1)
k1 = 1, we are now done as all vertices in Wt
lie in the same component of Gp .
48
F. Chung, P. Horn, and L. Lu (1)
Now we iterate. Each of the vertices in Wt lies in one of the k1 components (1) of W0 . We combine the largest k2 components to form a set W (2) of size > Cσ 2 vol(G). If k2 = 1, then one more growth finishes us, otherwise vol(W (2) ) < 2Cσ 2 vol(G), the average size of components is at least C2 vol(G) σ4 n and hence k2 ≤ C2′ σ 6 n. (m) We iterate, growing W (m) until either vol(Qt ) < (1 − (m + 3)ǫ)vol2 (G) or (m)
vol(Wt
(m)
) >
2δvol2 (Qt
)vol(G)
(m)
5pvol3 (Qt
)
(m)
, so that Wt
has volume θ(vol(G)) and then
creating W (m+1) by combining the largest km+1 components to form a W (m+1) with volume at least Cσ 2 n. Once km = 1 for some m all vertices in W (m) are in the same component and one more growth round finishes the process, resulting (m) in a giant component in G. Note that the average size of a component in Wn vol(G) (that is, components must grow by a factor of at has size at least Cm σ2(m+1) n 1 ′ σ 2(m+1) n. If least σ2 each iteration) and if km > 1, we must have km ≤ Cm 1 ⌉ − 1, this would imply that km = o(1) by our condition σ = o(n−κ ) , m = ⌈ 2κ 1 so after at most ⌈ 2κ ⌉ − 1 rounds, we must have km = 1 and the process will halt with a giant connected component. ) √ 0 ≤ C ′ Δ√ln n , and In the case where Δ ln n > Cσ 2 n, we note that |S0 | ≤ vol(S ǫd ǫd ′′
the average volume of components in St is at least C Δvol(G)d = ω(Δ ln n), so we ln n can form W (1) by taking just one component for n large enough, and the proof goes as above. We note that throughout, if we try to expand we have that
1 9c (m) vol(Qt )>(1 − (m + 3)ǫ)vol2 (G)> 1 − + 4 ǫ vol2 (G)> 1− vol2 (G). 2κ 20
By our choice of c being sufficiently small, (1 − (m + 3)ǫ)(1 − δ)(1 + c) > 1 at all (m) times, so throughout, noting that vol(Si ) and vol(Wi ) are at least Δ ln n, we are guaranteed our exponential growth by Lemma 3 with an error probability bounded by
9c )vol(Si ) δ(1 − δ)(1 − 20 δ(1 − δ)pvol2 (Ti )vol(Si ) exp − ≤ exp − ≤ n−K . 10Δvol(G) 2Δ We run for a constant number of phases, and run for at most a logarithmic number of steps in each growth phase as the sets grow exponentially. Thus, the probability of failure is at most C ′′ log(n)n−K = o(1) for some constant C ′′ , thus completing our argument that Gp contains a giant component with high probability. Finally, we prove the uniqueness assertion. With probability 1−C ′′ log(n)n−K there is a giant component X. Let u be chosen at random; we estimate the probability that u is in a component of volume at least max(2d log n, ω(σ vol(G))). Let Y be the component of u. Theorem 5.1 of [7] asserts that if vol(Y ) ≥ max(2d log n, ω(σ vol(G))): e(X, Y ) ≥
vol(X)vol(Y ) − σ vol(X)vol(Y ) ≥ 1.5d log n vol(G)
The Giant Component in a Random Subgraph of a Given Graph
49
Notethat the probability that Y is not connected to X given that vol(Y ) = ω(σ vol(G)) is (1 − p)e(X,Y ) = o(n−1 ), so with probability 1 − o(1) no vertices are in such a component - proving the uniqueness of large components.
References 1. Alon, N., Benjamini, I., Stacey, A.: Percolation on finite graphs and isoperimetric inequalities. Annals of Probability 32(3), 1727–1745 (2004) 2. Ajtai, M., Koml´ os, J., Szemer´edi, E.: Largest random component of a k-cube. Combinatorica 2, 1–7 (1982) 3. Bollobas, B., Kohayakawa, Y., Luczak, T.: The evolution of random subgraphs of the cube. Random Structures and Algorithms 3(1), 55–90 (1992) 4. Bollobas, B., Borgs, C., Chayes, J., Riordan, O.: Percolation on dense graph sequences (preprint) 5. Borgs, C., Chayes, J., van der Hofstad, R., Slade, G., Spencer, J.: Random subgraphs of finite graphs. I. The scaling window under the triangle condition. Random Structures and Algorithms 27(2), 137–184 (2005) 6. Borgs, C., Chayes, J., van der Hofstad, R., Slade, G., Spencer, J.: Random subgraphs of finite graphs. III. The scaling window under the triangle condition. Combinatorica 26(4), 395–410 (2006) 7. Chung, F.: Spectral Graph Theory. AMS Publications (1997) 8. Chung, F., Lu, L., Vu, V.: The spectra of random graphs with given expected degrees. Internet Mathematics 1, 257–275 (2004) 9. Chung, F., Lu, L.: Connected components in random graphs with given expected degree sequences. Annals of Combinatorics 6, 125–145 (2002) 10. Chung, F., Lu, L.: Complex Graphs and Networks. AMS Publications (2006) 11. Erd˝ os, P., R´enyi, A.: On Random Graphs I. Publ. Math Debrecen 6, 290–297 (1959) 12. Frieze, A., Krivelevich, M., Martin, R.: The emergence of a giant component of pseudo-random graphs. Random Structures and Algorithms 24, 42–50 (2004) 13. Grimmett, G.: Percolation. Springer, New York (1989) 14. Kesten, H.: Percolation theory for mathematicians. In: Progress in Probability and Statistics, vol. 2, Birkh¨ auser, Boston (1982) 15. Malon, C., Pak, I.: Percolation on finite cayley graphs. In: Rolim, J.D.P., Vadhan, S.P. (eds.) RANDOM 2002. LNCS, vol. 2483, pp. 91–104. Springer, Heidelberg (2002) 16. Nachmias, A.: Mean-field conditions for percolation in finite graphs (preprint, 2007) 17. Nachmias, A., Peres, Y.: Critical percolation on random regular graphs (preprint, 2007)
Quantifying the Impact of Information Aggregation on Complex Networks: A Temporal Perspective⋆ Fernando Mour˜ao, Leonardo Rocha, Lucas Miranda, Virg´ılio Almeida, and Wagner Meira Jr. Federal University of Minas Gerais, Department of Computer Science, Belo Horizonte MG, Brazil {fhmourao,lcrocha,lucmir,virgilio,meira}@dcc.ufmg.br
Abstract. Complex networks are a popular and frequent tool for modeling a variety of entities and their relationships. Understanding these relationships and selecting which data will be used in their analysis is key to a proper characterization. Most of the current approaches consider all available information for analysis, aggregating it over time. In this work, we studied the impact of such aggregation while characterizing complex networks. We model four real complex networks using an extended graph model that enables us to quantify the impact of the information aggregation over time. We conclude that data aggregation may distort the characteristics of the underlying real-world network and must be performed carefully.
1 Introduction It is not surprising that the intensive process of digitization in all areas is resulting in an increasing amount of data being stored in computer systems. It is also remarkable that the information associated with such data is getting more complex both in terms of their nature and the relationships they represent. Such scenario imposes a continuous demand for scalable and robust modeling tools. A popular and successful strategy that has been used in this context is modeling the information interaction using complex networks [1,2]. There are several practical problems that may be modeled by a set of objects belonging to a domain and the relationship between them. Some fields such as sports, economy, biology and others are modeling their data using complex networks in order to understand the hidden behavior inherent to the relationships among their objects (e.g., people, animals, molecules). Further, the understanding of such relationships is key to the characterization of the objects in the network [3,4] and may benefit several applications such as service personalization, information retrieval, and automatic document classification, among others. A major challenge for performing good characterizations of complex networks is the selection of the relevant data for sake of relationship analysis. Current approaches usually employ all available information regarding the objects and their relationships, aggregating it over time. For example, to identify the emergence of communities in a scientific collaboration network, we may cluster frequently related people (i.e., who ⋆
Partially supported by CNPq, Finep, Fapemig, and CNPq/CT-Info/InfoWeb.
K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 50–61, 2009. c Springer-Verlag Berlin Heidelberg 2009
Quantifying the Impact of Information Aggregation on Complex Networks
51
publish papers together) over time. Intuitively, the larger the observation period, the more information we should have for a good clustering, but the relationships may change over time and the resulting aggregated information becomes inconsistent or contradictory. In this work we investigate whether the aggregation of information over time may result in confusing networks. There may be several definitions of confusion, and ours stems from the entropy concept, that is, the uncertainty or randomness level associated with the meaning of the relationships between network objects. Such confusion makes it more difficult to extract truthful descriptions of object behaviors and their interactions. For example, two individuals may have common publications in the area of Data Mining in a given time period. Later they publish together again, but in another field such as Bioinformatics, leading to the erroneous conclusion that these two areas are closely related. In order to pursue the proposed investigation, we propose a methodology that models complex networks using a temporal-aware graph model which extends traditional models [5] and compares the measurements of several criteria at various time scales. These criteria comprise network statistics as well as shrinking, densification, and confusion-related measures. In particular, we want to investigate whether shrinking and densification phenomena [6] are correlated to the increase in confusion. Although there has been several characterization studies of complex networks, few of them have focused on the analysis of the individuals and their relations. Actually, there have been significant efforts in the analysis of statistical properties that summarize the networks and the development of new models to allow explain the network behavior as a whole. The works that focus on individuals usually characterize it without taking into account the temporal evolution of the networks. Unlike those works, we focus on the analysis of the individuals and their relationships, in the light of the temporal evolution of these networks.
2 Related Work The increasing availability of a large volume of real data, as well as the availability of powerful computers to process these data, turned complex networks into an area of growing interest [1]. Due to its capability to model a wide range of applications, complex networks have been applied to fields such as economics, sports, and medicine. Wilson [7] stated that “The greatest challenge today, not just in cell biology and ecology, but in all of science, is the accurate and complete description of complex systems.”. We can divide the characterization efforts into two main groups that will be explained next. In the first group, research has focused on finding ways to summarize and explain the network behaviors [5,8]. In order to provide a better understanding on the networks, several topological metrics have been analyzed [3,2]. The aim of these works is to identify well-defined phenomena, which are intriguing and sometimes common to various networks. Furthermore, some studies looking for mathematical models that enable the explanation of these phenomena. Based on these studies, models such as Random Graph [9], Scale Free [10], and Small Words [11] were proposed for real networks. In the second group, the studies have analyzed the individuals in the network, and the relationship between them, in order to use this information to improve tasks such as prediction, recommendation systems, and marketing campaigns [12,13]. These studies
52
F. Mour˜ao et al.
focus, mainly, on investigating three basic assumptions: (1) the individual’s behavior tends to be consistent over time; (2) the behavior of a group may explain individual behaviors; and (3) similar individuals tend to behave similarly. Although several studies have been performed, few of them consider the temporal evolution of the networks. Most of them focus on the analysis of a static “network snapshot”. However, these networks may evolve across time, invalidating partially or completely earlier static models. Some studies analyzed such evolution, identifying behavioral trends over time [6,14]. Others targeted the weakness of current models w.r.t. the temporal evolution, incorporating evolutive aspects to these models [15,16] and to their premises [17,18]. Our work differs from others since we propose a methodology that quantifies the actual impact caused by the temporal evolution on complex networks. Although it is common sense that such impact exists, we did not find, in the literature, other studies that measure this impact, or check the validity of models that involve time as an important factor.
3 Modeling Complex Networks from a Temporal Perspective In order to account for the temporal evolution of complex networks, we extended the traditional graph model by adding temporal information to the relationships between objects, so that . it supports the analysis of the modeled networks at various time scales, Our proposal is similar to others [19,20], but these were targeted at different scenarios. Our model, as illustrated by Figure 1, is an undirected, multivalued multi-graph. We associate two kinds of information with the edges: Type and Moment. The Type comprises any information inherent to the relationship represented by the edge, such as a class. The Moment is the time when the relationship was established or other significant temporal event, and may be represented at different time scales according to the network being modeled. For example, in some networks it is necessary an analysis at a daily basis while in others a monthly or even annual analysis works better. Notice that our model is generic enough to be applicable to a wide range of application domains such as social, biological, technological, and information networks [1], enabling a study of these networks from a temporal perspective in their respective application domain. For example, we can build a crime network by modeling the criminals as vertices and the co-authorship in a crime as edges between criminals. To these edges we may associate information such as the type of the crime (i.e., assault, murder) and when the
Fig. 1. Example of temporal-aware complex network model
Quantifying the Impact of Information Aggregation on Complex Networks
53
crime happened. A second example is social networks, where people may be connected by several types of relations (e.g., Type may be friendship and family relations, among others).
4 Quantifying the Impact of Temporal Evolution In this section we present a methodology to quantify the impact of the information aggregation over time, which is applicable to any network modeled as we just described. As mentioned, our analysis is based on comparing different temporal aggregation levels. For example, we can build a network by merging either all relationships between two individuals in a given year as just one relationship, or all relationships in the entire database as a single one. We then evaluate topological and statistical measures of complex networks and compare the resulting measures. The methodology is divided into five distinct steps that are applied to each time scale: (1) Basic Data Characterization; (2) Analysis of Densification Effect; (3) Analysis of Shrinking Effect; (4) Analysis of Prediction Confidence; and (5) Analysis of Prediction Quality. The first step is to perform a basic characterization of each network that we want to analyze. The goal of this characterization is to acquire inputs to the analysis and to explain the results found in other steps of the methodology. We consider basic information such as the degree distribution and the frequency of the relationships between the objects. This information is very important to identify the most significant forms of relationship between individuals in the network. The second step is to analyze the densification phenomenon [6], which is characterized by the number of edges growing at a rate greater than the number of vertices, and this rate follows a power law. This observation shows that the number of relationships increases faster than the number of individuals, making the network denser in terms of relationships. The impact of this phenomenon in our context is that a denser network may increase the probability of different types of relationships, as well as the diversity of the relationship’s semantics over time. Thus, aggregating the data may induce accidental co-occurrence of relationship types that never happened simultaneously or temporally close. On the other hand, the relationship densification may strengthen other types, which will be further amplified by aggregation, helping the analysis and understanding. The third step in our methodology analyzes the network shrinking effect [6], which is characterized by a decrease of the network diameter across time. In our context, it means that the objects in the network that are semantically distinct become increasingly closer. This effect may make it harder to cluster data or to distinguish individuals. In addition, the relationships in this scenario will become less discriminative, impairing the ability to predict them, since completely different individuals can be connected as they were similar. In order to measure this effect, the authors used the diameter of the network, that is, the cumulative distance distribution between pairs of nodes. This metric is widely used, since its is more robust than a simple average network distance. However, in our work, we extend the use of this metric by analyzing the frequency distribution of each distance value across time in order to find some hidden behavior that may occur. Such procedure supports safer conclusions about the general behavior of the network.
54
F. Mour˜ao et al.
The fourth step assesses the impact on the ability to predict the occurrence of relationships based on historical data. Such measure is directly related to the quality of some applications such as automatic classifiers, clustering, and service personalization, being a good way to assess the quality of a characterization. The metric chosen is the confidence, or degree of certainty, associated with the prediction performed at different time scales. Intuitively, the smaller the time scale, the higher the confidence. On the other hand, relationships, and the confidence associated with them, may become stronger through time. For example, in a crime network, the relationships may represent the co-authorship in a crime, and a given crime that relate two individuals may be always the same across time. In order to analyze the confidence of the relationships, we adopt the concept of Dominance [21]. The Dominance measures the degree of exclusivity of a relationship Type between two objects in the network during a given period. For sake of prediction, in our case, the type of relationship associated with the highest Dominance is more likely to occur again. We then analyze how the various Dominances vary at different time scales. Finally, the last step of our methodology verifies the quality of the predictions at different time scales, analyzing the percentage of the accuracy of these predictions. Considering the crime example, although the prediction confidence to a given crime as the next relationship between two individuals is 100%, a different crime may occur, as a consequence of the evolution of the relationship between those individuals. In order to evaluate the prediction quality, we perform experiments where we try to predict the relationships, following a procedure that is similar to those applied to verify the quality of automatic document classifiers, based on the 10-fold Cross-Validation [22] strategy.
5 Case Study 5.1 Workload For sake of the evaluation presented in the next section, we use two information networks and two social networks from distinct areas, which are described next. The first information network (ACM T) consists of terms that appear in Computer Science articles, from ACM digital library. The database is composed by 30,000 articles from 1980 to 2001. In this collection, classes are assigned to documents by their own authors. The information network is built as follows: each node is a term that appears in the title or abstract of each article. Two terms will have a relationship if they co-occur in the same document. The relationship Type is the class assigned to the document in which the terms co-occur. The Moment is the year when the document was published. The second information network (MD T) corresponds to terms in Medicine articles, from MedLine digital library. This collection comprises 7,251,507 articles from 1960 to 2006, also classified by the authors. The construction of the information network followed the same procedure of the first one. Due to the huge size of the network generated from the entire database, we chose to use only the documents from 1970 to 1990. The first social network (MD CA) analyzed represents a network of co-authorships in the field of Medicine and was built using the same MedLine database described above, however, using all years. In this network each author is a node and there will be a relationship between two authors if they published together. The relationship Type
Quantifying the Impact of Information Aggregation on Complex Networks
55
DataBase Number of Vertices Number of Edges ACM T 56,449 19,861,357 MD T 351,430 620,799,505 MD CA 4,141,726 53,983,502 Wiki 170,499 114,100,310
(a) Non-Aggregation Network Size
(b) Relation Sumarization
Fig. 2. Basic Network Information
is the published document class and, the Moment is the year when the document was published. The second social network (Wiki) was built from edition log pages of the English Wikipedia. The database is a set of 7,761,201 revisions, from 13,928,917 distinct pages, from 01/01/2005 to 12/31/2005, discretized at a daily basis. In this network, each user is a vertex and there will be a relationship between two users if they edited, at some moment, the same page. We ignore anonymous edits, since these are recorded only by IP address and may combine the activities of many people. We consider the Type as the edited page and the Moment as the day in which the page was edited. In order to measure the impact inherent to information aggregation, we analyze three versions of each network, each employing a distinct aggregation level. The first one (Non-Aggregation) is the network modeled as we propose, in which each relationship is considered separately, without aggregation. The second version (Local-Aggregation) is a network in which all relationships between two nodes that occurred in the same year are merged into a single relationship. Finally, we analyze a version (GlobalAggregation) in which all relationships between two nodes, regardless of time, are transformed into a single relationship, providing thus a global, or historical, aggregation of relationships. Note that these aggregations summarize the networks in terms of the total number of relationships. Considering the number of edges found for each network using the Non-Aggregation version presented in Figure 2 (a), we present in Figure 2 (b) the percentage of edges that remain in the other versions. As we can see the value is small for most of the networks, mainly the Global-Aggregation version. This high summarization rate results from a greater amount of information condensed into a single relationship, which may distort the networks considerably, as we will see below. 5.2 Experimental Evaluation For sake of evaluation, we applied the proposed methodology to each aggregation level and compare the results found in each one. In order to make the understanding of the results easier, we group the discussion per steps. Basic Data Characterization As mentioned, the first step is to analyze some basic information of the networks for a better understanding about them. Figure 3 shows the degree distribution of each network, considering the Non-Aggregation version. As we can see, all of them present the smallest degrees as the most frequent, which is explained by the fact that most objects
56
F. Mour˜ao et al.
(a) ACM T
(b) MD T
(c) MD CA
(d) Wiki
Fig. 3. Degree Distribution
occur only once in all databases, and such characteristic has different meanings for each network. For the ACM T and MD T networks, it means that a term occurs only in one document, consequently it will be connected just to the terms that appear in this document. As the average number of terms per document is 8, whether it has just the title, or 80, whether it has title and abstract, for both collections, the distribution exhibits a bimodal shape. Similarly, for the MD CA network, an author occurrence in a single article implies in its connection to the average number of authors per article, which is between 5 and 7. For the Wiki network, the higher probabilities in the range of 5 to 10% represent the number of users with whom a user has, at some moment, a relationship (i.e., reviewed the same article). The shape of the degree distribution observed in the Figure 3, for the four networks, was the same in all versions. However, considering the maximum degree found, while the Local-Aggregation version presented similar values to the Non-Aggregation, the Global-Aggregation is smaller by one order of magnitude, shifting the distribution to the left of the graph. Thus, in terms of vertex connectivity, the information aggregation, mainly locally, does not cause significant distortions. We also analyze the distribution of the duration of the relationships between pairs of individuals in the networks across the years (see Figure 4). As we can see, most of the relationships last for just one year, which is consistent with previous findings where most of the individuals also appear only once. Moreover, we can see that, for all networks, the duration of the relationships between individuals follows a power law with exponential cut-off. Therefore, we can conclude that there is a considerable number of relationships that are maintained for relatively long periods. In this scenario aggregating information may be beneficial if the semantics of these relationships is the same throughout the period. Otherwise, aggregating information may bring confusion while analyzing these networks. Another characteristic evaluated was the frequency with which pairs of vertices are related in the Non-Aggregation network (Figure 5). We can see that, in all networks, the behavior is explained by a power law, showing that most of the relationships occur rarely, influenced by the number of occurrences of vertices in the collections. Comparing these graphs to those from the Local-Aggregation and Global-Aggregation versions, we conclude that the Local-Aggregation version behaves similarly, with a similar power law exponent in all networks. However, the exponent of the Global-Aggregation version is half of the former, presenting a much slower decay and, consequently, higher relationship frequencies. Thus, a historical aggregation distorts the networks in terms of the relationship frequency.
10
−2
10
−3
10
−4
1
10
0
MD_T Distribution −2.357 0.2242 x µx e
−1
10
10−2
−3
10
1
10
7
100 −1 10 10−2 −3 10 −4 10 10−5 −6 10 10−7 −8 10
MD_CA Distribution −1.575 −0.2576 ∝x e
5
Duration (in years)
Duration (in years)
(a) ACM T
Probability of Occurrence
10
−1
ACM_T Distribution −3.191 0.1666 x ∝x e
Probability of Occurrence
100
Probability of Occurrence
Probability of Occurrence
Quantifying the Impact of Information Aggregation on Complex Networks
10
15
20
25
30
35
100 −1 10 10−2 −3 10 −4 10 10−5 −6 10 10−7 −8 10
Wiki Distribution −2.193 −4.064 x e
∝x
1
Duration (in years)
(b) MD T
57
10
100
Duration (in days)
(c) MD CA
(d) Wiki
ACM_T Distribution −3.35 ∝x
1
10
100
MD_CA Distribution −3.478 ∝x
−1
10
−2
10
10−3 −4
10
10−5 −6
10
−7
10
100
Probability >= X (CCDF)
100 −1 10 10−2 −3 10 −4 10 10−5 −6 10 10−7 −8 10
Probability >= X (CCDF)
Probability >= X (CCDF)
Fig. 4. Relationship Duration
Frequency of Relationship
1
10
100
101 0 10 −1 10 −2 10 −3 10 −4 10 −5 10 10−6 10−7 −8 10
Wiki Distribution −2.448 ∝x
1
Frequency of Relationship
(a) ACM T
(b) MD T
10
100
1000
Frequency of Relationship
(c) MD CA
(d) Wiki
Fig. 5. Frequency Distribution of Relationships between pairs of Vertices 8
10
10
8
10
10
10
1980
10
3
4
10 Number of Vertices
(a) ACM T
10
8
107
2.215
∝x
ACM_T Distribution
5
10
1.92
∝x 0.6 ∝x
9
MD_T Distribution 5
10
5
10 Number of Vertices
(b) MD T
Number of Edges
6
10
10
Number of Edges
2000
1990
7
10
Number of Edges
Number of Edges
2001 107 106
1971 1.254
∝x 1.924 ∝x 1.032 ∝x
5
10
MD_CA Distribution 5
6
10 10 Number of Vertices
(c) MD CA
7
10
9
108 10
7
106 10
5
10
4
10
1.574
∝x 2.041 ∝x Wiki Distribution 3
4
5
10 10 Number of Vertices
10
6
(d) Wiki
Fig. 6. Densification Law
Analysis of Densification Effect In order to analyze the densification effect, we plot the same graph of vertices per edges for each network, as shown in Figure 6, where each point refers to a distinct moment. As we can observe, there is an increasing number of edges per vertices that also follows a power law in all cases, however, we identified double power law in most networks. The most interesting is that each power law identified refers exactly to different periods (decades). That is, over time, though the relationship between vertices and edges follows a power law, this relationship changes, justifying a more detailed analysis than in [6]. More specifically, there is an evolution even in the process of the network growth, maybe as a consequence of several issues external to the networks. A hypothesis to be considered, in the future, is the emergence of a greater interaction between the various sciences that has occurred over time, and amplified by the Internet in the 90s, affecting the formation process of several complex networks.
58
F. Mour˜ao et al.
Analysis of Shrinking Effect We then analyzed whether the networks shrink over time, as shown in [6]. We use the frequency distribution of the network distances for the evaluation. Figure 7 shows the evolution of this distribution over time in our networks. We compute the distance values intervals with the highest frequency for each network. As we can see, the expected shrinking behavior was not observed, during the whole period, in the networks analyzed. For most of them there are periods of both shrinking and growing. One exception is the ACM T network, which has just grown during the analyzed period.
1990
1995
2000
1975
1
Old Relations New Relations
1985
1990 Year
1980
1985
1990
1995
2000
1
0.8 0.7 0.6 0.5 0.4 0.3 0.2 1970
1975
1980 Year
Probability of Occurrence
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1960 1970 1980 1990 2000
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Distance 1 Distance 2 Distance 3 Distance > 3
0
Year
Old Relations New Relations
0.9
Wiki
Distance < 30 Distance > 30 Unconnected nodes
Year
Percent of Occurrence
Percent of Occurrence
Relation Types Evolution
Year 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1980
MD CA
Distance 1 Distance 2 Distance 3 Distance > 3
Percent of Occurrence
1985
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1970
1985
1990
0.8 0.6 0.4 0.2 0 1960
1970
1980
50 100 150 200 250 300 350 Day
Old Relations New Relations
Percent of Occurrence
Distance 1 Distance 2 Distance 3
Probability of Occurrence
MD T Probability of Occurrence
Probability of Occurrence
Distance Distribution
ACM T 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1980
1990
2000
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Old Relations New Relations
0
Year
50 100 150 200 250 300 350 Year
Fig. 7. Evolution of the relationships between pairs of vertices
We explain this phenomenon by examining the network topology. As explained in the first step, many of the new vertices that appear in the network across time are linked to just a few others, which may increase the network distances. For sake of the analysis, we divide these novel relationships according to the prior occurrence (Old Relations) or not (New Relations) of a relationship between the same two individuals. Thus, for each moment we compute the probability of a new relationship reinforce or not an existing one. In case of reinforcement, that is, New Relations, relationships tend to be the same as well as the diameter. Otherwise, the diameter tends to reduce, since vertices that were not related before become connected. The purpose of this analysis is to verify whether the relationships are created between vertices that were already related in the past or not. If this is true, relations tend to be the same across time, not contributing to a reduction of the network diameter. Otherwise the diameter of the network tends to decline because nodes that were not related before become connected. As we can see in Figure 7, there is an increasing probability of Old Relations for most of the networks. Only the MD CA network presents an increasing probability of New Relations, resulting in a network shrinking. An interesting point to note is that the periods of network growth match the periods of increasing probability of Old Relations for all networks. Analysis of Prediction Confidence The next step is to analyze the level of prediction confidence. We started our analysis by observing the relationship dispersion, in terms of the number of different relationship
Quantifying the Impact of Information Aggregation on Complex Networks
10
−2
0
MD_T Distribution −3.012 x −0.649 x +e
∝e
10−3
10
−2
10
10−3
10−4 2
−1
4 6 8 10 Number of Distinct Types
Probability
10
0
10
ACM_T Distribution −2.733 x −0.4954 x +e
∝e
(a) ACM T
1
2 3 4 5 6 Number of Distinct Types
7
10 10−1 −2 10 10−3 −4 10 −5 10 10−6 −7 10 10−8
(b) MD T
1
MD_CA Distribution 10.08 −5.784 x ∝x e Probability
0
−1
Probability
Probability
10
59
1
2 3 4 5 6 Number of Distinct Types
7
10 100 −1 10 10−2 −3 10 −4 10 10−5 −6 10 −7 10 −8 10 10−9
(c) MD CA
Wiki Distribution −2.942 ∝ x
1
10 100 1000 Number of Distinct Types
10000
(d) Wiki
Fig. 8. Number of Distinct Relationship Types between Pairs of Nodes
Dominance 50 Dominance 90 Dominance 100
0.97 1980 1985 1990 1995 2000 2005 Year 1 0.95 0.9 0.85 0.8 0.75 0.5
0.6
0.7 0.8 Dominance
0.9
1
0.9 0.85 0.8 0.75 0.7 1970
Dominance 50 Dominance 90 Dominance 100 1975
1980 Year
1985
1990
1 0.95 0.9 0.85 0.8 0.75 0.7 0.5
0.6
0.7 0.8 Dominance
0.9
1
0.95 0.9 0.85 0.8 Dominance 50 Dominance 90 Dominance 100 0.7 1960 1970 1980 1990 2000 2010 Year
0.75
1
Probability >= X (CCDF)
0.98 0.975
0.95
Wiki
1
0.95 0.9 0.85 0.8 0.75 0.5
0.6
0.7 0.8 Dominance
0.9
1
1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8
Dominance 50 Dominance 90 Dominance 100 0 50 100 150 200 250 300 350 400 Day
Probability >= X (CCDF)
0.985
Probability >= X (CCDF)
0.99
MD CA
1
Probability >= X (CCDF)
0.995
Probability >= X (CCDF)
MD T
Probability >= X (CCDF)
Probability >= X (CCDF) Probability >= X (CCDF)
Global-Aggregation
Local-Aggregation
ACM T 1
1 0.95 0.9 0.85 0.8 0.75 0.5
0.6
0.7 0.8 Dominance
0.9
1
Fig. 9. Dominance Over Time
types between vertices in the Global-Aggregation version. Figure 8 shows that most relationships encompass just a single type, but it is likely that they present a large number of different types in all networks. Thus, there is a considerable number of vertices that are related across time, possibly in distinct moments, through different types of relationships. This will certainly interfere with the quality of characterization when we aggregate information across time. To show that the confusion caused by the variation of these relationship types across time degrades the quality of the predictions, we plot a Complementary Cumulative Distribution Function (ccdf ) of the Dominance across time, for each network. We compare the Dominance found in the Global-Aggregation versions with those verified at each distinct moment in the Local-Aggregation versions, as shown in Figure 9. For LocalAggregation versions we generate the ccdf for three different Dominance values: 50%, 90%, and 100%. Considering the Local-Aggregation version, we can observe that the probability of the Dominance be greater than or equal to 50%, at any moment, is very high in all databases (always above 90%). Another interesting observation is that the probability of the Dominance be greater than or equal to 90% is almost identical to those associated with 100% in all datasets. Thus, there is not a significant difference, in terms of probability
60
F. Mour˜ao et al. Table 1. Average Prediction Hit Rate (%) Network ACM T MD T MD CA Wiki Local-Aggregation Version 75.19 70.63 79.30 74.17 Global-Aggregation Version 60.85 63.28 78.67 48.57
of occurrence, between these two Dominances at any considered time. Finally, when comparing the values of Dominances found in the Local-Aggregation versions with the ones found in Global-Aggregation versions, we can notice significant differences for networks ACM T and Wiki. While the Local-Aggregation versions of these networks, even to the Dominance 100%, present a high probability of occurrence (greater than 90%), in the Global-Aggregation version such probability is lower than 79%. That is, the global information aggregation results in a considerable reduction in the predictions confidence (almost 13%). This shows that considering a smaller aggregation scale is better for sake of prediction. In the MD T and MD CA networks, such reduction is much smaller, but still exists, showing that a local aggregation would still be interesting, due to the accuracy levels achieved using just a small fraction of the historical data. Analysis of Prediction Quality Finally, we check the quality related to the predictions at each aggregation level. In order to verify it, we predict the relationship types based on existing relationships in our database, performing a 10-fold cross validation. The average prediction hit rates for each network are shown in Table 1. As we can see, the quality achieved by the Local-Aggregation version is better or statistically equivalent to those found for the Global-Aggregation version for all networks. For the ACM T and Wiki networks, the quality is around 20 % higher, showing that the aggregation with a historic perspective, considering all available information to the nodes, is not a good strategy.
6 Conclusion and Ongoing Work In this work we presented a study of the effects of information aggregation across time in complex networks. We used a network model to investigate these effects in several domains. Also, we proposed and applied to real databases a quantification methodology of such effects, which is applicable to any network that fits our model. We evaluated two information networks and two social networks from distinct areas. Following the proposed methodology, we found that analyzing the aggregated information does not bring benefit to the characterization of the networks examined, degrading the confidence and the quality of the predictions performed based on the historical information of the individuals in the networks. These results support our hypothesis that aggregating information across time increases confusion, once it distorts the characteristics of the underlying real-world network, making more difficult to draw meaningful characterizations that describe the behavior of objects and their interactions. Our analyses are relevant for improving the characterization task of the complex network objects, providing guidelines to the process of data selection prior to the relationship analysis of such objects. This study is different from others in the literature because we investigate the real impacts caused by the temporal evolution of complex
Quantifying the Impact of Information Aggregation on Complex Networks
61
networks. Several studies of large-scale networks may benefit from it, providing a strategy to quantify the impact of each type of aggregation and to define the most appropriate level for the characterization to be performed. As examples we can cite improvements in the building of predictive models, making them more effective by considering appropriate time periods. As future work, we will investigate more effective ways to select the data for analysis and characterization, so that we are able to extract information consistent from a temporal perspective.
References 1. Newman, M.E.J.: The structure and function of complex networks. SIAM Review 45(2), 167–256 (2003) 2. Albert, R., Barabasi, A.L.: Statistical mechanics of complex networks. Reviews of Modern Physics 74, 47 (2002) 3. Albert, R., Jeong, H., Barabasi, A.L.: The diameter of the world wide web. Nature 401, 130 (1999) 4. Elmacioglu, E., Lee, D.: On six degrees of separation in dblp-db and more. SIGMOD Rec. 34(2), 33–40 (2005) 5. Dorogovtsev, S., Mendes, J.: Evolution of networks. Advances in Physics 51, 1079 (2002) 6. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proc. of the 11th ACM SIGKDD, pp. 177–187. ACM, New York (2005) 7. Wilson, E.O.: Consilience: The Unity of Knowledge. Knopf (1998) 8. Archdeacon, D.: Topological graph theory: A survey. Cong. Num. 115, 115–5 (1996) 9. Erdos, P., Renyi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5, 17–61 (1960) 10. Barab´asi, A.L., Bonabeau, E.: Scale-free networks. Scientific American 288, 60–69 (2003) 11. Watts, D.J.: Small worlds: the dynamics of networks between order and randomness. Princeton University Press, Princeton (1999) 12. Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scale social networks. In: Proc. of the 9th WebKDD and 1st SNA-KDD, NY, USA, pp. 16–25. ACM, New York (2007) 13. Said, Y.H., Wegman, E.J., Sharabati, W.K., Rigsby, J.T.: Social networks of author-coauthor relationships. Comput. Stat. Data Anal. 52(4), 2177–2184 (2008) 14. Barabasi, A.L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution of the social network of scientific collaborations. PHYSICA A 311, 3 (2002) 15. Leskovec, J., Backstrom, L., Kumar, R., Tomkins, A.: Microscopic evolution of social networks. In: Proc. of the 11th ACM SIGKDD. ACM, New York (2008) 16. Kossinets, G., Kleinberg, J., Watts, D.: The structure of information pathways in a social communication network (June 2008) 17. Crandall, D., Cosley, D., Huttenlocher, D., Kleinberg, J., Suri, S.: Feedback effects between similarity and social influence in online communities. In: Proc. of ACM SIGKDD (2008) 18. Sharan, U., Neville, J.: Exploiting time-varying relationships in statistical relational models. In: Proc. of the 9th WebKDD and 1st SNA-KDD, pp. 9–15. ACM, New York (2007) 19. Liben-Nowell, D., Kleinberg, J.: The Link-Prediction Problem for Social Networks. JournalAmerican Society for Information Science and Technology 58(7), 1019 (2007) 20. Kossinets, G., Watts, D.: Empirical Analysis of an Evolving Social Network (2006) 21. Rocha, L., Mourao, F., Pereira, A., Gonc¸alves, M., Meira, W.: Exploiting temporal contexts in text classification. In: Proc. of ACM CIKM, Napa Valley, CA, USA. ACM, New York (2008) 22. Brieman, L., Spector, P.: Submodel selection and evaluation in regression: The x-random case. International Statistical Review 60, 291–319 (1992)
A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank Fan Chung University of California at San Diego, La Jolla CA 92093, USA
[email protected] http://www.math.ucsd.edu/~fan/
Abstract. We give an improved local partitioning algorithm using heat kernel pagerank, a modified version of PageRank. For a subset S with Cheeger ratio (or conductance) h, we show that there are at least a quarter of the vertices in S that can serve as seeds for heat kernel pagerank √ which lead to local cuts with Cheeger ratio at most O( h), improving the previously bound by a factor of log |S|.
1
Introduction
With the emergence of massive information networks, many previous algorithms are often no longer feasible. A basic setup for a generic algorithm usually includes a graph as a part of its input. This, however, is no longer possible for dealing with massive graphs with prohibitive large size. Instead, the (host) graph, such as the WWW-graph or various social networks, is usually meticulously crawled, organized and stored in some appropriate database. The local algorithms that we study here involve only “local access” of the database of the host graph. For example, getting a neighbor of a specified vertex is considered to be a type of local access. Of course, it is desirable to minimize the number of local accesses needed, hopefully independent of n, the number of vertices in the host graph (which may as well be regarded as “infinity”). In this paper, we consider a local algorithm that improves the performance bound of previous local partitioning algorithms. Graph partitioning problems have long been studied and used for a wide range of applications, typically along the line of divide-and-conquer approaches. Since the exact solution for graph partitioning is known to be NP-complete [12], various approximation algorithms have been utilized. One of the best known partition algorithms is the spectral algorithm. The vertices are ordered by using an eigenvector and only cuts which are initial segments in such an ordering are considered. The advantage of such a “one-sweep” algorithm is to reduce the number of cuts under consideration from an exponential number in n to a linear number. Still, there is a performance guarantee within a quadratic order by using a relation between eigenvalues and the Cheeger constant, called the Cheeger inequality. However, for a large graph (say, with a hundred million vertices), the task of computing an eigenvector is often too costly and not competitive. K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 62–75, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank
63
A local partitioning algorithm finds a partition that separates a subset of nodes of specified size near specified seeds. In addition, the running time of a local algorithm is required to be proportional to the size of the separated part but independent of the total size of the graph. In [19], Spielman and Teng first gave a local partitioning algorithm using random walks. The analysis of their algorithm is based on a mixing result of Lov´ asz and Simonovits in their work on approximating the volume of convex body. The same mixing result was also proved by Mihail earlier independently [17]. In a previous paper [1], a local partitioning algorithm was given using PageRank, a concept first introduced by Brin and Page [3] in 1998 which has been widely used for Web search algorithms. PageRank is a quantitative ordering of vertices based on random walks on the Webgraph. The notion of PageRank which can be carried out for any graph is basically an efficient way of organizing random walks in a graph. As seen in the detailed definition given later, PageRank can be expressed as a geometric sum of random walks starting from the seed (or an initial probability distribution), with its speed of propagation controlled by a jumping constant. The usual question in random walks is to determine how many steps are required to get close to stationary distribution. In the use of PageRank, the problem is reduced to specifying the range for the jumping constant to achieve the desired mixing. The advantage of using PageRank as in [1] is to reduce the computational complexity by a factor of log n. In this paper, we consider a modified version of PageRank called heat kernel pagerank. Like PageRank, the heat kernel pagerank has two parameters, a seed, and a jumping constant. The heat kernel pagerank can be expressed as an exponential sum of random walks from the seed, scaled by the jumping constant. In addition, the heat kernel pagerank satisfies a heat equation which dictates the rate of diffusion. We will examine several useful properties of the heat kernel pagerank. In particular, for a given subset of vertices, we consider eigenvalues on the induced subgraph on S satisfying Dirichlet boundary condition (details to be given in the next section). We will show that for a subset S with Cheeger ratio h, there are many vertices in S (whose volume is at least a quarter of the volume of S) such that the one-sweep algorithm using heat kernel √ pagerank with such vertices as seeds will find √ local cuts with Cheeger ratio O( h). This improves the previous bound of O( h log s) in a similar theorem using PageRank [1,2]. Here s denotes the volume of S and a local cut has volume at most s.
2
Preliminaries
In a graph G, the transition probability matrix W of a typical random walk on a graph G = (V, E) is a matrix with columns and rows indexed by V and is defined by: 1 if {u, v} ∈ E, W (u, v) = du 0 otherwise where the degree of v, denoted by dv , is the number of vertices that v is adjacent to. We can write W = D−1 A where A denotes the adjacency matrix of G and
64
F. Chung
D is the diagonal degree matrix.. A random walk on a graph G has a stationary distribution π if G is connected and non-bipartite. The stationary distribution π, if exists, satisfies π(u) = du / v dv . The PageRank we consider here is also called personalized PageRank, which generalized the version first introduced by Brin and Page [3]. PageRank has two parameters, a preference vector f (i.e., the probabilistic distribution of the seed(s)) and a jumping constant α. Here, the function f : V → R is taken to be a row vector so that W can act on f from the right by matrix multiplication. The PageRank pralpha,f , with the scale parameter α and the preference vector f , satisfies the following recurrence relation: prα,f = αf + (1 − α)prα,f W.
(1)
An equivalent definition for PageRank is the following: prα,f = α
∞
k=0
(1 − α)k f W k .
(2)
For example, if we have one starting seed denoted by vertex u, then f can be written as the (0, 1)-indicator function χu of u. Another example is to take f to be the constant function with value 1/n at every vertex as in the original definition of PageRank in Brin and Page [3]. The heat kernel pagerank also has two parameters, with (the temperature) t ≥ 0 and a preference vector f , defined as follows: ρt,f = e−t
∞ k t k=0
k!
f W k.
(3)
We see that (3) is just an exponential sum versus (2) is a geometrical sum. For many combinatorial problems, exponential generating functions play a useful role. The heat kernel pagerank as defined in (3) satisfies the following heat equation: ∂ ρt,f = −ρt,f (I − W ). (4) ∂t Let us define L = I − W. Then the definition of the heat kernel pagerank in (3) can be rewritten as follows: ρt,f = f Ht , where Ht is defined by Ht = e
−t
∞ k t
k=0
k!
Wk
= e−t(I−W )
= e−tL ∞ (−t)k k L . = k! k=0
A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank
65
From the above definition, we have the following facts for ρt,f which will be useful later. Lemma 1. For a graph G, its heat kernel pagerank ρ satisfies the following: (i) ρ0,f = f. (ii) ρt,π = π. (iii) ρt,f 1∗ = f 1∗ = 1 if f satisfies v f (v) = 1 where 1 denotes the all 1’s function (as a row vector) and 1∗ denotes the transpose of 1. (iv) W Ht = Ht W . ∗ (v) DHt = Ht∗ D and Ht = Ht/2 Ht/2 = Ht/2 D−1 Ht/2 D. Proof. Since H0 = I, (i) follows. (ii) and (iii) can be easily checked. (iv) follows from the fact that Ht is a polynomial of W . (v) is a consequence of the fact that W = D−1 A.
3
Dirichlet Eigenvalues and the Restricted Heat Kernel
For a subset S of V (G), there are two types of boundary of S — the vertex boundary δ(S) and edge boundary ∂(S). The vertex boundary δ(S) is defined as follows: δ(S) = {u ∈ V (G) \ S : u ∼ v for some v ∈ S}. For a single vertex v, the degree of v, denoted by dv is equal to |δ(v)| (which is short for |δ({v})|. The closure of S, denoted by S ∗ , is the union of S and δS. For a function f : S ∗ → R, we say f satisfies the Dirichlet boundary condition if f (u) = 0 for all u ∈ δ(S). We use the notation f ∈ D∗S to denote that f satisfies the Dirichlet boundary condition and we require that f = 0. For f ∈ D∗S , we define a Dirichlet Rayleigh quotient: RS (f ) =
2 x∼y (f (x) − f (y)) . 2 x∈S f (x)dx
(5)
where the sum is to be taken over all unordered pairs of vertices x, y ∈ S ∗ such that s ∼ y. The Dirichlet eigenvalue of an induced subgraph on S of a graph G can be defined as follows: λS = inf ∗ RS (f ) f ∈DS
f, (DS − AS )f f, DS f g, LS g = inf ∗ , g∈DS g, g
= inf ∗
f ∈DS
66
F. Chung
where XS denotes the submatrix of a matrix X with rows and columns restricted to those indexed by vertices in S, and the Laplacian L and LS are defined by −1/2 −1/2 L = D−1/2 (D − A)D−1/2 and LS = DS (DS − AS )DS , respectively. Here we call f the combinatorial Dirichlet eigenfunction if RS (f ) = λS . The Dirichlet eigenfunctions are the eigenfuctions of the matrix LS . (The detailed proof of these statements can be found in [5].) The Dirichlet eigenvalues of S are the eigenvalues of LS , denoted by λS,1 ≤ λS,2 ≤ · · · ≤ λS,s , where s = |S|. The smallest Dirichlet eigenvalue λS,1 is also denoted by λS . If the induced subgraph on S is connected, then the eigenvector of LS associated with λS is all positive (using the Perron-Frobenius Theorem [18] on I − LS ). The reader is referred to [5] for various properties of Dirichlet eigenvalues. For a subset S of vertices in G, the edge separater (or the edge boundary) whose removal separates S consists of all edges leaving S. Namely, ∂S = {{u, v} ∈ E : u ∈ S and v ∈ S}. How good is the edge separator? The answer depends both on the size of the edge separator and the volume of S, denoted by vol(S), is u∈S du . The Cheeger ratio, (sometimes called conductance), of S, denoted by hS , is defined by |∂S| hS = ¯ min{vol(S), vol(S)} where S¯ = V \ S denotes the complement of S. The Cheeger constant of agraph G is hG = minS⊆V hS . The volume of a graph G, denoted by vol(G) is u du . For a given subset S, we define the local Cheeger ratio, denoted by h∗S , as follows: h∗S = inf hT . T ⊂S
Note that in general hS is not necessarily equal to h∗S . The Dirichlet eigenvalue λS and the local Cheeger ratio h∗S are related by the following local Cheeger inequality: h∗S ≥ λS ≥
(h∗S )2 , 2
while the proof can be found in [6]. In [8], the following weighted Rayleigh quotient: 2 x∼y (f (x) − f (y)) φ(x)φ(y) Rφ (f ) = sup , 2 2 c x∈S (f (x) − c) φ (x)dx
(6)
where φ is the combinatorial Dirichlet eigenfunction which achieves λS . Then we can define λφ = inf Rφ (f ). f =0
Here we state several useful facts concerning λS and λφ (see [8]).
A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank
67
Lemma 2. For an induced subgraph S of a graph G, the Dirichlet eigenvalues of S satisfy λS,2 − λS = λφ ≥ λS . Theorem 1. For an induced subgraph S of G, the combinatorial Dirichlet eigenfunction φ which achieves λS satisfies 2 1 φ2 (x)dx φ(x)dx ≥ vol(S) 2 x∈S
x∈S
Proof. We note that λS,2 = inf
x∼y (f (x)
− f (y))2
, f (x)2 dx where f ranges over all functions satisfying x∈S f (x)φ(x) = 0. Now we use the fact that 2 x∼y (φ(x) − φ(y)) . λS = 2 x∈S φ(x) dx f
x∈S
We then have
x∼y λS,2 ≤
where
=
(φ(x) − φ(y))2
x∈S (φ(x)
− c)2 dx
λS x∈S φ2 (x)dx 2 2 x∈S φ (x)dx − c vol(S)
c=
φ(x)dx . vol(S)
x∈S
This implies 2 φ(x)dx λS,2 − λS ≥ vol(S) x∈S φ2 (x)dx λS,2 λφ = λφ + λS 1 ≥ 2
x∈S
by using Lemma 2.
4
A Lower Bound for the Restricted Heat Kernel Pagerank
For a given set S, we consider the distribution fS with choosing the vertex u with probability fS (u) = du /vol(S) if u ∈ S, and 0 otherwise. Note that fS
68
F. Chung
1 χS D where χS is the indicator function for S. For any can be written as vol(S) function g : V → R, we define g(S) = v∈S g(v). In this section, we wish to establish a lower bound for the expected value of heat kernel pagerank ρt,u = ρt,χu over u in S. We note that
E(ρt,u ) =
u∈S
du ρt,u vol(S)
= fS Ht (S). We consider the restricted heat kernel Ht′ for a fixed subset S, defined as follows: Let us define L = I − W. Then the definition of the heat kernel pagerank in (3) can be rewritten as follows: ρ′t,f = f Ht′ where Ht′ is defined by Ht′ = e−t
∞ k t
k!
WSk
k=0 −t(IS −WS )
=e = e−tLS ∞ (−t)k k LS . = k! k=0
Also, Ht′ satisfies the following heat equation: ∂ ′ H = −LS Ht′ . ∂t t
(7)
From the above definition, we immediately have the following: Lemma 3 Ht′ (x, y) ≤ Ht (x, y) for all vertices x and y. In particular, for a non-negative function f : V → R, we have f Ht′ ≤ f Ht . Therefore, it suffices to establish the desired lower bound for the expected value of restricted heat kernel pagerank. Following the notation in Section 3, we consider the Dirichlet eigenvalues λS,i of S and their associated Dirichlet combinatorial eigenfunction φi with R(φi ) = 1/2 λS,i , for i = 1, . . . , |S|. Clearly, φi DS are orthogonal eigenfunction of LS and form a basis for functions defined on S. Here we assume that u∈S φi (u)2 du = 1. To simplify the notation, in this proof we write λ′i = λS,i and λ′1 = λS .
A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank
Let fS denote fS = χS D/vol(S). We express f = as follows: 1/2 ai φi DS , f D−1/2 =
69
vol(S)fS in terms of φi
i
where
ai =
φi (u)du . vol(S) u∈S
Since f D−1/2 = f D−1/2 2 = 1, we have a2i = 1. i
From the above definitions we have ′ 2 , fS Ht′ (S) = f Ht/2 1/2
−1/2
where Ht′ = DS Ht′ DS
. From Theorem 1, we know the fact that a21
=
≥ 1/2
≥1−
φ1 (u)du vol(S)
u∈S
2
a2j .
j =1
Since fS Ht′ (S) =
′
a2i e−λi t ,
i
we have
fS Ht′ (S) ≥ a21 e−λS t 1 ≥ e−λS t . 2 We have proved the following: Theorem 2. For a subset S, the Dirichlet heat kernel Ht′ satisfies fS Ht′ (S) ≥
1 −λS t e . 2
As an immediate consequence of Theorem 1, Lemmas 2 and 3 and Theorem 2, we have the following: Corollary 1. In a graph G, for a subset S of vertices the heat kernel pagerank ρt,fS satisfies ∗ 1 1 E(ρt,u (S)) = ρt,fS (S) ≥ e−λS t ≥ e−hS t 2 2 where u is chosen according to fS .
70
5
F. Chung
A Local Lower Bound for Heat Kernel Pagerank
Corollary 1 states that the expected value of ρt,u is at least e−h(S)t . Thus, there exists a vertex u in S such that ρt,u (S) is at least e−h(S)t . However, in order to have an efficient local partitioning algorithm, we need to show that there are many vertices v satisfying ρt,v (S) ≥ cρt,fS (S) for some absolute constant c. To do so, we will prove the following: Theorem 3. In a graph G with a given subset S of vertices, the subset T = {u ∈ S : ρt,u (S) ≥ 41 e−tλS } satisfies vol(T ) ≥
1 vol(S), 4
if t ≥ 1/λS . From Lemma 3, Theorem 3 follows directly from Theorem 4: Theorem 4. In a graph G with a given subset S of vertices, the subset T = {u ∈ S : ρ′t,u (S) ≥ 14 e−tλS } satisfies vol(T ) ≥
1 vol(S), 4
if t ≥ 1/λS . To prove Theorem 4, we first prove the following lemma along the secondmoment methods: Lemma 4. If t ≥ 1/λS , then
u∈S
2 5 du ′ ρt,u (S) − ρ′t,fS (S) ≤ ρ′t,fS (S)2 . vol(S) 4
Proof. We note that
u∈S
2 du du ′ ρt,u (S) − ρ′t,fS (S) = ρ′ (S)2 − (ρ′t,fS (S))2 vol(S) vol(S) t,u u∈S
′ = fS H2t (S) − (fS Ht′ (S))2 .
It suffices to show that ′ fS H2t (S) ≤
9 (fS Ht′ (S))2 . 4
We consider the Dirichlet eigenvalues λS,i of S and their associated Dirichlet combinatorial eigenfunctions φi with R(φi ) = λS,i = λ′i , for i = 1, . . . , |S|. Here, 1/2 φi DS are orthonormal eigenfunctions of LS with u∈S φi (u)2 du = 1.
A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank
71
We can write f = χS DS vol(S)−1/2 = fS vol(S) by: 1/2 −1/2 ai φi DS . = f DS i
We have
a2i = 1,
i
since f D
−1/2
2 = 1. From Theorem 1, we know the fact that a21 ≥ 1/2 ≥ 1 − a2j . j =1
Also we have ′ (S) = f Ht′ 2 = fS H2t −1/2 1/2 DS Ht′ DS .
′
a2i e−2λi t ,
i
where Ht′ = Therefore we have, for t ≥ 1/λ′1 , ′ ′ fS H2t a2i e−2λi t (S) = i
=
′ a21 e−2λ1 t ′ a21 e−2λ1 t
≤
9 2 −2λ′1 t a e 8 1
≤
′
+ (1 − a21 )e−2λ2 t ′
+ (1 − a21 )e−4λ1 t
since λ′2 ≥ 2λ′1 by using Lemma 2. Therefore we have 9 2 −2λS t a e 8 1 9 ≤ a41 e−2λS t 4 9 2 −λ′i t 2 a e ) ≤ ( 4 i i
′ fS H2t (S) ≤
as desired. Now, we are ready to proceed to prove Theorem 4 which then implies Theorem 3. Proof of Theorem 4: Suppose vol(T ) ≤ vol(S)/4. We wish to show that this leads to a contradiction. Let T ′ = S \ T . We consider
u∈S
=
u∈T
2 du ′ ρt,u (S) − ρ′t,fS (S) vol(S)
2 du ′ 2 du ′ ρt,u (S) − ρ′t,fS (S) + ρt,u (S) − ρ′t,fS (S) vol(S) vol(S) ′ u∈T
72
F. Chung du ′ u∈T vol(S) (ρt,u (S) vol(T ) vol(S)
≥
≥
u∈T ′
2 − ρ′t,fS (S))
+
du ′ u∈T ′ vol(S) (ρt,u (S) vol(T ′ ) vol(S)
2 1 du (ρ′t,u (S) − ρ′t,fS (S)) + vol(T ) vol(S) vol(S)
since
u∈S
1 vol(T ′ ) vol(S)
2 − ρ′t,fS (S))
du (ρ′ − ρ′t,fS ) = 0. vol(S) t,u
Therefore we have du 2 vol(T ′ ) 3 ′ 2 1 ( )ρt,fS (S) + ρ′t,u (S) − ρ′t,fS (S) ≥ vol(T ) vol(S) vol(S) 4 u∈S
vol(S)
1 vol(T ′ ) vol(S)
3 3 ≥ 4( )4 + ( )3 ρ′t,fS (S)2 4 4 27 ′ ≥ ρ (S)2 16 t,fS 5 > ρ′t,fS (S)2 4
which is a contradiction to Lemma 4. Theorem 4 is proved.
6
An Upper Bound for Heat Kernel Pagerank
We define a s-local Cheeger ratio of a sweep f , denoted by hf,s to be the minimum Cheeger ratio of the segment Si with 0 ≤ vol(Si ) ≤ 2s. If no such segment exists, then we set hf,s to be 0. We will establish the following upper bound for the heat kernel pagerank in terms of s-local Cheeger ratios. The proof is similar but simpler than that in [7]. Theorem 5. In a graph G with a subset S with volume s ≤ vol(G)/4, for any vertex u in G, we have
s −tκ2t,u,s /4 ρt,u (S) − π(S) ≤ e du where κt,u,s denote the minimum s-local Cheeger ratio over a sweep of ρt,u . Proof. For a function f : V → R, we define f (u, v) = f (u)/du if v is adjacent to u and 0 otherwise. For an integer x, 0 ≤ x ≤ vol(G)/2, we define f (x) = max f (u, v). T ⊆V ×V,|T |=x
(u,v)∈T
We can extend f to all real x = k + r, with 0 ≤ r < 1 by defining f (x) = (1 − r)f (k) + rf (k + 1). If x = vol(Si ) where Si consists of vertices with the i highest values of f (u)/du , then it follows from the definition that f (x) = u∈Si f (u). Also f (x) is concave in x.
A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank
73
We consider the lazy walk W = (I + W )/2. Then 1 f (u, v) f W(S) = f (S) + 2 u∼v∈S 1 f (u, v) + f (u, v) = 2 u or v∈S
u and v∈S
1 ≤ f (vol(S) + |∂S|) + f (vol(S) − |∂S|) 2 1 = f (vol(S)(1 + hS )) + f (vol(S)(1 − hS )) . 2 This can be straightforwardly extended to real x with 0 ≤ x ≤ vol(G)/2. In particular, we focus on x satisfying 0 ≤ x ≤ 2s ≤ vol(G)/2 and we choose ft = ρt,u − π. Then 1 ft W(x) ≤ ft (x(1 + κt,u,s )) + ft (x(1 − κt,u,s )) . 2 We now consider for x ∈ [0, 2s], ∂ ft (x) = −ρt,u (I − W )(x) ∂t = −2ρt,u (I − W)(x)
= −2ft (x) + 2ft W(x) ≤ −2ft (x) + ft (x(1 + κt,u,s )) + ft (x(1 − κt,u,s ))
(8)
≤0
by the concavity of ft . Suppose gt (x) is a solution of the equation in (8) satisfying ∂ ∂ ft (x)|t=0 ≤ ∂t gt (x)|t=0 . Then, we have f0 (x) ≤ g0 (x), ft (0) = gt (0) and ∂t √ 2 −tκt,u,s /4 x ft (x) ≥ gt (x). It is easy to check that gt (x) ≤ e du using −2+ 1 + x+ √ 1 − x ≤ −x2 /4. Thus, ρt,u (S) − π(S) ≤ ρt,u (s) − π(s)
s −tκ2t,u,s /4 e , ≤ du as desired.
7
A Local Cheeger Inequality and a Local Partitioning Algorithm
Let hs denote the minimum Cheeger ratio hS with 0 ≤ vol(S) ≤ 2s. Also let κt,2s denotes the minimum of κt,u,2s over all u. Combining Theorem 3 and Theorem 5, we have that the set of u satisfying √ 2 1 −t h∗S 1 e ≤ e−tλS ≤ ρt,fS (s) − π(s) ≤ se−tκt,u,2s /4 , 2 2 has volume at least vol(S)/4, provided t ≥ 1/h2S ≥ 1/λS .
74
F. Chung
As an immediate consequence, we have the following local Cheeger inequality: Theorem 6. For a subset S of a graph G with vol(S) = s ≤ vol(G)/4 and t ≥ log s/(h∗S )2 , with probability at least 1/4 a vertex u in S satisfies h∗S ≥ λS ≥
κ2t,u 2 log s − 4 t
where the Cheeger ratio κt,u is determined by the heat kernel pagerank with seed u. Corollary 2. For s ≤ vol(G)/4,and t ≥ 4 log s/(hS ∗)2 , and a set S of volume s, the Cheeger ratio κt,u , determined by the heat kernel pagerank with a random seed u in S, satisfies κ2t,u h∗S ≥ λS ≥ 8 with probability at least 1/4. The above local Cheeger inequalities are closely associated with local partition algorithms. A local partition algorithm has inputs including a vertex as the seed, the volume s of the target set and a target value φ for the Cheeger ratio of the target set. The local Cheeger inequality in Theorem 5 suggests the following local partition algorithm. In order to find the set achieving the minimum s-local Cheeger ratio, one can simply consider a sweep of heat kernel pagerank with further restrictions to the cuts with smaller parts of volume between 0 and 2s. Theorem 6 implies the following. Theorem 7. In a graph G, for a set S with volume s ≤ vol(G)/4, and Cheeger ratio hS ≤ φ2 , there is a subset S ′ ⊆ S with vol(S ′ ) ≥ vol(S)/4 such that for any u ∈ S ′ , the sweep by using the heat kernel pagerank ρt,u , with t = 2φ−2 log s, will find a set T with s-local Cheeger ratio at most 2φ. We note that the performance bound for the Cheeger ratio improves the earlier result in [1] by a factor of log s. In fact, the inequality in Theorem 6 suggests a whole range of trade-off. If we choose t to be t = 2φ−2 instead, then the guaranteed Cheeger ratio as in the above statement will be 2φ log s and we obtain the same approximation results as in [1]. We remark that the computational complexity of the above partitioning algorithm is the same as that of computing the heat kernel pagerank. However, the algorithmic design for heat kernel pagerank has not been as extensively studied as the PageRank. More research is needed in this direction.
References 1. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: Proceedings of the 47th Annual IEEE Symposium on Founation of Computer Science (FOCS 2006), pp. 475–486 (2006)
A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank
75
2. Andersen, R., Chung, F.: Detecting sharp drops in pageRank and a simplified local partitioning algorithm. In: Cai, J.-Y., Cooper, S.B., Zhu, H. (eds.) TAMC 2007. LNCS, vol. 4484, pp. 1–12. Springer, Heidelberg (2007) 3. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998) 4. Cheeger, J.: A lower bound for the smallest eigenvalue of the Laplacian. In: Gunning, R.C. (ed.) Problems in Analysis, pp. 195–199. Princeton Univ. Press, Princeton (1970) 5. Chung, F.: Spectral Graph Theory. AMS Publications (1997) 6. Chung, F.: Random walks and local cuts in graphs. LAA 423, 22–32 (2007) 7. Chung, F.: The heat kernel as the pagerank of a graph. PNAS 105(50), 19735–19740 (2007) 8. Chung, F., Oden, K.: Weighted graph Laplacians and isoperimetric inequalities. Pacific Journal of Mathematics 192, 257–273 (2000) 9. Chung, F., Yau, S.-T.: Coverings, heat kernels and spanning trees. Electronic Journal of Combinatorics 6, R12 (1999) 10. Chung, F., Lu, L.: Complex Graphs and Networks. In: CBMS Regional Conference Series in Mathematics, vol. 107, pp. viii+264. AMS Publications, RI (2006) 11. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. J. Symbolic Comput. 9, 251–280 (1990) 12. Garey, M.R., Johnson, D.S.: Computers and Intractability. A Guide to the theory of NP-completeness, pp. x+338. W.H. Freeman, San Francisco (1979) 13. Jerrum, M., Sinclair, A.J.: Approximating the permanent. SIAM J. Computing 18, 1149–1178 (1989) 14. Kannan, R., Vempala, S., Vetta, A.: On clusterings: Good, bad and spectral. JACM 51, 497–515 (2004) 15. Lov´ asz, L., Simonovits, M.: The mixing rate of Markov chains, an isoperimetric inequality, and computing the volume. In: 31st IEEE Annual Symposium on Foundations of Computer Science, pp. 346–354 (1990) 16. Lov´ asz, L., Simonovits, M.: Random walks in a convex body and an improved volume algorithm. Random Structures and Algorithms 4, 359–412 (1993) 17. Mihail, M.: Conductance and Convergence of Markov Chains: A Combinatorial treatment of Expanders. In: FOCS, pp. 526–531 (1989) 18. Perron, O.: Theorie der algebraischen Gleichungen, II (zweite Auflage). de Gruyter, Berlin (1933) 19. Spielman, D., Teng, S.-H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pp. 81–90 (2004) 20. Schoen, R.M., Yau, S.T.: Differential Geometry. International Press, Cambridge (1994)
Choose the Damping, Choose the Ranking?⋆ Marco Bressan and Enoch Peserico Dipartimento di Ingegneria dell’Informazione, Universit`a di Padova, Italy {bressanm,enoch}@dei.unipd.it
Abstract. To what extent can changes in PageRank’s damping factor affect node ranking? We prove that, at least on some graphs, the top k nodes assume all possible k! orderings as the damping factor varies, even if it varies within an arbitrarily small interval (e.g. [0.84999, 0.85001]). Thus, the rank of a node for a given (finite set of discrete) damping factor(s) provides very little information about the rank of that node as the damping factor varies over a continuous interval. We bypass this problem introducing lineage analysis and proving that there is a simple condition, with a “natural” interpretation independent of PageRank, that allows one to verify “in one shot” if a node outperforms another simultaneously for all damping factors and all damping variables (informally, time variant damping factors). The novel notions of strong rank and weak rank of a node provide a measure of the fuzziness of the rank of that node, of the objective orderability of a graph’s nodes, and of the quality of results returned by different ranking algorithms based on the random surfer model. We deploy our analytical tools on a 41M node snapshot of the .it Web domain and on a 0.7M node snapshot of the CiteSeer citation graph. Among other findings, we show that rank is indeed relatively stable in both graphs; that “classic” PageRank (d = 0.85) marginally outperforms Weighted In-degree (d → 0), mainly due to its ability to ferret out “niche” items; and that, for both the Web and CiteSeer, the ideal damping factor appears to be 0.8 − 0.9 to obtain those items of high importance to at least one (model of randomly surfing) user, but only 0.5 − 0.6 to obtain those items important to every (model of randomly surfing) user.
1
Introduction
This paper addresses the fundamental question of how the ranking induced by PageRank can be affected by variations of the damping factor. This introduction briefly reviews the PageRank algorithm (Subsection 1.1) and the crucial difference between score and rank (Subsection 1.2) before presenting an overview of our results and the organization of the rest of the paper (Subsection 1.3). ⋆
This work was supported in part by MIUR under PRIN Mainstream and by EU under Integr. Proj. AEOLUS (IP-FP6-015964).
K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 76–89, 2009. c Springer-Verlag Berlin Heidelberg 2009
Choose the Damping, Choose the Ranking?
1.1
77
PageRank and the Damping Factor
PageRank [19] is probably the best known of link analysis algorithms, ranking nodes of a generic graph in order of “importance”. PageRank was first used in search engines to rank the nodes of the Web graph; since then, it has been adapted to many other application domains, such as credit and reputation systems [14], crawling [8], automatic construction of Web directories [23], word stemming [3], automatic synonym extraction in a dictionary [5], word sense disambiguation [24], item selection [26] and text summarization [9]. In its original form, PageRank is based on a model of a Web surfer that probabilistically browses the Web graph, starting at a node chosen at random according to the distribution given by a personalization vector e > 0 whose ith component ei is the probability of starting at the ith node. At each step, if the current node has outgoing links to m > 0 other nodes v1 , . . . , vm , the surfer next browses with probability d one of those nodes (chosen uniformly at random), and with probability 1 − d a node chosen at random according to e. If the current node is a sink with no outgoing links, the surfer automatically chooses the next node at random according to e. If the damping factor d is less than 1 (for the Web graph a popular value is ≈ 0.85 [19], [16]), the probability Pi (t) of visiting the node vi at time t converges for t → ∞ to a stationary probability Pi . This is the score PageRank assigns to vi , ranking all nodes in order of non-increasing score. Intuitively, the more ancestors (both immediate and far removed) a node has, and the fewer descendants those ancestors have, the higher the score of that node. Note that both the score of each node and its rank may change as the damping factor changes [22]. Lower damping factors decrease the likelihood of following long paths of links without taking a random jump — and thus increases the contribution to a node’s score and rank of its more immediate ancestors. 1.2
Score vs. Rank
PageRank assigns to the n nodes of a graph G a score vector that is the dominant (right) eigenvector of the stochastic n × n matrix d M + (1 − d) [e · 1T ], where M is obtained from the transposed adjacency matrix of G by normalizing each non-zero column, and replacing each zero column (corresponding to a sink) with the personalization vector e. Typically, the score vector is calculated using the Power Method [16]. Thus, one can analyze the impact of the damping factor on the score vector using the highly developed toolset of linear algebra [16]. This is no longer true when dealing with rank: loss of linearity and continuity in the mathematical model make analysis considerably more difficult and require different tools. Unfortunately, in many cases, (variation in) rank is far more important than (variation in) score. Although the PageRank score is often combined with other scores (e.g. obtained from textual analysis in the context of Web pages) to provide a ranking, and thus large variations in PageRank score would seem more important than large variations in rank, this is often not the case. Some search engines may simply filter items ranked by PageRank through a Boolean model of
78
M. Bressan and E. Peserico
relevance to the query at hand - in these cases the ordering induced by PageRank is strictly maintained [3]. And many other applications simply use the unmodified PageRank score to rank items in order of precedence, whether for efficiency or by lack of other valid alternatives: e.g. Web crawlers [8] and reputation systems [14]. For a more thorough discussion of why, in many cases, rank is more important than score and thus why it is crucial to analyze the perturbation in rank induced by variations of the damping factor, see [11,13,17,18,21]. 1.3
Our Results
This paper addresses the fundamental question of how variations in PageRank’s damping factor can affect the ranking of nodes; it is organized as follows. Section 2 shows that for any k, at least on some graphs, arbitrarily small variations in the damping factor can completely reverse the ranking of the top k nodes, or indeed make them assume all possible k! orderings. It is natural to ask whether this can happen in “real” graphs; for example, the Web graph. However, verifying rank stability for discrete variations in the damping factor (e.g. 0.01 increments) is not sufficient to conclude that rank is stable as the damping factor varies over a whole continuous interval - just like sampling the function f (x) = sin(100πx) for x = 0.01, 0.02, . . . is not enough to conclude that f (x) = 0 ∀x ∈ R. Section 3 provides the analytical tools to address this issue. We show a simple, “natural” condition that is both necessary and sufficient to guarantee that a node outranks another simultaneously for all damping factors and all damping variables (informally, time-variant damping factors). This condition, based on the concept of lineage of a node, can be checked efficiently and has an intuitive justification totally independent of PageRank. Section 4 leverages lineage analysis to introduce the novel concepts of strong rank and weak rank. These provide for each node a measure of the “fuzziness” of its ranking that is subtly but profoundly different from pure rank variation; they also provide a measure of the effectiveness of link analysis on a generic graph; finally, they allow an objective evaluation of the performance of “classic” (d = 0.85) PageRank and some of its variations (e.g. d → 0). Section 5 brings the analytical machinery of Section 3 to bear on two real graphs - a snapshot of the .it domain, and the CiteSeer citation graph [1]. Section 6 summarizes our results, analyzes their significance, and reviews a few problems this paper leaves open, before concluding with the bibliography. Note that space limitations force us to omit all theorem proofs. The interested reader can find them in our online Technical Report [20].
2
The Damping Makes the Ranking
This section presents two Theorems showing that, at least on some graphs, a minuscule variation of the damping factor can dramatically change the ranking of the top k nodes. More formally, we prove (see our Tech Report [20]):
Choose the Damping, Choose the Ranking?
79
Theorem 1. For every even k > 1 and every d satisfying k1 < d < 1 − k1 , there is a graph G of 2k 2 + 4k − 2 vertices such that PageRank’s top k nodes are, in order, v1 , . . . , vk if the damping factor is d, and vk , . . . , v1 if it is d + k1 . Theorem 2. Consider an arbitrary set Π of orderings of k nodes v1 , . . . , vk (|Π| ≤ k!), and an arbitrary open interval I ⊂ [0, 1], however small. Then there is a graph G such that PageRank’s top k nodes are always v1 , . . . , vk but appear in every order in Π as the damping factor varies within I. By Theorem 1, there are graphs of ≈ 2M nodes (a size comparable to that of a citation archive or of a small first level domain) where a variation of the damping factor as small as 0.001 (e.g. from 0.850 to 0.851) can cause a complete reversal of the ranking of the top 1000 items; and the sensitivity to the damping factor can grow even higher for larger graphs. Theorem 2 is even more general: for any arbitrarily small interval of variation of the damping factor, there are graphs in which the top items assume all possible permutations (but providing bounds on the size of these graphs is beyond the scope of this paper).
3
Lineage Analysis
It is natural to ask to what extent PageRank rankings of “real” graphs like the Web graph are affected by variations in the damping factor, particularly in the light of the results of Section 2. Unfortunately, verifying that a node always outranks another as the damping factor varies by discrete increments (e.g. 0.01) suggests, but can not conclusively prove that the first node always outranks the second over a whole continuous interval of variation of the damping factor: by virtue of Theorems 1 and 2 the ranking could drastically change between those isolated sampling points. For the same reason, while previous experimental evidence obtained over a set of isolated values of the damping factor (e.g.[18]) suggests that PageRank is indeed stable even for large variations, this evidence can not be taken as conclusive. This section shows how to bypass the problem. The core result is a simple condition that can be checked efficiently and yet is both necessary and sufficient to guarantee that a node outranks another for all damping factors as well as for all damping variables - informally, time-variant damping factors introduced in Subsection 3.1 and related to the damping functions of [4]. Subsection 3.2 presents this “dominance” condition, which is based on the concept of lineage of a node, and turns out to be a very “natural” concept with a strong intuitive justification totally independent of PageRank. 3.1
Damping Variables
It is natural to extend the stochastic model of PageRank to one where the probability of following one of the current node’s outgoing links is not a constant damping factor d, but is instead a damping variable d(τ ) that is a function of the number of steps τ taken since the last random jump. E.g. a Web surfer might have a high probability of following a chain of links up to a depth of e.g. 3, and
80
M. Bressan and E. Peserico
then of “hitting the reset button” with a random jump. In this case d(τ ) would be close to 1 for τ ≤ 3, but close to 0 for τ > 3 (we formally define d(0) = 1). The score assigned by PageRank to the node vi using a damping variable d(τ ), Pvi (d(·)) — i.e. the stationary probability of being on vi according to the model of Subsection 1.1 — can be seen as the sum, over all ℓ ≥ 0, of the probability of taking, ℓ timesteps in the past, the last random jump to a node viℓ at distance ℓ from v, and following any path of length ℓ from viℓ to v. ℓ
More formally, denote by p → v the fact that p is a path of ℓ + 1 vertices viℓ , . . . , vi1 , v, from some vertex viℓ to v. Let branching(p) be the inverse of the product of the out-degrees ωiℓ , . . . , ωi1 of viℓ , . . . , vi1 , i.e. branching(p) = 1 ωiℓ ... ωi1 (informally, the probability of following the whole path if one starts on the first node viℓ and no random jumps are taken). We can then prove: Theorem 3. If d(τ ) ≤ (1 − ǫ) for some ǫ > 0 and for infinitely many τ , the score assigned by PageRank to the node vi is: P (d(·)) = i
+∞ ℓ =0
Jd(·)
ℓ
d(j)
j =0
branching(p) · eiℓ
(1)
ℓ
p →v
where eiℓ is the entry of e associated with the first node viℓ of the path p and Jd(·) is the limit for t → +∞ of the probability Jd(·) (t) that a random jump occurs at time t. Note that for d(·) ≡ d we have Jd(·) ≡ 1 − d, which yields the classic PageRank formulation. Again, the proof is in our Tech Report [20]. branching(p) · eiℓ , the branching contriOne can verify that the term ℓ p→v
bution at level ℓ, equals the sum of the elements on the row of Mℓ associated to v, each weighted by the corresponding component of e. This term is completely independent of the damping variable; whereas the damping function [4] ℓ Jd(·) j=0 d(j) denotes the probability, in the stationary state, of having taken the last random jump ℓ steps in the past and does not depend on the structure of the graph, but only on the damping variable. The score vector is then: P (d(·)) = Jd(·)
+∞
cℓM ℓ· e
(2)
ℓ =0
ℓ where cℓ = j=0 d(j) — again, for d(·) ≡ d, we have the classic PageRank formulation as a power series that can be found e.g. in [4]. While the meaning of Equation 1 is quite intuitive, proving that it holds requires some subtlety; indeed, if the condition d(τ ) ≤ (1 − ǫ) is not satisfied (e.g. if one only requires d(τ ) < 1) then the probability of being at node v at time t might not become stationary as t → ∞. In this case, one might still formally define the score of a node through Equation 1, but relating it to a stochastic surfing model becomes considerably harder; and, perhaps more importantly, computing the score of a node becomes considerably more difficult, since the classic iterative algorithms based on the Power Method can fail to converge.
Choose the Damping, Choose the Ranking?
3.2
81
Lineages
This subsection provides a simple condition both necessary and sufficient to guarantee that, for all damping variables d(·), the score Pv (d(·)) of a node v is always at least as high as the score Pw (d(·)) of a node w. branching(p) · eiℓ in Equation 1 - the level ℓ branching Recall the term ℓ p→v contribution to the score of v, weighted by the personalization vector e. Informally, the mth generation of the lineage of v is equal to the sum of the branching contributions of all levels ℓ ≤ m, each weighted by the respective component of e. More formally: Definition 1 The mth generation of the lineage of
v
is Lv (m) =
m ℓ=0
ℓ
branching(p) · eiℓ
p→v
We say that the lineage Lv (·) of a node v is greater or equal than the lineage Lw (·) of a node w if it is greater or equal at every generation (in which case we write simply Lv ≥ Lw ). There is a simple, intuitive interpretation of dominance between lineages. If we imagine that the authority/reputation/trust of a node is divided evenly among the nodes it points to, having a greater lineage at the mth generation means receiving more authority from all nodes within at most m hops. Nodes that are further away simply convey no authority at all, modeling the fact that, after all, authority/reputation/trust may be inherited, but only to a point. If one is uncertain to which point, one can be sure that a node has authority at least as high as that of another node only if its lineage is at least as high at every generation. And, indeed, we show that having a lineage that is at least as high at every generation is strictly equivalent to receiving an equal or higher score by PageRank for every damping variable. More formally, our Tech Report [20] proves: Theorem 4. Lv ≥ Lw
⇔
∀ d(·) ∈ (0, 1), Pv d(·) ≥ Pw d(·)
Note that in practice we do not need to check lineage dominance at every generation; if we restrict ourselves to damping variables that are 0 for τ > t, we only need to check at most t generations. This is equivalent to considering PageRank computations truncated after at most t iterations (effectively truncating the series in Equations 1 and 2 to the τ th term), which is always true in practice and justifies further the introduction of damping variables.
4
Damping-Independent Ranking
Lineages and Theorem 4 provide powerful tools to compare two nodes over the spectrum of all damping variables. This section leverages them to introduce the concepts of strong rank and weak rank (Subsection 4.1), and the related
82
M. Bressan and E. Peserico
ranking algorithms StrongRank and WeakRank (Subsection 4.2). These provide interesting measures of the “fuzziness” of rankings, of the “orderability” of a graph, and the performance of different ranking algorithms based on the random surfer model. 4.1
Strong and Weak Rank
Given a node v, and assuming for simplicity all ties in lineage score are broken (e.g. arbitrarily), all other nodes fall into three sets: the set S(v) of nodes stronger than v (with a greater lineage), the set W (v) of nodes weaker than v (with a lesser lineage), and the set I(v) of nodes incomparable with v (with a lineage greater at some generations and lesser at others). The cardinalities of these sets define the weak and strong rank of v: Definition 2. The weak rank ρw (v) and the strong rank ρs (v) of a node v are, respectively, |S(v)| + 1 and |S(v) ∪ I(v)| + 1. Note that ρw (v) − 1 is the number of those nodes that outperform v for all damping variables, whereas ρs (v) − 1 is the number of those nodes which outperform v for at least one damping variable. Thus ρw (v) is a lower bound to the minimum (i.e. best) rank achievable by v, while ρs (v) is an upper bound to the maximum (i.e. worse) rank achievable by v; and ρs (v) − ρw (v) is then an upper bound to the maximum variation in v’s rank. Note that any or all of these three bounds might hold strictly, and there is a subtle but profound difference between ρs (v) − ρw (v) and the maximum variation of v’s rank that makes the former more descriptive of the “fuzziness” of v’s performance. E.g. suppose that v holds 10th rank for all damping variables, but 999 other nodes fill the top 9 positions in turn for different damping variables. In this case ρs (v) = 1000, since v does not fare definitively better than any of those 999 nodes; and ρw (v) = 1, since none of those 999 nodes fares definitively better than v. Strong and weak rank also provide a measure of the global rank fuzziness in a generic graph - that is, of the extent to which the graph can be ordered satisfying simultaneously every type of user (with different users described by different damping variables). For a graph G, consider the number sk (G) of nodes with strong rank k or less. If sk (G) ≈ k, then every user’s top k set has a high density of nodes also interesting to every other user. Conversely, if sk (G) ≪ k, then few nodes will be universally interesting. Similarly, consider the number wk (G) of nodes with weak rank k or less. If wk (G) ≈ k, then relatively few nodes are sufficient to cover the interests of every user. Conversely, if wk (G) ≫ k, then the interests of different users are sufficiently diverse that, in order to ensure that no user misses the items most interesting to him, any algorithm must return a very large collection of items. The ratio wk (G)/sk (G) can then be seen as a measure of the inevitable price of obliviousness to the user’s model: the smaller it is, the more well-orderable G is. In the ideal case, sk (G) = k = wk (G), all users have exactly the same preferences and every damping variable yields the same ordering.
Choose the Damping, Choose the Ranking?
4.2
83
StrongRank and WeakRank
Strong rank and weak rank automatically induce two new ranking algorithms, StrongRank and WeakRank, that rank nodes respectively in order of strong and weak rank. Neither necessarily corresponds to PageRank for some damping variable. StrongRank tends to return items that are each of at least moderate interest to every user (with different users described by different damping variables). WeakRank tends to return, for every user, at least a few items that are of high interest to that user. As a consequence, they can provide benchmarks to evaluate other ranking algorithms. If for every k many of the top k items returned by a ranking algorithm are also among the top k items returned by StrongRank (the “intersection metric” of [10]), one can reasonably deduce that a large fraction of items returned by that algorithm is of at least moderate interest to every user. Similarly, a large intersection with WeakRank points to an algorithm that returns, for every user, a large fraction of that user’s top choices. A complete analysis of the complexity of StrongRank and WeakRank would require taking into account the properties of the target graph (and thus of the application domain) as well as caching and parallelizability issues. This is beyond the scope of this paper; however, we provide a few basic results (our Tech Report [20] contains a more detailed analysis and the proofs of the following Theorems 5 and 6). The worst case complexity of PageRank (for one value of the damping factor) on a graph of n nodes equals that of StrongRank and WeakRank: Theorem 5. The worst case complexity of computing StrongRank and WeakRank up to lineage ℓ on an n node graph is O(ℓn2 ), equal to that of computing the first ℓ iterations of PageRank. StrongRank and WeakRank can then be computed for any graph of up to millions of nodes on a PC in a few days; and very few application domains of PageRank entail larger graphs (the World Wide Web being one notable exception). Graphs of much larger size n are manageable by PageRank itself only if of low (average) degree g ≪ n; and in such graphs, one is often interested only in the top k ranking nodes, with k ≪ n. When these graphs are ∆−well orderable, i.e. there are at most k∆ nodes with WeakRank less than k and at least k/∆ with StrongRank less than k (this seems to hold with 2 ≤ ∆ ≤ 4 for k > 100 in social networks like the Web, see Section 5) we can refine the analysis of Theorem 5: Theorem 6. The worst case complexity of computing the top k ranks of StrongRank and WeakRank up to lineage ℓ on a ∆−well orderable graph of average + ∆(k+ℓ)(lg(∆)+k) )) and degree g and n nodes is, respectively, O(nℓg(1 + lg(k) g ng 2
3
k ∆ +∆(k+ℓ)(lg(∆)+k) O(nℓg(1 + lg(k) )) vs. a complexity of O(nℓg) for the top g + ng k ranks and the first ℓ iterations of PageRank. √ For max(lg(n), g, ℓ) ≤ k ≤ n the complexities of StrongRank and WeakRank 3 simplify respectively into O(nℓg(1 + lg(k)+∆ )) and O(nℓg(1 + lg(k)+∆ )). Thus, g g even in the case of the (indexed) Web graph it still appears possible to compute the top ≈ 105 ranks of StrongRank and WeakRank in time comparable to that of PageRank (within an order of magnitude - several hours on a single PC).
84
M. Bressan and E. Peserico
5
Experimental Results
This section brings the analytical machinery of Section 3 to bear on “real” graphs - a 2004 snapshot of the .it domain Web graph, and a 2007 snapshot of the CiteSeer citation graph. Subsection 5.1 briefly presents the data and the experimental setup. Subsection 5.2 evaluates the extent to which the nodes of those graphs can be ordered in a fashion satisfying simultaneously every user (with different users described by different damping variables). Finally, Subsection 5.3 analyzes the performance of PageRank as d varies from nearly 0 to nearly 1 by evaluating the intersection of its top k set with that of StrongRank and WeakRank. 5.1
Data and Experimental Setup
We analyze the 40M node, 1G link snapshot of the 2004 .it domain published by [25], and the 0.7M node, 1.7M link snapshot of the late 2007 CiteSeer citation graph [1] (nodes are articles and links are citations). We use the WebGraph package [7,25] for their manipulation. .it, as a national first level domain, provides a large and (unlike e.g. .com or .gov) “well-rounded” portion of the Web graph with a size still within reach of with our computational resources. Furthermore, the primary language of .it is not shared by any other country (unlike e.g. .uk or .fr), minimizing distortions in ranking introduced by the inevitable cut of links with the rest of the Web. We perform two pre-processing steps on the .it snapshot. First we remove intradomain links (which constitute strongly biased conferral of authority, see [15] and [12]). Second, we merge all dynamic pages generated from the same base page, dealing with pages (like the homepage of a forum’s dynamic language) automatically linked by a huge collection of other template-generated pages. Both steps appear to markedly improve the human-perceived quality of the results. Then, we compute the lineage of each node up to the 128th generation for both graphs. Note that, in theory, lineages should be compared for all generations. Stopping at the 128th is equivalent to considering damping variables d(τ ) that are 0 for τ > 128, and (typical) PageRank implementations that compute the score vector using at most 128 iterations and disregarding authority propagation over paths longer than 128 hops. This appears reasonable because for all damping factors/variables except those extremely close to 1, the branching contribution of levels beyond the 128th is dwarfed by rounding errors of finite precision computing machinery; and because, empirically, the set of the top k items returned by StrongRank and WeakRank when considering only the first ℓ generations appears to almost completely stabilize as ℓ grows larger than 60−100 (Fig. 1 and 2 show the normalized intersection between the top k item sets returned by StrongRank and WeakRank when considering 128 lineages, and the top k item sets returned when considering ℓ lineages for ℓ = 1, . . . , 128). 5.2
Choose the Damping, Choose the Ranking?
Figures 3 and 4 show the number of k−strongly ranked nodes and the number of k−weakly ranked nodes for the .it domain graph and the CiteSeer citation
0.8
0.6
0.6
0.4
k = 16 0.4 k = 64 k = 256 0.2 k = 1024 0 80 100 120
0.2 0
20
40
60
80 100 120 20 Level
40
60
Fig. 1. .it domain graph: convergence of top k-strong (left) and top k-weak (right) sets as a function of lineage level, for different values of k 5
10
1 0.8
0.6
0.6
0.4
k = 16 0.4 k = 64 k = 256 0.2 k = 1024 0 80 100 120
0.2 0
20
40
60
80 100 120 20 Level
40
60
Fig. 2. CiteSeer citation graph: convergence of top k-strong (left) and top kweak (right) sets as a function of lineage level, for different values of k 10
k strongly ranked weakly ranked
4
10
3
k strongly ranked weakly ranked
3
10
10
2
2
10
10
1
1
10
10
0
10 0 10
1 0.8
5
10
4
85
Normalized intersection size
1
Normalized intersection size
1 0.8
Normalized intersection size
Normalized intersection size
Choose the Damping, Choose the Ranking?
0
1
10
2
10
3
10
4
10
Fig. 3. Number of k-strongly and kweakly ranked nodes in the .it Web graph, as a function of k
10 0 10
1
10
2
10
3
10
4
10
Fig. 4. Number of k-strongly and kweakly ranked nodes in the CiteSeer citation graph, as a function of k
graph as k varies from 1 to 16000. The two graphs exhibit a strikingly similar behavior, with a few differences. Both in the .it and in the CiteSeer graph the ratio wk /k between the number wk of k−weakly ranked nodes and k is never higher than 6, and converges relatively quickly to a value between 1.5 and 2.5 (it is never larger than 4 for k ≥ 7 on the .it graph, and for k ≥ 127 on the CiteSeer Graph). This ratio represents the inevitable cost of obliviousness to the user model: to include all those items that are outranked by less than k other items for some damping variable, a set of at least wk items is needed. The ratio sk /k between the number sk of k−strongly ranked nodes and k is considerably smaller than 1 in both graphs - particularly in the CiteSeer graph. This ratio represents the fraction of items that are robustly in the top k, for all damping factors and variables. For the CiteSeer graph, the ratio is ≈ 0.05 for k < 200, converging to a value between 0.5 and 0.6 for k > 10000. For the .it graph, it is slightly larger: between 0.1 and 0.2 for k < 200, converging to a value between 0.4 and 0.5 for k > 10000. Thus, ranking sensitivity to the damping factor in “real” graphs appears not nearly as high as that of the synthetic graphs of Section 2, but still considerable - particularly for the top 10 − 100 items. To return all items that would appear among the top k for some damping factor or variable, even the “best” ranking algorithm might have to return from 2 to 4 times as many items; and although
86
M. Bressan and E. Peserico 1
0.6 0.4
k = 16 k = 64 k = 256 k = 1024
0.2 0
0.1
0.2
0.3
0.4
0.5 d
0.6
0.7
0.8
normalized intersection
normalized intersection
1 0.8
k = 16 k = 64 k = 256 k = 1024
0.2
0.1
0.2
0.3
0.4
0.5 d
0.6
0.7
0.8
0.9
1
0.6 0.4
k = 16 k = 64 k = 256 k = 1024
0.2
0.1
0.2
0.3
0.4
0.5 d
0.6
0.7
0.8
0.9
Fig. 7. CiteSeer citation graph: intersection metric for PageRank vs. StrongRank as d varies
normalized intersection
normalized intersection
0.4
Fig. 6. .it graph: intersection metric for PageRank vs. WeakRank as d varies
1 0.8
0
0.6
0
0.9
Fig. 5. .it graph: intersection metric for PageRank vs. StrongRank as d varies
0.8
0.8 0.6 0.4
k = 16 k = 64 k = 256 k = 1024
0.2 0
0.1
0.2
0.3
0.4
0.5 d
0.6
0.7
0.8
0.9
Fig. 8. CiteSeer citation graph: intersection metric for PageRank vs. WeakRank as d varies
any choice of the damping factor will guarantee among the top k a non-negligible core of items that would also be returned among the top k for every other choice, this core appears relatively small, between 5% and 40% of the total. 5.3
The Best Damping Factor
In this subsection we deploy StrongRank and WeakRank as a testbed to evaluate the performance of PageRank for different damping factors (see Subsection 3.2) on the .it graph and on the CiteSeer graph. Figures 5,6,7 and 8 show the fraction of the top k items returned by PageRank that are also among the top k items returned respectively by StrongRank and WeakRank, for k = 16, 64, 256 and 1024, as the damping factor varies between 0.01 and 0.99 in steps of 0.01. Again, the behavior of PageRank is strikingly similar on the two graphs. For both of them and for all four values of k, the size of the intersection set is between 0.38k and 0.93k for all damping factors sampled (but note that, in theory, anything could happen between those discrete sampling points - see Section 2!). Thus, for all damping factors, including ones very close to 0 that make the computation of the score vector particularly efficient [6], PageRank always returns a set of results in common with both StrongRank and WeakRank of reasonable size. For both graphs, the largest intersection with WeakRank is achieved for damping factors in the 0.8 − 0.9 range; whereas the largest intersection with
Choose the Damping, Choose the Ranking?
87
StrongRank is achieved for damping factors in the 0.5 − 0.6 range. Which is more desirable? Ultimately, it depends on the nature of the application. For example, a Web search engine typically returns to a user many more “leads” than that user will follow. In this context a false negative (not returning a page of high interest) is far more damaging than a false positive (returning a page of little interest). Thus, a large intersection with WeakRank is more desirable, being indicative of an algorithm that returns, for every user, a large fraction of that user’s top choices. It is then perhaps not surprising that the typical value assigned to the damping factor in search engines is indeed 0.85! On the other hand, in a trust/reputation system a false positive (returning as highly trusted an item the user would not trust) is far more damaging than a false negative (not returning as trusted an item the user would actually trust). A large intersection with StrongRank is then more desirable, being indicative of an algorithm that returns only items that are at least moderately trusted by every user model; and a damping factor closer to 0.5 - as suggested in [2] - might be a better choice.
6
Conclusions
We show that for any k, at least on some graphs, arbitrarily small variations in the damping factor can completely reverse the ranking of the top k nodes, or indeed make them assume all possible k! orderings. This is deeply unsatisfying should one be aiming at an “objective” ranking, since an item may be ranked higher or lower than another depending on a tiny variation in the value of a parameter in the user model (the damping factor) likely to differ between different users and, in any case, hard to assess precisely. Previous experimental evidence suggested this was not the case in “real” graphs like the Web graph. However, verifying rank stability for discrete variations in the damping factor (e.g. 0.01 increments) is not sufficient to conclude that rank is indeed stable as the damping factor varies over a whole continuous interval - just like a set of discrete samples of a continuous signal is not, in general, sufficient to reconstruct it. Yet, just like Shannon’s Sampling Theorem allows the reconstruction of a continuous signal from a discrete set of samples as long as it is limited in bandwidth, lineage analysis allows us to compare the rank of two nodes “in one shot” for all (even “time variant”) damping factors using just a discrete set of lineage measurements, as long as PageRank scores can be computed or sufficiently approximated with a limited number of iterations of the Power Method (which is the case for all practical applications). Lineage analysis not only allows us to efficiently verify that, indeed, in “real” graphs like Web graphs or citation graphs PageRank is not excessively ranking sensitive (but not insensitive either!) to variations of the damping factor. It also provides a simple, very “natural” interpretation of rank dominance for all damping factors that is completely independent of PageRank. And it allows one to introduce the notions of strong rank and weak rank of a node, related to, but subtly different from, those of best and worse rank - and more descriptive, since
88
M. Bressan and E. Peserico
they capture the level of churn “around” a node whose rank is relatively stable overall yet highly unstable compared to its individual competitors. Weak and strong rank also induce two new ranking algorithms, StrongRank and WeakRank. StrongRank tends to return results that are at least moderately useful under all user models (i.e. for all damping variables). WeakRank tends to return results that are highly interesting to at least a niche of users. As such, they can provide useful benchmarks to compare different link analysis algorithms in terms of their ability to, respectively, return “universally useful” results, and to ferret out results of high interest to niches of users. A more thorough evaluation of StrongRank and WeakRank is certainly a promising direction for future research. StrongRank and WeakRank can also be used to evaluate different values of the damping factor — validating the “folk lore” result that 0.85 is ideal, at least for search engines and other applications where it is far more important to avoid false negatives then false positives. For applications where the reverse is true, our results suggest that a value closer to 0.5 might be more effective, reflecting the results of [2]. It is surprising that, in this regard, the .it and CiteSeer graphs exhibit exactly the same characteristics - it would be interesting to investigate if (and perhaps why) this is the case in other application domains of PageRank.
References 1. CiteSeer metadata, http://citeseer.ist.psu.edu/oai.html 2. Avrachenkov, K., Litvak, N., Son Pham, K.: A singular perturbation approach for choosing PageRank damping factor. ArXiv Mathematics e-prints (2006) 3. Bacchin, M., Ferro, N., Melucci, M.: The effectiveness of a graph-based algorithm for stemming. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, pp. 117–128. Springer, Heidelberg (2002) 4. Baeza-Yates, R., Boldi, P., Castillo, C.: Generalizing PageRank: Damping functions for link-based ranking algorithms. In: Proc. ACM SIGIR 2006 (2006) 5. Berry, M.W.: Survey of Text Mining. Springer, Heidelberg (2003) 6. Boldi, P., Santini, M., Vigna, S.: PageRank as a function of the damping factor. In: Proc. ACM WWW 2005 (2005) 7. Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: Proc. of the Thirteenth International World Wide Web Conference (WWW 2004), Manhattan, USA, pp. 595–601. ACM Press, New York (2004) 8. Cho, J., Garc´ıa-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998) 9. Erkan, G., Radev, D.R.: Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22, 457–479 (2004) 10. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proc. ACM SODA (2003) 11. Haveliwala, T.H.: Efficient computation of pagerank. Technical report (1999) 12. Jiang, X.M., Xue, G.R., Zeng, H.J., Chen, Z., Song, W.-G., Ma, W.-Y.: Exploiting pageRank at different block level. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds.) WISE 2004. LNCS, vol. 3306, pp. 241–252. Springer, Heidelberg (2004)
Choose the Damping, Choose the Ranking?
89
13. Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating pagerank computations. In: Proceedings of WWW, pp. 261– 270. ACM, New York (2003) 14. Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H.: The eigentrust algorithm for reputation management in p2p networks. In: Proc. of ACM WWW 2003(2003) 15. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999) 16. Langville, A.N., Meyer, C.D.: Deeper inside PageRank. Internet Math. 1(3), 335– 380 (2004) 17. Langville, A.N., Meyer, C.D.: Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton (2006) 18. Melucci, M., Pretto, L.: PageRank: When order changes. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 581–588. Springer, Heidelberg (2007) 19. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Dig. Libr. Tech. Proj. (1998) 20. Peserico, E., Bressan, M.: Choose the Damping, Choose the Ranking? Technical report, Univ. Padova (2008), http://www.dei.unipd.it/~ enoch/papers/damprank.pdf 21. Peserico, E., Pretto, L.: What does it mean to converge in rank? In: Proc. ICTIR 2007 (2007) 22. Pretto, L.: A theoretical analysis of google’s PageRank. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 131–144. Springer, Heidelberg (2002) 23. Chakrabarti, S., Dom, B.E., Gibson, D., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Experiments in topic distillation. In: Proc. ACM SIGIR Workshop on Hypertext IR on the Web (1998) 24. Tarau, P., Mihalcea, R., Figa, E.: Semantic document engineering with wordnet and PageRank. In: Proc. ACM SAC 2005 (2005) 25. University of Milan. Laboratory of Web Algorithmics, http://law.dsi.unimi.it/ 26. Wangk, K.W.: Item selection by ’hub-authority’ profit ranking. In: Proc. ACM SIGKDD 2002 (2002)
Characterization of Tail Dependence for In-Degree and PageRank⋆ Nelly Litvak1,⋆⋆ , Werner Scheinhardt1 , Yana Volkovich1 , and Bert Zwart2 1
University of Twente, Dept. of Applied Mathematics, P.O. Box 217, 7500 AE, Enschede, The Netherlands {n.litvak,w.r.w.scheinhardt,y.volkovich}@ewi.utwente.nl 2 CWI, Science Park Amsterdam, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands
[email protected]
Abstract. The dependencies between power law parameters such as indegree and PageRank, can be characterized by the so-called angular measure, a notion used in extreme value theory to describe the dependency between very large values of coordinates of a random vector. Basing on an analytical stochastic model, we argue that the angular measure for indegree and personalized PageRank is concentrated in two points. This corresponds to the two main factors for high ranking: large in-degree and a high rank of one of the ancestors. Furthermore, we can formally establish the relative importance of these two factors. Keywords: Power law graphs, PageRank, Regular variation, Multivariate extremes.
1
Introduction
Large self-organizing networks, such as Internet, the World Wide Web, social and biological networks, often exhibit power laws. In simple words, a random variable X has a power law distribution with exponent α > 0 if its tail probability P(X > x) is roughly proportional to x− α , for large enough x. Power law distributions are heavy-tailed since the tail probability decreases much slower than negative exponential, and thus one can sometimes observe extremely large values of X. Statistical analysis of complex networks characterized by power laws has received a massive attention in recent literature, see e.g. [1,2,3] for excellent surveys. Nevertheless, we are still far from complete understanding of the structure of such networks. In particular, the question of measuring dependencies between network parameters remains an open and complex issue [1]. ⋆
⋆⋆
Part of this research has been funded by the Dutch BSIK/BRICKS project. This article is also the result of joint research in the 3TU Centre of Competence NIRICT (Netherlands Institute for Research on ICT) within the Federation of Three Universities of Technology in The Netherlands. This author is supported by NWO Meervoud grant no. 632.002.401.
K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 90–103, 2009. c Springer-Verlag Berlin Heidelberg 2009
Characterization of Tail Dependence for In-Degree and PageRank
91
A common example of two related power law characteristics is that of indegree and PageRank of a Web page [4,5,6]. The (personalized) PageRank is defined in [7] as follows: IN(k)
P R(k) = c
i=1
1 P R(ki ) + (1 − c)P REF (k), k = 1, . . . , n, OU T (ki )
(1)
where P R(k) is the PageRank of page k, n is the number of nodes in the network, IN (k) in the in-degree of k, the sum is taken over all pages ki that link to page k, OU T (ki ) is the number of outgoing n links of page ki , P REF (k) is the preference of the user for page k, with k=1 P REF (k) = n, and c ∈ (0, 1) is a damping factor. If there is no outgoing link from a page then we say that the page is dangling and assume that it links to all nodes in the network. The PageRank in (1) is uniquely defined, and the PageRanks of all pages sum up to n. We note that in the literature the PageRank and the user preference vectors are often viewed as probability vectors, normalized to sum up to one. Clearly, the PageRank is influenced largely by in-degree. However, there is still no agreement in the literature on the dependence between these two quantities. In particular, the values of the correlation coefficient vary considerably in different studies [4,8]. This only confirms that the correlation coefficient is an uninformative dependence measure in heavy-tailed (power law) data [9,1,10]. In fact, the correlation coefficient is a ‘crude summary’ of dependencies that is most informative for jointly normal random variables. It is a common and simple technique but it is not subtle enough to distinguish between the dependencies in large and in small values. This becomes a problem if we want to measure the dependence between two heavy tailed network parameters, because in that case we are mainly interested in the dependence between extremely large values. We propose to solve the problem of evaluating the dependencies between network parameters, using the theory of multivariate extremes. This theory operates with the notion of tail dependence for a random vector (X, Y ), that is, the dependence between extremely large values of X and Y . Such tail dependence is characterized by an angular measure on [0, 1] (see Section 4 for a formal definition). Informally, a concentration of the angular measure around 0 and/or 1 signals independence, while concentration around some other number a ∈ (0, 1) suggests that a certain fraction of large values of Y comes together with large values of X. In [11,12] a first attempt was made to compute the angular measure between in-degree and PageRank, and completely different dependence structures were discovered in Wikipedia (independence), preferential attachment networks (complete dependence) and the Web (intermediate case). In this paper the goal is to compute the angular measure analytically, based on the stochastic model proposed in [13,6,14]. The resulting angular measure is concentrated in points 0 and a ∈ (1/2, 1), and the mass distribution depends on the network parameters. Such angular measure is a formalization of the common understanding that there are two main sources for high ranking: high in-degree and a high rank of one of the ancestors. Furthermore, the fraction of the measure mass in 0 stands for the
92
N. Litvak et al.
proportion of highly ranked nodes that have a low in-degree. Thus, we obtain a description of the dependence structure, that is more informative and relates better to reality than the correlation coefficient. In order to derive the tail dependencies, we employ the theory of regular variation, that provides a natural mathematical formalism for analyzing power laws [10]. By definition, the random variable X is regularly varying with index α, if P(X > u) = u−α L(u), u > 0, where L(u) is a slowly varying function, that is, for x > 0, L(ux)/L(u) → 1 as u → ∞, for instance, L(u) may be equal to a constant or log(u). In Section 2 we describe the model where power law network parameters are represented by regularly varying random variables. Basing on this model, the results on tail dependence are derived in Sections 3 and 4, while some of the proofs are deferred to the Appendix. In Section 5 we discuss the results and compare our findings to the graph data. The derived two-point measure is only a first-order approximation of the complex angular measure observed on the data, since the realistic situation is way more complex than our simplified model. Further modifications of the model are needed in order to adequately describe the dependencies in real-life networks.
2
Model and Preliminaries
Choose a random node in the graph, let N and R denote its in-degree and PageRank, respectively, and let Di denote the out-degree of its ith ancestor, where i = 1, . . . , N . As in [6,13,14] we assume that N and R are random variables that satisfy N 1 d Ri + (1 − c)T. (2) R=c Di i=1 Here N , Ri ’s, Di ’s and T are independent; Ri ’s are distributed as R with ER = 1; d a = b means that a and b have the same probability distribution, and c ∈ (0, 1) is the damping factor. The equation above clearly corresponds to the definition of personalized PageRank (1). We note that compared to our previous work [6,13], here we account for personalization by setting T to be random. In this paper we neglect the presence of dangling nodes but they can be easily included in the model (see e.g. [6]). For convenience we prefer to work with the following, slightly more general, representation of (2): N d Ai Ri + B, (3) R= i=1
where Ai ’s are independent and distributed as some random variable A < 1, and B > 0 is independent of the Ai ’s. Next, we define F¯1 (u) := P(N > u)
and
F¯2 (u) := P(R > u),
u > 0,
and assume that F¯1 (u) is regularly varying with index α > 1. We also assume that B in (3) has a lighter tail than N , that is, P(B > u) = o(P(N > u)) as
Characterization of Tail Dependence for In-Degree and PageRank
93
u → ∞. As a result, F¯2 (u) is also regularly varying. In fact, the next proposition was proved in [6,13]; a more general case is presented in [14]. For technical reasons, in [6,13,14] it was assumed that the index α is non-integer. Proposition 1. Under the assumptions above, F¯2 (u) ∼ K F¯1 (u)
as
u → ∞,
where a ∼ b means that a/b → 1. The value of K depends on the precise assumptions on the Ai ’s and B; if EN = d, A = c/d and B = 1 − c as in [13], we have cα K= α . (4) d − dcα In the sequel we will only use the specific form (4) in Corollary 1 and Section 5. We also note that within the same model (3), we could assume that the distribution of the Ri ’s is different from the one of R. In this case, if the tail of the Ri ’s is not heavier than the one of N , Proposition 1 still holds, only K will depend on the behavior of P(R > u) as u → ∞ (see Lemma 3.7 in [15]). We need to deal with a minor complication because F¯1 is not strictly decreasing, and we will in the sequel need to consider the behavior of its inverse function for small arguments. Instead of working with the generalized inverse F¯1−1 (v) = inf{u > 0 : F¯1 (u) ≤ v}, which would make the proofs more involved, we prefer to simply work with some function that is strictly decreasing and asymptotically equivalent to F¯1 (u). Such a function can e.g. be defined as f1 (u) := (1 + e−u )F¯1 (u), for which the inverse function is well-defined. Thus, we arrive at the following: F¯1 (u) := P(N > u) ∼ f1 (u), F¯2 (u) := P(R > u) ∼ f2 (u),
as u → ∞
(5)
as u → ∞,
where f1 (u) = u−α L(u),
f2 (u) = Ku−α L(u) = Kf1 (u),
for some slowly varying function L(·).
3
Tail Dependence
Let us introduce two functions that are defined on R2+ , namely the stable tail dependence function [9], ℓ(x, y) = lim t−1 P(F¯1 (N ) ≤ tx or F¯2 (R) ≤ ty) t↓0
and the function r(x, y) := lim t−1 P(F¯1 (N ) ≤ tx, F¯2 (R) ≤ ty). t↓0
(6)
94
N. Litvak et al.
Provided that the limit in (6) exists, these are closely related. In fact adding them gives ℓ(x, y) + r(x, y) = lim t−1 (P(F¯1 (N ) ≤ tx) + P(F¯2 (R) ≤ ty)), t↓0
which would yield x+ y if F¯1 and F¯2 were stricly decreasing, because then F¯1 (N ) and F¯2 (R) would be uniform random variables on (0, 1). The following lemma shows that this result holds anyway. Lemma 1. The functions ℓ and r satisfy ℓ(x, y) + r(x, y) = x + y. Proof. We use the function f1 to show that limt↓0 t−1 (P(F¯1 (N ) ≤ tx) = x, as follows (the corresponding result for P(F¯2 (R) ≤ ty) is proven analogously). Since F¯1 (u) → 0 and |F¯1 (u) − f1 (u)| = o(F¯1 (u)) as u → ∞, then for any small ε > 0 we can choose t1 small enough so that for any t ≤ t1 and u > 0 that satisfy F¯1 (u) ≤ tx we also have |F¯1 (u) − f1 (u)| ≤ ε|F¯1 (u)|, and hence |F¯1 (u) − f1 (u)| ≤ εtx. If we now fix some small ε > 0, the above implies for any t ≤ t1 that P(F¯1 (N ) ≤ tx) = P(f1 (N ) ≤ (f1 (N ) − F¯1 (N )) + tx) ≤ P(f1 (N ) ≤ (1 + ε)tx) = P N ≥ f1−1 ((1 + ε)tx) = F¯1 (f −1 ((1 + ε)tx)) ∼ f1 (f −1 ((1 + ε)tx)) = (1 + ε)tx. 1
1
So we obtain lim sup t−1 P(F¯1 (N ) ≤ tx) ≤ (1 + ε)x,
and similarly,
t→0
lim inf t−1 P(F¯1 (N ) ≤ tx) ≥ (1 − ε)x. t→0
The result now follows by letting ε go to 0. The main result of this section gives the stable tail dependence function for N and R: Theorem 1. The function r(x, y) for N and R is given by r(x, y) = min{x, y(EA)α /K}.
(7)
Consequently, ℓ(x, y) = max{y, x + y(1 − (EA)α /K)}. In the remainder of the paper we will mainly work with r(x, y) rather than ℓ(x, y), since its derivation is more appealing. To prove Theorem 1 we need to use the following lemma. Lemma 2. As u → ∞, the following asymptotic relation holds for any constant C > 0, P(N > u, R > Cu) ∼ min{f1 (u), (EA/C)α f1 (u)}.
Characterization of Tail Dependence for In-Degree and PageRank
95
We refer to the Appendix for the proof of this lemma, but the intuition behind it is clear. It follows from (3) and the strong law of large numbers that when N is large, we have R ≈ EA · N . Therefore, when EA > C, the event {R > Cu} is already ‘implied’ by {N > u}, so the joint probability behaves as P(N > u). When EA < C, N needs to be larger for R > Cu to hold, and the joint probability behaves like P(N > uC/EA). In order to understand Theorem 1 we fix x, y > 0 throughout this section and rewrite the joint probability in a form that enables application of Lemma 2. The schematic derivation is as follows, where the superscripts denote three issues to be resolved: P(F¯1 (N ) ≤ tx, F¯2 (R) ≤ ty) 1
∼ P(f1 (N ) ≤ tx, f2 (R) ≤ ty) = P(N ≥ f1−1 (tx), R ≥ f2−1 (ty)) −1/α y L(f1−1 (tx)) 2 −1 −1 f1 (tx) = P N ≥ f1 (tx), R ≥ Kx L(f2−1 (ty))
y −1/α 3,1 −1 −1 f1 (tx) ∼ P N ≥ f1 (tx), R ≥ Kx
(8)
The statement of Theorem 1 now follows from Lemma 2 since obviously f1 (f1−1 (tx)) = tx, provided that each of the three steps indicated in (8) is justified. We resolve these issues as follows: 1. We deduce the asymptotic equivalence of the two probabilities from the asymptotic equivalence of the functions inside the probabilities. This step is intuitively clear but not mathematically rigorous. In the proof of Theorem 1 we will make the argument precise, see the Appendix. 2. This step is fairly straightforward. Indeed, v = f1 (u) = u−α L(u) implies u = (v/L(u))−1/α , so f1−1 (v) = (v/L(f1−1 (v)))−1/α . Also, since f2 (u) = Kf1 (u) we have f2−1 (v) = f1−1 (v/K) = (v/KL(f2−1 (v)))−1/α . Hence, f2−1 (ty) = f1−1 (tx)
y L(f1−1 (tx)) Kx L(f2−1 (ty))
−1/α
.
(9)
3. This is a consequence of the following statement (the proof of which can be found in the Appendix), combined with issue 1. Lemma 3. For all x, y > 0 we have L(f1−1 (tx)) ∼ L(f2−1 (ty)) as t ↓ 0. Now, in order to prove Theorem 1 we only need to resolve issue 1 twice in the derivation in (8). The formal proof of this can be found in the Appendix.
4
Angular Measure
In this section we find the angular measure that corresponds to the function r(x, y) we found, but first we will give some preliminaries. In extreme value
96
N. Litvak et al.
theory (see [9]), it has been shown that a unique (nonnegative) measure H(·) exists on the set Ξ = {ω ∈ R2+ | ||ω|| = 1}, such that the stable tail dependence function ℓ can be expressed as max(ω1 x, ω2 y)H(dω). (10) ℓ(x, y) = Ξ
Here || · || is a norm that may be chosen freely, but for (10) to hold, the measure has to be normalized in such a way that ω2 H(dω) = 1, ω1 H(dω) = Ξ
Ξ
so that we have ℓ(x, 0) = x and ℓ(0, y) = y, as should. In this work we choose the || · ||1 norm, for which ||ω||1 = |ω1 | + |ω2 |, since that is easiest to work with. Then (10) can be rewritten as 1 max{wx, (1 − w)y}H(dw), ℓ(x, y) = 0
and the normalization becomes 1 wH(dw) = 0
1
(1 − w)H(dw) = 1.
(11)
0
Here we let w = ω1 , and we identify the measures on Ξ and [0, 1]. By (11) it follows that the function r(x, y) can be written as 1 1 1 max{wx, (1 − w)y}H(dw) (1 − w)yH(dw) − wxH(dw) + r(x, y) = =
0
0
0
1
min{wx, (1 − w)y}H(dw).
(12)
0
We will now derive the function r(x, y) in case when the angular measure has masses in 0 and a only, as we suspect to be the case for in-degree and PageRank. First of all, the normalization (11) boils down to aH(a) = H(0)+(1−a)H(a) = 1, which is easily solved to give H(0) = 2 − 1/a
and
H(a) = 1/a.
(13)
Note that H has total measure 2 (as also follows for the general case by summing both integrals in (11)), and that H(0) > 0 implies a > 1/2. Combining (12) and (13), the function r(x, y) can now be written as r(x, y) = min{x, (1/a − 1)y}. This is a very similar form as we found earlier in (7), and it is not difficult to see that the expressions are equal for a = K/(K + (EA)α ). Since the angular measure is uniquely determined by the stable tail dependence function ℓ, see [9], and hence by the function r, we showed that the angular measure of N and R is indeed a two-point measure. After using (13) we arrive at
Characterization of Tail Dependence for In-Degree and PageRank
97
Theorem 2. The angular measure with respect to the || · ||1 norm of N and R is a two-point measure, with masses (EA)α K (EA)α H(a) = 1 + K H(0) = 1 −
in 0, in a =
K . K + (EA)α
Corollary 1. If, as in [13], K is given by (4) with EN = d and EA = c/d, then the angular measure of N and R is a two-point measure, with masses H(0) = cα d(1−α) H(a) = 2 − cα d(1−α)
5
Examples and Discussion
in 0, −1 in a = 2 − cα d(1−α) .
(14)
We compare the above results to the measurements on two different network structures: Web and Growing Network data sets. For the Web sample we choose the EU-2005 data set with 862.664 nodes and 19.235.140 links. This set was collected by The Laboratory for Web Algorithmics (LAW) of the Universit` a degli studi di Milano [16], and is available at http://law.dsi.unimi.it/. In this data set in-degree and PageRank exhibit well known power law behavior with exponent α = 1.1. For the evaluation of the exponent we refer to [12]. In Figure 1(a) we present log-log plots for in-degree and PageRanks with c = 0.85 and c = 0.5 (the straight lines are fitted). We also simulate a Growing Network of 10.000 nodes with constant out-degree d = 8. We start with d initial nodes, and at each step we add a new node that links to already existing nodes. A new link points to a randomly chosen page with probability q = 0.1, and with probability (1 − q) it follows the preferential attachment selection rule [17]. We present log-log plots for this Growing Network set in Figure 1(b). Following [9, p.328] we define an estimator of the angular measure. We are interested in the dependencies between two regularly varying characteristics of a node, namely the in-degree N and the PageRank R. Let (Nj , Rj ) be observations 0
0
10
10
in−degree PageRank (c=0.85) y=−1.1x+0.22 y=−1.1x−1.10
−1
10
−1
fraction of pages
fraction of pages
10 −2
10
−3
10
−4
in−degree PageRank (c=0.85) PageRank (c=0.5) y=−1.1x+0.61 y=−1.1x−0.90 y=−1.1x−1.23
10
−5
10
−6
10
−2
10
0
10
−2
10
−3
10
−4
2
10
4
10
in−degree, PageRank
(a) Web data set
6
10
10
−1
10
0
10
1
10
2
10
3
10
4
10
in−degree, PageRank
(b) Growing Network data set
Fig. 1. Cumulative log-log plots for in-degree and PageRanks
98
N. Litvak et al.
of (N, R) for the corresponding node j. Then we use the rank transformation of (N, R), leading to {(rjN, rjR), 1 ≤ j ≤ n}, where rjN is the descending rank of Nj in (N1 , . . . , Nn ) and rjR is the descending rank of Rj in (R1 , . . . , Rn ). Next we apply a coordinate transform (rjN , rjR ) −→ (rj , Θj ), given by (rj , Θj ) = TRANS
1 1 , rjN rjR
,
where we set TRANS(x, y) := (x + y, x/(x + y)) since all results of this paper are proven for the || · ||1 norm. Alternatively, we
could use the polar coordinate transformation as in [11,12]: TRANS(x, y) := x2 + y 2 , arctan (y/x) . However, in this case we need to transform the angular measure in Theorem 2 to the corresponding measure w.r.t. the || · ||2 norm using formula (8.38) in [9]. Now we need to consider k points {Θj : rj ≥ r(k) }, where r(k) is the kth largest in (r1 , . . . , rn ), and make a plot for the cumulative distribution function of Θ, which gives the estimation of the probability measure H(·)/2. The question how to choose the right k can be solved by employing the Starica plot (see [10,12]). From (14) we can calculate the predicted angular measure concentrated in 0 and a. For the Web data sample with average in-degree d = 22.2974, taking c = 0.5 and c = 0.85, we obtain that a0.5 = 0.6031, H(a0.5 )/2 = 0.8290, and a0.85 = 0.7210, H(a0.85 )/2 = 0.6934, respectively. Recall that the values of H(a)/2 estimate the fraction of highly ranked pages whose large PageRank is explained by large in-degree. Observe that according to the model, this fraction becomes larger if c decreases. In Figure 2 (a,b) we plot the theoretical angular measures together with the empirical ones. The comparison between the graphs shows that there is only a very rough similarity to be seen, in the sense that the value of H(0)/2 is a reasonable estimate for the fraction of pages with high PageRank and small in-degree (corresponding to the ‘turn’ around 0.8). However, the ‘point mass’ at a seems to be spread out in an almost uniform manner. To understand this, we should realize that the theoretical two-point measure we found is only a formalization of the idea that each large PageRank value has to be either due to a large in-degree, or due to a large contributing PageRank. In the data (representing ‘reality’), such a strict division is not reasonable; for instance there will surely be pages with high PageRank due to a high in-degree and a high contributing PageRank, or due to more than one high contributing PageRanks. Thus we see that although our model roughly captures the idea of different causes for large PageRank values, it is not subtle enough to properly represent the angular measure as found from a realistic data set. In particular, the assumption of the branching structure of the Web in (2) is probably not justified. Future work could try to investigate how to improve the model in that respect, mainly by studying the dependencies amongst the Ri in (2), or between the Ri on the one hand and N on the other. Finally, we perform experiments on the Growing Network. It was proved in [18] that the PageRank in such models follows a power law with the same exponent as the in-degree. However, in our model based on stochastic equation (2) we
Characterization of Tail Dependence for In-Degree and PageRank 1
1
1
AngMeasure Theor.AngMeasure
AngMeasure Theor.AngMeasure
0.4
0.2
0.8 fraction of pages
0.6
0 0
AngMeasure Theor.AngMeasure
0.8 fraction of pages
fraction of pages
0.8
0.6
0.4
0.2
0.2
0.4
0.6 w
0.8
1
0 0
99
0.6
0.4
0.2
0.2
0.4
0.6 w
0.8
1
0 0
0.2
0.4
0.6
0.8
1
w
(a) Web data set: c=0.5, (b) Web data set: c=0.85, (c) Growing Network data k=100.000 k=100.000 set: c=0.85, k=100.000 Fig. 2. Angular measure and theoretically predicted angular measure
cannot assume anymore that R is distributed as the Ri ’s since Ri ’s are the ranks of ‘younger’ nodes, and presumably, the Ri will have lighter tails than R itself. Assuming that P(Ri > u) = o(P(N > u)) as u → ∞, from Lemma 3.7 in [15] we obtain that for this simple model the value of K is just K = (c/d)α . Substituting this into (14) gives us a = 1/2, H(a) = 2, and H(0) = 0, i.e. the measure is concentrated in one point a = 1/2. In Figure 2 (c) we again plot the empirical and theoretical measures, which match perfectly. We see that in synthetic graphs constructed by the preferential attachment rule, large PageRank is always due to large in-degree, and this can be easily captured by our stochastic model. In further research, it will be interesting to consider other graph models of the Web, for instance, a configuration model, where the degree of each node is chosen independently according to a pre-defined power law distribution [19,20]. The configuration model is not as centered as the preferential attachment network, and it is known to be close to the tree structure. Thus, one may expect that equation (2) provides an accurate description of the dependencies between indegree and PageRank in such a model. Finally, we would like to note that by measuring and comparison of tail dependencies in synthetic graphs and experimental data one can easily reveal whether a specific graph model adequately reflects the dependence structure observed in the experiments. From this point of view, the analysis of tail dependencies contributes towards better modelling and understanding of real-life networks.
References 1. Chakrabarti, D., Faloutsos, C.: Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38(1), 2 (2006) 2. Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Math. 1(2), 226–251 (2004) 3. Newman, M.E.J.: The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003) 4. Donato, D., Laura, L., Leonardi, S., Millozi, S.: Large scale properties of the Webgraph. Eur. Phys. J. 38, 239–243 (2004) 5. Pandurangan, G., Raghavan, P., Upfal, E.: Using PageRank to characterize Web structure. In: Ibarra, O.H., Zhang, L. (eds.) COCOON 2002. LNCS, vol. 2387, p. 330. Springer, Heidelberg (2002)
100
N. Litvak et al.
6. Volkovich, Y., Litvak, N., Donato, D.: Determining factors behind the PageRank log-log plot. In: Bonato, A., Chung, F.R.K. (eds.) WAW 2007. LNCS, vol. 4863, pp. 108–123. Springer, Heidelberg (2007) 7. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Networks 33, 107–117 (1998) 8. Fortunato, S., Bogu˜ na ´, M., Flammini, A., Menczer, F.: Approximating PageRank from in-degree. In: Aiello, W., Broder, A., Janssen, J., Milios, E.E. (eds.) WAW 2006. LNCS, vol. 4936, pp. 59–71. Springer, Heidelberg (2008) 9. Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J.: Statistics of Extremes: Theory and Applications. Wiley, Chichester (2004) 10. Resnick, S.I.: Heavy-tail Phenomena. Springer, New York (2007) 11. Volkovich, Y., Litvak, N., Zwart, B.: Measuring extremal dependencies in Web graphs. In: WWW 2008: Proceedings of the 17th international conference on World Wide Web, pp. 1113–1114. ACM Press, New York (2008) 12. Volkovich, Y., Litvak, N., Zwart, B.: A framework for evaluating statistical dependencies and rank correlations in power law graphs. Memorandum 1868, Enschede (2008) 13. Litvak, N., Scheinhardt, W.R.W., Volkovich, Y.: Probabilistic relation between indegree and PageRank. In: Aiello, W., Broder, A., Janssen, J., Milios, E.E. (eds.) WAW 2006. LNCS, vol. 4936, pp. 72–83. Springer, Heidelberg (2008) 14. Volkovich, Y., Litvak, N.: Asymptotic analysis for personalized Web search. Memorandum 1884, Enschede (2008) 15. Jessen, A.H., Mikosch, T.: Regularly varying functions. Publications de l’institut mathematique, Nouvelle s´erie 79(93) (2006) 16. Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, pp. 595–601. ACM Press, New York (2004) 17. Albert, R., Barab´ asi, A.L.: Emergence of scaling in random networks. Science 286, 509–512 (1999) 18. Avrachenkov, K., Lebedev, D.: PageRank of scale-free growing networks. Internet Math. 3(2), 207–231 (2006) 19. van der Hofstad, R., Hooghiemstra, G., van Mieghem, P.: Distances in random graphs with finite variance degrees. Random Structures Algorithms 27(1), 76–123 (2005) 20. van der Hofstad, R., Hooghiemstra, G., Znamenski, D.: Distances in random graphs with finite mean and infinite variance degrees. Electron. J. Probab. 12(25), 703–766 (2007)
Characterization of Tail Dependence for In-Degree and PageRank
A
101
Proofs
Proof (of Lemma 2). The proof is based on the strong law of large numbers. Informally, we use the fact that if N is large, then (3) implies R ≈ EA · N . Assume first that C < EA. Then we write P(N > u, R > Cu) = P(N > u)P(R > Cu|N > u),
(15)
and we further obtain ⎛
P(R > Cu|N > u) ≥ P ⎝ ⎛
= P ⎝ C −1 u−1
⌊ u⌋ i=1
⌊ u⌋ i=1
⎞
⎛
Ai Ri + B > Cu⎠ ≥ P ⎝
⎞
⌊ u⌋ i=1
⎞
Ai Ri > Cu⎠
Ai Ri > 1⎠ → 1 as u → ∞,
where the convergence holds by the strong law of large numbers for any C < EA. Hence when C < EA the result follows directly from (5) and (15). Now assume that C > EA. We would like to show that P(N > u, R > Cu) → 1. u→∞ f1 ([C/EA]u) lim
(16)
Then the result of the lemma will follow since L(u) ∼ L([C/EA]u) as u → ∞. For the proof, we choose a sufficiently small δ so that we can break the joint probability into three terms: P(N > u, R > Cu) = P(N > [C/EA + δ]u, R > Cu) + P([C/EA − δ]u < N ≤ [C/EA + δ]u, R > Cu) + P(u < N ≤ [C/EA − δ]u, R > Cu).
(17)
Exactly as in case C < EA, using (5), we have lim
u→∞
P(N > [C/EA + δ]u) P(N > [C/EA + δ]u, R > Cu) = lim = 1 + O(δ). u →∞ f1 ([C/EA]u) f1 ([C/EA]u) (18)
Moreover, applying the argument as in the case when C < EA, from the law of large numbers we obtain that P(R > Cu|u < N ≤ [C/EA − δ]u) → 0 as u → ∞, and thus P(u < N ≤ [C/EA − δ]u, R > Cu) f1 ([C/EA]u) P(N > u)P(R > Cu|u < N ≤ [C/EA − δ]u) = 0. ≤ lim u→∞ f1 ([C/EA]u)
0 ≤ lim
u→∞
(19)
102
N. Litvak et al.
Finally, we get P([C/EA − δ]u < N ≤ [C/EA + δ]u, R > Cu) u→∞ P(N > [C/EA]u) P([C/EA − δ]u < N ≤ [C/EA + δ]u) ≤ lim u→∞ P(N > [C/EA]u) f1 ([C/EA − δ]u) − f1 ([C/EA + δ]u) = lim = O(δ). u→∞ f1 ([C/EA]u)
0 ≤ lim
(20)
The result (16) now follows from (17)–(20) by letting δ ↓ 0. In the case C = EA the argument is similar, only we write P(N > u, R > EAu) = P(N > [C/EA + δ]u, R > Cu) + P(u < N ≤ [C/EA + δ]u, R > Cu). This completes the proof of the lemma. Proof (of Lemma 3). It will be convenient to use the functions g1 (t) := f1−1 (tx)
g2 (t) := f2−1 (ty) = f1−1 (ty/K) = g1 (ty/Kx), (21) which, for fixed x, y > 0, are well-defined for all t > 0, due to the monotonicity of f1 , and hence also f2 . Applying the Potter bounds, see Resnick [10, p.32], we obtain that for all A > 1, δ > 0 one can choose t sufficiently small such that −δ δ g1 (t) g2 (t) g1 (t) g2 (t) L(g1 (t)) , ≤ A max , A−1 max ≤ g2 (t) g1 (t) L(g2 (t)) g2 (t) g1 (t) and
which by (9) is the same as g1 (t) g2 (t) −1 , A max g2 (t) g1 (t)
−δ
Kx ≤ y
g1 (t) g2 (t)
α
≤ A max
g1 (t) g2 (t) , g2 (t) g1 (t)
δ
.
From the first inequality above we get δ/α 1/α g1 (t) g2 (t) Kx g1 (t) lim inf A1/α max , ≥1 t↓0 g2 (t) g1 (t) y g2 (t) for all A > 1, δ > 0. Taking A → 1 and δ ↓ 0 we obtain that 1/α Kx g1 (t) lim inf ≥ 1. t↓0 y g2 (t) Analogously, we can show that lim sup t↓0
Kx y
1/α
g1 (t) ≤ 1. g2 (t)
so that the limit of the left-hand side is 1. This implies the result, again by (9).
Characterization of Tail Dependence for In-Degree and PageRank
103
Proof (of Theorem 1). Since F¯i (u) → 0 and |F¯i (u) − fi (u)| = o(F¯i (u)), i = 1, 2, as u → ∞, then for any small ε > 0 we can choose t1 small enough so that for any t ≤ t1 and u > 0 that satisfy F¯1 (u) ≤ tx we also have |F¯1 (u) − f1 (u)| ≤ ε|F¯1 (u)|, and hence |F¯1 (u) − f1 (u)| ≤ εtx. Moreover, we can choose t2 ≤ t1 small enough such that F¯2 (u) ≤ ty implies |F¯2 (u) − f2 (u)| ≤ εty for all t ≤ t2 . Also, for any small δ > 0 it follows from Lemma 3 that there exists a positive number t3 ≤ t2 such that for all t ≤ t3 , 1−δ ≤
L(f1−1 ((1 + ε)tx)) ≤ 1 + δ. L(f2−1 ((1 + ε)ty))
If we now fix some small ε > 0 and δ > 0, the above implies for any t ≤ t3 that P(F¯1 (N ) ≤ tx, F¯2 (R) ≤ ty) = P(f1 (N ) ≤ (f1 (N ) − F¯1 (N )) + tx, f2 (R) ≤ (f2 (R) − F¯2 (R)) + ty) ≤ P(f1 (N ) ≤ (1 + ε)tx, f2 (R) ≤ (1 + ε)ty) = P N ≥ f1−1 ((1 + ε)tx), R ≥ f2−1 ((1 + ε)ty) −1/α y L(f1−1 ((1 + ε)tx)) −1 −1 = P N ≥ f1 ((1 + ε)tx), R ≥ f1 ((1 + ε)tx) Kx L(f2−1 ((1 + ε)ty)) −1/α
y −1 −1 (1 + δ) f1 ((1 + ε)tx) . ≤ P N ≥ f1 ((1 + ε)tx), R ≥ Kx Note that the above closely follows the derivation in (8), with ∼ signs replaced by inequalities; in particular the 5th line follow immediately from (9) upon replacing t by (1 + ε)t. Noting that f1 (f1−1 ((1 + ε)tx)) = (1 + ε)tx, we can now apply Lemma 2 to the above and then let t → 0, to obtain lim sup t−1 P(F¯1 (N ) ≤ tx, F¯2 (R) ≤ ty) ≤ (1 + ε) min{x, (1 + δ)y(EA)α /K}. t→0
Similarly we can obtain lim inf t−1 P(F¯1 (N ) ≤ tx, F¯2 (R) ≤ ty) ≥ (1 − ε) min{x, (1 − δ)y(EA)α /K}, t→0
so that the statement of the theorem follows by letting ε and δ go to 0.
Web Page Rank Prediction with PCA and EM Clustering Polyxeni Zacharouli1, Michalis Titsias2 , and Michalis Vazirgiannis1 1
2
Univ. of Economics and Business, Athens, Greece {zaharouli06,mvazirg}@aueb.gr School of Computer Science, University of Manchester, UK
[email protected]
Abstract. In this paper we describe learning algorithms for Web page rank prediction. We consider linear regression models and combinations of regression with probabilistic clustering and Principal Components Analysis (PCA). These models are learned from time-series data sets and can predict the ranking of a set of Web pages in some future time. The first algorithm uses separate linear regression models. This is further extended by applying probabilistic clustering based on the EM algorithm. Clustering allows for the Web pages to be grouped together by fitting a mixture of regression models. A different method combines linear regression with PCA so as dependencies between different web pages can be exploited. All the methods are evaluated using real data sets obtained from Internet Archive, Wikipedia and Yahoo! ranking lists. We also study the temporal robustness of the prediction framework. Overall the system constitutes a set of tools for high accuracy pagerank prediction which can be used for efficient resource management by search engines.
1 Introduction The ranking of query results in a Web search-engine is an important problem and has attracted significant attention in the research community. In order to maintain timely and accurate web pages ranking a huge crawling and indexing infrastructure is needed. On the other hand continuous crawling of the Web is almost impossible due to its dynamic nature, the large number of Web pages and bandwidth constraints. The same holds for the indexing and page ranking process, since the computations needed are very expensive. Thus the need for rank prediction arises. In this paper we describe learning algorithms for Web page rank prediction. We consider linear regression models and combinations of regression with probabilistic clustering and Principal Components Analysis. These models are learned from timeseries data sets and can predict the ranking of a set of Web pages in some future time. Each training data set corresponds to preprocessed rank values of the Web pages observed in previous time points. A first algorithm uses separate linear regression models so as the ranking evolution process of each Web page is explained independently from the corresponding processes of other Web pages. This simple technique is then combined with probabilistic clustering based on the EM algorithm. Clustering allows for the Web pages to be grouped together by fitting a mixture of regression models. A third K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 104–115, 2009. c Springer-Verlag Berlin Heidelberg 2009
Web Page Rank Prediction with PCA and EM Clustering
105
method combines linear regression with dimensionality reduction by applying P CA. This P CA-based method can exploit dependencies between different web pages so as the rank predicted values of all Web pages are correlated with each other. All of the above mentioned methods are evaluated using real data sets obtained from Wikipedia, Internet Archive and Yahoo!. The remainder of the paper is as follows. Section 2 presents related work, while section 3 discusses the data preprocessing of the rank values by using normalized ranking. Section 4 describes learning algorithms for predicting the Web page ranking, while sections 5 and 6 present the evaluation measures and the experimental study. The paper concludes with a discussion in section 7.
2 Related Work The problem of predicting PageRank is partly addressed in [9]. This work focuses on Web page classification based on URL features. Based on the proposed framework, the authors perform experiments trying to make PageRank predictions using the extracted features. For this purpose, they use linear regression; however, the complexity of this approach grows linearly in proportion to the number of features used. The experimental results show that PageRank prediction based on URL features does not perform very well, probably because even though these features correlate very well with the subject of pages, they do not influence the authority of the page in the same way.A recent approach towards page ranking prediction is presented in [1], generating Markov Models from historical ranked lists and using them for predictions. An approach that aims at approximating PageRank values without the need of performing the computations over the entire graph is that of Chien et al. [10]. The authors propose an algorithm to incrementally compute approximations to PageRank, based on the evolution of the link structure of the Web graph. Given a set of link changes, they identify a small portion of the Web graph in the vicinity of these changes, and model the rest of the Web as a single node in this small graph. Then they compute a version of PageRank on the reduced graph and transfer these results to the original graph. Their experiments demonstrate that the algorithm performs well both in speed and quality and is robust to various types of link modifications. This approach, however, requires the continuous monitoring of the Web graph in order to track any link modifications. There has also been work in adaptive computation of PageRank [11,14] or even estimation of PageRank scores [12]. In [16], Yang et al. propose a method called predictive ranking, aiming at estimating the Web structure based on the intuition that the crawling and consequently the ranking results are inaccurate (due to inadequate data and dangling pages). In this work, the authors do not make future rank predictions. Instead, they estimate the missing data in order to achieve more accurate rankings.
3 Normalized Ranking In order to predict future rankings of Web pages, we need to define a measure that effectively expresses the trends of Web pages among different snapshots of the Web graph. We have adopted a measure (nrank), introduced in [2] suitable for measuring
106
P. Zacharouli, M. Titsias, and M. Vazirgiannis
page rank dynamics. We briefly present its design. Let Gti be the snapshot of the Web graph created by a crawl at time ti and let nti = |Gti | the number of Web pages at time ti . We define rank (p, ti ) as a function providing the ranking at the time ti of a Web page p ∈ Gti according to some criterion, for example PageRank values. Intuitively, an appropriate measure for Web pages trends is the rank change rate between two snapshots, but as the size of the Web graph constantly increases the trend measure should be comparable across different graph sizes. A way to deal with this problem is to consider the normalized rank (nrank) of a Web page. We impose that the nrank of all pages in a ranked list sum up to 1. If N is the normalizing factor and n are all the Web pages for which we have crawled the Web graph, then: n n i nt (nt + 1) = 1. nrank(p, ti ) = 1 ⇒ i=1 = 1 ⇒ i i N 2N i=1 Thus, the nrank of a page p at rank (p, ti ) is: nrank (p, ti ) =
2 · rank (p, ti ) . n2ti
(1)
(We assume here nti ≫ 1 and thus do not distinguish between nti and nti + 1.) The −1 nrank ranges between 2n−2 ti and 2nti .
4 Methods In this section we describe algorithms for Web page rank prediction. Section 4.1 discusses an approach that uses a separate linear regression model for each web page. Section 4.2 combines linear regression with clustering based on an EM probabilistic framework. Section 4.3 considers a combination of PCA and linear regression. 4.1 Separate Linear Regression Models Assume a set of n Web pages for which we observe the nrank values at m times steps. Let xi = (xi1 , ..., xim ) be the nrank values for the ith Web page at the time points t = (t1 , ..., tm ). Further, we assume that the n × m design matrix X stores all the observed nrank values so as each row corresponds to a Web page and each column to a time point. Given these observations we wish to predict the nrank value xi∗for each Web page i at some time t∗. t∗ will typically correspond to a future time point, i.e. t∗> ti , with i = 1, ..., m. Next we discuss a simple prediction method based on linear regression where the input variable correspond to time and the response variable is the nrank value. For a certain Web page i we assume a linear regression model having the following form (2) xik = ai tk + bi + ǫ, k = 1, . . . , m, where ǫ denotes a zero-mean Gaussian noise. Note that the parameters (ai , bi ) are Webpage-specific. In other words, the above formulation defines a separate linear regression model for each Web page and thus Web pages are treated independently. The
Web Page Rank Prediction with PCA and EM Clustering
107
specification of the values (ai , bi ) is achieved by solving a linear system using least squares. Prediction of the nrank value xi∗ at some future time t∗ is carried by evaluating Equation (2) and ignoring the ǫ noise term. The linear model can be easily extended to account for simple non-linearities. For instance, Equation (2) can be replaced by a quadratic function over tk or a higher order polynomial function. In the experiments we investigate a linear and a quadratic regression model. The complexity of the algorithm per Webpage-prediction is a multiplication of a 2×2 matrix with a 2-dimensional vector. If the time steps (t1 , . . . , tm ) are common for all Webpages, this matrix needs to be estimated once since it is common for all predictions. The above framework treats each Web page independently from the remaining Web pages by learning a separate linear model. This can be restrictive since any global similarities and dependencies that might exist between different Web pages are not taken into account. In the next two sections we generalize the above baseline method in two ways. In section 4.2, we cluster the linear regression models using the EM algorithm in order to capture similarities between Web pages, while in section 4.3 we apply PCA in order to model inter-dependencies among the different Web pages. 4.2 Clustering Using EM We assume that the nrank values of each Web page follow one of J different types or clusters. The number of clusters J can be much smaller than the number n of Web pages. Clustering can be viewed as training a mixture probability model. To generate the nrank values xi for the ith JWeb page, we first select the cluster type j with probability πj (where πj ≥ 0 and j=1 πj = 1) and then produce the values xi according to a linear regression model: xik = aj tk + bj + ǫk , k = 1, . . . , m
(3)
where ǫk is independent Gaussian noise with zero mean and variance σj2 . The above formulation implies that given the cluster type from the j the nrank values are drawn 2 following product of Gaussians: p(xi |j) = m N (x |a t + b , σ ). The cluster ik j k j j k=1 type that generated the nrank values of a certain Web page is an unobserved variable and thus after marginalization we obtain a mixture unconditional density for the observation vector xi : p(xi ) = Jj=1 πj p(xi |j). To train the mixture model and estimate the parameters θ = (πj , σj2 , aj , bj )Jj=1 , we can maximize the log likelihood of the data L(θ) = log N n=1 p(xi ) by using the EM algorithm [4]. Given an initial state for the parameters, EM optimizes over θ by iterating between E and M steps. Once we have obtained suitable values for the parameters, we can use the mixture model for prediction. Particularly, to predict the nrank value xi∗ of the ith Web page at time t∗ given the observed values xi = (xi1 , ..., xim ) at previous times we express the posterior distribution p(xn∗ |xi ) using the Bayes rule: p(xi∗ |xi ) =
J j=1
Rji N (xi∗ |aj t∗ + bj , σj2 ),
(4)
108
P. Zacharouli, M. Titsias, and M. Vazirgiannis
where Rji is computed according to πj p(xi |j) Rji = J ρ=1 πρ p(xi |ρ)
(5)
for j = 1, ..., J and i = 1, ..., N .To obtain a specific predictive Jvalue for xi∗ we can use the mean value of the above posterior distribution xi∗ = j=1 Rji (aj t∗ + bj ) or the median estimate xi∗ = aj t∗ + bj where j = arg maxρ Rρj that considers a hard assignment of the Web page into one of the J clusters. 4.3 PCA and Regression Although the EM approach models global similarities between different Web pages by grouping them into clusters, the dependencies among different Web pages are not well modeled. To see this, note that the likelihood that generates the data factorizes across all Web pages. Dependencies between Web pages can naturally exist in the time-evolution ranking process. This is because the popularity of a certain Web page can affect the popularity of other Web pages. In this section we discuss a prediction method that takes into account such dependencies by applying dimensionality reduction. The n × m design matrix X of the observed nrank values can be viewed as a timesseries instantiation of the large column vector of n measurements corresponding to all Web pages. Based on this, we can consider the m columns of X = [X1 , ..., Xm ], where each Xk is a n-dimensional vector, as the observed data vectors and apply a dimensionality reduction technique. The promise of this formulation is that it allows us to model the correlation structure in the elements of Xk that reflects the possible interdependencies among Web pages. Note that this modeling view is rather different to that used in section 4.2 (and 4.1) where we considered a probability model to generate each row of X. Next we use PCA as the dimensionality reduction method and combine it with linear regression. PCA is a well-established technique for data processing [5]. Although there are many ways to motivate the PCA framework, here we focus on the probabilistic view that is widely used in statistical machine learning [3,6]. Each n-dimensional vector Xk of nrank values at time tk is generated according to a latent variable model so as: Xk = Azk + μ + ǫ,
(6)
where the n × q matrix A relates the q-dimensional latent vector zk with the observed high dimensional vector Xk and μ is the mean of Xk s. The noise vector ǫ is drawn from a zero-mean isotropic Gaussian, i.e. ǫ ∼ N (0, σ 2 In ), while the latent vector zk is drawn from N (0, Iq ). From Equation (6), we can easily marginalize out the latent variable zk and obtain the unconditional probability distribution of Xk : p(Xk ) = N (μ, AAT + σ 2 I). The maximization of the likelihood with respect to the transformation matrix A gives rise to PCA [3], so as the columns of A will be equal to the q largest eigenvectors of the empirical covariance matrix. We now discuss how we adapt the PCA framework to Web page rank prediction. The probability distribution in the previous equationis a Gaussian with full covariance matrix AAT + σ 2 I which implies that the correlation of different Web pages in the nrank
Web Page Rank Prediction with PCA and EM Clustering
109
vector Xk is modeled. Our objective is to learn how to predict X∗ at future time t∗ . We follow a two-stage learning process. First, we use the matrix X to find the q principal components A and we form the low dimensional projection of the data according to zk = AT (Xk − μ), for k = 1, ..., m. Then, we train q independent linear regression models in the low dimensional space similarly to section 4.1. More specifically, assuming that Z is a q × m matrix that collects together all the latent vectors, each linear regression model is constructed for any row of Z: zjk = aj tk + bj + ǫ, k = 1, ..., m
(7)
and j = 1, ..., q. The parameters (aj , bj ) for each j are obtained by solving a linear system using least squares. To predict the nrank values X∗ , we firstly estimate the latent vector z∗ at time t∗ by using Equation (7) and then we predict X∗ according to the expression X∗ = Az∗ + μ. Similarly to the separate regression models described in section 4.1 we can slightly modify the above method by using a higher order polynomial function in Equation (7) instead of the linear model. The linear and the quadratic functions have been used in the experiments.
5 Top-k Lists Similarity Measures In order to evaluate the quality of predictions we need to measure the similarity of the predicted to the actual top-k rankings. For this purpose, we employ measures commonly used for comparing rankings. The first one, denoted as OSim(A, B) [13] indicates the degree of overlap between the top-k elements of two sets A and B (each one of size k): OSim(A, B) =
|A ∩ B| k
(8)
The second, KSim(A, B) [13] is based on Kendall’s distance measure [15] and indicates the degree that the relative orderings of two top-k lists are in agreement: KSim(A, B) =
|{(u, v) : A′ , B ′ agree on order}| |A ∪ B|(|A ∪ B| − 1)
(9)
where A′ is an extension of A resulting from appending at its tail the elements x ∈ A ∩ (B − A). The added elements are shuffled such that they do not preserve their original relative ordering in list B (B ′ is defined analogously). In other words, KSim(A, B) is the probability that A′ and B ′ agree on the relative ordering of a randomly selected pair of distinct nodes (u, v) ∈ |A ∪ B| × |A ∪ B|. OSim indicates the concurrence of predicted pages with the actual visited ones, but does not take into consideration the ranks of the pages in the final rankings. KSim takes into consideration the common items of the two lists, and is proportional to the number of pairs that have the same relative ordering in both lists.
110
P. Zacharouli, M. Titsias, and M. Vazirgiannis
6 Experiments In order to evaluate the effectiveness of our approach, we performed experiments on three different real-world datasets. The first is a subset of the European Archive1, consisting of 22 Web graph snapshots of UK government Websites. The second one is the Web graph of the English version of the Wikipedia encyclopedia2. The third one is a collection of top-k ranked lists for 22 queries over a period of 37 days, as they result from Yahoo! search engine3 . In our experiments, we evaluate the prediction quality in terms of similarities between the predicted top-k ranked lists and the actual ones using the OSim and KSim similarity measures. All the experiments were run on low-end commodity hardware (3 GHz Pentium PCs with 2 GB of memory and local IDE disk, running either Windows XP or Linux). 6.1 Datasets Description and Preprocessing For each dataset (Internet Archive, Wikipedia and Yahoo!), a wealth of snapshots were available, ensuring that we have enough evolution to test our approach. A concise description of each dataset follows. We also present in detail our query-based approach on Wikipedia and Yahoo!. The Internet Archive dataset comprises of approximately 500, 000 pages, referring to weekly collections of eleven UK government Websites. We obtained 22 graph snapshots evenly distributed in time between March 2004 and January 2006. The Wikipedia dataset consists of 55 consecutive monthly snapshots of the English version of Wikipedia, ranging from January 2002 to July 2006. It represents a dynamic large-scale collection (grew from 4,593 to 1,467,982 articles between the first and last snapshots), frequently used as a benchmark dataset 4 . The snapshots were extracted from a database dump containing the entire history of the encyclopedia, which can be found at http://download.wikipedia.org/. We use the Wikipedia dataset and its evolution features to verify the validity of our prediction framework on query-based topk results. This is an intuitive choice as this is a natural way of producing top-k results on the Web. Thus we maintain for each selected query (selection criteria follow) the topk results, for each of the 55 monthly snapshots. For some of the queries we have less top-k lists, some query terms do not yield any results for some of the initial snapshots. The Yahoo! dataset consists of 37 consecutive daily top-1000 ranked lists as they are computed using the Yahoo! Search Web Services5. We have collected such top-k rankings for 22 queries. There is an overlap between this query set and the Wikipedia one. Ranking Function for Querying Wikipedia. Regarding the ranking function in the Wikipedia top-k lists, we take into account two features: a) the PageRank value of the page containing the term and b) the tf-idf value of the term for the specific page. Thus, given a term t, the score of document d is computed as follows: 1 2 3 4
5
http://www.europarchive.org/ukgov.php http://en.wikipedia.org/ http://search.yahoo.com/ http://en.wikipedia.org/wiki/Wikipedia:Wikipedia in academic studies is a regularly updated list of relevant works http://developer.yahoo.com/search/
Web Page Rank Prediction with PCA and EM Clustering
score t (d) = (tf -idf (t, d) · title(t, d))
1.5
· pr (d)
111
(10)
with title(t, d) = 2 if t appears in the title of d, 1 otherwise, and pr (d) the graph based PageRank score for the page d. Both tf -idf and title are lifted to the power 1.5 to increase the importance of high tf-idf values, and decrease the importance of low tf-idf values, applying thus implicitly a smooth thresholding strategy. The choice of the aforementioned score function and parameter values was based on both an anecdotal evaluation of the top-k resulting lists and a comparison to the respective top-k lists produced by querying Wikipedia with Google—thus implicitly assuming Google’s ranking as a reference ones. Query results were reasonably comparable to Google’s ranking, and it is not the intent of this work to devise a very elaborate scoring function. Query Selection for Wikipedia and Yahoo! In the case of Wikipedia, we classify query terms (and therefore the respective top-k lists) based on their frequency but also on their dynamism. Frequency is measured by the proportion of the documents containing a given term. Another property we are interested in is the dynamism of the query in terms of the rate of increase (or decrease) in the number of documents that contain the query time as time progresses. We define the previous features as follows: Let W = {wi } be the set of Wikipedia snapshots, ordered in time (i.e., wi+1 is the immediate successor of wi ). Let TF wi (tj ) be the term frequency of term tj in snapshot wi (i.e., the number of documents tj appears in, in snapshot wi ). For each wi in W and each tj in wi we compute TF wi (tj ). Then we define dynamism rate of the term for the snapshot wi as: (TF wi (tj ) − TF wi−1 (tj )) (11) SD wi (tj ) = TF wi (tj ) and its aggregate value over the snapshots series: TSD(tj ) = TSD(tj ) + SD wi (tj ). Then we compute for each term its average dynamism: D (tj ) = TSD(tj)/(|W | − 1), where |W | represents the number of snapshots the term is present in. In the case of the Yahoo! dataset, since we have no access to information as for Wikipedia, we picked a) random queries from the Wikipedia dataset, b) popular queries that appeared in Google Trends (http://www.google.com/trends) and c) popular current queries (such as “euro 2008” or “olympic games 2008”). All queries used for Wikipedia and Yahoo! can be found at:www.db-net.aueb.gr/michalis/WAW2009/ queries.rar 6.2 Experimental Methodology We compare all predictions to a baseline scheme, called Static, which just returns the top-k list of the previous snapshot (i.e., we consider that the pages remained at the same rank). The steps we followed for all three datasets, follow. We first computed PageRank scores for each snapshot of our datasets. In order to obtain the global top-k rankings in the case of Internet Archive, we ordered the pages using PageRank scores for each snapshot in a descending order. In a similar way, we obtained the Wikipedia top-k rankings using the scoring function mentioned before. After computing the PageRank scores, we calculated the nrank values for each pair of consecutive graph snapshots and we stored them in a nrank × time matrix (where rows represent the different Web pages
112
P. Zacharouli, M. Titsias, and M. Vazirgiannis
and columns all available snapshots). Assuming an m-path of consecutive snapshots we predict the m + 1 state. Then for each page p we predict a ranking which is compared to its actual ranking after using a 10−fold cross validation process: we use 90% of the dataset as training and the learned models are tested on the remaining 10% . This process is repeated 10 times each time with a different 10% test fold. In the case of the EM approach, we tested the quality of clustering results for clusters cardinality between 2 and 10, for each query and chose the one that maximized the overall quality of the clustering. The overall quality of clustering (or score function) was defined as a monotone combination of the within-cluster variation and the betweencluster variation. A simple measure of within-cluster variation is the sum of squares of distances from each point to the center of the cluster it belongs to: wc =
J
d(x, rk )
(12)
k=1 x∈clusterk
where rk is the center of cluster k, J is the total number of clusters (where 2 ≤ J ≤ 10) and d(x, rk ) is the simple Euclidean distance. Between-cluster variation can be measured by the distance between cluster centers: bc = d(rj , rk ) (13) 1≤j
The score function of clustering was then defined as the ratio bc /wc . In the case of the PCA method combined with regression, we projected the data into the space formed by the k principal eigenvectors in such a way as to capture the 85% of the variance in the original data. The variance of the projected data can be expressed k as j=1 λj , where λj is the j-th eigenvalue. Equivalently, the squared error in terms of approximating the true data matrix using only the first k eigenvectors (where the total number of eigenvectors is p) can be expressed: as: pj=k+1 λj / pl=1 λl . 6.3 Experimental Results
In this section we evaluate the prediction performance of our framework involving all three datasets: Internet Archive, Wikipedia and Yahoo!. Assuming that we know the web pages’ nrank values for n time steps, we use the first n − 3 time steps for training the prediction models and the static baseline, while the last three actual nrank values are used for comparing them to the ones we predicted. The results of our experiments are described in more detail in the following subsections. Internet Archive. In Fig. 1 we report results for the Internet Archive data set. The EM approach achieves the best performance for small values of k (k < 30) achieving a prediction accuracy of 0.65 − 0.70. Then its performance degrades dramatically as the top-k list’s size grows. The static approach performs very well for small k values, then it degrades and increases again to a 0.6 accuracy figure. The static baseline is a good choice for this data set due its relative lack of dynamism (governmental sites change less frequently than the commercial ones). The best performing method for small k values (i.e. for the most meaningful rankings) is the 2nd order regression which achieves a 0.7
Web Page Rank Prediction with PCA and EM Clustering 1
0.9
0.9
0.8
0.8
113
0.7
0.7 0.6 0.6 KSim
OSim
0.5 0.5
0.4 0.4 0.3 0.3 0.2
0.2
Regression PCA−Regression 2nd−Order EM Approach Static
0.1
0
0 0
50
100
150 top−k
200
250
Regression PCA−Regression 2nd−Order EM Approach Static
0.1
300
0
50
(a) OSim
100
150 top−k
200
250
300
(b) KSim
Fig. 1. Prediction accuracy vs top-k list length: Internet Archive dataset
top 0.8 prediction accuracy for OSim and for k in [20, 50]. Then it slightly degrades and increases again to reach an outstanding 0.86 accuracy for larger values of k. The plain regression technique performs really bad for small k values while its performance increases exponentially with k and surpasses that of all other approaches for k ≥ 80. On the other hand the PCA combined with regression methods perform consistently worse than all the others for most k values.This performance pattern remains for the KSim measure although the accuracy figures are somewhat smaller. 0.65
0.9
0.6 0.8
0.55 0.7
0.5
0.45 OSim
KSim
0.6
0.4
0.5
0.35
0.3
0.4
Regression PCAíRegression 2ndíOrder EM Approach Static
0.3
0.2
0
50
100
150 topík
(a) OSim
200
250
300
0.25
Regression PCAíRegression 2ndíOrder EM Approach Static
0.2
0
50
100
150 topík
200
250
300
(b) KSim
Fig. 2. Prediction accuracy vs top-k list length: Wikipedia dataset
Wikipedia. We conducted a large series of experiments for the Wikipedia English dataset, using an one-month time interval between the 55 successive snapshots. We show in Fig. 2 the performance of the predicting schemes for different values of k, for all employed similarity measures (OSim,KSim). The query load we used consists of 112 queries. The accuracy appearing in the figure is the average value for all 112 queries, using 10− fold cross validation ensuring thus robustness and reliability of the results. All approaches perform badly for small values of k (k < 50). Interestingly the static approach has the best (although objectively bad) performance for k ≤ 20. For larger k ′ s all approaches ameliorate drastically and outperform the static one with linear and 2nd order regression being the best approaches for this data set reaching a prediction accuracy 0.75. The P CA and EM approaches perform clearly worse than plain regression ones.
114
P. Zacharouli, M. Titsias, and M. Vazirgiannis
Yahoo! In Fig. 3 we report the predictions’ results for the Yahoo! data set. We consider these results as most valuable since the data represent real worlds rankings to popular queries. Even though we do not have access to the ranking algorithm our task is to learn predictors for top-k lists. Also we have to stress the top-k lists time interval which is very short for Yahoo! as data are collected daily. The results are very encouraging with regards to the predictions. The EM approach excels for small values of k (i.e. the most interesting ones among real users) achieving a prediction accuracy of 0.8 - 0.85 depending on the similarity measure. The performance slightly degrades as the top-k list’s size grows. On the other hand the PCA combined with regression approach performs very good for small k values (although a bit worse than EM) while its performance increases slightly and outperforms the EM approach for larger k values. The plain regression techniques perform really bad for small k values while their performance increases exponentially with k and surpasses that of all other approaches for k ≥ 50 (100) for the OSim (and KSim) similarity measure reaching an outstanding 90% prediction accuracy. 0.9
0.95
0.85
0.9
0.8
0.85
0.75 0.8
0.7 OSim
KSim
0.75
0.65
0.7
0.6 0.65
0.55 0.6 Regression PCA−Regression 2nd−Order EM Approach Static
0.55
0.5
0
50
100
150 top−k
(a) OSim
200
250
300
0.5
Regression PCA−Regression 2nd−Order EM Approach Static
0.45
0.4
0
50
100
150 top−k
200
250
300
(b) KSim
Fig. 3. Prediction accuracy vs top-k list length: Yahoo! dataset
7 Conclusions The dynamics of the Web provide a fascinating domain of study for researchers from both academic and commercial fields. Searching the Web inherently involves the ranking issue. In this paper, we proposed learning algorithms for Web page rank prediction based on combinations of linear regression, clustering and PCA. We capitalize on previous Web page rankings and built a family of predictors. We conducted extensive experiments on large scale real Web data (Internet Archive), as well as on query-based data from the English Wikipedia and Yahoo! query results. We studied the prediction performance for various query types and query loads. statistically robust as they result from a 10− fold cross-validation process. The results are overall encouraging, for all similarity measures across all datasets used achieving high prediction accuracy. Acknowledgements. We warmly thank Klaus Berberich, Pierre Sennelart and Thimios Kostakis for their suggestions and help at previous stages of this work.
Web Page Rank Prediction with PCA and EM Clustering
115
References 1. Vazirgiannis, M., Drosos, D., Senellart, P., Vlachou, A.: Web Page Rank Prediction with Markov Models. WWW poster, Beijing, China (2008) 2. Vlachou, A., Vazirgiannis, M., Berberich, K.: Representing and quantifying rank - change for the web graph. In: Aiello, W., Broder, A., Janssen, J., Milios, E.E. (eds.) WAW 2006. LNCS, vol. 4936, pp. 157–165. Springer, Heidelberg (2008) 3. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Computation 11, 443–482 (1999) 4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39(1), 1–38 (1977) 5. Jolliffe, I.T.: Principal Component Analysis. Springer, New York (1986) 6. Bishop, C.M.: Machine learning and pattern recognition. Information Science and Statistics. Springer, Heidelberg (2006) 7. J¨arvelin, K., Kek¨al¨ainen, J.: Cumulated gain-based evaluation of IR techniques. TOIS 20(4), 422–446 (2002) 8. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006) 9. Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: Proc. CIKM, Bremen, Germany (October 2005) 10. Chien, S., Dwork, C., Kumar, R., Simon, D.R., Sivakumar, D.: Link evolution: Analysis and algorithms. Internet Mathematics 1(3), 277–304 (2003) 11. Broder, A.Z., Lempel, R., Maghoul, F., Pedersen, J.: Efficient PageRank approximation via graph aggregation. Information Retrieval 9(2), 123–138 (2006) 12. Chen, Y.-Y., Gan, Q., Suel, T.: Local methods for estimating PageRank values. In: Proc. CIKM, Washington, USA (November 2004) 13. Haveliwala, T.H.: Topic-sensitive PageRank. In: Proc. WWW, Honolulu, USA (May 2002) 14. Langville, A.N., Meyer, C.D.: Updating PageRank with iterative aggregation. In: Proc. WWW, New York, USA (May 2004) 15. Kendall, M.G., Gibbons, J.D.: Rank Correlation Methods, Charles Griffin, London, UK (1990) 16. Yang, H., King, I., Lyu, M.R.: Predictive ranking: a novel page ranking approach by estimating the Web structure. In: Proc. WWW (May 2005)
Permuting Web Graphs∗ Paolo Boldi, Massimo Santini, and Sebastiano Vigna Dipartimento di Scienze dell’Informazione, Università degli Studi di Milano, Italy
Abstract. Since the first investigations on web graph compression, it has been clear that the ordering of the nodes of the graph has a fundamental influence on the compression rate (usually expressed as the number of bits per link). The author of the LINK database [1], for instance, investigated three different approaches: an extrinsic ordering (URL ordering) and two intrinsic (or coordinatefree) orderings based on the rows of the adjacency matrix (lexicographic and Gray code); they concluded that URL ordering has many advantages in spite of a small penalty in compression. In this paper we approach this issue in a more systematic way, testing some old orderings and proposing some new ones. Our experiments are made in the WebGraph framework [2], and show that the compression technique and the structure of the graph can produce significantly different results. In particular, we show that for the transpose web graph URL ordering is significantly less effective, and that some new orderings combining host information and Gray/lexicographic orderings outperform all previous methods. In particular, in some large transposed graphs they yield the quite incredible compression rate of 1 bit per link.
1 Introduction The web graph [3] is a directed graph whose nodes correspond to URLs, with an arc from x to y whenever page denoted by x contains a hyperlink toward page denoted by y; more loosely, the same term is sometimes used for the undirected version of the graph, when arc direction is not relevant. Web graphs are a huge source of information, and contain precious data that find applications in ranking, community discovery, and more. In many cases, results obtained and techniques applied for the web graph are also appropriate for the larger realm of social networks, of which the web graph is only a special case. One nontrivial practical issue when dealing with web graphs is their size: a typical web graph contains millions, sometimes billions, of nodes and although sparse its adjacency matrix is way too big to fit in memory, even on large computers. To overcome this technical difficulty, one can access the graph in an offline fashion, which however requires to design special offline algorithms even for the most basic problems (e.g., finding connected components or computing shortest paths); alternatively, one can try to compress the adjacency matrix so that it can be loaded into memory and still be directly ∗ This work is partially supported by the EC Project DELIS, by MIUR PRIN Project “Automi e
linguaggi formali: aspetti matematici e applicativi”, and by MIUR PRIN Project “Web Ram: web retrieval and mining”. K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 116–126, 2009. c Springer-Verlag Berlin Heidelberg 2009
Permuting Web Graphs
117
accessed without decompressing it (or, decompressing it only partially, on-demand, and efficiently). The latter approach, that can be referred to as web graph compression, can be traced back to the Connectivity Server [4] and to the LINK database [1]; more recently, it led to the development of the WebGraph framework [2], that still provides the best practical compression techniques. Most web graph compression algorithms strongly rely on properties that are satisfied by the typical web graphs; in particular, the key properties that are exploited to compress the graph adjacency structure are locality and similarity. Locality means that most links from page x point to pages of the same host as x (and often share with x a long path prefix); similarity means that pages that are from the same host tend to have many links in common (this property tends to be more and more frequent with the widespread use of templates and generated content). The fact that most compression algorithms make use of these (and similar) properties explains why such algorithms are so sensible to the way nodes are ordered. A technique that works incredibly well, and was adopted already in the Connectivity Server [4], consists in sorting nodes lexicographically by URL (node 0 is the one that corresponds to the lexicographically first URL, and so on). In this way, successors lists contain by locality URLs that are assigned close numbers, and gap encoding1, a standard technique borrowed by inverted-index construction, makes it possible to store each successor using a small number of bits. This solution is usually considered good enough for all practical purposes, and has the extra advantage that even the URL list can be compressed very efficiently via prefix omission. Analogous techniques, which use additional information beside the web graph itself, are called extrinsic. It is natural to wonder if there is an alternative way of finding a “good ordering” of the nodes that allow one to obtain the same (or maybe better) compression rate without having to rely on URLs; this is especially urgent for social network graphs, where nodes do not correspond themselves to URLs.2 A general way to approach this problem is the following: take the graph with some ordering of its n nodes, and let A be the corresponding adjacency matrix; now, based on A, find some permutation π of its rows and columns such that, when applied to A, produces a new matrix A′ in which two rows are similar (i.e., they contain 1’s more or less in the same positions) iff they are close to each other (i.e., appear consecutively, or almost consecutively). Finding a good permutation π is an interesting problem in its own account; in [1], the authors propose to choose the permutation π that would sort the rows of A in lexicographic ordering. This is an instance of a more general approach: fix some total ordering ≺ on the set of n-bit vectors (e.g., the lexicographic ordering), and let π be the permutation3 that sorts the rows of A according to ≺. Observe that the rows of A′ are not ≺-ordered, because the permutation π is applied to both rows and columns. 1 Instead of storing x , x , x , . . . we store, using a variable-length bit encoding, x , x − x , 0 1 2 0 1 0
x2 − x1 , . . . . 2 We note that the same approach has been shown to be fruitful in the compression of inverted indices; see, for instance [5,6,7,8]. 3 In this description we are ignoring the problem that π is not unique if A contains the same row
many times.
118
P. Boldi, M. Santini, and S. Vigna
These approaches are called intrinsic, or coordinate-free, as they do not rely on external information (such as the URLs of each node).4 Another possible solution to the same problem, already briefly mentioned in [1], consists in letting ≺ be a Gray ordering, i.e., an ordering where adjacent vectors differ by exactly one bit. Although this solution may sound promising, one should carefully determine an efficient algorithm that finds the sorting permutation π with respect to (some) Gray ordering. In this paper we explore experimentally, using the WebGraph framework, the improvements in compression due to permutations. Besides the classical permutations described above, we propose two new permutations based on the Gray ordering: however, we restrict the permutation to rows of the same host. Moreover, for the first time we provide experimental data on the transposed graph, showing that coordinate-free permutations provide a dramatic increase in compression, contrarily to what happens in the standard case.
2 Notation and Gray Code Basics Von Neumann’s notation. In the following we will adopt von Neumann’s definition of natural numbers x = { 0, 1, . . . , x − 1 }, which leads to a simple notation for sets of integers. We allow some ambiguity when writing exponentials: 2n denotes the vectors of n bits, or, equivalently, the power set of n (interpreting the vectors as characteristic functions n → 2). Since 2n ambiguously denotes also the set X = { 0, 1, . . . , 2n − 1 } we assume the natural correspondence between the latter set in increasing order and the strings of n bits in lexicographic ordering. The mapping from strings to X is obviously given by base-2 evaluation. In the following, if x ∈ 2n , we use x 0 , x 1 , . . . to denote the bits of its binary expansion (x 0 being its least-significant bit). In other words, interpreting x as a characteristic function n → 2, we let x k = x(k). Gray codes and Gray orderings. An n-bit Gray code is an arrangement (i.e., a total ordering) of 2n such that any two successive vectors5 differ by exactly one bit; Gray codes, named after the physicist Frank Gray, find countless applications in computer science, physics and mathematics (we refer the interested reader to [9] for more information on this topic). The ordering imposed by a Gray code on 2n is called a Gray ordering. Even though there are many Gray codes, and thus many Gray orderings, one that is very simple to describe is the following one. 4 Of course, it is possible to devise coordinate-free methods that do not necessarily depend on
some ordering; see, for instance, [5]. 5 We say that a ′ is the successor of a with respect to the ordering < iff a < a ′ and a ≤ b ≤ a ′ implies either a = b or b = a ′ .
Permuting Web Graphs
119
Lemma 1. For every x ∈ 2n , let x¯ ∈ 2n be defined recursively by x¯ n−1 = x n−1 x¯k = x k+1 ⊕ x k , where ⊕ is the exclusive or. Then, x → x¯ is a bijection, and the ordering ≺ defined by x¯ ≺ y¯ iff x < y is a Gray ordering (the natural Gray ordering). Table 1. The natural Gray code on 3 bits: in the right-hand column, any two successive vectors differ in exactly one bit x 000 001 010 011 100 101 110 111
x¯ 000 001 011 010 110 111 101 100
Table 1 shows the natural Gray code on 3 bits. Now, let us write x → xˆ for the inverse of x → x; ¯ starting from the definition given in Lemma 1 one can easily see that xˆ can be recursively defined as follows xˆ n−1 = x n−1 xˆk = x k ⊕ xˆk+1 , This observation gives a nice and simple way to compute xˆ from x: indeed, let x ↓ k be the number of 1’s in x preceding position k (i.e., the number of 1’s in bits that are not less significant than the k-th). Then xˆk = x ↓ k mod 2.
(1)
3 On Gray Ordering and Graphs An obvious application of Gray ordering to graphs is that of permuting node labels so that the resulting adjacency matrix changes “slowly” from row to row. Indeed, intuitively if we permute the rows of the adjacency matrix following a Gray ordering rows with a small number of changes should appear nearby. (This intuition is only partially justified—e.g., in Table 1 the first and last word of the second column differ by just one bit, yet they are as far apart as possible). Of course, to maintain correctly the graph we also need to permute columns in the same way: but this process will not change the number of differences between adjacent rows. Indeed, already some of the first investigation of web graph compression experimented with Gray orderings [1]6 . However, the authors reported a very small improvement in compression with respect to URL ordering; this fact, coupled with the obvious 6 Note that in the paper the codes are incorrectly spelt as “Grey”.
120
P. Boldi, M. Santini, and S. Vigna
advantages of the latter, pushed the authors to discard Gray ordering altogether. The main question we try to answer in this paper is that this small difference is actually an absolute property or it is an artifact strongly depending on the compression algorithms used, and whether it also applies to transposed graphs. To this purpose, we first develop a very simple algorithm that makes it possible to decide the Gray code ordering inspecting the successor lists in parallel. When manipulating web graphs using the WebGraph framework, successors are returned under the form of an iterator providing an increasing sequence of integers. This makes it possible to compare the position in the Gray ordering of two rows of the adjacency matrix by iterating in parallel over the adjacency lists of two nodes. Until the lists coincide, we skip, and keep a variable recording the parity of the number of arcs seen so far (note that the value of formula (1) depends only on the parity of x ↓ k). As soon as the lists differ, we can use formula (1) to compute the first different bit of the ranks of the adjacency rows in the Gray ordering: assuming the first list returns j and the second list returns k (for sake of simplicity, we assume that the end-of-list is marked by ∞), we have the following scenario: – if the parity is odd, the order of the lists is the order of j and k; – if the parity is even, the order of the lists is the order of j and k reversed. This can be easily seen as j < k implies that the first difference in the rows of the adjacency matrix is at position j , where the first list has a one whereas the second list has a zero. If the parity is odd, this means that the rank of the first list has a zero in position j , whereas the rank of the second list has a one. The situation is reversed if the parity is even. Algorithm 1 describes formally this process. Once this simple consideration is made, it is trivial to implement Gray code (or lexicographic) graph permutation using WebGraph’s facilities. The idea is that of using a standard exchange-based sorting algorithm using a lazy comparator based on the Algorithm 1. The procedure for deciding the Gray natural order of two rows of the adjacency matrix, represented by means of iterators i and j that return the position of the next nonzero element. Note that end-of-list is denoted by ∞ and that we used Iverson’s notation: [a < b] has value one if a < b, zero otherwise. The meaning of the return value is the same as that of the C standard function strcmp. 0 1 2 3 4 5 6 7 8 9 10
p ← false; forever begin a ← next(i); b ← next( j ); if a = ∞ and b = ∞ then return 0; if a = b then begin if p ⊕ [a < b] then return 1 else return −1 end; p ← ¬p end;
Permuting Web Graphs
121
considerations above. As a result, we can compute the Gray permutation of the uk graph (see Table 2) in about one hour on an Opteron at 2.8 GHz. Note that from a complexity viewpoint this approach is far from optimal. Indeed, a simple way to permute words in Gray code ordering is to apply a modified radix sort in which, at each recursive call, we have a parity bit that tells us whether 0 ≺ 1 or 1 ≺ 0. We apply a standard radix sort algorithm, dividing words in two blocks depending on the first bit, and then recurse on each block: however, when we recurse on the block of words starting with one, we invert the parity bit. Now, this approach is theoretically optimal if measured of the space occupancy of the adjacency matrix. However, the adjacency matrix of a web graph is very sparse, and never represented explicitly. Alternatively, we could develop a radix sort that picks up lazily successors from successor lists (for all nodes) and deduces implicitly the zeroes and the ones of the adjacency matrix. Albeit in principle such an algorithm would iterate optimally (i.e., it would extract from each iterator the minimum number of elements that are necessary to compute the ordering), it would require to build at the same time the iterators for all nodes—a task that would require a preposterous amount of core memory.
4 Experiments We ran a number of experiments, on web graphs of different sizes (ranging from 300K nodes up to almost 120M nodes) and collected at different times, and on their transpose version (i.e., the graph obtained reversing the direction of all arcs). The graphs used are described in Table 2, and they are all publicly available, as well as the code used in the experiments. Table 2. Basic properties of graphs used as dataset Year Nodes Edges Name cnr 2000 325 557 3 216 152 webbase 2001 118 142 155 1 019 903 190 2004 41 291 594 1 150 725 436 it 2005 862 664 19 235 140 eu uk 2007 105 896 555 3 738 733 648
We started with the standard URL ordering of nodes and permuted the nodes in different ways, taking note of the number of bits/link occupied if the graph is compressed in WebGraph format. Six node orderings were considered: – URL: URL ordering (to avoid confusion, we use “URL ordering” instead of “URL lexicographic ordering”); – lex: lexicographic row ordering; – Gray: Gray ordering; – lhbhGray: loose host-by-host Gray ordering (i.e., keep URLs from the same host adjacent, and order them using Gray ordering); – shbhGray: strict host-by-host Gray ordering (like before, but Gray ordering is applied considering only local links, i.e., links to URLs of the same host).
122
P. Boldi, M. Santini, and S. Vigna
We remark that we devised the latter two orderings trying to combine external and internal information. Figure 1 shows how the adjacency matrix changes when re-ordering is applied. To obtain this figure, we actually divided the matrix into smaller 100 × 100-square submatrices, computed the fraction of 1’s found in each submatrix, and plotted a square using a grayscale that maps 0 to white and that becomes exponentially darker (1 is black). WebGraph compression depends not only on the graph to be compressed and the ordering of its nodes, but also on a number of other parameters that control its behaviour. Two parameters that happen to be important for our experiments are: – Window size: when compressing a certain matrix row, WebGraph compares it with a number of previous rows, and tries to compress it differentially with respect to (i.e., as a difference from) each such row, choosing at the end the row that gave the best compression (or none, if representing the row non-differentially gives better compression); the number of rows considered in this process is called the window size. Of course, larger window sizes produce slower compression, but usually guarantee better compression: the default window size is 7. – Maximum reference count: Compressing a row x differentially with respect to some previous row y < x makes it necessary, at access time, to decompress row y before decompressing row x; if also y is compressed differentially with respect to some other row z < x, z must be decompressed first, and so on, producing what we call a reference chain. This recursive process must be somehow limited, or access becomes extremely inefficient (and may even overflow the recursion stack). The (maximum) reference count is the maximum length of a reference chain (we simply avoid to compress a row differentially with respect to one that already has a maximum-length reference chain); the default reference count is 3, but this value may be pushed up to ∞ (meaning that we just don’t care of creating too long chains: this makes sense if we plan to access the graph only sequentially, in which case we just need to keep the last w uncompressed rows, where w is the window size used for compression). Tables 3 and 4 show the number of bits/link occupied by the various graphs using ∞ and 3, respectively, as reference count (the window size was fixed at 8). Not surprisingly, the number of bits/link for the random permutation, that we present here only for comparison, is very large. A more thorough, and visually understandable, presentation of the experimental results in shown in Figure 2.
5 Discussion Our experiments highlight a number of very interesting issues. – The effectiveness of intrinsic methods depends on the graph. As we discussed, the fact that lex or Gray ordering should improve compression is intuitive, but has little formal justification. Indeed, on some graphs we have a visible decrease in compression.
Permuting Web Graphs
123
Fig. 1. The picture shows the adjacency matrix of the cnr graph (325 557 nodes), when nodes are ordered as follows (left-to-right, top-to-bottom): lexicographically by URL, lexicographically by row, Gray ordering, loose host-by-host Gray ordering, strict host-by-host Gray ordering, randomly
124
P. Boldi, M. Santini, and S. Vigna
Table 3. Compression rate summary, window size set to 8, and (maximum) reference count set to ∞ URL 2.823 cnr 2.654 3.059 webbase 2.876 1.969 it 1.737 4.331 eu 3.903 1.906 uk 1.662 Graph
lex 2.981 2.185 3.410 2.753 1.733 1.206 3.944 2.832 1.513 1.042
Gray shbhGray lhbhGray Random 2.983 2.845 2.843 17.986 2.192 2.176 2.177 15.084 3.416 2.907 2.895 30.937 2.740 2.589 2.598 28.236 1.723 1.541 1.545 26.430 1.207 1.206 1.209 21.717 3.938 3.600 3.715 19.859 2.833 2.761 2.795 16.445 1.509 1.332 1.367 27.576 1.040 1.007 1.014 21.682
– Intrinsic methods are extremely effective on transposed graphs. Our data for standard web graphs confirm what has been previously reported [1], but our new data for transposed graphs show that here the situation is reversed: intrinsic methods improve very significantly the compression of transposed web graphs. With infinite reference chains, the transpose of uk requires just one bit per link. This is a phenomenon that clearly needs a combinatorial explanation. We modified WebGraph so that we could get access to detailed statistics about which compression technique is responsible for the increase in compression. Moving from URL to Gray ordering, the number of copied arcs of the uk graph (arcs that are not recorded explicitly, but rather represented differentially) jumps from ≈ 2 G to ≈ 3 G. Thus, more than 80 % of the arcs of the graph is not represented explicitly. Another visible effect is that of shifting the distribution of gaps—very small values (say, below 10) are much more frequent, which also increases compression. We believe that such a major improvement in the transpose depends on the repetition in patterns of predecessors being much more frequent than in patterns Table 4. Compression rate summary, window size set to 8, and (maximum) reference count set to 3 URL 3.551 cnr 2.839 3.732 webbase 3.092 2.763 it 1.852 5.130 eu 4.002 2.659 uk 1.761
Graph
lex 3.833 2.489 4.404 3.132 2.626 1.407 4.935 3.063 2.394 1.222
Gray shbhGray lhbhGray Random 3.844 3.654 3.659 18.008 2.495 2.472 2.474 15.084 4.409 3.680 3.688 30.937 3.105 2.902 2.916 28.236 2.618 2.334 2.353 26.430 1.393 1.378 1.379 21.717 4.927 4.438 4.599 19.872 3.072 2.993 3.019 16.445 2.396 2.101 2.163 27.576 1.205 1.162 1.170 21.682
Permuting Web Graphs
125
Fig. 2. The picture shows the bits/link compression rate for various node orderings and compression parameters. Each row of two boxes corresponds (top to bottom) to one of the graphs in the dataset (see Table 2), the box on the left relating to the graph itself and the one on the right to its transpose. In every box, the abscissa corresponds to different orderings and the ordinate to the number of bits per link; the considered orderings are (left to right): URL, lex, Gray, lhbhGray, and shbhGray. Note that the random ordering is not shown since its compression rate is an order of magnitude larger than the one of other orderings. Every group of five lines corresponds to different values of the maximum number of (maximum) reference count: the above (continous) one has such parameter set to 3, whereas the below (dashed) one has such parameter set to ∞. In every group of five lines, each line corresponds to different values of the window size parameter; from top to bottom, the values are 1, 2, 4, 8, 16.
of successors. For instance, all pages of level k are often pointed to by pages of level k + 1 (e.g., general topics of a site). This makes the predecessor list of level-k pages large and very similar. The same phenomenon does not happen for successors because usually at level k the pointers at level k + 1 are distinct. Moreover, URL
126
P. Boldi, M. Santini, and S. Vigna
ordering does not gather level k pages nearby—rather, it sorts URLs so that a level k page is followed by all its subpages. On the contrary, manual inspection of the URL permutation induced by strict host-by-host Gray ordering shows that pages on the same level of the hierarchy are grouped together. – Mixed methods work better. Essentially in all cases, our new orderings outperform old methods.
6 Conclusions We have presented some experiments about the effect of permuting nodes on web graph compression. Albeit our results are clearly preliminary, they highlight a number of issues that have not been tackled in the literature. First of all, we have provided two new permutations that outperform all previous methods. Second, we have shown that transposed graphs behave in a radically different manner when permuted with our techniques, giving rise to extreme compression rates.
References 1. Randall, K., Stata, R., Wickremesinghe, R., Wiener, J.L.: The LINK database: Fast access to graphs of the Web. Research Report 175, Compaq Systems Research Center, Palo Alto, CA (2001) 2. Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: Proc. of the Thirteenth International World Wide Web Conference, Manhattan, USA, pp. 595–601. ACM Press, New York (2004) 3. Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tompkins, A., Upfal, E.: The Web as a graph. In: PODS 2000: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–10. ACM Press, New York (2000) 4. Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S.: The Connectivity Server: fast access to linkage information on the Web. Computer Networks and ISDN Systems 30(1-7), 469–477 (1998) 5. Blandford, D.K., Blelloch, G.E.: Index compression through document reordering. In: Data Compression Conference, pp. 342–351. IEEE Computer Society, Los Alamitos (2002) 6. Shieh, W.Y., Chen, T.F., Shann, J.J.J., Chung, C.P.: Inverted file compression through document identifier reassignment. Inf. Process. Manage 39(1), 117–131 (2003) 7. Silvestri, F.: Sorting out the document identifier assignment problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007) 8. Blanco, R., Barreiro, Á.: Document identifier reassignment through dimensionality reduction. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 375–387. Springer, Heidelberg (2005) 9. Knuth, D.E.: The Art of Computer Programming. In: Fascicle 2: Generating All Tuples and Permutations (Art of Computer Programming), vol. 4. Addison-Wesley Professional, Reading (2005)
A Dynamic Model for On-Line Social Networks Anthony Bonato1 , Noor Hadi2 , Paul Horn3 , Pawel Pralat4,⋆ , and Changping Wang1
2
1 Ryerson University, Toronto, Canada
[email protected],
[email protected] Wilfrid Laurier University, Waterloo, Canada
[email protected] 3 University of California, San Diego, USA
[email protected] 4 Dalhousie University, Halifax, Canada
[email protected]
Abstract. We present a deterministic model for on-line social networks based on transitivity and local knowledge in social interactions. In the Iterated Local Transitivity (ILT) model, at each time-step and for every existing node x , a new node appears which joins to the closed neighbour set of x . The ILT model provably satisfies a number of both local and global properties that were observed in real-world on-line social and other complex networks, such as a densification power law, decreasing average distance, and higher clustering than in random graphs with the same average degree. Experimental studies of social networks demonstrate poor expansion properties as a consequence of the existence of communities with low number of inter-community links. A spectral gap for both the adjacency and normalized Laplacian matrices is proved for graphs arising from the ILT model, thereby simulating such bad expansion properties.
1
Introduction
On-line social networks such as Facebook, MySpace, and Flickr have become increasingly popular in recent years. In such networks, nodes represent people on-line, and edges correspond to a friendship relation between them. In these complex real-world networks with sometimes millions of nodes and edges, new nodes and edges dynamically appear over time. Parallel with their popularity among the general public is an increasing interest in the mathematical and general scientific community on the properties on-line social networks, in both gathering data and statistics about the networks, and in finding models simulating their evolution. Data about social interactions on-line networks is more readily accessible and measurable than in off-line social networks, which suggests a need for rigorous models capturing their evolutionary properties. The small world property of social networks, introduced by Watts and Strogatz [29], is a central notion in the study of complex networks, and has roots in the work of Milgram [25] on short paths of friends connecting strangers. The ⋆
The authors gratefully acknowledge support from NSERC and MITACS grants.
K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 127–142, 2009. c Springer-Verlag Berlin Heidelberg 2009
128
A. Bonato et al.
small world property posits low average distance (or diameter) and high clustering, and has been observed in a wide variety of complex networks. An increasing number of studies have focused on the small world and other complex network properties in on-line social networks. Adamic et al. [1] provided an early study of an on-line social network at Stanford University, and found that the network has the small world property. Correlation between friendship and geographic location was found by Liben-Nowell et al. [24] using data from LiveJournal. Kumar et al. [21] studied the evolution of the on-line networks Flickr and Yahoo!360. They found (among other things) that the average distance between users actually decreases over time, and that these networks exhibit power-law degree distributions. Golder et al. [19] analyzed the Facebook network by studying the messaging pattern between friends with a sample of 4.2 million users. They also found a power law degree distribution and the small world property. Similar results were found in [2] which studied Cyworld, MySpace, and Orkut, and in [26] which examined data collected from four on-line social networks: Flickr, YouTube, LiveJournal, and Orkut. For further background on complex networks and their models, see the books [5,7,10,13]. Recent work by Leskovec et al. [22] underscores the importance of two additional properties of complex networks above and beyond more traditionally studied phenomena such as the small world property. A graph G with e t edges and nt nodes satisfies a densification power law if there is a constant a ∈ (1, 2) such that et is proportional to na t . In particular, the average degree grows to infinity with the order of the network (in contrast to say the preferential attachment model, which generates graphs with constant average degree). In [22], densification power laws were reported in several real-world networks such as a physics citation graph and the internet graph at the level of autonomous systems. Another striking property found in such networks (and also in on-line social networks; see [21]) is that distances in the networks (measured by either diameter or average distance) decreases with time. The usual models such as preferential attachment or copying models have logarithmically or sublogarithmically growing diameters and average distances with time. Various models (such as the Forest Fire [22] and Kronecker multiplication [23] models) were proposed simulating power law degree distribution, densification power laws, and decreasing distances. We present a new model, called Iterated Local Transitivity (ILT), for on-line social and other complex networks which dynamically simulates many of their properties. Although modelling has been done extensively for other complex networks such as the web graph (see [5]), models of on-line social networks have only recently been introduced (such as those in [12,21,24]). The central idea behind the ILT model is what sociologists call transitivity: if u is a friend of v, and v is a friend of w, then u is a friend of w (see, for example, [16,28,30]). In its simplest form, transitivity gives rise to the notion of cloning, where u is joined to all of the neighbours of v. In the ILT model, given some initial graph as a starting point, nodes are repeatedly added over time which clone each node, so that the new nodes form an independent set. The ILT model not only
A Dynamic Model for On-Line Social Networks
129
incorporates transitivity, but uses only local knowledge in its evolution, in that a new node only joins to neighbours of an existing node. Local knowledge is an important feature of social and complex networks, where nodes have only limited influence on the network topology. We stress that our approach is mathematical rather than empirical; indeed, the ILT model (apart from its potential use by computer and social scientists as a simplified model for on-line social networks) should be of theoretical interest in its own right. Variants of cloning were considered earlier in duplication models for proteinprotein interactions [3,4,9,27], and in copying models for the web graph [6,20]. There are several differences between the duplication and copying models and the ILT model. For one, duplication models are difficult to analyze due to their rich dependence structure. While the ILT model displays a dependency structure, determinism makes it more amenable to analysis. The ILT model may be viewed as simplified snapshot of the duplication model, where all nodes are cloned in a given time-step, rather than duplicating nodes one-by-one over time. Cloning all nodes at each time-step as in the ILT model leads to densification and high clustering, along with bad expansion properties (as we describe in the next paragraph). We prove that the model exhibits a densification power law with exponent 3 a = log log 2 ; see Theorem 2. We study the average distances and clustering coefficient of the model as time tends to infinity. In particular, we show that the average distance of the model at time t converges to a function dependent on the Wiener index of the initial graph; see Theorem 2. For many initial graphs, the average distance decreases, and the diameter does not change over time. In Theorem 3, the clustering coefficient of the graph at time t is estimated and shown to tend to 0 slower than a G(n, p) random graph with the same average degree. Experimental studies of social networks (see Estrada [15]) demonstrate smaller expansion properties than in other complex networks as a consequence of the existence of communities with low number of inter-community links. Interestingly, this phenomena is found in the ILT model, where a smaller spectral gap than in random graphs is found for both the normalized Laplacian (see Theorem 5) and adjacency (see Theorem 7) matrices.
2
The ILT Model
We first give a precise formulation of the model. The ILT model generates simple, undirected graphs (Gt : t ≥ 0) over a countably infinite sequence of discrete timesteps. The only parameter of the model is the initial graph G0 , which is a fixed finite connected graph. Assume that for a fixed t ≥ 0, the graph Gt has been constructed. To form Gt+1 , for each node x ∈ V(Gt ), add its clone x′ , such that x′ is joined to x and all of its neighbours at time t. Note that the set of new nodes at time t + 1 form an independent set of cardinality |V(Gt )|. We write degt (x) for the degree of a node at time t, nt for the order of Gt , and et for its number of edges. It is straightforward to see that nt = 2t n0 . Given a node x at time t, let x′ be its clone. The simple but important recurrences governing the degrees of nodes are given as
130
A. Bonato et al.
degt+1 (x) = 2 degt (x) + 1, degt+1 (x′ ) = degt (x) + 1. 2.1
(1) (2)
Average Degree and Densification
We now consider the number of edges and average degree of Gt , and prove the following densification power law for the ILT model. Define the volume of Gt by degt (x) = 2et . vol(Gt ) = x∈V(Gt )
Theorem 1. For t > 0, the average degree of Gt equals t vol(G0 ) 3 + 2 − 2. 2 n0 Note that Theorem 1 supplies a densification power law with exponent a = log 3 log 2 ≈ 1.58. We think that the densification power law makes the ILT model realistic, especially in light of real-world data mined from complex networks (see [22]). Theorem 1 follows immediately from Lemma 1, since the average degree of Gt is vol(Gt )/nt . Lemma 1. For t > 0, vol(Gt ) = 3t vol(G0 ) + 2n0 (3t − 2t ). In particular, et = 3t (e0 + n0 ) − nt . Proof. By (1) and (2) we have that vol(Gt+1 ) = degt+1 (x) + x∈V(Gt )
=
= 3vol(Gt ) + nt+1 .
degt+1 (x′ )
x′ ∈V(Gt+1 )\V(Gt )
(2 degt (x) + 1) +
x∈V(Gt )
(degt (x) + 1)
x∈V(Gt )
Hence by (3) for t > 0, vol(Gt ) = 3vol(Gt−1 ) + nt t−1 i t−i t 32 = 3 vol(G0 ) + n0 i=0
= 3t vol(G0 ) + 2n0 (3t − 2t ),
where the third equality follows by summing a geometric series.
(3)
A Dynamic Model for On-Line Social Networks
2.2
131
Average Distance
Define the Wiener index of Gt as W (Gt ) =
dt (x, y).
x,y∈V (Gt )
The Wiener index arises in applications of graph theory to chemistry, and may be used to define the average distance of Gt as L(Gt ) =
W (Gt ) nt . 2
We will compute the average distance by deriving first the Wiener index. Define the ultimate average distance of G0 , as U L(G0 ) = lim L(Gt ) t→∞
assuming the limit exists. We provide an exact value for L(Gt ) and compute the ultimate average distance for any initial graph G0 . Theorem 2 1. For t > 0, W (Gt ) = 4
t
t 3 W (G0 ) + (e0 + n0 ) 1 − . 4
2. For t > 0, t ⎞ 4t W (G0 ) + (e0 + n0 ) 1 − 43 ⎠. L(Gt ) = 2 ⎝ 4t n20 − 2t n0 ⎛
3. For all graphs G0 ,
U L(G0 ) =
2(W (G0 ) + e0 + n0 ) . n20
Further, U L(G0 ) ≤ L(G0 ) if and only if W (G0 ) ≥ (n0 − 1)(e0 + n0 ).
Note that the average distance of Gt is bounded above by diam(G0 ) + 1 (in fact, by diam(G0 ) in all cases except cliques). Further, the condition in (3) for U L(G0 ) < L(G0 ) holds for large cycles and paths. Hence, for many initial graphs G0 , the average distance decreases, a property observed in on-line social and other networks (see [21,22]). When computing distances in the model, the following lemma is helpful. As its proof is elementary, we omit it. Lemma 2. Let x and y be nodes in Gt with t > 0. Then dt+1 (x′ , y) = dt+1 (x, y ′ ) = dt+1 (x, y) = dt (x, y), and ′
′
dt+1 (x , y ) =
dt (x, y) if xy ∈ / E(Gt ), dt (x, y) + 1 = 2 if xy ∈ E(Gt ).
132
A. Bonato et al.
Proof of Theorem 2. We only prove item (1), noting that items (2) and (3) follow from (1) by computation. We derive a recurrence for W (Gt ) as follows. To compute W (Gt+1 ), there are five cases to consider: distances within Gt , and distances of the forms: dt+1 (x, y ′ ), dt+1 (x′ , y), dt+1 (x, x′ ), and dt+1 (x′ , y ′ ). The first three cases contribute 3W (Gt ) by Lemma 2. The 4th case contributes nt . The final case contributes W (Gt ) + et (the term et comes from the fact that each edge xy contributes dt (x, y) + 1). Thus, W (Gt+1 ) = 4W (Gt ) + et + nt = 4W (Gt ) + 3t (e0 + n0 ). Hence, W (Gt ) = 4t W (G0 ) +
t−1 i=0
4i 3t−1−i (e0 + n0 )
t 3 . = 4 W (G0 ) + 4 (e0 + n0 ) 1 − 4 t
t
⊓ ⊔
Diameters are constant in the ILT model. We record this as a strong indication of the (ultra) small world property in the model. Lemma 3. For all graphs G0 different than a clique, diam(Gt ) = diam(G0 ), and diam(Gt ) = diam(G0 ) + 1 = 2 when G0 is a clique. 2.3
The Clustering Coefficient and Degrees
The purpose of this subsection is to estimate the clustering coefficient of Gt . Let Nt (x) be the neighbour set of x at time t, and let e(x, t) be the number of edges in the subgraph of Gt induced by Nt (x). For a node x ∈ V (Gt ) with degree at least 2 define e(x, t) ct (x) = deg (x) . t
2
By convention ct (x) = 0 if the degree of x is at most 1. The clustering coefficient of Gt is x∈V (Gt ) ct (x) . C(Gt ) = nt Our main result is the following. Theorem 3
t t 7 7 −2 Ω t = C(Gt ) = O t2 . 8 8
A Dynamic Model for On-Line Social Networks
133
Observe that C (Gt ) tends to 0 as t → ∞. If we let nt = n (so t ∼ log2 n), then this gives that C(Gt ) = nlog2 (7/ 8)+o(1) . (4) In contrast, for a random graph G(n, p) with comparable average degree pn = Θ((3/2)log2 n ) = Θ(nlog2 (3/2) ) as Gt , the clustering coefficient is p = Θ(nlog2 (3/4) ) which tends to zero much faster than C(Gt ). We introduce the following dependency structure that will help us classify the degrees of nodes. Given a node x ∈ V (G0 ) we define its descendant tree at time t, written T (x, t), to be a rooted binary tree with root x, and whose leaves are all of the nodes at time t. To define the (k + 1)th row of T (x, t), let y be a node in the kth row (y corresponds to a node in Gk ). Then y has exactly two descendants on row k + 1: y itself and y ′ . In this way, we may identify the nodes of Gt with a length t binary sequence corresponding to the descendants of x, using the convention that a clone is labelled 1. We refer to such a sequence as the binary sequence for x at time t. We need the following technical lemma whose proof is omitted. Lemma 4. Let S(x, k, t) be the nodes of T (x, t) with exactly k many 0’s in their binary sequence at time t. Then for all y ∈ S(x, k, t) 2k (deg0 (x) + 1) + t − k − 1 ≤ degt (y) ≤ 2k (deg0 (x) + t − k + 1) − 1. It follows from Lemma 4 that the number of nodes of degree at least k at time t, denoted by N(≥k) , satisfies t t ≤ N(≥k) ≤ i
i=log2 k
t
i=log 2 k−log 2 t
√
t . i
In particular, N(≥k) = Θ(nt ) for k ≤ nt , and therefore, the degree distribution of Gt does not follow a power law. Since kt nodes have degree around 2k , the degree distribution has ‘binomial type’ behaviour. We now prove the following lemma. Lemma 5. For all x ∈ V (Gt ) with k 0’s in their binary sequence, we have that Ω(3k ) = e(x, t) = O(3k t2 ). Proof. For x ∈ V (Gt ) we have that degt (x)
e(x, t + 1) = e(x, t) + degt (x) +
(1 + degGt ↾Nt (x) (x))
i=1
= 3e(x, t) + 2 degt (x),
where Gt ↾ Nt (x) is the subgraph induced by Nt (x) in Gt . For x′ , we have that e(x′ , t + 1) = e(x, t) + degt (x).
134
A. Bonato et al.
Since there are k many 0’s and e(x, 2) is always positive for all initial graphs G0 , e(x, t) ≥ 3k−2 e(x, 2) = Ω(3k ) and the lower bound follows. For the upper bound, a general binary sequence corresponding to x is of the form (1, . . . , 1, 0, 1, . . . , 1, 0, 1, ..., 1, 0, 1, . . . , 1, 0, 1, . . . , 1) with the 0’s in positions ik (1 ≤ i ≤ k). Consider a path in the descendant tree from the root of the tree to node x. By Lemma 4, the node on the path in the ith row (i < ij ) has (at time i) degree O(2j−1 t). Hence, the number of edges we estimate is O(t2 ) until the (i1 − 1)th row, increases to 3O(t2 ) + O(21 t) in the next row, and increases to 3O(t2 ) + O(21 t2 ) in the (i2 − 1)th row. By induction, we have that e(x, t) = 3(. . . (3(3O(t2 ) + O(21 t2 )) + O(22 t2 )) . . . ) + O(2k t2 ) k j 2 2 k = O(t )3 3 i=0 = O(3k t2 ).
⊓ ⊔
We now prove our result on clustering coefficients. Proof of Theorem 3. For x ∈ V (Gt ) with k many 0’s in its binary sequence, by Lemmas 4 and 5 we have that k 3 3k c(x) = Ω t−2 , =Ω 2 4 (2k t) and c(x) = O
3 k t2 (2k )
2
k 3 =O t2 . 4
Hence, since we have n0 kt nodes with k many 0’s in its binary sequence, t 3 k −2 t t t n t 0 t−2 1 + 34 k=0 k Ω 4 7 −2 . =Ω t =Ω C(Gt ) = n0 2 t 2t 8 In a similar fashion, it follows that t 3 k 2 t t n t 0 k=0 4 k O 7 t2 . = O C(Gt ) = n0 2 t 8
3
⊓ ⊔
Spectral Properties of the ILT Model
Social networks often organize into separate clusters in which the intra-cluster links are significantly higher than the number of inter-cluster links. In particular, social networks contain communities (characteristic of social organization),
A Dynamic Model for On-Line Social Networks
135
where tightly knit groups correspond to the clusters [17]. As a result, social networks possess bad expansion properties realized by small gaps between their first and second eigenvalues [15]. In this section, we find that the ILT model has such bad expansion properties for both its normalized Laplacian and adjacency matrices. 3.1
The Spectral Gap of the Normalized Laplacian
For regular graphs, the eigenvalues of the adjacency matrix are related to several important graph properties, such as in the expander mixing lemma. The normalized Laplacian of a graph, introduced by Chung [8], relates to important graph properties even in the case where the underlying graph is not regular (as is the case in the ILT model). Let A denote the adjacency matrix and D denote the diagonal adjacency matrix of a graph G. Then the normalized Laplacian of G is L = I − D−1/2 AD−1/2 . Let 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 ≤ 2 denote the eigenvalues of L. The spectral gap of the normalized Laplacian is λ = max{|λ1 − 1|, |λn−1 − 1|}. Chung, Lu, and Vu [11] observe that, for random power law graphs with some parameters (effectively in the case that dmin ≫ log2 n), that λ ≤ (1 + o(1)) √4d , where d is the average degree. For the graphs Gt studied here, we observe that the spectra behaves quite differently and, in fact, the spectral gap has a constant order. The following theorem suggests a significant spectral difference between graphs generated by the ILT model and random graphs. Define λ(t) to be the spectral gap of the normalized Laplacian of Gt . Theorem 4. For t ≥ 1, λ(t) > 21 . Theorem 4 represents a drastic departure from the good expansion found in random graphs, where λ = o(1) [8,11], and from the preferential attachment model [18]. We use the expander mixing lemma for the normalized Laplacian (see [8]). For sets of vertices X and Y we use the notation vol(X) for the volume of the subgraph induced by X, and e(X, Y ) for the number of edges with one end in each of X and Y. Lemma 6. For all sets X ⊆ G, 2 ¯ e(X, X) − (vol(X)) ≤ λ vol(X)vol(X) . vol(G) vol(G)
Proof of Theorem 4. We observe that Gt contains an independent set (that is, a set of vertices with no edges) with volume vol(Gt−1 ) + nt−1 . Let X denote this set, that is, the new nodes added at time t. Then by (3) it follows that ¯ = vol(Gt ) − vol(X) = 2vol(Gt−1 ) + nt−1 . vol(X)
136
A. Bonato et al.
Since X is independent, Lemma 6 implies that λ(t) ≥
vol(X) 1 vol(Gt−1 ) + nt−1 > . = ¯ 2vol(Gt−1 ) + nt−1 2 vol(X)
⊓ ⊔
If G0 has bad expansion properties, and has λ1 < 1/2 (and thus, λ > 1/2) then, in fact, this trend of bad expansion continues as shown by the following theorem. Theorem 5. Suppose G0 has at least two nodes, and for t > 0 let λ1 (t) be the second eigenvalue of Gt . Then we have that λ1 (t) < λ1 (0). Note that Theorem 5 implies that λ1 (1) < λ1 (0) and this implies that the sequence {λ1 (t) : t ≥ 0} is strictly decreasing. This follows since Gt is constructed from Gt−1 in the same manner as G1 is constructed from G0 . If G0 is K1 , then there is no second eigenvalue, but G1 is K2 . Hence, in this case, the theorem implies that {λ1 (t) : t ≥ 1} is strictly decreasing. Before we proceed with the proof of Theorem 5, we begin by stating some ˜ ∈ V (G0 ) denote notation and a lemma. For a given node u ∈ V (Gt ), we let u the node in G0 that u is a descendant of. Given uv ∈ E(G0 ), we define Auv (t) = {xy ∈ E(Gt ) : x ˜ = u, y˜ = v}, and for v ∈ E(G0 ), we set Av (t) = {xy ∈ E(Gt ) : x˜ = y˜ = v}. We use the following lemma, for which the proof of items (1) and (2) follow from Lemma 1. The final item contains a standard form of the Raleigh quotient characterization of the second eigenvalue; see [8]. Lemma 7 1. For uv ∈ E(G0 ),
|Auv (t)| = 3t .
2. For v ∈ V (G0 ),
|Av (t)| = 3t − 2t .
3. Define d¯ =
vol(Gt )
Then λ1 (t) =
inf
f (v) degt (v)
v∈V (Gt )
uv∈E(Gt )
f :V (Gt )→R, f =0 v
.
(f (u) − f (v))2
. f 2 (v) degt (v) − d¯2 vol(Gt )
Note that in item (3), d¯ is a function of f.
(5)
A Dynamic Model for On-Line Social Networks
137
Proof of Theorem 5. Let g : V (G0 ) → R be the harmonic eigenvector for λ1 (0) so that g(v) deg0 (v) = 0, v∈V (G 0)
and
λ1 (0) =
uv∈E(G0 )
(g(u) − g(v))2 g 2 (v) deg0 (v)
v∈V (G0 )
.
Furthermore, we choose g scaled so that v∈V (G0 ) g 2 (v) deg0 (v) = 1. This is the standard version of the Raleigh quotient for the normalized Laplacian from [8], so such a g exists so long as G0 has at least two eigenvalues, which it does by our assumption that G0 ≇ K1 . Our strategy in proving the theorem is to show that lifting g to G1 provides an effective bound on the second eigenvalue of G1 using the form of the Raleigh quotient given in (5). x). Then note that Define f : Gt → R by f (x) = g(˜
xy∈E(Gt )
(f (x) − f (y))2 =
xy∈E(Gt ), x ˜=˜ y
=
(f (x) − f (y))2 +
uv∈E(G0 ) xy∈Auv
= 3t
uv∈E(G0 )
xy∈E(Gt ) x ˜ =y˜
(f (x) − f (y))2
(g(u) − g(v))2
(g(u) − g(v))2 .
By Lemma 7 (1) and (2) it follows that f 2 (x) degt (x) =
f 2 (x)
x∈V (Gt ) xy∈E(Gt )
x∈V (Gt )
=
g 2 (u)
u∈V (G0 ) xy∈E(Gt ), x ˜=u
=
u∈V (G0 )
= 3t
⎛
g 2 (u) ⎝
vu∈E(G0 ) xy∈Auv
u∈V (G0 ) t
t
⎞
1 + 2|Au |⎠
g 2 (u) deg0 (u) + 2(3t − 2t ) t
= 3 + 2(3 − 2 )
g 2 (u)
u∈V (G0 )
2
g (u).
u∈G0
By Lemma 1 and proceeding as above, noting that we have that
v∈V (G0 )
g(v) deg0 (v) = 0,
138
A. Bonato et al.
d¯2 vol(Gt ) =
=
=
≤
2
f (x) degt (x)
x∈V (Gt )
vol(Gt )
t
t
2(3 − 2 )
2
g(u)
u∈V (G0 )
vol(Gt ) 2 t 2 2t 1− 3 4·3
2
g(u)
u∈V (G0 )
t 3t vol(G0 ) + 2n0 1 − 32 t 2 4 · 3t 1 − 32 g 2 (u) u∈V (G )
0 2 t ¯ D+2 1− 3
,
¯ is the average degree of G0 , and the last inequality follows from the where D Cauchy-Schwarz inequality. By (5) we have that (f (x) − f (y))2 λ1 (t) ≤
xy∈E(Gt )
f 2 (x) degt (x) + d¯2 vol(Gt )
x∈V (Gt )
≤
=
3t
uv∈E(G0 )
(g(u) − g(v))2
t 2 3t + 2 · 3t 1 − 32 u∈V (G0 ) g (u) −
t 2 4·3t 1−( 23 )
λ1 (0)
t 1 + 2 1 − 32
< λ1 (0),
u∈V (G0 )
g 2 (u)
1−
g2 (u)
u∈ V (G0 ) t ¯ D+2 1− 23
( )
t 2 1−( 23 ) t ¯ D+2 1−( 32 )
¯ ≥ 1 since G0 is connected where the strict inequality follows from the fact that D and G0 ≇ K1 . ⊓ ⊔ 3.2
The Spectral Gap of the Adjacency Matrix
Let ρ0 (t) ≥ |ρ1 (t)| ≥ . . . denote the eigenvalues of the adjacency matrix Gt . As in the Laplacian case, we can show that there is a small spectral gap of the adjacency matrix. If A is the adjacency matrix of Gt , then the adjacency matrix of Gt+1 is A A+I M= , A+I 0
A Dynamic Model for On-Line Social Networks
139
where is I is the identity matrix of order nt . We note the following recurrence for the eigenvalues of the adjacency matrix of Gt , whose proof is omitted. Theorem 6. If ρ is an eigenvalue of the adjacency matrix of Gt , then ρ ± ρ2 + 4(ρ + 1)2 , 2 are eigenvalues of the adjacency matrix of Gt+1 . Indeed, one can check that the eigenvectors of Gt can be written in terms of the eigenvalues of Gt−1 . We prove the following theorem. Theorem 7. Let ρ0 (t) ≥ |ρ1 (t)| ≥ · · · ≥ |ρn (t)| denote the eigenvalues of the adjacency matrix of Gt . Then ρ0 (t) = Θ(1). |ρ1 (t)| That is, ρ0 (t) ≤ c|ρ1 (t)| for some constant c. Theorem 7 is in contrast to fact that in G(n, p) random graphs, |ρ0 | = o(ρ1 ). Proof of Theorem 7. Without loss of generality, we assume that G0 ≇ K1 ; otherwise, G1 is K2 , and we may start from there. Thus, in particular, we can assume ρ0 (0) ≥ 1. We first observe that by Theorem 6 ρ0 (t) ≥
√ t 1+ 5 ρ0 (0). 2
(6)
By Theorem 6 and by taking a branch of descendants from the largest eigenvalue it follows that √ √ t 2( 5 − 1) 1 + 5 √ |ρ1 (t)| ≥ ρ0 (0). 2 (1 + 5)2 Hence, to prove the theorem, it suffices to show that ρ0 (t) ≤ c
√ t 1+ 5 ρ0 (0). 2
Observe that, also by Theorem 6 and taking the largest branch of descendants from the largest eigenvalues, ⎛ ⎛ ⎞ ⎞ 8 4 6 t−1 t−1 1 + 5 + ρ0 (i) + ρ20 (i) 1 + 5 + ρ0 (i) ⎠ ≤ ρ0 (0) ⎠. ⎝ ⎝ ρ0 (t) = ρ0 (0) 2 2 i=0 i=0
140
A. Bonato et al.
Thus, 5 + ρ06(i) 2 ρ0 (t) √ √ ≤ ρ0 (0) (1 + 5)t 1+ 5 i=0 √ t−1 6 5 √ ≤ ρ0 (0) 1+ 1 + 5 5ρ0 (i) i=0 √ t−1 6 5 √ ≤ ρ0 (0) exp ρ0 (i)−1 5(1 + 5) i=0 √ −i ∞ 2 6 5 √ √ ≤ ρ0 (0) exp = ρ0 (0)c. 5(1 + 5)ρ0 (0) i=0 1 + 5 t
t−1
1+
In all we have proved that for constants c and d that √ t √ t 1+ 5 1+ 5 ρ0 (0) ≥ ρ0 (t) ≥ |ρ1 (t)| ≥ d ρ0 (t). c 2 2
4
⊓ ⊔
Conclusion and Further Work
We introduced the ILT model for on-line social and other complex networks, where the network is cloned at each time-step. We proved that the ILT model generates graphs with a densification power law, in many cases decreasing average distance (and in all cases, the diameter is bounded above by a constant independent of time), have higher clustering than random graphs with the same average degree, and have smaller spectral gaps for both their normalized Laplacian and adjacency matrices than in random graphs. Much more can be said about the ILT model than space permits here; for example, many graph properties at time t are strongly related to properties from time 0. For example, the cop and domination numbers of the graphs Gt equal those of G0 (see [5] for definitions of these parameters). In addition, the automorphism group (endomorphism monoid) of G0 embeds as a subgroup (submonoid) in the automorphism group (endomorphism monoid) of Gt . A discussion of these and other properties of the ILT model will appear in the full version of this paper. In the duplication and copying models and in the model [14] of social networks, transitivity is modelled so that neighbours are copied with some fixed probability. The ILT model may be randomized, so that x′ clones x with a fixed probability. We will study this randomized ILT model in future work. As we noted after the statement of Lemma 4 the ILT model does not generate graphs with a power law degree distribution. An interesting problem that we will address in the full version of this paper is to design and analyze a randomized version of the ILT model satisfying the properties displayed in the deterministic ILT model (and make them tuneable; for example, the densification power law exponent should vary with the choice of parameters) as well as generating power law graphs. Such
A Dynamic Model for On-Line Social Networks
141
a randomized ILT model should with high probability generate power law graphs with topological properties similar to graphs from the deterministic ILT model.
References 1. Adamic, L.A., Buyukkokten, O., Adar, E.: A social network caught in the web. First Monday 8 (2003) 2. Ahn, Y., Han, S., Kwak, H., Moon, S., Jeong, H.: Analysis of topological characteristics of huge on-line social networking services. In: Proceedings of the 16th International Conference on World Wide Web (2007) 3. Bebek, G., Berenbrink, P., Cooper, C., Friedetzky, T., Nadeau, J., Sahinalp, S.C.: The degree distribution of the generalized duplication model. Theoretical Computer Science 369, 234–249 (2006) 4. Bhan, A., Galas, D.J., Dewey, T.G.: A duplication growth model of gene expression networks. Bioinformatics 18, 1486–1493 (2002) 5. Bonato, A.: A Course on the Web Graph. American Mathematical Society Graduate Studies Series in Mathematics, Providence, Rhode Island (2008) 6. Bonato, A., Janssen, J.: Infinite limits and adjacency properties of a generalized copying model. Internet Mathematics (accepted) 7. Caldarelli, G.: Scale-Free Networks. Oxford University Press, Oxford (2007) 8. Chung, F.: Spectral Graph Theory. American Mathematical Society, Providence, Rhode Island (1997) 9. Chung, F., Lu, L., Dewey, T., Galas, D.: Duplication models for biological networks. Journal of Computational Biology 10, 677–687 (2003) 10. Chung, F., Lu, L.: Complex graphs and networks. American Mathematical Society, Providence, Rhode Island (2006) 11. Chung, F., Lu, L., Vu, V.: The spectra of random graph with given expected degrees. Internet Mathematics 1, 257–275 (2004) 12. Crandall, D., Cosley, D., Huttenlocher, D., Kleinberg, J., Suri, S.: Feedback effects between similarity and social influence in on-line communities. In: Proceedings of the 14th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (2008) 13. Durrett, R.: Random Graph Dynamics. Cambridge University Press, New York (2006) 14. Ebel, H., Davidsen, J., Bornholdt, S.: Dynamics of social networks. Complexity 8, 24–27 (2003) 15. Estrada, E.: Spectral scaling and good expansion properties in complex networks. Europhys. Lett. 73, 649–655 (2006) 16. Frank, O.: Transitivity in stochastic graphs and digraphs. Journal of Mathematical Sociology 7, 199–213 (1980) 17. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99, 7821–7826 (2002) 18. Gkantsidis, C., Mihail, M., Saberi, A.: Throughput and congestion in power-law graphs. In: Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement Modeling of Computer Systems (2003) 19. Golder, S., Wilkinson, D., Huberman, B.: Rhythms of social interaction: messaging within a massive on-line network. In: 3rd International Conference on Communities and Technologies (2007)
142
A. Bonato et al.
20. Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the web graph. In: Proceedings of the 41th IEEE Symposium on Foundations of Computer Science (2000) 21. Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of on-line social networks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006) 22. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification Laws, shrinking diameters and possible explanations. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2005) 23. Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C.: Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS, vol. 3721, pp. 133–145. Springer, Heidelberg (2005) 24. Liben-Nowell, D., Novak, J., Kumar, R., Raghavan, P., Tomkins, A.: Geographic routing in social networks. Proceedings of the National Academy of Sciences 102, 11623–11628 (2005) 25. Milgram, S.: The small world problem. Psychology Today 2, 60–67 (1967) 26. Mislove, A., Marcon, M., Gummadi, K., Druschel, P., Bhattacharjee, B.: Measurement and analysis of on-line social networks. In: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (2007) 27. Pastor-Satorras, R., Smith, E., Sole, R.V.: Evolving protein interaction networks through gene duplication. J. Theor. Biol. 222, 199–210 (2003) 28. Scott, J.P.: Social Network Analysis: A Handbook. Sage Publications Ltd., London (2000) 29. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998) 30. White, H., Harrison, S., Breiger, R.: Social structure from multiple networks, I: Blockmodels of roles and positions. American Journal of Sociology 81, 730–780 (1976)
TC-SocialRank: Ranking the Social Web Antonio Gulli1 , Stefano Cataudella1 , and Luca Foschini1, 2 2
1 Ask.com University of California, Santa Barbara
Abstract. Web search is extensively adopted for accessing information on the web, and recent development in personalized search, social bookmarking and folksonomy systems has tremendously eased the user’s path towards the desired information. In this paper, we discuss TC-SocialRank, a novel link-based algorithm which extends state-of-the-art algorithms for folksonomy systems. The algorithm leverages the importance of users in the social community, the importance of the bookmarks/resource they share, and additional temporal information and clicks information. Temporal information has a primary importance in social bookmarking search since users continuously post new and fresh information. The importance of this information may decay after a while, if it is no longer tagged or clicked. As a case study for testing the effectiveness of TC-SocialRank, we discuss JammingSearch a novel folksonomy system that unifies web search and social bookmarking by transparently leveraging a Wiki-based collaborative editing system. When an interesting search result is found, a user can share it with the JammingSearch community by simply clicking a button. This information is implicitly tagged with the query submitted to any commodity search engine. Later on, additional tags can be added by the user community. Currently, our system interacts with Ask.com, Google, Microsoft Live, Yahoo!, and AOL.
1
Introduction
Nowadays, search is undoubtedly the primary way for users to access the information on the web. According to Nielsen [19], Google, Microsoft and Yahoo! are the three most visited U.S. properties as of April 2008. Interactive Corp, Ask.com’s parent company, occupies the 7th place. Different search engines drive users to the desired information in different ways since they index distinct, albeit overlapping, portions of the web [9] and might provide different results for the same query [14]. Users take advantages of this diversity. The 48% of searchers regularly use two or three search engines, only 7% use more than three and 44% use just one, as shown in a recent study [6]. Although search plays an important role in user’s navigation, the user activity on the web cannot be reduced to only search-related routines. Once the desired information is found, it has to be exploited first and, secondly, possibly stored for the user future needs. This mechanism, called bookmarking, has been recently extended to become more and more social. With the advent of social K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 143–154, 2009. c Springer-Verlag Berlin Heidelberg 2009
144
A. Gulli, S. Cataudella, and L. Foschini
bookmarking, bookmarks can be shared among different users and can be organized in different categories, annotated with tags and metadata, and searched chronologically or by tag. Those type of integrated systems are called folksonomy search systems [23]. Folksonomy is a neologism denoting a combination of “folks” and “taxonomy”. We note that the functionalities of search engines and social bookmarking sites are both aimed at providing the user with high-quality information. The main advantage of social bookmarking is that discovering, tagging, and organizing the information are carried out by human beings. This means that users can potentially organize the content by leveraging their past experiences and tastes. Therefore, they can obtain a more accurate classification, when compared to automatic techniques yielded by machine learning, data mining or information retrieval. The fact that social systems are built by people is undoubtedly their greatest feature but, it also brings up several issues. A first problem can be identified in the fact that the folksonomies grow with no terminological control. If users can freely choose tags, instead of selecting them from a given vocabulary, synonyms (different tags for the same concept), polysemy and homonymy (same tag used or associated with different meanings) can arise. Another problem is that users can deliberately spam a social repository with the sole goal of poisoning the common knowledge. For addressing those issues, we introduce a Web 2.0 graph which accounts both the users and the resources that they share. We discuss TC-SocialRank a novel linkbased algorithm which extends state-of-the-art algorithms for folksonomy systems, such as FolkRank [12], and SocialPageRank [1]. TC-SocialRank leverages the importance of the users in the social community, the importance of the resource they submit, and additional temporal information and clicks related to them. This way, the model can validate the instantaneous importance of users and resources. We claim that the model can reduce the malicious and spam content submitted to our system and can help to identify premium users in social community. Note that TC-SocialRank stands for “Time and Click SocialRank”. In addition, we address the problem of ambiguities by proposing a simple algorithm for suggesting related tags during search. This is somehow similar to traditional query suggestion approaches [2]. As a case study for testing the effectiveness of TC-SocialRank, we discuss JammingSearch a novel folksonomy system that unifies web search, social bookmarking and user generated tagging by transparently leveraging a Wikibased collaborative editing system. We validate the ranking model with a set of experiments run on two different data sets. One was collected on JammingSearch and another one was built from a crawl of Del.icio.us [5]. In the rest of this paper we will use equivalently the terms social bookmarking and resource sharing.
2
Related Work
In this Section, we review the academic works and industrial proposals that leverage social bookmarks and folksonomy systems for improving web search. Academic Works: In the last two years, the academic research has produced a number of interesting works about social bookmarking and web search. A recent
TC-SocialRank: Ranking the Social Web
145
paper [10] considers the process of social search (bookmarking and tagging) as the next big thing that can substantially improve the quality of web search. The authors observe that social bookmarking URLs are full of new and fresh information and can improve the web search adding unindexed pages annotated with relevant and objective tags. For these reasons, they argue that social bookmarking can have a significative impact on web search, if the user contribution to social bookmarks will increase in the next years. Many authors propose to extend traditional web pages graph and link-based ranking algorithm with information extracted by social bookmarking and folksonomy systems. Krause et al. [18] compare search in social bookmarking systems with traditional Web search. They show that a graph-based ranking approach on folksonomies yields results that are close to the rankings of the commercial search engines. In [25], Yanbe et al. propose to use data from social bookmarking systems to enhance web searches. They suggest to combine a link-based ranking metric with the one derived using social bookmarking data. Another interesting proposal is FolkRank [12], which ranks users, tags, and resources. All of them are seen as nodes in an un-directed weighted tripartite graph. The nodes are ranked using a variant of Personalized PageRank biased towards a specific topic preference. In a similar spirit, SocialPageRank [1] ranks only the resources available in the social engine using a variant of PageRank. Different from the above works, our ranking algorithm, TC-SocialRank, uses a 2.0 Web graph which clearly differentiates two types of nodes: user nodes and resource nodes. The algorithm computes two separate ranks for them. Besides, TC-SocialRank leverages other information, such as temporal information and the number of clicks associated to the resource. This way, TC-SocialRank is able to capture the instantaneous change of importance of users or shared resources. Another interesting stream of research tries to reduce ambiguities faced in social tagging. Different synthetical tags can express the same semantical information. This condition may create ambiguities during web search. Yeung et al. [26] propose to extract semantic information from the collaborative tagging contained in Del.icio.us. This information is then used to disambiguate search results returned by Del.icio.us and Google. SocialSimRank[1] computes the similarity between two tags of a folksonomy and allows to reduces ambiguities in social tagging. We address the problem of ambiguities by proposing a simple algorithm for suggesting related tags during search, in a way somehow similar to traditional query suggestion approaches [2]. Industrial Proposals: The concept of shared online bookmark was pioneered by iList, Backflip, Clip2, ClickMarks, HotLinks, while the term Social Bookmarks was coined by Del.icio.us in 2004. Since then, many other social bookmarking services have been proposed, such as Furl, StumbleUpon, Simpy, and Diigo. Only recently, the model of social bookmarking was exported to more specific fields, such as, online newspapers and academic publications. In this spirit, Digg, Reddit, and Newsvine were born to allow users to bookmark and vote news items found in different news sources on the web. CiteULike and BibSonomy [11] represent
146
A. Gulli, S. Cataudella, and L. Foschini
instead an application of the social bookmarking concept applied to the field of academic research. In the last few years, major search engines have incorporated customized solutions for social bookmarking. Google allows users to bookmark Web pages contained in Google’s Web History, a service that tracks queries and search results for users who are logged on Google Account. Ask.com’s MyStuff and Yahoo! Bookmarks offer a similar service, also allowing users to specify public folders of bookmarks which can be searched and shared by others. These services are useful but cannot be extended to work with different search engines. A user searching on Google cannot share his bookmarks with others using Yahoo!, Ask.com, Microsoft Live, and vice versa. In Section 4, we describe JammingSearch our model of Web social search allows users to seamlessly leverage the search results returned by different search engines. Wikia Search [24] is the first attempt to integrate the traditional web search with the Wiki collaborative model. The system allows editing of any search result crawled and indexed by the engine. We note that Wikia Search requires users to change their habits by leaving their favorite search engine in favor of its lesser known and still under development service. On the other hand, JammingSearch does not require any switching cost, since users are encouraged to continue searching their favorite search engine. Recently, Google added an experimental feature [8] which allows the users “to influence search experience by adding, moving, and removing search results”. We note that those modifications are local to each single user. On the contrary, JammingSearch promotes a true collaboration among the community and all the changes performed by a user can be seen by others, if desired.
3
TC-SocialRank for Social Web
In this Section, we introduce TC-SocialRank a ranking scheme used to rank both the users of the community, and the resources they share. We note that TC-SocialRank extends current state-of-art social ranking algorithms, such as FolkRank [12], and SocialPageRank [1]. Our algorithm (1) clearly separates user ranks from resource ranks and (2) leverages additional temporal information and user clicks. The algorithm extends the paradigm introduced by one of the authors in [3] for ranking a stream of news. We validate our ranking algorithm on two data sets. One data set has been collected by observing the usage of JammingSearch, the other data set has been obtained by crawling some specific topics on Del.icio.us for a period of one week. Ranking Resources and Users: According to [13] a folksonomy can be defined as a quadruple F = (U, T, R, Y ) where: – U , T , R are finite sets of instances of users, tags and resources; – Y defines a relation of tag assignment between these sets, Y ⊆ U × T × R. We extend this definition by considering two functions, ψ : (U, T, R) → t, which accounts for the time t when the user, the tag or the resource has been
TC-SocialRank: Ranking the Social Web
147
inserted in our systems, and φ : (U, T, R) → nc, which accounts for the number of clicks, nc, received by a tag or a resource inserted in our system. Therefore, our model of folksonomy is a sextuple F = (U, T, R, Y, ψ, φ). We believe that having fresh search results could be particularly important in folksonomy systems because users are continuously sharing new resources and adding fresh tags. In our model there are two different types of subjects that can be ranked: resources and users. The ranking is computed on the basis of votes assigned to the subjects, more precisely: – A resource/bookmark item receives a positive vote in two situations: (a) each time it is submitted by a user or (b) every time it is clicked by a user during browsing or searching activities carried out on the social community; – A user receives a positive vote every time that other authoritative users click on any of his bookmark items, while searching the social community; The key intuition is that user’s clicks and submissions reflect the social activities on our community. If a page is posted or clicked by many authoritative users, it gains a higher position in our bookmark ranking. Conversely, if a user produces content that other users like, he is ranked higher in the community. The idea is similar to the one used by the HITS algorithm [17], in identifying hubs and authorities in communities, and EigenRumor [7], in identifying a separate rank for agents and objects. Therefore, we outline only the main distinctive features of our ranking algorithm with respect to the well-known solutions. Adding Temporal Information: In our model, user’s clicks and bookmarks/ resources are annotated with temporal information. Our ranking scheme can be formalized by adopting a model similar to the one proposed in [3] for ranking a stream of news items. The intuition behind the analogy is the following. Bookmarks submitted by users can be paralleled to articles produces by a news source in the model of [3]. Conversely, the news sources of [3] are represented by the users of the community. Let Gw = (V, E) be a graph, where V = U ∪ K. U are the nodes representing the users and K are the nodes representing the bookmarks/resources seen in a time window ω, an input parameter of our model. Analogously, the set of edges E is partitioned in two disjoint sets E1 and E2 . E1 is the set of undirected edges between U and K. Here, we assume that anonymous clicks or bookmark/resource submission are generated by a special node in U . E2 is the set of undirected edges with both the endpoints in K and represents the bookmarks/resources saved under the same tag, according to the model described in Section 4.1. Analogously to what we have proposed in [3], the adjacency matrix of the user-bookmarks graph is given by: 0 B . A= BT Σ The matrix is obtained by assigning an identifier to the nodes G so that all the identifiers assigned to nodes in U are smaller than identifiers assigned to nodes
148
A. Gulli, S. Cataudella, and L. Foschini
in K. The submatrix B refers to edges from users to bookmarks/resources, and bij = 1 iff the user i created a bookmark/resource j or clicked on it. The submatrix Σ refers to edges from bookmarks to bookmarks, and σi,j = 1 iff the bookmarks/resources i and j are tagged by the same query. In [3] two nontime aware and three time aware algorithms for ranking the nodes in G were discussed. We now describe only one time aware ranking scheme and invite the interested reader to review our previous work for a detailed description of the other schemes. Let R(bi , t) be the rank of bookmark item bi at time t, and analogously, let R(uk , t) be the rank of user uk at time t, where k ∈ {1 . . . n} and i ∈ {1 . . . m} (being n and m cp the number of users and bookmark items respectively observed during the time window ω). Moreover, by S(bi ) = uk we mean that bi has been posted by user uk and by C(bi ) the set of users that clicked the bookmark bi . Decay Rule: We adopt the following exponential decay rule for the rank of bi which has been submitted at time ti : R(bi , t) = e−α( t−ti ) R(bi , ti ),
t > ti .
The value α is obtained from the half-life decay time ρ. ρ is an input parameter denoting the time required by the rank to halve its value, with the relation e−αρ = 12 and α, ρ > 0. In addition, we consider another parameter β, 0 < β < 1, which gives us the amount of user’ s rank we want to transfer to each bookmark item. Similar to [3], our ranking scheme is defined by the following equations: β R(bi , ti ) =
lim R (S(bi ), ti −τ )
τ →0 +
+
uj ∈C( bi )
R(uk , t) =
S( bi ) = uk
+
(1)
β lim R (u
τ →0 +
j , ti
e−α( t−ti ) R(bi , t) +
− τ)
+
e−α( ti −tj ) σij R(bj , tj )β
tj
S( bi ) = uk
e−α( t−ti )
σij R(bj , tj )β (2)
tj > ti S(bi ) = uk
The intuition behind Equation 1 is that the rank of a bookmark item bi at time ti depends on three factors: (a) the rank of the user that posted the bookmark item, (b) the rank of the users who clicked on it, and (c) the rank of other bookmark items previously posted under the same query tag. The importance of those previous bookmark items is damped by a negative exponential factor. Note that the limit used in the equation captures the idea of a user rank computed “a little before” the time ti . The intuition behind Equation 2 is that the rank of a user uk at time t depends on two factors: (d) the rank of the bookmark items that the user posted and (e) the rank of bookmark items under the same query tag of the bookmark posted by the user. This last factor is a “bonus” to those users who start wiki pages
TC-SocialRank: Ranking the Social Web
149
which, later on, becomes popular among other wiki contributors. Note that the parameter β is similar to the magic ε accounting for the random jump in Google’s Pagerank [20] and guarantees that the fixed point equation involving the user, has a non-zero solution. Recall that in both equations, the time tj is the posting time of bookmark bj . The interested reader can refer to our previous work for a discussion about the existence of a fixed point solution for the system reported in Equations 1 and 2. The Equations 1 and 2 defines our TC-SocialRank algorithm. We point out that it boosts those bookmark items posted or clicked by important users. This is a desired property which can help demote spam and malicious content posted on the social community. Ranking at Query Time: TC-SocialRank assigns a static rank to bookmark items. In a real search engine Equation 2 can be combined with a query based textual score as in the Equation R(bi , ti , q) = γ textRank(bi, q) + (1 − γ) R(bi , ti ), where textRank(bi,q) is a TF x IDF textual ranking function and 0 < γ < 1 is a constant that tunes the importance of the traditional textual rank. In the future we plan to use SVM [4] (with linear kernel) to learn the optimal weight γ.
4
JammingSearch
In this Section we present JammingSearch, an automatic prototype system that is able to seamlessly integrate social bookmarking, wiki, user tagging and web search in a unified view. The system has been tested to work with Google, Yahoo!, Microsoft Live Search, Ask.com and AOL. 4.1
A Query-Based Tagging Model
One key idea behind JammingSearch is the following. A search query can be seen as a summary of each result retrieved by a search engine. This description becomes a very effective tag if one of those results is clicked by the user. For instance, if a user searches the query “ madonna songs” and selects the http://www.likeavirgin.com, that web page can effectively be tagged with the keywords contained in the query. This intuition is very simple but defines a clear and uniform tagging model. A similar idea has been already suggested by WebWatcher [15] and I-Spy [21], two web recommending systems. Here, we leverage this idea in the context of social tagging and claim that the resources receive the query as a primary tag. This primary tagging induces a natural separation of resources into user generated groups. Later on, each resource can be further edited or annotated with secondary tags provided by the user community. As it happens for many other folksonomy systems, JammingSearch may suffer of ambiguity issues since different syntactical tags can express the same semantical information. Another issue is that a query used by a user as a primary tag, may not clearly capture the search needs of the rest of the community. For instance, if a user searches for “Britney Spears”, he might miss the result regarding the tag “Britney”, or any abbreviated or misspelled version of it. JammingSearch addresses this issue in a simple way. Each primary tag tagip has a
150
A. Gulli, S. Cataudella, and L. Foschini
relation with the set of most common secondary tags associated to the resources grouped under the tag tagip . When the user search JammingSearch, we present a list of relevant shared resources and a list of the most common tags associated with them. The approach used is similar to the one adopted for query suggestion in traditional web search and described in [2]. This way, the user can explore the folksonomy space by leveraging previous user tag contributions. 4.2
Working with JammingSearch
In this section we briefly describe the features available in JammingSearch. First we describing the client and servers side operations, then we focus our attention on the tagging and social editing features available for the users. Client Side Operations: A user submits a query q by using his preferred search engine. If the user likes the web content, he can bookmark it by clicking on a button installed on the browser. This button is a bookmarklet, a little snippet of javascript code that is executed when it is clicked. It captures: 1. The query q for which the user selected the resource r; 2. The URL and the title of the selected resource r; 3. A snippet of text associated to r. (Optionally, the user can highlight a part of text by simply using the mouse over the resource); 4. A UserID, if the user is logged in our system. Our bookmarklet works with all the major browsers such as Mozilla, Internet Explorer, Opera, Safari and many others. The bookmarklet can be sent via email or embedded in a web page as a hyperlink. The user who wants to install JammingSearch simply needs to bookmark this hyperlink on his browser and show it on the bookmark toolbar. We would like to point out that users do not need to interrupt their normal activity of searching and browsing, in order to interact with our system. The benefit for the user is to have a network place where to store those search results considered as useful. The benefit for the community is to leverage the contribution of each different user. We are not creating yet another search engine aiming at attracting users, but instead we encourage users to continue using their preferred Web search engines. If they want to start a social activity on web search results, they can transparently use our system as an auxiliary tool. Server Side Operations: JammingSearch’s bookmarklet sends the information selected by the user by means of an HTTP connection. Our central server stores this content in a structured format. We decided to adopt the MediaWiki markup language to store this data, since it is simple and naturally fits our model. Therefore, we will refer to JammingSearch server side part as JammingWiki. Our choice allows us to leverage of all the infrastructure optimizations currently adopted in Wikipedia.
TC-SocialRank: Ranking the Social Web
5
151
Experimental Results
In this section we evaluate our TC-SocialRank algorithm. We consider the impact of the temporal information, and the precision achieved in ranking users and resources. We do not directly compare TC-SocialRank with FolkRank [12] and SocialPageRank [1] since our ranking model is more general that theirs. Our experimental study is carried out on two data-sets. We denote the first one as data set as D1. D1 has been collected on JammingSearch and consists of 7845 bookmark items submitted by 13 users during a period of 3 months. We denote the second one data set as D2. D2 consists of 887 random tags crawled by Del.cio.us. Folksonomy Issues: We crawled 50 random queries from Del.icio.us and their associated secondary tags, for a total of 887 tags. For each primary tag, we sorted the secondary tags by number of occurrences, cutting the list at the integer value t. Figure 1 gives the F1 curves for different values of t and for different positions in the lists. It appears that the best choice is t = 5.
F i g . 1 . F1 for different values of t
Users Rank Stability: TC-SocialRank has some tuning parameters: ρ denotes the time required by the rank to halve its value, while β gives us the amount of user’ s rank we want to transfer to each resource (see Section 3). A first group of experiments addressed the sensitivity at changes of the parameters ρ and β. This set of experiments were run on data-set D1. As a measure of concordance between the user ranks produced with different values of the parameters, we adopted the well known Spearman [22] and Kendall-Tau [16] correlations. i , where We report the ranks computed with our scheme, for values of βi = 10 i = 1, 2, . . . , 9 and for ρ = 15 days, 1 month and 3 months. In Figure 2, for a fixed ρ the abscissa βi represents the correlation between the ranks obtained with values βi and βi−1 . As in [3], we observe that the ranking scheme is not very sensitive to changing in the parameters involved. The rank attributed to the users of our community is relatively stable, if we adopt a value of β less than 0.80. This observation is confirmed for different value of temporal decay parameter ρ.
152
A. Gulli, S. Cataudella, and L. Foschini
Fig. 2. Correlations between ranks of users obtained with two consecutive values of β
Fig. 3. Average P@N for different values of γ
Fig. 4. Average precision and recall with and without clicks
Evaluating Resource Ranking: For each query, we evaluated the precision @N at the first N items generated. Precision at top N is defined as: P @N = M N where M @N is the number of search results that have been manually tagged relevant among the N top-level labels computed by JammingSearch. We believe that P @N reflects the natural user behavior of considering the top-level results, since lazy users will refrain from browsing many pages of results. Therefore we decided to limit our evaluation to the first page of search results and, in particular to, P @3, P @5, P @7 and P @10. In Figure 3 we report the average P@N for different values of γ. The best results are obtained for γ = 0.6. This value yields a good compromise between textual ranking and our novel ranking based on authoritativeness of users, temporal information and user clicks.
TC-SocialRank: Ranking the Social Web
153
Evaluating User Clicks: Now, we consider the importance of past clicks for our TC-SocialRank algorithm. Figure 4 reports P@N, when we consider and when we do not consider the set of user, C(bi ), that clicked the bookmark item bi in Equation 1. We note that using click feedbacks increase the performance over the baseline.
6
Conclusion and Future Work
In this paper, we proposed a novel ranking algorithm for social web search. TC-SocialRank extends current state-of-art social ranking algorithms, such as FolkRank [12], and SocialPageRank [1]. We do not compared TC-SocialRank with previous algorithms, since our algorithm (1) clearly separates user ranks from resource ranks and (2) leverages additional temporal information and user clicks. We noted that temporal information has a primary importance in social bookmarking search since users continuously post new and fresh information. The importance of this information may decay after a while, if it is no longer tagged or clicked. We run a set of preliminary experiments for assessing the effectiveness of TC-SocialRank. Initially, we evaluated the sensitivity at changes of the temporal parameters used by the TC-SocialRank. We observed that the rank attributed to the users of our community is relatively stable, and this is a desirable property of our algorithm. Then, we evaluated the precision achieved in ranking shared resources and the influence of clicks in ranking the users of the community. The results are quite promising. We addressed the problem of polisemy, homonymy and synonymy. Our experiment, show that tags suggestion may help users in their social search activities. For the future, we think that TC-SocialRank can be improved on various aspects. An interesting open research topic that still needs to be addressed is to extend our ranking scheme towards personalization, by adopting a personalized time-aware variant of HITS.
References 1. Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., Su, Z.: Optimizing web search using social annotations. In: Proceedings of the 16th international conference on World Wide Web, pp. 501–510 (2007) 2. Beeferman, D., Berger, A.: Agglomerative clustering of a search engine query log. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 407–416 (2000) 3. Corso, G.M.D., Gull´ı, A., Romani, F.: Ranking a stream of news. In: WWW 2005: Proceedings of the 14th international conference on World Wide Web, pp. 97–106. ACM, New York (2005) 4. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning. Methods Cambridge University Press, Cambridge (2000) 5. http://del.icio.us/ (2008)
154
A. Gulli, S. Cataudella, and L. Foschini
6. Fallows, D.: Search engine users (2004), http://www.pewinternet.org/pdfs/PIP Searchengine users.pdf 7. Fujimura, K., Tanimoto, N.: The eigenRumor algorithm for calculating contributions in cyberspace communities. In: Falcone, R., Barber, S., Sabater-Mir, J., Singh, M.P. (eds.) Trusting Agents for Trusting Electronic Societies. LNCS, vol. 3577, pp. 59–74. Springer, Heidelberg (2005) 8. http://www.google.com/support/faqs/?editresults (2008) 9. Gulli, A., Signorini, A.: The indexable web is more than 11.5 billion pages. In: Proceedings of 14th International World Wide Web Conference, Chiba, Japan, pp. 902–903 (2005) 10. Heymann, P., Koutrika, G., Garcia-Molina, H.: Can social bookmarking improve web search?. In: Proceedings of the international conference on Web search and web data mining, pp. 195–206 (2008) 11. Hotho, A., J¨ aschke, R., Schmitz, C., Stumme, G.: BibSonomy: A Social Bookmark and Publication Sharing System. In: Proceedings of the Conceptual Structures Tool Interoperability Workshop at the 14th International Conference on Conceptual Structures, pp. 87–102 (2006) 12. Hotho, A., Jaschke, R., Schmitz, C., Stumme, G.: Folkrank: A ranking algorithm for folksonomies. In: Proc. FGIR 2006 (2006) 13. Hotho, A., J¨ aschke, R., Schmitz, C., Stumme, G.: Information Retrieval in Folksonomies: Search and Ranking. In: Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011, pp. 411–426. Springer, Heidelberg (2006) 14. Different engines, different results (April 2007), http://www.infospaceinc.com/ onlineprod/Overlap-DifferentEnginesDifferentResults.pdf 15. Joachims, T., Freitag, D., Mitchell, T.: WebWatcher: A Tour Guide for the World Wide Web. In: Proceedings of IJCAI 1997, vol. 8 (1997) 16. Kendall, M.G.: A new measure of rank correlation. Biometrika 430, 81–93 (1938) 17. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999) 18. Krause, B., Hotho, A., Stumme, G.: A comparison of social bookmarking with traditional search. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 101–113. Springer, Heidelberg (2008) 19. Nielsen - NetRatings. Nielsen online reports topline U.S. data (April 2008), http://www.nielsen-netratings.com/pr/pr 080515.pdf 20. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web (1998) 21. Smyth, B., Freyne, J., Coyle, M., Briggs, P., Balfe, E., Building, T.: I-spy: anonymous, community-based personalization by collaborative web search. In: 23rd SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence (2003) 22. Spearman, C.: The proof and measurement of association between two things. American Journal of Psychology 15(1), 72–101 (1904) 23. Vanderwal, T.: Off the top: Folksonomy entries (2005) 24. http://search.wikia.com/wiki/Search Wikia (2008) 25. Yanbe, Y., Jatowt, A., Nakamura, S., Tanaka, K.: Can social bookmarking enhance search in the web? In: JCDL 2007: Proceedings of the 7th ACM/IEEE joint conference on Digital libraries, pp. 107–116. ACM, New York (2007) 26. Yeung, C.M.A., Gibbins, N., Shadbolt, N.: Web search disambiguation by collaborative tagging. In: Workshop on Exploring Semantic Annotations in Information Retrieval at ECIR 2008 (2008)
Exploiting Positive and Negative Graded Relevance Assessments for Content Recommendation Maarten Clements1 , Arjen P. de Vries1,2 , and Marcel J.T. Reinders1 1
2
Delft University of Technology, The Netherlands
[email protected] National Research Institute for Mathematics and Computer Science (CWI), Amsterdam, The Netherlands
Abstract. Social media allow users to give their opinion about the available content by assigning a rating. Collaborative filtering approaches to predict recommendations based on these graded relevance assessments are hampered by the sparseness of the data. This sparseness problem can be overcome with graph-based models, but current methods are not able to deal with negative relevance assessments. We propose a new graph-based model that exploits both positive and negative preference data. Hereto, we combine in a single content ranking the results from two graphs, one based on positive and the other based on negative preference information. The resulting ranking contains less false positives than a ranking based on positive information alone. Low ratings however appear to have a predictive value for relevant content. Discounting the negative information therefore does not only remove the irrelevant content from the top of the ranking, but also reduces the recall of relevant documents.
1
Graded Relevance Assessments
The popularity of online social media encourages users to manage their online identity by active participation in the annotation of the available content. Most of these systems allow their users to assign a graded relevance assessment by giving a rating for a specific content element. Many people use these ratings in order to convey their opinion to the other network users, or to organize their own content to gain easy access to their favorite files. Collaborative filtering methods use the created rating profiles to establish a similarity between users or items. This similarity is often based on the Pearson correlation which has proven to be an effective measure to incorporate positive and negative feedback while compensating for differences in offset or rating variance between users [1,2,3,4]. Because these similarity functions derive the similarity based on the overlapping part of the users’ rating profiles these methods perform poorly in sparse data spaces [5,6]. Graph-based methods have shown to effectively deal with extremely sparse data sets by using the entire network structure in the predicted ranking. Most of K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 155–166, 2009. c Springer-Verlag Berlin Heidelberg 2009
156
M. Clements, A.P. de Vries, and M.J.T. Reinders
these methods have been developed to estimate a global popularity ranking in graphs with a single entity type, like websites [7,8]. These methods are generally not adapted to negative relevance information and do not provide personalized rankings for each network user. Using a personalized random walk over two graphs we separately compute a ranking based on positive and negative preference information. We combine these two rankings and compare the result on two real data sets. We discuss the positive and negative effects of the proposed method compared to recently proposed graph-based ranking models.
2
Graph Combination Model
We define two bipartite graphs G+ = V, E + and G− = V, E − where the set of vertices consists of all users and items V = U ∪ I (U is the set of users uk ∈ U (with k ∈ {1, . . . , K}) and I is the set of items il ∈ I (with l ∈ {1, . . . , L})). The set of edges (E + /E − ) consists of all user-item pairs {uk , il }. The weight of the edges is determined by the value of the rating, which will be discussed in Section 3. We propose to use a random walk model to obtain a ranking of the content in both graphs. A random walk can be described by a stochastic process in which the initial condition (Sn ) is known and the next state (Sn+1 ) is given by a certain probability distribution. This distribution is represented by a transition matrix A, where ai,j contains the probability of going from node i (at time n) to j (at time n + 1): ai,j = P (Sn+1 = j|Sn = i) (1) The initial state of all network nodes can now be represented as a vector v0 (with i v0 (i) = 1), in which the starting probabilities can be assigned. By multiplying the state vector with the transition matrix, we can find the state probabilities after one step in the graph (v1 ). Multi step probabilities can be found by repeating the multiplication vn+1 = vn A, or using the n-step transition matrix vn = v0 An . The number of steps taken in the random walk determines the influence of the initial state on the current state probabilities. The random walk is a Markov model of order 1 (or Markov chain), because the next state of the walk only depends on the current state and not on any previous states, which is known as the Markov property: P (Sn+1 = x|Sn = xn , . . . , S1 = x1 ) = P (Sn+1 = x|Sn = xn )
(2)
If A is stochastic, irreducible and aperiodic, v will become stable, so that v∞ = v∞ A [9]. These limiting state probabilities represent the prior probability of all nodes in the network determined by the volume of connected paths [10]. In order to make these conditions true, we ensure that all rows of A add up to 1 by normalizing the rows, that A is fully connected, and that A is not bipartite. We include self-transitions that allow the walk to stay in place, which increases the influence of the initial state. The self-transitions are represented by the
Exploiting Positive and Negative Graded Relevance Assessments
157
identity matrix S = I, so that the weight of the self-transitions is equal for all nodes. We distinguish the transition matrix based on positive and negative ratings (T+ and T− ). The random walk over the positive graph estimates the ranking of relevant documents, while the walk over the negative graph estimates the ranking of most irrelevant documents. We create the positive transition matrix as follows: αSK (1 − α)R+ + T = T (1 − α)R+ αSL where R+ contains the positive preference information (See Section 3). To make sure the graph is fully connected we add an edge with weight ǫ between all node pairs, which allows the walk to teleport to a random node at each step. The 1KL , where 1KL final transition matrix is now given by: A+ = (1 − ǫ)T+ + ǫ K+L represents the ones matrix of size K + L. The teleport probability ǫ is set to 0.01 in all experiments. The negative transition matrix is constructed similarly. We now create the initial state vector with a zero array of length K + L and set the index corresponding to the target user to one v0 (uk ) = 1. Multiplying the state vector with one of the transition matrices gives either the estimation of relevant (v1+ = v0 A+ ) or irrelevant (v1− = v0 A− ) content for the target user. The first step gives the content annotated by the user himself, while subsequent − ) give an estimate of the most similar users and content. steps (vn+ ;vm Both random walks produce a state probability vector which indicates either the positive or negative information that we have about each node. To obtain a single content ranking, we combine the parts of the state vectors that correspond to item nodes (v+ (K + 1, ..., K + L) and v− (K + 1, ..., K + L)). The combined content ranking is obtained by simply subtracting the negative state probabilities − from the positive state probabilities (vn+ −vm ). Intuitively, this subtraction ranks the content according to the difference in positive and negative information in the neighborhood of the target user. 2.1
Self-transition (α) and Walk Length (n)
The number of steps in the random walk (n) determines how strongly the final ranking depends on the starting point (target user uk ). The speed of convergence is determined by the self-transition probability α. Because all nodes have equal self-transition probability, the total number of non-self steps (Q) after n steps through the graph is a binomial random variable with probability mass function (PMF): n q n−q q = 0, . . . , n q α (1 − α) (3) PQ (q) = 0 otherwise Where PQ (q) is the probability of q non-self steps (Q = q). If a large value is chosen for α, most of the probability mass will stay close to the starting point. A small value of α results in a walk that quickly converges to the stable state probability distribution. Based on earlier experiments we fix the self-transition probability (α) to 0.8 [11,12].
158
M. Clements, A.P. de Vries, and M.J.T. Reinders
3
Data
3.1
LibraryThing (LT)
LibraryThing1 is a social online book catalog that allows its users to indicate their opinion about their books by giving a rating. Based on these preference indications LT gives suggestions about interesting books to read and about people with similar taste. The popularity of the system has resulted in a database that contains over 3 million unique works, collaboratively added by more than 400,000 users. We have collected a part of the LibraryThing network, containing 25,295 active users2 . After pruning this data set we retain 7279 users that have all supplied a rating to at least 20 books. We remove books that occur in less than 5 user profiles, resulting in 37,232 unique works. The user interface of LibraryThing allows users to assign ratings on the scale from a half to five, the distribution of the ratings in our LT data sample is shown in Figure 1a. 3.2
MovieLens (ML)
To validate the reproducibility of our results, we also use the data set from MovieLens3 , which is a well known benchmark data set for collaborative filtering algorithms. ML consists of 100,000 ratings for 1682 movies given by 943 users. In this data, ratings have been given on a scale of 1 to 5, 1 being dreadful and 5 excellent. Figure 1b shows the rating distribution in the ML data. 40
300
35 250
# Movies (x1000)
# Books (x1000)
30 200
150
100
25 20 15 10
50
0
5
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0
1
2
3
Rating
Rating
(a)
(b)
4
Fig. 1. Rating distribution of a) LibraryThing and b) MovieLens
1 2 3
http://www.librarything.com Crawled in July 2007 http://www.grouplens.org/
5
Exploiting Positive and Negative Graded Relevance Assessments
LibraryThing MovieLens
½
1 1
1½
2 2
2½
3 3
3½
4 4
4½
5 5
Edge Weight
5
4
3
2
1
1
2
3
4
5
–
R
159
+
R
Fig. 2. Conversion of ratings to edge weights
3.3
Rating to Edge Weight
Figure 2 gives the edge weights we assign to various ratings. The positive rating matrix R+ integrates the ratings 3 − 5 and the negative rating matrix R− contains the ratings 21 − 2 12 where the largest weight is assigned to the lowest rating. Although a 3 is the average rating that can be given in most user interfaces (by clicking a number of stars) it is generally regarded as slightly positive, because more than half of the stars are filled when a user has clicked the third star. The LT data consists of a total of 749401 ratings, using the split between positive and negative ratings as indicated in Figure 2, R+ will have a density of 2.53 · 10−3 and R− has a density of 2.34 · 10−4 . The MovieLens data has a much lower positive bias and the indicated data split results in R+ with density of 5.20 · 10−2 and R− with density of 1.10 · 10−2 . The difference in positive bias can be explained because watching a movie is a social experience while reading is not. Users will therefore watch more movies they do not like (because your friends want to see it) than read books they do not like. The resulting graphs (G+ , G− ) have a clear power-law structure, which is common to socially organized data [13]. However, the long tail is reduced due to the pruning step in which the users and items with few connections were removed.
4 4.1
Experimental Setup Data Split
To obtain a fair comparison without overfitting the model to the data we split the data sets in two equal parts, see Figure 3. First we use the training users to estimate the optimal model parameters, by finding the optimal value for our evaluation criterion. We remove 1/5 of the ratings of 1/5 of the training users (validation set) and use the rest of the data to predict the missing values. We average the results over 5 different independent splits. Using the optimal model parameters we evaluate the different models on the test set. We again remove 1/5 of the user profiles and use a 5-fold cross validation to obtain stable results. 4.2
Evaluation
NDCG. To evaluate the predicted content ranking we use the Normalized Discounted Cumulative Gain (NDCG) proposed by J¨ arvelin and Kek¨ al¨ ainen [14].
160
M. Clements, A.P. de Vries, and M.J.T. Reinders (Step 1)
(Step 2) 4/5
Items
(Step 3) 1/5 Optimal parameters
1/5
Users
1/2 Train
4/5
Training Graph
Model performance
1/5 1/2 Test 4/5
Test Graph
(Step 4)
(Step 5)
Fig. 3. Splitting the data in a train and test set
We first create a gain vector G with length L (all items) of zeros. In this gain vector the predicted rank positions of the held-out validation items are assigned a value equal to the edge weights in the training graph (See Figure 2), called the gain. In order to progressively reduce the gain of lower ranked test items, each position in the gain vector is discounted by the log2 of its index i (where we first add 1 to the index, to ensure discounting for all rank positions > 0). The Discounted Cumulative Gain (DCG) now accumulates the values of the discounted gain vector: (4) DCG[i] = DCG[i − 1] + G[i]/ log2 (i + 1) The DCG vector can now be normalized to the optimal DCG vector. This optimal DCG is computed using a gain vector where all test ratings are placed in the top of the vector in descending order. Component by component division now gives us the NDCG vector in which each position contains a value in the range [0, 1] indicating the level of perfectness of the ranking so far. We use the area below the NDCG curve as score to evaluate our rank prediction. We want to evaluate the prediction of relevant content with respect to the prediction of irrelevant content. We separately compute the predictive value for positive test items (NDCG+ ) and negative test items (NDCG− ) and use the fraction of the two NDCG measures as evaluation method: NDCG+ /NDCG− . This measure will be optimal if the predicted ranking contains the positive test items at the top (in descending rating order) and the negative test items at the bottom. PPV@20. The positive predictive value indicates the fraction of recommended relevant documents (true positives, TP) with respect to incorrectly recommended irrelevant documents (false positives, FP): PPV =
TP TP + FP
(5)
We assume that the system will give 20 recommendations and therefore compute the PPV on the top 20 of the ranked list (PPV@20).
Exploiting Positive and Negative Graded Relevance Assessments
161
Compared to precision, which is defined as T P/n (where n is the number of recommended documents), PPV does not regard the unassessed recommended items as incorrect recommendations. Because we are interested in evaluating the number of relevant compared to negatively assessed documents we use PPV as evaluation method. Recall@ 20. Recall indicates the number of true positives with respect to all relevant documents in the database and is defined as: TP (6) Recall = TP + FN where FN indicates the number of unrecommended relevant documents (false negatives).
5
Experiments
5.1
Optimizing Relevance Ranking
We separately optimize the ranking of positive and negative content on the training set. We first look at the NDCG+ for increasing walk length on the positive graph using v+ to obtain the content ranking. Figure 4a shows that for both LT and ML the optimal ranking is achieved after only 5 steps through the graph. The NDCG+ quickly converges to a stable value when the state vector reaches the global content popularity. The absolute difference between the performance on the two data sets can be explained by two factors. The ML users on average rated a larger fraction of the available items. Therefore, the probability of finding a relevant test item at the top of the ranking is higher, independent from the used method. Also, the more extensive user profiles result in denser social graphs, allowing the model to make more accurate predictions. (a)
(b)
0.4
0.5
0.06
0.11
LT ML
0.1
0.05
0.09
0.045
0.08
0.04
0.07
NDCG –
NDCG + NDCG – 0.2
0.1
0.055 0.4
NDCG+
0.3
LT ML
0.3
0
20
40
60
80
100
n
120
140
160
180
0.2 200
0.035 0
20
40
60
80
100
120
140
160
180
0.06 200
m
Fig. 4. a) Optimization of the walk length over the positive rating graph (Max. at n = 5 for both data sets). b) Optimization of the prediction of the prediction of negative test items (Max. at m = 81 for LT and m = 23 for ML).
162
M. Clements, A.P. de Vries, and M.J.T. Reinders T ab le 1. Test results for both datasets. NDCG is abbreviated with N. Data Method
Evaluation measure N
N+ /N− PPV@20 Recall@20
v5+ 0.310 5.279 − 0.195 7.466 v5+ − v81 v5 0.318 4.909
0.973 0.976 0.967
0.474 0.362 0.496
ML v5+ 0.491 3.538 − 0.167 6.165 v5+ − v23 0.508 3.246 v5
0.944 0.971 0.925
0.596 0.278 0.627
LT
5.2
+
Optimizing Irrelevance Ranking
Figure 4b shows the NDCG− (prediction of irrelevant content) optimized on the − negative rating graph (ranking based on vm ). It is clear that a longer walk over the graph is needed to obtain an optimal prediction. Also, the optimal negative ranking is reached for a smaller number of steps on the ML data than on the LT data. This is expected because the ML data contains more negative ratings; in other words, the negative graph of ML is much more dense than the LT negative graph. On a dense graph the state probability vector of the random walk will converge more quickly to the stable distribution (i.e., the graph has a shorter mixing time). Because the random teleport probability is very low (ǫ = 0.01) it will only slightly decrease the mixing time. 5.3
Test Results
Table 1 summarizes the evaluation results on the test set for recommendation using different model settings. We first compare our proposed model using the − ) to the ranking difference of state probabilities as ranking function (vn+ − vm + based on positive information alone (vn ). We now use the optimal settings of the walk length parameters m and n derived from the individual optimizations on the training set. If we can correctly predict both positive and negative content, subtracting the probability of reaching a node in the negative graph from the state probability in the positive graph will give a ranking with good content at the top and bad content at the bottom. Our proposed combination model outperforms the ranking based on positive information if we use the fraction NDCG+ /NDCG− or PPV@20 as evaluation measure. This means that the top of the ranking contains relatively more positive test items than negative test items. However, we also observe a large drop in recall (and NDCG+ ), meaning that our method finds a lower absolute number of relevant test items. Apparently the use of the negative graph not only removes irrelevant content from the top of the ranking, but also penalizes some of the relevant content.
Exploiting Positive and Negative Graded Relevance Assessments (a)
(b) 5
5
10
10
POS NEG
4
POS NEG
4
10
3
# items
# items
10
10
2
10
3
10
2
10
1
1
10
10
0
0
10 0 10
1
2
10
10 0 10
3
10
10
Bin (100 items)
1
2
10
3
10
10
Bin (100 items)
(c)
(d)
4
4
10
10
POS NEG
POS NEG 3
3
10
# items
10
# items
163
2
10
2
10
1
1
10
10
0
0
10 0 10
1
10
Bin (100 items)
10 0 10
1
10
Bin (100 items)
Fig. 5. a) LT: Aggregated ranking using only the positive graph after 5 steps (v5+ ). − ). c) ML: Aggregated b) LT: Aggregated ranking for the combined method (v5+ − v81 + ranking using only the positive graph after 5 steps (v5 ). d) ML: Aggregated ranking − ). for the combined method (v5+ − v23
Alternative model vn is obtained using all ratings as positive evidence in the transition matrix (Rating 12 . . . 5 mapped to edge weights 1 . . . 10 in R). The test results show that this method has a higher PPV@20 and NDCG+ than the ranking based on vn+ . This shows that the negative training items even have a small predictive value for relevant content. 5.4
Understanding the Test Results
Figure 5 shows the position of the positive (Rating ≥ 3) and negative (Rating < 3) test items in the predicted content ranking, aggregated over all test users. The ranked list is split into bins of 100 items and the graphs plot the number of test items that fall into a certain bin. The gap between positive and negative ratings in the top part of the ranking is clearly larger in the combined model, − (Figure 5b,d) than in the purely positive model, based on vn+ based on vn+ − vm (Figure 5a,c). This finding corresponds to the increase in NDCG+ /NDCG− .
164
M. Clements, A.P. de Vries, and M.J.T. Reinders
As expected by previously discussed results, we observe a peak at the bottom of the ranking in Figure 5b and 5d, both in negative and positive test items. This confirms the effect that the negative graph also penalizes some of the relevant content, meaning that some of the relevant content has more connections to the target user in the negative graph than in the positive graph.
6 6.1
Discussion Graph-Based Ranking Models
Graph-based algorithms have shown to be very effective to find a global ranking of hyperlinked documents in the web graph [7,8]. Also in other domains have these methods shown to be useful ranking mechanisms. Gy¨ ongyi et al. adapted traditional PageRank in order to reduce the rank position of spam web-sites [15]. Analogous to our approach they try to find an optimal document ranking in a graph with many unjudged documents. The small subset of documents that has received a relevance judgement is used as seed of the random walk. Besides the difference in domain, this method mostly differs from our model in the fact that the authors assume a global binary opinion on the content quality, while our approach is based on the individual preference annotations. Gori and Pucci used the graph of referenced scientific research papers to obtain a paper ranking, based on a user’s history [16]. User-based recommendation algorithms were described as a graph-theoretic model by Mirza et al. [17]. In their model the graph is represented by hammocks (the set of 2-step connections between 2 nodes), based on rating commonality between users. By taking multiple steps over the user similarity graph their algorithm finds latent relations between users. These models are based on graphs with a single type node (items and users relatively), an extra step needs to be taken to relate the content to the target user. Different methods have been proposed that represent both users and items into a single graph. Huang et al. applied spreading activation algorithms on the user-item graph to explore transitive associations among consumers through their past transactions and feedback [6]. Fouss et al. [18] removed the diffusion parameter and used the average commute time between nodes in the graphbased representation of the MovieLens database to derive similarities between the entities in the data. These algorithms showed to be very effective on binary relevance data. They ignored the numeric value of the rating provided by the user and only used the fact that a user did or did not see/buy the content (binary relevance assessments). The random walk model with self-transitions has been applied on the graph constructed by graded relevance information based on queries and clicks on images [12]. In this work Craswell and Szummer explained the soft clustering effect that is obtained with a medium length random walk. This effect is clearly visible in our results on the negative rating graph, where an average length walk finds
Exploiting Positive and Negative Graded Relevance Assessments
165
the cluster of irrelevant content and therefore outperforms direct relations or the popularity ranking. We have to the best of our knowledge for the first time used a graph-based ranking model on the positive and negative user-item graph. Because of the different graph statistics we used the walk length parameter to individually optimize the prediction of relevant and irrelevant content. The combined model showed that the information in negative relevance assessments can be used to improve the positive predictive value of the content ranking, by pushing some documents to the bottom of the ranking. 6.2
Selective Assessment Explains Positive Predictive Value
We have shown that negative preference indications not only predict irrelevant content, but also have a predictive value for positively rated test items. This can be explained by the fact that users in a social content systems do not randomly select the content to assess. People carefully select the content to read/view based on prior knowledge about theme, author etc. Based on this prior knowledge the user assumes that he will like the content (otherwise he would not view it). Although the user gives a low rating to the selected content, this content can still be related to other documents the user does like, because of features corresponding to prior knowledge. In those cases, negative items are connected to books that the user would give a high rating. In the negative graph, these items incorrectly drag some of the relevant content with them to the bottom of the ranking. Perhaps, modeling more aspects of the content will separate the relevant and irrelevant recommendations. Acknowledgments. Arjen P. de Vries has been supported by the European Union via the European Commission project VITALAS (contract no. 045389).
References 1. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI 1998), pp. 43–52. Morgan Kaufmann, San Francisco (1998) 2. Herlocker, J.L., Konstan, J.A.: Content-independent task-focused recommendation. IEEE Internet Computing 5(6), 40–47 (2001) 3. Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering recommendation algorithms. In: WWW 2001: Proceedings of the 10th international conference on World Wide Web, pp. 285–295. ACM Press, New York (2001) 4. Wang, J., de Vries, A.P., Reinders, M.J.T.: Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In: SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 501–508. ACM Press, New York (2006) 5. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of dimensionality reduction in recommender systems–a case study. In: ACM WebKDD Workshop (2000)
166
M. Clements, A.P. de Vries, and M.J.T. Reinders
6. Huang, Z., Chen, H., Zeng, D.: Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Trans. Inf. Syst. 22(1), 116– 142 (2004) 7. Page, L., Brin, S., Motwani, R., Winograd, T.: The Pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998) 8. Kleinberg, J.M.K.: Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46(5), 604–632 (1999) 9. Yates, R.D., Goodman, D.J.: Probability and Stochastic Processes. John Wiley & Sons, Inc., New York (1999) 10. Szummer, M., Jaakkola, T.: Partially labeled classification with Markov random walks. In: Advances in Neural Information Processing Systems (NIPS), vol. 14, pp. 945–952. MIT Press, Cambridge (2001) 11. Clements, M., de Vries, A.P., Reinders, M.J.T.: Optimizing single term queries using a personalized markov random walk over the social graph. In: Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR (March 2008) 12. Craswell, N., Szummer, M.: Random walks on the click graph. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 239–246. ACM Press, New York (2007) 13. Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46(5), 323–351 (2005) 14. J¨ arvelin, K., Kek¨ al¨ ainen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002) 15. Gy¨ ongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank. In: VLDB 2004: Proceedings of the Thirtieth international conference on Very large data bases, VLDB Endowment, pp. 576–587 (2004) 16. Gori, M., Pucci, A.: Research paper recommender systems: A random-walk based approach. In: WI 2006: Proceedings of the, IEEE/WIC/ACM International Conference on Web Intelligence, pp. 778–781. IEEE Computer Society, Washington (2006) 17. Mirza, B.J., Keller, B.J., Ramakrishnan, N.: Studying recommendation algorithms by graph analysis. Journal of Intelligent Information Systems 20(2), 131–160 (2003) 18. Fouss, F., Pirotte, A., Renders, J.M., Saerens, M.: Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. Knowledge and Data Engineering, IEEE Transactions on 19(3), 355–369 (2007)
Cluster Based Personalized Search Hyun Chul Lee1 and Allan Borodin2 1
2
Thoora.com, Toronto, ON, M4W 0A1
[email protected] DCS, University of Toronto, Toronto, ON, M5S 3G4
[email protected]
Abstract. We study personalized web ranking algorithms based on the existence of document clusterings. Motivated by the topic sensitive page ranking of Haveliwala [20], we develop and implement an efficient “localcluster” algorithm by extending the web search algorithm of Achlioptas et al. [10]. We propose some formal criteria for evaluating such personalized ranking algorithms and provide some preliminary experiments in support of our analysis. Both theoretically and experimentally, our algorithm differs significantly from Topc Sensitive Page Rank.
1
Introduction
Due to the size of the current Web and the diversity of user groups using it, the current algorithmic search engines are not completely ideal for dealing with queries generated by a large number of users with different interests and preferences. For instance, it is possible that some users might input the query “Star Wars” with their main topic of interest being “movie” and therefore expecting pages about the popular movie as results of their query. On the other hand, others might input the query “Star Wars” with their main topic of interest being “politics” and therefore expecting pages about proposals for deployment of a missile defense system. (Of course, in this example, the user could easily disambiguate the query by adding say “movie” or “missile” to the query terms.) To both expedite simple searches as well as to try to accommodate more complex searches, web search personalization has recently gained significant attention for handling queries produced by diverse users with very different search intentions. The goal of web search personalization is to allow the user to expedite web search according to ones personal search preference or context. There is no general consensus on exactly what web search personalization means, and moreover, there has been no general criteria for evaluating personalized search algorithms. The goal of this paper is to propose a framework, which is general enough to cover many real application scenarios, and yet is amenable to analysis with respect to correctness in the spirit of Achlioptas et al [10] and with respect to stability properties in the spirit of Ng et al. [26] and Lee and Borodin [24] (see also [12,17]). We achieve this goal by assuming that the targeted web service has an underlying cluster structure. Given a set of clusters over the intended documents in which we want to perform personalized search, our K. Avrachenkov, D. Donato, and N. Litvak (Eds.): WAW 2009, LNCS 5427, pp. 167–183, 2009. c Springer-Verlag Berlin Heidelberg 2009
168
H.C. Lee and A. Borodin
framework assumes that a user’s preference is represented as a preference vector over these clusters. A user’s preference over clusters can be collected either online or off-line using various techniques [27,15,29,19]. We do not address how to collect the user’s search preference but we simply assume that the user’s search preference (possibly with respect to various search features) is already available and can be translated into his/her search preference(s) over given cluster structures of targeted documents. We define a class of personalized search algorithms called “local-cluster” algorithms that compute each page’s ranking with respect to each cluster containing the page rather than with respect to every cluster. We propose a specific local-cluster algorithm by extending the approach taken by Achlioptas et al. [10]. Our proposed local-cluster algorithm considers linkage structure and content generation of cluster structures to produce a ranking of the underlying clusters with respect to a user’s given search query and preference. The rank of each document is then obtained through the relation of the given document with respect to its relevant clusters and the respective preference of these clusters. Our algorithm is particularly suitable for equipping already existing web services with a personalized search capability without affecting the original ranking system. Our framework allows us to propose a set of evaluation criteria for personalized search algorithms. We observe that Topic-Sensitive PageRank [20], which is probably the best known personalized search algorithm in the literature, is not a local-cluster algorithm and does not satisfy some of the criteria that we propose. In contrast, we show that our local-cluster algorithm satisfies the suggested properties. Our main contributions are the following. – We define a personalized search algorithm which provides a more practical implementation of the web search model and algorithm proposed by Achlioptas et al [10]. – We propose some formal criteria for evaluating personalized search algorithms and then compare our proposed algorithm and the Topic-Sensitive PageRank algorithm based on such formal criteria. – We experimentally evaluate the performance of our proposed algorithm against that of the Topic-Sensitive PageRank algorithm.
2
Motivation
We believe that our assumption that the web service to be personalized admits cluster structures is well justified. For example, we mention: • H uman generated web directories: In web sites like Yahoo [7] and Open Directory Project[5], web pages are classified into human edited categories (possibly machine generated as well) and then organized in a taxonomy. In order to personalize such systems, we can simply take the leaf nodes in any pruning of the taxonomy tree as our clusters. • Geographically sensitive search engines: Sites like Yahoo Local [8], Google Local [3] and Citysearch [2] classify reviews, web pages and business
Cluster Based Personalized Search
169
information of local businesses into different categories and locations (e.g., city level). Therefore, in this particular case, a cluster would correspond to a set of data items or web pages related to the specific geographic location (e.g. web pages about restaurants in Houston, TX). We note that the same corpus can admit several cluster structures using different features. For instance, web documents can be clustered according to features such as topic [7,5,6,4,1], whether commercial or educational oriented [9], domain type, language, etc. Our framework allows incorporating various search features into web search personalization as it works at the abstract level of clustering structures.
3
Preliminaries
Let GN (or simply G) be a web page collection (with content and hyperlinks) of node size N , and let q denote a query string represented as a term-vector. Let C = C(G) = {C1 , . . . , Cm } be a clustering (not necessarily a partition) for G (i.e. each x ∈ G is in Ci1 ∩ . . . ∩ Cir for some i1 , . . ., ir ). For simplicity we will assume there is a single clustering of the data but there are a number of ways that we can extend the development when there are several clusterings. We define a clustersensitive page ranking algorithm µ as a function with values in [0, 1] where µ(Cj , x, q) will denote the ranking value of page x relative to1 cluster Cj with respect to query q. We define a user’s preference as a [0, 1] valued function P where P (Cj , q) denotes the preference of the user for cluster Cj (with respect to query q). We call (G, C, µ, P, q) an instance of personalized search; that is, a personalized search scenario where there exist a user having a search preference function P over a clustering C(G), a query q, and a cluster-sensitive page ranking function µ. Note that either µ or P can be query-independent. Definition 1. Let (G, C, µ, P, q) be an instance of personalized search. A personalized search ranking P SR is a function that maps GN to an N dimensional real vector by composing µ and P through a function F ; that is, P SR(x) = F (µ(C1 , x, q), . . . , µ(Cm , x, q), P (C1 , q), . . . , P (Cm , q)). For instance, F might be defined as a weighted sum of µ and P values.
4
Previous Algorithms
4.1
Modifying the PageRank Algorithm
Due to the popularity of the PageRank algorithm [14], the first generation of personalized web search algorithms are based on the original PageRank algorithm by manipulating the teleportation factor of the PageRank algorithm. In the PageRank algorithm, the rank of a page is determined by the stationary 1
Our definition allows and even assumes a ranking value for a page x relative to Cj even if x ∈ / Cj . Most content based ranking algorithms provide such a ranking and if not, we can then assume x has rank value 0.
170
H.C. Lee and A. Borodin
distribution of a modified uniform random walk on the web graph. Namely, with some small probability ǫ > 0, a user at page i uniformly jumps to a random page, and otherwise with probability (1 − ǫ) jumps uniformly to one of its neighboring pages 2 . That is, the transition probability matrix PǫA = ǫ · U + (1 − ǫ) · A; U = ev T is the teleportation factor matrix, with e = (1, 1, . . . , 1) and v the 1 uniform probability vector defined by vi = N1 ; and A = (aij ) with aij = outdeg(i) if (i, j) is an edge and 0 otherwise. The first generation of personalized web search algorithms introduce some bias, reflecting the user’s search preference, by using non-uniform probabilities on the teleportation factor (i.e. controlling v). Among these, we have Topic-Sensitive PageRank [20], Modular PageRank [21] and BlockRank [22]. In this paper, we restrict analysis to Topic-Sensitive PageRank. Topic-Sensitive PageRank. One of the first proposed personalized search ranking algorithms is Topic-Sensitive PageRank[20]. It computes a topicsensitive ranking (i.e. cluster-sensitive in our terminology) by constraining the uniform jumping factor of a random surfer to each cluster. More precisely, let Tj be the set of pages in a cluster Cj . Then, when computing the PageRank vector with respect to cluster Cj , we use the personalization vector v j where 1 i ∈ Tj vij = | Tj | 0 i∈ / Tj More precisely, the page rank vector with respect to cluster Cj is then computed as the solution to T R(Cj ) = (1 − ǫ) · AT · T R(Cj ) + ǫ · v j . We note that if there exists y ∈ Cj with a link to x, then T R(x, Cj ) = 0 whether or not x ∈ Cj . During query time, the cluster-sensitive ranking is combined with a user’s search preference. Given query q, using (for example) a multinomial naive-Bayes classifier we compute the class probabilities for each of the clusters, conditioned on q. Let qi be the ith term in the query q. Then, given the query q, we compute P r(Cj )· P r(q| Cj ) for each Cj the following: P r(Cj |q) = ∝ P r(Cj ) · i P r(qi |Cj ). P r(q) P r(qi |Cj ) is easily computed from the class term-vector Dj . The quantity P r(Cj ) is not as straightforward. In the original Topic-Sensitive PageRank, is P r(Cj ) chosen to be uniform. Certainly, more advanced techniques can be used to better estimate P r(Cj ). To compute the final rank, we retrieve all documents containing all of query terms using a text index. The final query-sensitive ranking of each of these pages is given as follows. For page x, we compute the final importance score T SP R(x, q) as T SP R(x, q) = Cj ∈C P r(Cj |q) · T R(x, Cj ). TopicSensitive PageRank algorithm is then a personalized search ranking algorithm with μ(Cj , x, q) = T R(x, Cj ), and P (Cj , q) = P r(Cj |q). 4.2
Other Personalized Systems
Aktas et al. [11] employ the Topic-Sensitive PageRank algorithm at the level of URL features such as Internet domain names. Chirita et al. [15], extend the 2
When page i has no hyperlinks (i.e. outdeg(i) = 0) it is customary to let ǫ = 1.
Cluster Based Personalized Search
171
Modular PageRank algorithm [21]. In [15], rather than using the arduous process for collecting the user profile as in Modular PageRank[21], the user’s bookmarks are used to derive the user profile. They augment the pages obtained in this way by finding their related pages using Modified PageRank and the HITS algorithms. Most content based web search personalization methods are based on the idea of re-ranking the returned pages in the collection using the content of pages (represented as snippet, title, full content, etc) with respect to the user profile. Some content analysis based personalization methods consider how to collect user profiles as part of its personalization framework. Liu et al.[25] propose a technique to map a user query to a set of categories, which represent the user’s search intention for the web search personalization. A user profile and a general profile are learned from the user’s search history and a category hierarchy respectively. Later, these two profiles are combined to map a user query into a set of categories. Chirita et al. [15] propose a way of performing web search using the ODP (open directory project) metadata. First, the user has to specify a search preference by selecting a set of topics (hierarchical) from the ODP taxonomy. Then, at run-time, the web pages returned by the ordinary search engine can be re-sorted according to the distance between the URL of a page and the user profile. Sun et al. [28] proposed an approach called CubeSVD (motivated by HOSVD, High-Order Singular Value Decomposition) which focuses on utilizing the click-through data to personalize the web search. Note that the click-through data is highly sparse data containing relations among user, query, and clicked web page.
5
Our Algorithm
We propose a personalized search algorithm for computing cluster-sensitive page ranking based on a linear model capturing correlations between cluster content, cluster linkage, and user preference. Our model borrows heavily from the Latent Semantic Analysis (LSA) of Deerwester et al. [16], which captures termusage information based on a (low-dimensional) linear model, and the SP algorithm of Achlioptas et al.[10], which captures correlations between 3 components (i.e. links, page content, user query) of web search in terms of proximity in a shared latent semantic space. For a given clustering C, let CS(x) = {Cj ∈ C|x ∈ Cj }. Given an instance (G, C, µ, P, q) of personalized search, a local-cluster algorithm is a personalized search ranking such that F is given by F (µ(C1 , x, q), . . . , µ(Cm , i, q), P (C1 , q), . . . P (Cm , q)) = Cj ∈CS(x) P (Cj , q) · µ(Cj , x, q). That is, only preferences for clusters containing a site x will effect the ranking of x. Our algorithm personalizes existing web services utilizing existing ranking algorithms. Our model assumes that there is a generic page ranking R(x, q) for ranking page x given query q. Using an algorithm to compute the ranking for clusters (described in the next section), we compute the cluster-sensitive ranking µ(Ci , x, q) as
172
H.C. Lee and A. Borodin
µ(Ci , x, q) =
R(x, q) · CR(Ci , q) if x ∈ Ci 0 Otherwise
where CR(Ci , q) refers to the ranking of cluster Ci with respect to query q. Finally P SR(x, q) will be computed as P SR(x, q) = Cj ∈CS(x) P (Cj , q) · µ(Cj , x, q). We call our algorithm PSP (for Personalized SP algorithm) and note that it is a localcluster algorithm. 5.1
Ranking Clusters
The algorithm for ranking clusters is the direct analogy of the SP algorithm [10] where now clusters play the role of pages. That is, we will be interested in the aggregation of links between clusters and the term content of clusters. We also modify the generative model of [10], so as to apply to clusters. This generative model motivates the algorithm and also allows us to formulate a correctness result for the PSP algorithm analogous to the correctness result of [10]. We note that like the SP algorithm, PSP is defined without any reference to the generative model. Let {C1 , . . . , Cm } be a clustering for the targeted corpus. Now following [16] and [10], we assume that there exists a set of k unknown (latent) basic concepts whose combinations represent every topic of the web. Given such a set of k concepts, a topic is a k-dimensional vector λ, describing the contribution of each of the basic concepts to this topic. Authority and Hub Values for Clusters. We first review the notion of a page’s authority and hub values as introduced in Kleinberg [23] and utilized in [10]. Two vectors are associated with a web page x: • There is a k-tuple A(x) ∈ [0, 1]k reflecting the topic on which x is an authority. The i-th entry in A(x) expresses the degree to which x concerns the concept associated with the i-th entry in A(x). This topic vector captures the content on which this page is an authority. • The second vector associated with x is a k-tuple H(x) ∈ [0, 1]k reflecting the topic on which x is a hub. This vector is defined by the set of links from x to other pages. Based on this notion of page hub and authority values, we introduce the concept of cluster hub and authority values. With each cluster Cj ∈ C, we associate two vectors: (j) which represents the • The first vector associated with Cj is a k-tuple A expected authority value that is accumulated in cluster Cj with respect to (j) as A (j) (c) = each concept. We define A x∈Cj A(x, c) where A(x, c) is document x’s authority value with respect to the concept c. (j) representing the • The second vector associated with Cj is a k-tuple H expected hub value accumulated in cluster Cj with respect to each concept. (j) as H (j) (c) = We define H x∈Cj H(x, c) where H(x, c) is document x’s hub value with respect to concept c.
Cluster Based Personalized Search
173
Link Generation over Clusters. In what follows, we assume all random variables have bounded range. Given clusters Cp and Cr ∈ C, our model assumes that the total number of links from pages in Cp to pages in Cr is a random (r) >. Note that the intuition is (p) , A variable with expected value equal to
+ < H (p) , S(u) >. We describe the term generation model of clusters A A H where again m is the number of underlying clusters and with a m by l matrix S, · ST + A · ST . The (j, i) entry in l is the total number of possible terms, S = H H A S represents the expected number of occurrences of term i within all documents in cluster j. Let S be the actual term-document matrix of all documents in the targeted corpus. Analogous to the previous link generation model of clusters, we assume that S = Z T S is an instantiation of the term generation model of clusters described by S.
174
H.C. Lee and A. Borodin
U ser Query. The user has in mind some topic on which he wants to find the most authoritative cluster of documents on the topic when he performs the search. The terms that the user presents to the search engine should be the terms that a perfect hub on this topic would use, and then these terms would potentially lead to the discovery of the most authoritative cluster of documents on the set of topics closely related to these terms. The query generation process in our model is given as follows: • The user chooses the k-tuple v˜ describing the topic he wishes to search for in terms of the underlying k concepts. T • The user computes the vector q˜T = v˜T SH where the u-th entry of q˜ is the expected number of occurrences of the term u in a cluster. • The user then decides whether or not to include term u among his search terms by sampling from a distribution with expectation q˜[u]. We denote the instantiation of the random process by q[u]. The input to the search engine consists of the terms with non-zero coordinates in the vector q. Algorithm Description. Given this generative model that incorporates link structure, content generation, user preference, and query, we can rank clusters of documents using a spectral method. While the basic idea and analysis for our algorithm follows from [10], our PSP algorithm is different from the original SP algorithm in one substantial aspect. In contrast to the original SP algorithm which works at the document level, our algorithm works at the cluster level making our algorithm computationally more attractive and consequently more practical3 . For our algorithm, in addition to the SVD computation of M and W matrices, the SVD computation of S is also required. This additional computation is not very expensive because of the size of matrix S. We need some additional notation. For two matrices A and B with an equal number of rows, let [A|B] denote the matrix whose rows are the concatenations of the rows of A and B. Let σi (A) denote the i-th largest singular value of a matrix A and let ri (A) = σ1 (A)/σi (A) ≥ 1 denote the ratio between the primary singular value and the i-th singular value. Using standard notation for the singular value decomposition (SVD) of matrix B ∈ ℜn×m , B = U ΣV T where U is a matrix of dimensions n × rank(B) whose columns are orthonormal, Σ is a diagonal matrix of dimensions rank(B) × rank(B), and V T is a matrix of dimensions rank(B) × m whose rows are orthonormal. The (i, i) entry of Σ is σi (B). The cluster ranking algorithm pre-processes the entire corpus of documents independent of the query.
3
To the best of our knowledge, the SP algorithm was never implemented. Ignoring any personalization aspects (i.e. setting the preference P to be a constant function), the cluster framework provides a significant computational benefit.
Cluster Based Personalized Search
175
Pre-processing Step T
1. Let M = [W |S]. Recall that M ∈ ℜm×(m+l) (m is the number of clusters and l is the number of terms). Compute the SVD of the matrix, denoted by ∗ T M = UM ΣM VM ∗
∗
∗
∗
2. Choose the largest index r such that the difference |σr (M ) − σr+1 (M )| is ∗ T )r sufficiently large (we require ω( (m + l))). Let M r = (UM )r (ΣM )r (VM be the rank r-SVD approximation to M . ∗ T 3. Compute the SVD of the matrix W , denoted by W = UW ΣW VW 4. Choose the largest index t such that the difference |σt (W ) − σt+1 (W )| is ∗ T sufficiently large (we require ω( (t))). Let W t = (UW )t (ΣW )t (VW )t be the rank t-SVD approximation to W . ∗ 5. Compute the SVD of the matrix S, denoted by S = US ΣS VSt . ∗
∗
6. Choose the largest index o such that the difference |σo (S ) − σo+1 (S )| is ∗ sufficiently large (we require ω( (o))). Let S o = (US )o (ΣS )o (VST )o be the rank o-SVD approximation to S. Query Step T
Once a query vector q T ∈ ℜl is presented, let q ′ =[0m |q T ] ∈ Rm+l . Then, T
we compute the cluster authority vector aT = q ′ M T (VM )r (ΣM )−1 r (UM )r is the pseudo-inverse of M r . 5.2
∗ −1 ∗ r Wt
∗ −1
where M r
=
Final Ranking
Once we have computed the ranking for clusters, we proceed with the actual computation of cluster-sensitive page ranking. Let a(Cj , q) denote the authority value of cluster Cj for query q as computed in the previous section. The clustersensitive page rank for page x with respect to cluster Cj is computed as μ(x, Cj , q) =
R(x, q) · a(Cj , q) if x ∈ Cj 0 Otherwise
where again R(x, q) is the generic rank of page x with respect to query q. As discussed in Section 1, we assume that the user provides his search preference having in mind certain clusters (types of documents that he/she is interested). If the user exactly knows what the given clusters are, then he might directly express his search preference over these clusters. However, such explicit preferences will not generally be available. Instead, we consider a more general scenario in which the user expresses his search interests through a set of keywords (terms). More precisely, our user search preference is given by: • The user expresses his search preference by providing a vector p over terms whose i-th entry indicates his/her degree of preference over the term i. T • Given the vector p, the preference vector over clusters is obtained as pT · S .
176
H.C. Lee and A. Borodin
Let P (Cj ) = p˜T ST (j) denote this preference for cluster Cj . The final person alized rank for page x is computed as P SP (x, q) = Ci ∈CS(x) R(x, q) · a(Ci , q) · P (Ci ) The next theorem formalizes the correctness of our PSP algorithm with respect to the generative model. Theorem 1. Assume that the link structure for clusters, term content for clusters and search query are generated as described in our model: W is an instantiaST + H ST , q is an instantiation = H A T , S is an instantiation of S = A tion of W A H T T T of q˜ = v SH , the user’s preference is provided by p , and R(q) is a vector whose entries correspond to the generic ranks of pages (i.e. R(x,q) corresponds to the generic rank of page x with respect to query q). Additionally, we have 2 1. q has ω(k · rk (W )2 r2k (M )√ rk (GT )) terms. √ 2. σk (W ) ∈ ω(r2k (M )rk (GT ) m) and σ2k (M ) ∈ ω(rk (W ) r2k (M )rk (GT ) m), T T T 3. W , HS A and S H are rank k, M = [W |S] is rank 2k, l = O(m), and m = O(k).
then the PSP algorithm computes a vector of personalized ranks that is very close to the correct ranking. More precisely, we have T
1
∗ −
||Zq′ M
r
∗
∗ T
W t ·pT ·S o R(q)−Zv T AT pT S T R(q)||2 ||Zv T AT pT S T R(q)||2
is similar to that of Achlioptas et al. [10].
6
∈ O(1). The proof of this theorem
Personalized Search Criteria
We present a series of results comparing Topic-Sensitive PageRank algorithm and our PSP algorithm with respect to a set of personalized search algorithm criteria that we propose. (Some proofs can be found in the Appendix of the paper located at the web site http://www.cs.toronto.edu/∼leehyun/cbps experiment.html). Our criteria are all of the form “small changes in the input imply small changes in the computed ranking”. We believe such criteria have immediate practical relevance as well as being of theoretical interest. Since our ranking of documents produces real authority values in [0, 1], one natural approach is to study the effect of small continuous changes in the input information as in the rank stability studies of [12,17,24,26]. One basic property shared by both Topic-Sensitive PageRank and our PSP algorithm is continuity. Theorem 2. Both TSPR and our PSP ranking algorithms are continuous; i.e. small changes in any μ value or preference value will result in a small change in the ranking value of all pages. Our first distinguishing criteria is a rather minimal monotonicity property that we claim any personalized search should satisfy. Namely, since a (cluster based) personalized ranking function depends on the ranking of pages within their relevant clusters as well as the preference of clusters, when these rankings for a page and cluster preferences are increased, we expect the personalized rating can only improve. More precisely, we have the following definition:
Cluster Based Personalized Search
177
Definition 2. Let (G, C, µ, P, q) and (G, C, µ, P˜ , q) be two instances of personalized search. Let χ and ψ be the set of ranked pages produced by (G, C, μ, P, q) and (G, C, μ, P˜ , q) respectively. Suppose that x ∈ χ , y ∈ ψ share the same set of clusters (i.e. CS(x) = CS(y)), and suppose that μ(Cj , x, q) ≤ μ(Cj , y, q) and P (Cj , q) ≤ P˜ (Cj , q) hold for every Cj that they share. We say that a personal˜ ized ranking algorithm is monotone if P SR(x) ≤ P SR(y) for every such x ∈ χ and y ∈ ψ. We now introduce the idea of “locality”. The idea behind locality is that (small) discrete changes in the cluster preferences should have only a minimal impact on the ranking of pages. The notion of locality justifies our use of the terminology “local-cluster algorithm”. A perturbation ∂α of size α changes a cluster preference vector P to a new preference vector P˜ = ∂α (P ) such that P and P˜ ˜ denote the new personalized ranking differ in at most α components. Let P SR vector produced under the new search preference vector P˜ . Definition 3. Let (G, C, μ, P, q) and (G, C, μ, P˜ , q) be the original personalized search instance and its perturbed personalized search instance respectively. Let AC(∂α ), the active clusters, be the set of clusters that are aff ected by the perturbation ∂α (i.e., P (Cj , q) = P˜ (Cj , q) for every cluster Cj in AC(∂α )). We say that a personalized ranking algorithm is local if for every x, y ∈ / AC(∂α ), ˜ ˜ P SR(x, q) ≤ P SR(y, q) ⇔ P SR(x, q) ≤ P SR(y, q) where P SR refers to the ˜ refers to the personalized rankoriginal personalized ranking vector while P SR ing vector after the perturbation. Theorem 3. Topic-Sensitive PageRank algorithm is not monotone and not local. In contrast we show that our PSP algorithm does enjoy the monotone and local properties. Theorem 4. Any linear local-cluster algorithm (and hence PSP) is monotone and local. We next consider a notion of stability (with respect to cluster movement) in the spirit of [26,24]. Our definition reflects the extent to which small changes in the clustering can change the resulting rankings. We consider the following page movement changes to the clusters: – A migration migr(x, Ci , Cj ) moves page x from cluster Ci to cluster Cj . – A replication repl(x, Ci , Cj ) adds page x to cluster Cj (assuming x was not already in Cj ) while keeping x in Ci . – A deletion del(x, Cj ) is the deletion of page x from cluster Cj (assuming there exists a cluster Ci in which x is still present). We define the size of these three page movement operations to be μ(Ci , x, q)+ μ(Cj , x, q) for migration/replication, and μ(Cj , x, q) for deletion. We measure the size of a collection M of page movements to be the sum of the individual page movement costs. Our definition of stability then is that the resulting ranking
178
H.C. Lee and A. Borodin
Table 1. Sample queries and the preferred categories for search used in our experiments Q u ery U sed
Categories Query Used Categories Query Used Categories Query Used Categories Society/Issues Science/Astronomy Arts/Weblogs Recreation/Autos News/Current northern Kids and popular Arts/Chats jaguar Sports/Football Events light Teens/School Time blog and Forums Recreation/Travel Science/Software News/Weblogs Science/Biology Home/Personal Arts/Movies Home/Do It Yourself Science/Methods Finance and Techniques planning Shopping/Weddings star wars Games/Video Games common Arts/Writers technique Arts/Visual Arts Resources Recreation/Parties Recreation/Models tricks Games/Video Games Shopping/Crafts Computers/Software Sports/Strength Science/Math Health/Senses Sports integration Health/Alternative strong World/Deutch chaos Society/Religion vision Computers/Artificial man and Spirituality Intelligence Society/Issues Recreation/Drugs Games/Video Games Business/Consumer Goods Society/Folklore Society/Politics Arts/Education Business/Publishing and Printing proverb Reference conservative Society/Religion english Kids and Teens/ graphic Computers/Graphics /Quotations and Spirituality School Time design Home/Homemaking News/Analysis Society/Ethnicity Arts/Graphic Design and Opinion Recreation/Camps Society/Politics Society/History Business/Energy and Environment fishing Recreation liberal Society/Religion war Games/Board Games environment Science/Environment /Outdoors and Spirituality expedition Sports/Adventure News/Analysis Reference/Museums Arts/Genres Racing and Opinion middle east
does not change significantly when the clustering is changed by page movements of small size. We recall that each cluster is a set of pages and its induced subgraph, induced from the graph on all pages. We will assume that the µ ranking algorithm is a stable algorithm in the sense of [26,24]. Roughly speaking, locality of a µ ranking algorithm means that there will be a relatively small change in the ranking vector if we add or delete links to a web graph. Namely, the change in the ranking vector will be proportional the ranking values of the pages adjacent to the new or removed edges. Definition 4. Let (G, C, µ, P, q) and (G, C, µ, P˜ , q) be a personalized search instance. A personalized ranking function P SR is cluster movement stable if for every set of page movements M there is a β, independent of G, such that ˜ 2 ≤ β · size(M ) ||P SR − P SR|| ˜ refers where P SR refers to the original personalized ranking vector while P SR to the personalized ranking vector produced when the set of page movements M has been applied to a given personalized search instance. Theorem 5. Topic-Sensitive PageRank algorithm is not cluster movement stable. Theorem 6. The PSP algorithm is cluster movement stable.
7
Experiments
As a proof of concept, we implemented the PSP algorithm and the TopicSensitive PageRank algorithm for comparison. In section 7.1, we consider the retrieval effectiveness of our PSP algorithm versus that of the Topic-Sensitive
Cluster Based Personalized Search
179
Table 2. Top 10 precision scores for PSP and Topic-Sensitive PageRank Query middle east planning integration proverb fishing expedition Average
PSP 0.76 0.96 0.6 0.9 0.86
TSPR Query 0.8 northern lights 0.56 star wars 0.16 strong man 0.83 conservative 0.66 liberal
PSP 0.7 0.6 0.9 0.86 0.76 0.80
TSPR Query PSP TSPR Query 0.8 popular blog 0.93 0.7 jaguar 0.66 common tricks 0.66 0.9 technique 0.86 chaos 0.56 0.56 vision 0.76 english 0.8 0.26 graphic design 0.73 war 0.83 0.16 environment 0.59
PSP TSPR 0.96 0.46 0.96 0.7 0.43 0 1 0.73 0.93 0.5
PageRank algorithm. In section 7.2, we briefly discuss experiments regarding monotonicity and locality. A more complete reporting of experimental results can be found at http://www.cs.toronto.edu/∼leehyun/cbps experiment.html. As a source of data, we used the Open Directory Project (ODP) 4 data, which is the largest and most comprehensive human-edited directory in the Web. We first obtained a list of pages and their respective categories from the ODP site. Next, we fetched all pages in the list, and parsed each downloaded page to extract its pure text and links (without nepotistic links). We treat the set of categories in the ODP that are at distance two from the root category (i.e. the “Top” category) as the cluster set for our algorithms. In this way, we constructed 549 categories (or clusters) in total. The categorization of pages using these categories did not constitute a partition as some pages (5.4% of ODP data) belong to more than one category. 7.1
Comparison of Algorithms
To produce rankings, we first retrieved all the pages that contained all terms in a query, and then computed rankings taking into account the specified categories (as explained below). The PSP algorithm assumes that there is already an underlying page ranking for the given web service. Since we were not aware of the ranking used by the ODP search, we simply used the pure PageRank as the generic page ranking for our PSP algorithm. The Topic-Sensitive PageRank was implemented as described in Section 4.1. We used the same α = 0.25 value used in [20]. We devised 20 sample queries and their respective search preferences (in terms of categories) as shown in Table 1. The “preferred” categories were chosen as follows: for each query in Table 1, we chose a random subset of the categories given for the top ranked pages returned by the ODP search. For the TopicSensitive PageRank algorithm, we did not use the approach for automatically discovering the search preference (See Eq. 4.1) from a given query since we found that the most probable categories discovered in this way were heavily biased toward “News” related categories. Instead, we computed both Topic-Sensitive PageRank and PSP rankings by equally weighting all categories listed in Table 1. The evaluation of ranking results was done by three individuals: two having CS degrees and one with an engineering degree, all with extensive web 4
http://www.dmoz.com
180
H.C. Lee and A. Borodin
search experience. We used the precision over the top-10 (p@10) as the evaluation measure using the methodology employed by Tsaparas [30]. That is, for each query we merged the top 10 results returned by both algorithms into a single list. Without any prior knowledge about what algorithm was used to produce the corresponding result, each person was asked to carefully evaluate each page from the list as “relevant” if in their judgment the corresponding page should be treated as a relevant page with respect to the given query and one of the specified categories, or non-relevant otherwise. In Table 2, we summarize the evaluation results where the presented precision value is the average of all 3 precision values. These evaluation results suggest that our PSP algorithm outperforms the Topic-Sensitive PageRank algorithm. We also report on the actual produced results in experimental result 1 of our web page (http://www.cs.toronto.edu/∼leehyun/cbps experiment.html). To gain further insight, we analyzed the distribution of categories associated with each produced ranking. An ideal personalized search algorithm should retrieve pages in clusters representing the user’s specified categories as the top ranked pages. Therefore, in the list of top 100 pages associated with each query, we computed how many pages were associated with those categories specified in each search preference. Each page p in the list of top 100 pages was counted as 1/|nc(p)| where nc(p) is the total number of categories associated with page p. We report on these results in experimental result 2 of our web page(http://www.cs.toronto.edu/∼leehyun/cbps experiment.html). The results here exclude four queries ( strong man popular blog, common tricks, and vision) which did not retrieve a sufficient number of relevant pages in their lists of top 100 pages. Note that using the 1/|nc(p)| scoring, the total sum of all three preferred categories for each query was always less than 100 since several pages pertain to more than one category. For several queries in our web page, one can observe that each algorithm’s favored category is substantially different. For instance, for the query “star wars”, the PSP algorithm prefers “Games/Video Games” category while the Topic-Sensitive PageRank prefers “Recreation/Models” category. Furthermore, for the queries “liberal”, “conservative”, “technique”, “english” and “planning” the PSP algorithm and the Topic-Sensitive PageRank algorithm share a very different view on what the most important context associated with “liberal”, “conservative”, “technique”, “english” and “planning” is. One should also observe that when there is a highly dominant query context (e.g. “Society/ Relationships” category for “Society/Issues” category for “integration, and “Arts/Graphic Design” for “graphic design”) over other query contexts, then for both algorithms the rankings are dominated by this strongly dominant category with PSP being somewhat more focused on the dominant category. Finally, averaging over all queries, 86.38% of pages in the PSP list of top 100 pages were found to be in the specified preferred categories while for Topic-Sensitive PageRank, 69.05% of pages in the list of top 100 pages were found to be in the specified preferred categories. We personally considered a number of queries altering the preferred categories. For “integration”, we considered the single category “Science/Math” and
Cluster Based Personalized Search
181
the precision over the top 20 was .5 for PSP and .35 for TSPR. (Both algorithms suffered from uses of the term integration not applying to the calculus meaning.) For the “star wars” query, we added the category “Society/Politics” to our preferred categories. We note that the ODP search does not return any pages within this category. Looking at the top 100 pages returned, PSP returned 3 pages in society/politics not relevant to our query while TSPR returned 33 non relevant pages in this category. We also considered the query “middle east” (using the single category “Recreation/Travel”), the query “conservative” (using the single category “News/Religion and Spirituality”), and the query “jaguar” (using the single category “Sports/Football”) with regard to precision over the top 10 and observe that PSP performed qualitatively much better than TSPR. We report on these results in experimental result 3 of our web page (http://www.cs.toronto.edu/∼leehyun/cbps experiment.html). We further compared the PSP and TSPR rankings using a variant of the Kendall-Tau similarity measure[20,18]. Consider two partially ordered rankings σ1 and σ2 , each of length n and let U be the union of the elements in σ1 and σ2 . Let σ1′ be the extension of σ1 , where σ1′ contains the pages in σ2 − σ1 appearing after all the URLs in σ1 . We do the analogous σ2′ extension of σ2 . Using the ′ ′ order of (u,v),u=v}| measure KT Sim(σ1 , σ2 ) =|{ (u,v):σ1 ,σ2 agree|Uon , we computed ||U −1| the pairwise similarity between the PSP and TSPR rankings with respect to each query. Averaging over all queries, the KTSim value for the top 100 pages is 0.58 while the average KTSim value for the top 20 pages is 0.43, indicating a substantial difference in the rankings. 7.2
Monotonicity and Locality
How is the absense of monotonicity (as shown in Theorem 3) reflected in the ODP data? We searched our ODP dataset and randomly selected 19,640,761 pairs of sites (x, y) that share precisely one common cluster CI(x,y) ; i.e. CS(x) ∩ CS(y) = {CI(x,y) }5 and T R(x, CI(x,y) ) < T R(y, CI(x,y)). We computed the final T SP R(x) and T SP R(y) by uniformly weighting all 549 categories (or clusters). We found that 431,116 (approximately 2%) of these pairs violated monotonicity; that is, the ranking in CI(x,y) was opposite to the ranking produced by TSPR without favoring any particular category (or clusters). This would lead to a violation of monotonicity in the sense of Theorem 3 if, for example, we generated a query using the intersection of common terms in x and y. We report on the distribution of pairs violating monotonicity in experimental result 4 of our web page. We also conducted a study on how sensitive the algorithms are to change in search preferences. We argued that such sensitivity is theoretically captured by the notion of locality in Section 6, and showed that the PSP algorithm is robust to the change in search preferences while Topic-Sensitive PageRank is not. Our experimental evidence indicates that the Topic-Sensitive PageRank algorithm is somewhat more sensitive to the change in search preferences. For each query we 5
The pairs were selected in such way that they were reasonably distributed with respect to the common cluster.
182
H.C. Lee and A. Borodin
randomly chose 7 equally weighted categories so as to define a fixed preference vector. Let ∆N α refer to the class of perturbations induced by deleting some set of α categories. To compare the personalized ranking vectors produced under different perturbations, we again use the above KTSim measure [20,18]. In particular, we N varied α as 1,3, and 5 and for each fixed
5 α and for 5 random ∂i ∈ Δα , we computed the resulting rankings and then all 2 pairwise (KTSim) values considering the top 100 pages. We report on the average pairwise similarity6 across all queries for each fixed α in experimental result 5 of our web page.
8
Online Considerations
Given the dynamic nature of the web, it is important to consider personalized search engines in an online scenario where pages are incrementally added, deleted or updated. As a consequence, clusters are updated and this may in turn result in a desired change of the clustering (i.e. where existing clusters are merged or split). Online considerations have not received much attention in this context. The preprocessing phase of the PSP algorithm relies on the SVD computation of M , representing the linkage and semantic relations between clusters. The online addition, deletion or update of pages would then correspond to the addition, deletion or update of fragments of rows and columns in M and the consequent online updating of the SVD. There is a rich literature concerning online SVD updating. Recently, M. Brand [13] proposed a family of sequential update rules for adding data to a “thin” SVD data model, revising or removing data already incorporated into the model, and adjusting the model when the data-generating process exhibits non-stationarity. Moreover, he experimentally tested the practicability of the proposed approach in an interactive graphical movie recommender that predicts and displays ratings/rankings of thousands of movie titles in real-time as a user adjusts ratings of a small arbitrary set of movies. By applying such methods, the relevant aspects of the preprocessing phase becomes an online computation. In the full paper, we adapt the correctness and stability properties of the PSP algorithm established in section 6 to provide corresponding properties for the online version. In particular, as new pages arrive and are placed into their relevant pages, the PSP ranking will only change gradually.
References 1. 2. 3. 4. 5. 6. 7. 6
About.com, http://www.about.com Citysearch, http://www.citysearch.com Google local, http://local.google.com Google news, http://news.google.com Open directory project, http://www.dmoz.org Topix, http://www.topix.net Yahoo, http://www.yahoo.com
Fagin et al. [18] note that this KTSim variant has a mild personalization factor for items not common to both orderings and hence the rather large values.
Cluster Based Personalized Search
183
8. Yahoo local, http://local.yahoo.com 9. Yahoo! mindset, http://mindset.research.yahoo.com 10. Achilioptas, D., Fiat, A., Karlin, A.R., McSherry, F.: Web search via hub synthesis. In: FOCS, pp. 500–509 (2001) 11. Aktas, M., Nacar, M., Menczer, F.: Personalizing pagerank based on domain profiles. In: WebKDD (2004) 12. Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Link analysis ranking: algorithms, theory, and experiments. ACM Trans. Internet Techn. 5(1), 231–297 (2005) 13. Brand, M.: Fast online svd revisions for lightweight recommender systems. In: SDM (2003) 14. Brin, S., Page, L.: The anatomy of a large-scale hypertextual search engine. Computer Networks, 107–117 (1998) 15. Chirita, P.A., Nejdl, W., Paiu, R., Kohlschuetter, C.: Using odp metadata to personalize search. In: SIGIR (2005) 16. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. Journal of the Society for Information Science 41(6), 391–407 (1990) 17. Donato, D., Leonardi, S., Tsaparas, P.: Stability and similarity of link analysis ranking algorithms. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung, M. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 717–729. Springer, Heidelberg (2005) 18. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: SODA, pp. 28–36 (2003) 19. Ferragina, P., Gulli, A.: A personalized search engine based on web-snippet hierarchical clustering. In: WWW (Special interest tracks and posters), pp. 801–810 (2005) 20. Haveliwala, T.H.: Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge and Data Engineering 15(4), 784–796 (2003) 21. Jeh, G., Widom, J.: Scaling personalized web search. In: WWW, pp. 271–279 (2003) 22. Kamvar, S., Haveliwala, T., Manning, C., Golub, G.: Exploiting the block structure of the web for computing pagerank. Technical Report Stanford University Technical Report, Stanford University (March 2003) 23. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999) 24. Lee, H.C., Borodin, A.: Perturbation of the hyper-linked environment. In: Warnow, T.J., Zhu, B. (eds.) COCOON 2003. LNCS, vol. 2697, pp. 272–283. Springer, Heidelberg (2003) 25. Liu, F., Yu, C.T., Meng, W.: Personalized web search by mapping user queries to categories. In: CIKM, pp. 558–565 (2002) 26. Ng, A.Y., Zheng, A.X., Jordan, M.I.: Link analysis, eigenvectors and stability. In: IJCAI, pp. 903–910 (2001) 27. Qiu, F., Cho, J.: Automatic identification of user interest for personalized search. In: WWW (2006) 28. Sun, J.-T., Zeng, H.-J., Liu, H., Lu, Y., Chen, Z.: Cubesvd: a novel approach to personalized web search. In: WWW, pp. 382–390 (2005) 29. Teevan, J., Dumais, S.T., Horvitz, E.: Personalizing search via automated analysis of interests and activities. In: SIGIR, pp. 449–456 (2005) 30. Tsaparas, P.: Application of non-linear dynamical systems to web searching and ranking. In: PODS, pp. 59–70 (2004)
Author Index
Almeida, Virg´ılio 50 Andersen, Reid 25 Boldi, Paolo 116 Bonato, Anthony 127 Borodin, Allan 167 Bressan, Marco 76 Cataudella, Stefano 143 Chellapilla, Kumar 25 Chung, Fan 38, 62 Clements, Maarten 155 de Vries, Arjen P. Foschini, Luca
155
143
Gonen, Mira 13 Gulli, Antonio 143 Hadi, Noor Horn, Paul
127 38, 127
Lang, Kevin J. 1 Lee, Hyun Chul 167 Litvak, Nelly 90 Lu, Linyuan 38
Meira Jr., Wagner 50 Miranda, Lucas 50 Mour˜ ao, Fernando 50 Peserico, Enoch 76 P r al at , P aw e l 127 R e i n d e r s , M ar c e l J . T . R oc h a, L e on ar d o 50
155
S an t i n i , Massimo 116 Scheinhardt, Werner 90 Shavitt, Yuval 13 Titsias, Michalis
104
Vazirgiannis, Michalis 104 Vigna, Sebastiano 116 Volkovich, Yana 90 Wang, Changping Zacharouli, Polyxeni Zwart, Bert 90
127 104