Modeling and Simulation in Science, Engineering and Technology Series Editor Nicola Bellomo Politecnico di Torino Italy Advisory Editorial Board M. Avellaneda (Modeling in Economics) Courant Institute of Mathematical Sciences New York University 251 Mercer Street New York, NY 10012, USA
[email protected]
H.G. Othmer (Mathematical Biology) Department of Mathematics University of Minnesota 270A Vincent Hall Minneapolis, MN 55455, USA
[email protected]
K.J. Bathe (Solid Mechanics) Department of Mechanical Engineering Massachusetts Institute of Technology Cambridge, MA 02139, USA
[email protected]
L. Preziosi (Industrial Mathematics) Dipartimento di Matematica Politecnico di Torino Corso Duca degli Abruzzi 24 10129 Torino, Italy
[email protected]
P. Degond (Semiconductor and Transport Modeling) Mathématiques pour l’Industrie et la Physique Université P. Sabatier Toulouse 3 118 Route de Narbonne 31062 Toulouse Cedex, France
[email protected] A. Deutsch (Complex Systems in the Life Sciences) Center for Information Services and High Performance Computing Technische Universität Dresden 01062 Dresden, Germany
[email protected] M.A. Herrero Garcia (Mathematical Methods) Departamento de Matematica Aplicada Universidad Complutense de Madrid Avenida Complutense s/n 28040 Madrid, Spain
[email protected] W. Kliemann (Stochastic Modeling) Department of Mathematics Iowa State University 400 Carver Hall Ames, IA 50011, USA
[email protected]
V. Protopopescu (Competitive Systems, Epidemiology) CSMD Oak Ridge National Laboratory Oak Ridge, TN 37831-6363, USA
[email protected] K.R. Rajagopal (Multiphase Flows) Department of Mechanical Engineering Texas A&M University College Station, TX 77843, USA
[email protected] Y. Sone (Fluid Dynamics in Engineering Sciences) Professor Emeritus Kyoto University 230-133 Iwakura-Nagatani-cho Sakyo-ku Kyoto 606-0026, Japan
[email protected]
Dynamics On and Of Complex Networks Applications to Biology, Computer Science, and the Social Sciences
Niloy Ganguly Andreas Deutsch Animesh Mukherjee Editors
Birkhäuser Boston • Basel • Berlin
Editors Niloy Ganguly Indian Institute of Technology Department of Computer Science and Engineering Kharagpur 721302 India
[email protected]
Andreas Deutsch Center for Information Services and High Performance Computing Technische Universität Dresden 01062 Dresden Germany
[email protected]
Animesh Mukherjee Indian Institute of Technology Department of Computer Science and Engineering Kharagpur 721302 India
[email protected]
ISBN: 978-0-8176-4750-6 DOI: 10.1007/978-0-8176-4751-3
e-ISBN: 978-0-8176-4751-3
Library of Congress Control Number: 2009921285 Mathematics Subject Classification (2000): 05C85, 68M10, 82B43, 90B15, 90B18, 90B40, 90C35, 91D30, 92D30, 94C15 © Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Birkhäuser Boston, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Birkhäuser Boston is part of Springer Science+Business Media (www.birkhauser.com)
Preface
In the context of network theory, Complex networks can be defined as a collection of nodes connected by edges representing various complex interactions among the nodes. Almost any large-scale system, be it natural or man-made, can be viewed as a complex network of interacting entities, which is dynamically evolving over time. Naturally occurring networks include biological, ecological and social networks (e.g., metabolic networks, gene regulatory networks, protein interaction networks, signaling networks, epidemic networks, food webs, scientific collaboration networks and acquaintance networks), whereas man-made networks include communication networks and transportation infrastructures (e.g., the Internet, the World Wide Web, peerto-peer networks, power grids and airline networks). This edited volume is a sequel to the workshop Dynamics on and of Complex Networks (http://www.cel.iitkgp.ernet.in/∼eccs07/ ) held as a satellite event of the fourth European Conference on Complex Systems in Dresden, Germany from October 1–5, 2007. The primary aim of this workshop was to systematically explore the statistical dynamics “on” and “of” complex networks that prevail across a large number of scientific disciplines. Dynamics on networks refers to the different types of processes, for instance, proliferation and diffusion, that take place on networks. The functionality/efficiency of these processes is strongly tied to the underlying topology as well as the dynamic behavior of the network. On the other hand, dynamics of networks mainly refers to the phenomena of self-organization, which in turn lead to the emergence of the complex structure of the network. Another important motivation of the workshop was to create a forum for researchers applying the theories of complex networks to various domains as well as across several disciplines such as computer science, statistical physics, nonlinear dynamics, econometrics, biology, sociology and linguistics. The workshop received a large number of quality submissions from authors pursuing research in multiple disciplines, thus making the forum truly interdisciplinary. The total number of participants who attended the workshop
VI
Preface
was approximately 40. There were around 20 speakers, including both senior researchers and young scientists, who spoke about the dynamics on and of different systems exhibiting a complex network structure. The theme of this edited volume is identical to that of the workshop. Its primary aim is to show how the theories of complex networks are being successfully used by researchers to tackle numerous difficult problems in various domains. Towards this aim, it presents an extended version of some of the very high quality submissions received at the workshop together with new invited contributions, which can play an extremely important role in the understanding as well as advancement of the field. Since the target audience of this book is expected to be largely cross-disciplinary, the chapters have been made as readable as possible, explaining all the intricate technicalities wherever necessary in sufficient detail. The uniqueness of this volume lies in the fact that it presents an equal mix of (a) very relevant reviews (eight chapters) of important works in the field, which gives the reader an up-to-date picture of the state of the art, and (b) independent research reports (eight chapters) providing a clear conception about how complex networks can be extremely useful in harnessing even the hardest problems of a particular discipline. The editors feel that research in this area has reached a stage where there is an urgent need to have a comprehensive knowledge of the past and the present before the future can be planned. The blend of reviews and the contributory chapters presented in this volume strive to achieve this objective and, thereby, set the platform for a “Phase II” research in complex networks. The volume consists of three parts. The contributions in Part I center around the application of complex networks in the understanding of biological problems. This part consists of five chapters. The first chapter is From Network Structure to Dynamics and Back Again: Relating Dynamical Stability and Connection Topology in Biological Complex Systems, in which Sitabhra Sinha presents a study of how the topology of a biological network influences the nature of its dynamics, and conversely, how dynamical considerations put constraints on the network structure. The next chapter deals with Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis, in which Madalena Chaves et al. model and analyze, in the framework of complex networks, the interaction of the nuclear factor κB with the apoptosis signaling pathway. In the third chapter, Network-Based Models in Molecular Biology, Andreas Beyer presents a survey on the extensive literature that employs complex networks to understand numerous intricate phenomena in biology. The fourth chapter, Ecological Networks: Structure, Interaction Strength, and Stability, by Samit Bhattacharyya and Somdatta Sinha, presents a detailed survey of the various studies conducted on ecological networks and especially on food webs. In the last chapter, Signaling and Feedback in Biological Networks, Sandeep Krishna et al. review some important studies on the signaling and feedback mechanisms that are observed in different biological networks.
Preface
VII
Part II is also spread over five chapters and focuses on social networks. This part begins with a chapter on Topographic Spreading Analysis of an Empirical Sex Workers’ Network, by Johannes Bjelland et al., where the authors present a “topographic” analysis of spreading (of HIV) on an empirical network of female sex workers. The authors find that the HIV graph breaks into small components, thereby reducing the spreading if perfect condom protection is made possible. The next chapter, Spectral Characterization of Network Structures and Dynamics, by Anirban Banerjee and J¨ urgen Jost, centers around the investigation of the spectral properties of complex networks with a special thrust on social networks. The third chapter, Dynamics of Social Complex Networks: Some Insights into Recent Research, is authored by Sergi Lozano and presents a comprehensive review of how complex network theory has been instrumental in explaining the structure and the dynamics of a society. The last two chapters show how complex networks can be applied to explain the dynamics of human languages. The first one, titled The Structure and Dynamics of Linguistic Networks, by Monojit Choudhury and Animesh Mukherjee, is a review of the current literature on linguistic networks. The second one, Networks Generated from Natural Language Text, by Chris Biemann and Uwe Quasthoff, presents a survey focusing on how corpus linguistics (i.e., the study of language as expressed in corpora) can be studied within the framework of complex networks. Part III presents a comprehensive overview of the networks that are prevalent in information sciences. This part is laid out in six chapters. The first chapter in this part, Efficiency of Navigation in Indexed Networks, by Petter Holme, explores the efficiency of navigation of data packets on “indexed” graphs. The second chapter, Evolution of Apache Open Source Software, by Haoran Wen et al., attempts to explain the evolution of the Apache open source software through the analysis of its call graphs. The next chapter, Some New Applications of Network Growth Models, by Gourab Ghoshal, presents new models of growth for peer-to-peer file-sharing networks. The fourth chapter, The Big Friendly Giant: The Giant Component in Clustered Random Graphs, by Yakir Berchenko et al., is a theoretical study of the properties of the giant component in a special kind of random graph, which is relevant for various information networks. The fifth chapter, Technological Networks, by Bivas Mitra, presents a detailed review of the large number of studies that have been conducted on information networks, especially the World Wide Web and peer-to-peer networks. The last chapter, Advances in the Theory of Complex Networks, by Fernando Peruani, presents a survey of some of the theoretical advancements that have taken place and helps in providing a better understanding of the structure and dynamics of information networks. These contributions collectively demonstrate that complex networks indeed provide an elegant research framework relevant to a variety of scientific disciplines. The chapters are designed to serve as the state of the art not only for students and new comers who intend to pursue research in this field but
VIII
Preface
also for the experts. All the chapters have been carefully peer reviewed for their scientific content as well as readability and self-consistency. We would like to thank the authors for their contributions, constructive co-operation and gracious acceptance of the editorial comments. We are also indebted to Ranjita Bhagwan, Chris Biemann, Lutz Brusch, Geoffrey Canright, Michael Gamon, Gourab Ghoshal, Petter Holme, A. Kumaran, Abyayananda Maiti, Pabitra Mitra, Luis Morelli, Gautam Mukherjee, Romit Roy Choudhury, Gustavo Sibona and Biplab K. Sikdar for their constructive criticisms, comments and suggestions, which have significantly improved the quality of the chapters. In addition, we would also like to extend our gratitude to Rishabh Singh for his painstaking effort in helping to prepare the Glossary of Essential Terms. Finally, we are also grateful to Tom Grasso and the Birkh¨ auser team for all their help and support towards the publication of this volume. Kharagpur, India Dresden, Germany Kharagpur, India
Niloy Ganguly Andreas Deutsch Animesh Mukherjee
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V List of Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI Part I Biological Sciences From Network Structure to Dynamics and Back Again: Relating Dynamical Stability and Connection Topology in Biological Complex Systems Sitabhra Sinha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis Madalena Chaves, Thomas Eissing, and Frank Allg¨ ower . . . . . . . . . . . . . . 19 Network-Based Models in Molecular Biology Andreas Beyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Ecological Networks: Structure, Interaction Strength, and Stability Samit Bhattacharyya and Somdatta Sinha . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Signaling and Feedback in Biological Networks Sandeep Krishna, Mogens H. Jensen, and Kim Sneppen . . . . . . . . . . . . . . . 73
Part II Social Sciences Topographic Spreading Analysis of an Empirical Sex Workers’ Network Johannes Bjelland, Geoffrey Canright, Kenth Engø-Monsen, and Valencia P. Remple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
X
Contents
Spectral Characterization of Network Structures and Dynamics Anirban Banerjee and J¨ urgen Jost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Dynamics of Social Complex Networks: Some Insights into Recent Research Sergi Lozano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 The Structure and Dynamics of Linguistic Networks Monojit Choudhury and Animesh Mukherjee . . . . . . . . . . . . . . . . . . . . . . . . . 145 Networks Generated from Natural Language Text Chris Biemann and Uwe Quasthoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Part III Information Sciences Efficiency of Navigation in Indexed Networks Petter Holme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Evolution of Apache Open Source Software Haoran Wen, Raissa M. D’Souza, Zachary M. Saul, and Vladimir Filkov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Some New Applications of Network Growth Models Gourab Ghoshal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 The Big Friendly Giant: The Giant Component in Clustered Random Graphs Yakir Berchenko, Yael Artzy-Randrup, Mina Teicher, and Lewi Stone . . . 237 Technological Networks Bivas Mitra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Advances in the Theory of Complex Networks Fernando Peruani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Glossary of Essential Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
List of Contributors
Frank Allg¨ ower Institute for Systems Theory and Automatic Control University of Stuttgart Pfaffenwaldring 9 70550 Stuttgart Germany
[email protected] Yael Artzy-Randrup Biomathematics Unit Faculty of Life Sciences Tel Aviv University Ramat Aviv 69978 Israel
[email protected] Anirban Banerjee Max Planck Institute for Molecular Genetics Ihnestr. 63–73 14195 Berlin Germany
[email protected] Yakir Berchenko Interdisciplinary Brain Research Center Bar Ilan University Ramat Gan 52900 Israel
[email protected]
Andreas Beyer Biotechnology Center Technische Universit¨at Dresden 01062 Dresden Germany andreas.beyer@biotec. tu-dresden.de Samit Bhattacharyya Mathematical Modelling and Computational Biology Group Centre for Cellular and Molecular Biology, CSIR Hyderabad 500007 India
[email protected] Chris Biemann Institute for Computer Science NLP Department University of Leipzig Johannisgasse 26 04103 Leipzig Germany
[email protected] Johannes Bjelland Telenor R&I 1331 Fornebu Norway
[email protected]
XII
List of Contributors
Geoffrey Canright Telenor R&I 1331 Fornebu Norway
[email protected] Madalena Chaves COMORE, INRIA 2004 Route des Lucioles, BP 93 06902 Sophia-Antipolis France
[email protected] Monojit Choudhury Microsoft Research India Sadashivnagar Bangalore 560080 India
[email protected] Raissa M. D’Souza Department of Mechanical and Aeronautical Engineering Center for Computational Science and Engineering University of California Davis, CA 95616 USA
[email protected] Thomas Eissing Bayer Technologies Services GmbH PT-AS Systems Biology 51368 Leverkusen Germany thomas.eissing@ bayertechnology.com Kenth Engø-Monsen Telenor R&I 1331 Fornebu Norway
[email protected]
Vladimir Filkov Department of Computer Science University of California Davis, CA 95616 USA
[email protected]
Gourab Ghoshal Department of Physics, and Michigan Center for Theoretical Physics University of Michigan Ann Arbor MI, 48109 USA
[email protected]
Petter Holme Department of Physics Ume˚ a University 90187 Ume˚ a Sweden
[email protected]
Mogens H. Jensen Center for Models of Life Niels Bohr Institute Blegdamsvej 17 2100 Copenhagen Denmark
[email protected]
J¨ urgen Jost Max Planck Institute for Mathematics in the Sciences Inselstr. 22 04103 Leipzig Germany Santa Fe Institute Santa Fe, NM 87501 USA
[email protected]
List of Contributors
Sandeep Krishna Center for Models of Life Niels Bohr Institute Blegdamsvej 17 2100 Copenhagen Denmark
[email protected] Sergi Lozano ETH Z¨ urich Swiss Federal Institute of Technology UNO D11 Universit¨atstr. 41 8092 Z¨ urich Switzerland
[email protected] Bivas Mitra Department of Computer Science and Engineering Indian Institute of Technology Kharagpur 721302 India
[email protected]
XIII
Uwe Quasthoff Institute for Computer Science NLP Department University of Leipzig Johannisgasse 26 04103 Leipzig Germany quasthoff@informatik. uni-leipzig.de Valencia P. Remple BC Centre for Disease Control Epidemiology University of British Columbia Vancouver, BC V5Z 4R4 Canada
[email protected] Zachary M. Saul Department of Computer Science University of California Davis, CA 95616 USA
[email protected]
Animesh Mukherjee Department of Computer Science and Engineering Indian Institute of Technology Kharagpur 721302 India
[email protected]
Sitabhra Sinha The Institute of Mathematical Sciences CIT Campus Taramani Chennai 600113 India
[email protected]
Fernando Peruani Service de Physique de l’Etat Condens´e (SPEC/CEA) and Complex System Institute Paris Ile-de-France (ISC-PIF) F-75005, Paris France
[email protected]
Somdatta Sinha Mathematical Modelling and Computational Biology Group Centre for Cellular and Molecular Biology, CSIR Hyderabad 500007 India
[email protected]
XIV
List of Contributors
Kim Sneppen Center for Models of Life Niels Bohr Institute Blegdamsvej 17 2100 Copenhagen Denmark
[email protected]
Mina Teicher Interdisciplinary Brain Research Center Bar Ilan University Ramat Gan 52900 Israel
[email protected]
Lewi Stone Biomathematics Unit Faculty of Life Sciences Tel Aviv University Ramat Aviv 69978 Israel
[email protected]
Haoran Wen Department of Mechanical and Aeronautical Engineering Center for Computational Science and Engineering University of California Davis, CA 95616, USA
[email protected]
From Network Structure to Dynamics and Back Again: Relating Dynamical Stability and Connection Topology in Biological Complex Systems Sitabhra Sinha The Institute of Mathematical Sciences, CIT Campus, Taramani, Chennai 600113, India;
[email protected]
1 Introduction To see a world in a grain of sand, And a heaven in a wild flower, Hold infinity in the palm of your hand, And eternity in an hour. – William Blake, Auguries of Innocence Like Blake, physicists look for universal principles that are valid across many different systems, often spanning several length or time scales. While the domain of physical systems has often offered examples of such widely applicable “laws,” biological phenomena tended to be, until quite recently, less fertile in terms of generating similar universalities, with the notable exception of allometric scaling relations [20]. However, this situation has changed with the study of complex networks emerging into prominence. Such systems comprise a large number of nodes (or elements) linked with each other according to specific connection topologies, and are seen to occur widely across the biological, social and technological worlds [4, 9, 16]. Examples range from the intra-cellular signaling system which consists of different kinds of molecules affecting each other via enzymatic reactions, to the internet composed of servers around the world which exchange enormous quantities of information packets regularly, and food webs which link, via trophic relations, large numbers of inter-dependent species. While the existence of complex networks in various domains had been known for some time, the recent excitement among physicists working on such systems has to do with the discovery of certain universal principles among systems which had hitherto been considered very different from each other. N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 1, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
4
S. Sinha
Reflecting the development of the modern theory of critical phenomena, the rise of physics of complex networks has been driven by the simultaneous occurrence of detailed empirical studies of extremely large networks that were made possible by the advent of affordable high-power computing and the development of statistical mechanics tools to analyze the new network models. Prior to these developments, the networks that were studied by physicists belonged to either the class of (i) regular networks, defined on geometrical lattices, where each node interacted with all the neighboring nodes belonging to a specified neighborhood, or (ii) random networks, where any pair of nodes had a fixed probability of being linked, i.e., interacting with each other. The first work that focused public attention on the new network approach presented a class of network models that were neither regular nor random, but exhibited properties of both [28]. Such small world networks, as they were referred to, exhibited high clustering (with nodes sharing a common neighbor having a higher probability of being connected to each other than to other nodes) and a very low average path length (where the path length between any two nodes is defined as the shortest number of connected nodes one has to go through in order to reach one node starting from the other). As the former property characterized a regular network, while the latter was typical for a random network, this new class of networks was somehow intermediate between the extremes of the two well-known network models, which was manifest in their construction procedure (Fig. 1). Several networks occurring in reality, in particular, the power grid, the actor collaboration network and the neural connection patterns of the C. elegans worm, were shown to have the small-world property. Later, other examples were added to this list, including the network of co-active functional brain areas [1] and the Indian railway system [21]. Very soon afterwards, it was discovered that the frequency distribution of a node degree (i.e., the number of links a node has) exhibits a power-law scaling form for a large variety of systems including the world wide web [3].
Fig. 1. Constructing a small-world network on a 2-dimensional square lattice substrate. Starting from a regular network (left) where each node is connected to its nearest and next-nearest neighbors, a fraction p of the links are rewired among randomly chosen pairs of nodes. When all the links are rewired, i.e., p = 1, the system is identical to a random network (right). For small p, the resulting network (center) still retains the local properties of the regular network (e.g., high clustering), while exhibiting global properties of a random network (e.g., short average path length).
From Network Structure to Dynamics and Back Again
5
This further underlined the fact that most networks occurring in reality are neither regular (in which case the degree distribution would be close to a delta function) nor random (which has a Poisson degree distribution), as for both cases the probability of having a node with large degree (i.e., a hub) would be significantly smaller than that indicated by the power-law tail of empirically obtained degree distributions. In addition, it was observed that there exist non-trivial degree correlations among linked pairs of nodes. For example, a network where nodes with high degree tend to preferentially connect with other high degree nodes is said to show assortative mixing [15]. On the other hand, in a disassortative network, nodes with a large number of links prefer to connect with nodes having low degree. Empirical studies indicate that most biological and technological networks are disassortative, while social networks tend to be assortative [16]. As assortative mixing promotes percolation and makes a network more robust to vertex removal, it may be hard to understand why natural evolution in the biological world has favored disassortativity. However, in a recent study, we have shown that when one considers the stability of dynamical states of a network, disassortative networks would tend to be more robust, and this may be one of the reasons why they are preferred [6]. This brings us to the thrust of recent work in the area of complex networks which has shifted from the initial focus on purely structural aspects of the connection topology to the role such features play in determining the dynamical processes defined on a network [27]. Over the past few years, much effort has been made to understand not only how structure affects dynamics, and hence function, in a network, but also the reverse problem of how functional criteria, such as the need for dynamical stability, can constrain the topological properties of a network. In this chapter, some of the principal results obtained by our group will be briefly described. The goal of our research program is to understand the evolution of robust yet complex biological structures, viz., networks occurring in reality that are stable against perturbations and, yet, which can adapt to a changing environment.
2 Biological Networks: Some Examples Across Length Scales Before describing our results, which are applicable to a wide range of networks, we provide motivation for our general approach by briefly discussing in this section a few examples of biological networks. Although they span an enormous range of length scales, from ∼10−8 m in the case of protein contact networks to ∼105 m in the case of ecological interaction networks, they are often subject to similar constraints and may share common structural and dynamical properties. Questions about networks in one domain may often have answers and ramifications in another domain.
6
S. Sinha
Molecular scale: protein contact network. Protein structure, viewed as a network of non-covalent connections between the constituent amino acids, is one of the smallest length scale networks in the natural world. Its nodes are the Cα atoms of each amino acid, and their interaction strength is determined by their proximity to each other. Two nodes are considered to be linked if the Euclidean distance between them (in 3-dimensional space) is less than a A, which is the relevant distance for noncutoff value dc , usually between 8–14 ˚ covalent interactions. Figure 2 shows the KirBac1.1 protein, which belongs to the family of potassium ion channels involved in transmission of inward rectifying current across a cellular membrane [13]. The protein consists of four identical subunits spanning the membrane and intra-cellular regions. The corresponding protein contact network (PCN) manifests the existence of the identical subunits in the approximately block diagonal structure of the adjacency matrix. In addition, each of these four blocks can be divided into two modules, corresponding approximately to the membrane and intra-cellular regions. It is easy to see that the PCN shares the features of a small-world network, with the majority of connections between spatially neighboring nodes, although there are a few long-range connections. This small-world property of PCNs for different protein molecules has indeed been noted several times in the literature (see, e.g., Ref. [2]). This is probably not very surprising, given that it is also true for a randomly folded polymer. However, in addition, the PCN adjacency network shows a modular structure, with a majority of connections occurring between nodes belonging to the same module. This is a feature not seen in conventional models of small-world networks (e.g., the Watts–Strogatz model [28]). It is all the more intriguing, as we have recently subunit I
intra−cellular domain
subunit II subunit III subunit IV
200 400 600
membrane domain
800 1000 200
400
600
800
1000
Fig. 2. Structure of the KirBac1.1 protein (left) which comprises four identical subunits spanning the membrane and intra-cellular regions [13]. The PCN is constructed A, whose adjacency matrix is shown for by considering a cutoff distance of dc = 12 ˚ the entire network (right). Each of the four blocks corresponding to a subunit shows a clear partition into membrane and intra-cellular compartments, indicating a modular structure.
From Network Structure to Dynamics and Back Again
7
shown that modular networks (whatever the connection topology of individual modules) exhibit the small-world properties of high clustering and low average path length [18]. To identify whether the existence of modules indeed has a significant effect on protein dynamics (e.g., during folding), we look at the spectral properties of the Laplacian matrix1 L, defined as Lii = ki , where the degree of node i, Lij = −1 if nodes i and j are connected, 0 otherwise. The eigenvector for the smallest eigenvalue (=0), c(1) , corresponds to the time-invariant properties of the system and has uniform contribution from all components. The next few smallest eigenvalues dominate the time-dependent behavior of the protein and show a relatively large spectral gap with the bulk of the eigenvalue spectra. This indicates the existence of very distinct time scales in the protein dynamics which approximately correspond to the interand intra-modular modes of motion. As we shall see, the occurrence of modular structures in complex networks and their effect on dynamics is not just confined to PCNs but appears in many other biological networks. Intra-cellular scale: signaling network. Signal transduction pathways, through which a cell responds appropriately to a signal or stimulus, involve ordered sequences of biochemical reactions carried out by enzymes inside the cell. One of the most commonly observed class of enzymes in intra-cellular signaling is that of kinases, which activate target molecules (usually proteins) by transferring phosphate groups from energy donor molecules such as ATP to the targets. This process of phosphorylation is mirrored by the reverse process of deactivation by phosphatases through dephosphorylation. Such reaction cascades are activated by second messengers (e.g., cyclic AMP or calcium ions) and may last for a few minutes, with the number of kinase proteins and other molecules involved in the process increasing with every reaction step away from the initial stimulation. Thus, such a signaling cascade can result in a large response for a relatively low-amplitude signal. Research over the past decade has, however, shown that the classical picture of almost isolated cascades linking a unique signal to a specific response does not explain many experimental results. The adaptability of intra-cellular signaling is now thought to be a result of multiple signaling pathways interacting with one another to form complex networks. In this picture, complexity arises from the large number of components, many of which have partially overlapping functions, from the large number of links (through enzymatic reactions) among components and from the spatial relationship between the components [29]. Figure 3 shows a small fraction of the signaling network downstream of the B-cell antigen receptor (BCR) involved in immune response. As the breakdown of communication in this network can lead to disease (a fact that may be utilized by infectious agents for proliferation), it is of obvious importance to understand the mechanisms by which the network allows the cell response to be sensitive to different stimuli and yet to be robust in the presence of intra-cellular noise. With this in mind, the time evolution 1
The Laplacian matrix is also referred to as the Kirchhoff matrix (e.g., see Ref. [10]).
8
S. Sinha IgG receptor Igα, Igβ Syk
Pyk2
PI3K
Lyn
PIP2
PIP3
Shc Btk
BLNK Grb2 SOS
Rac DAG MEKK
Raf−1 MEK 1/2 K
Erk 1/2
PDK1
PLCg2
Vav
MKK 4/7
MKK 3/4/6
Jnk 1/2
p38
IP3
PKC IKK IkB
Akt Ca2+
CaMK2 NFAT
Bad
Bcl2
Fig. 3. A subset of the signal transduction network of the BCR [12]. The kinases are represented by squares, while other molecules (such as second messengers and adapters) are depicted as circles.
of the activity (i.e., phosphorylation) of about 20 signaling molecules in this network was recorded in a recent experiment by Kumar et al. [12]. Apart from observing the activation profiles under normal conditions, the network was also subjected to a series of perturbations by serially blocking each of these molecules from activating any of the other molecules in the network. The resulting experimental data, capturing the behavior of these molecules under 21 different conditions, enabled the detection of correlations between the activity of these molecules. This showed that the existing picture of interactions (Fig. 3) is grossly inadequate in explaining these correlations, e.g., the fact that p38 kinase seems to influence the activation of a majority of the other molecules, although it occurs at the end of a particular pathway. The results suggest that the signaling network is, in fact, a far more densely connected system than had been previously suspected. It also raises the question of how certain signals can elicit very specific responses, without significant risk of cross-talk between interacting pathways. This brings us to the issue of whether functional modules can exist in networks, such that by using positive and negative interactions one can channel information from the stimulus to the response along specific subnetworks only. Inter-cellular scale: neuronal network. The previous question is of importance not only for information processing within a cell, but also between cells. The most important example of the latter process is, of course, the networks of neurons occurring in the brain. As the nervous system of the nematode C. elegans comprising 302 neurons has been completely mapped out (in terms of the positions of the neurons, as well as all their interconnections), it provides a model system for studying these issues. We have recently analyzed the connection topology of the non-pharyngeal portion of the nervous system to which the majority of the neurons (280) belong [7]. One of the striking
From Network Structure to Dynamics and Back Again
9
observations is that many of the sensory neurons belonging to different modalities, viz., chemosensation, mechanosensation, etc., send signals to the same set of densely connected interneurons which forms the innermost core of the nervous system. Subsequently, signals are sent from these interneurons to specific motor neurons which generate appropriate muscle response, e.g., moving along a chemical gradient, egg laying, etc. It is vital that the signals coming from different sensory neurons to the same interneurons should not interfere with each other, as it may result in activation of the incorrect motor response. A preliminary investigation of a dynamical model for the neuronal network shows that a complex set of excitatory and inhibitory links between the interneurons manages to achieve segregation of the different functional circuits. This means that, e.g., a mechanical tap signal will not elicit egg laying, even if the tap withdrawal circuit shares many common interneurons with the egglaying circuit. Even more interesting is the fact that such functional modules do not need the existence of structural modules in the underlying networks. It underscores the importance of looking at the nature of the interactions, which can create complicated control mechanisms to prevent cross-talk and enable robust response in the presence of environmental noise. Inter-organism scale: epidemic propagation network. At the scale of individual organisms, such as human beings, one of the most widely studied networks is that which leads to propagation of epidemics. The ubiquity of small-world networks in nature implies that some of the classic theories of epidemiological transmission, based on assumptions of random connections, may need to be reviewed. In particular, the global spread of diseases like SARS shows that even a few long-range links can drastically enhance the propagation of epidemics [8]. This has led to a series of studies of different disease propagation models on Watts–Strogatz or related network models (e.g., see Ref. [19]). However, as mentioned above, all the structural features of such networks are also shared by modular networks, although modular network have very different dynamical properties. We have recently shown that while Watts– Strogatz networks have a continuous range of time scales, modular networks exhibit very distinct time scales that are related to intra- and inter-modular events [18]. Thus, an effective strategy to counter the spread of epidemics must take into account a detailed knowledge of such structures in the social network of contagious and susceptible individuals. Inter-species scale: food webs. Possibly the largest (in terms of length scale) biological networks on earth are those of interactions between different species in an ecosystem. While general ecological networks consist of all possible links, such as cooperation and competition, food webs describe the trophic relations, i.e., between predator and prey. A food web is a directed network where the nodes are the various species, with prey connected by arrows to predators, the direction of the arrow indicating the flow of biomass. The links are usually weighted to represent the amount of energy that is transferred.
10
S. Sinha
It is in the context of these networks that questions first arose on the connection between the structural properties of a network and the stability of its dynamical behavior (see Section 4). Indeed, one not only asks what kind of structures allow complex networks to be stable against ever-present perturbations, but also how the requirement to be robust constrains the kind of structures such networks can evolve. To stress the universality of the questions asked by physicists about networks, we note that, like many other networks, food webs also have been shown to have a modular structure, with species in each module interacting between themselves strongly and only weakly with other species [11]. As in the other systems discussed earlier, the role that modularity plays in stabilizing the dynamics of ecosystems can be seen as a specific instance of a much more general question. Having discussed a few instances of how universal principles about networks can appear by investigating very different systems in the biological world, we now describe certain results of our studies on general network models. However, we stress that each of these results has relevance to problems appearing in the context of specific biological systems.
3 From Structure to Dynamics The role that the connection topology of a network plays in the nature of its dynamics has been extensively investigated for spin models occurring in physics. In fact, such systems had been explored for a long time prior to the recent interest in complex networks, and many results are known regarding ordering transition in both regular as well as random structures. More recently, it has been shown that, for partial random rewiring in a system of sufficiently large size, any finite value of p (the rewiring probability) causes a transition to the small-world regime, with the Ising model defined on such a network exhibiting a finite temperature ferromagnetic phase transition [5]. However, spin models are extremely restricted in their dynamical repertoire; therefore, researchers have looked at the effect of introducing other kinds of node dynamics in such network structures, e.g., oscillators. Motivated by recent observations that the brain may have a connection structure with small-world properties (see e.g., Ref. [1]), we have examined the effect of long-range connections (i.e., non-local diffusion) over an otherwise regular network of nodes with links between nearest neighbors on a square lattice [25]. The dynamics considered is that of the excitable type, with the variable having a single stable state and a threshold. If a perturbation causes the system variable to exceed the threshold, we see a rapid transition to a metastable excited state followed by a slow recovery phase when the system gradually converges to the stable state. As a result of coupling the dynamics of individual nodes through diffusive coupling, various spatial patterns (which may be temporally varying) are observed. Such a dynamics is commonly observed in a large variety of biological
From Network Structure to Dynamics and Back Again Temporal Patterns Burn−out
Spatial Patterns time
0
0
0
500
0.2
1000
time
0.5
Activity
0.5
1500
11
2000
0.4
p
0
1600
plc 0.6
1800
0.5
time
0 2000 0 100 200
0.8 pu c
1
Fig. 4. Schematic diagram indicating the different dynamical regimes in a 2-dimensional small-world excitable medium as a function of the rewiring probability, p. For low p, the system exhibits spatial patterns characterized by single or multiple spirals. At p = plc , there is a transition to a state dominated by temporally periodic patterns that are spatially relatively homogeneous. Above p = puc , all activity ceases after a brief transient.
cells such as neurons and cardiac myocytes, as well as in non-linear chemical systems such the Belousov–Zhabotinsky reaction. In our simulations, by varying the probability of long-range connections, p, we have observed three categories of patterns. For 0 < p < plc , after an initial transient period where multiple coexisting circular waves are observed, the system is eventually spanned by a single or multiple rotating spiral waves whose temporal behavior is characterized by a flat power spectral density. At p = plc , the system undergoes a transition from a regime with temporally irregular, spatial patterns to one with spatially homogeneous, temporally periodic patterns (Fig. 4). The latter behavior occurs over the range plc < p < puc as a result of the increased number of long-range connections, whereby a large fraction of the system is synchronously active and subsequently goes into the recovery phase. Beyond the upper critical value puc , there is no longer any self-sustained activity in the system, as all nodes converge to the stable state. The patterns in each regime were found to be extremely robust against even large perturbations or disorder in the system. Our model explains several hitherto unexplained observations in experimental systems where non-local diffusion had been implemented [26]. In addition, by identifying the long-range connections with those made by neurons and the regular network with that formed by the glial cells in the brain, our results provide a possible explanation of why evolution may have preferred to increase the number of glial cells over neurons (with a ratio of more than 10:1 for certain parts of the human brain) in order to maintain robust dynamical patterns as brain size increased. It also points towards a possible functional role of the small-world brain topology in the occurrence of dynamical diseases such as epileptic seizures and bursts. More generally, our work shows
12
S. Sinha
how non-standard network topologies can influence system dynamics by generating different kinds of spatio-temporal patterns depending on the extent of non-local diffusion.
4 From Dynamics to Structure An important functional criterion for most networks occurring in nature and society is the stability of their dynamical states. While earlier studies have concentrated on the robustness of the network when subjected to structural perturbations (e.g., removal of nodes or links), we have looked at the effect of perturbations on the steady states of network dynamics. In particular, the question we ask is whether networks become more susceptible to small perturbations as their size (i.e., number of nodes N ) increases, the connections between the nodes become denser (i.e., increased connection probability C) and the average strength of interaction (s) increases. This is related to a decades-old controversy, often referred to as the stability-complexity debate. In the early 1970s, May [14] had shown that for a model ecological network, where species are assumed to interact with a randomly chosen subset of all other species, an arbitrarily chosen equilibrium state of the system becomes unstable if any of the parameters determining the network’s complexity (e.g., N , C or s) is increased. In fact, by using certain results of random matrix theory, the critical condition for the stability of the network was shown to be N Cs2 < 1 (May–Wigner theorem) [14]. This flew against common wisdom, gleaned from a large number of empirical studies as well as naive reasoning, which dictated that increased diversity and/or stronger interactions between species results in more robust ecosystems. Thus, ever since the publication of these results, there have been attempts to understand the reason behind the apparent paradox, especially as this result relates not only to ecological systems but extends to all dynamical networks for which the stability of equilibria has functional significance, e.g., in intra-cellular biochemical networks where the concentrations of different molecules need to be maintained within physiological levels. Two of the common charges leveled against the theoretical model of May is that (i) it assumes the interaction network to be random, whereas naturally occurring networks may have certain kinds of structures, and (ii) the linear stability analysis assumes the existence of simple steady states (viz., fixed point attractors), which may not be the case for real systems that may either be oscillating or in a chaotic state. In our work on dynamical systems defined on networks, we have tried to address both of these lines of criticism (see Ref. [31] for a recent discussion of our results from the perspective of ecosystem robustness). For example, focusing on the question of the inadequacy of linear stability analysis, we have considered networks with non-trivial dynamics at the nodes, spanning the range from simple steady states to periodic oscillation and fully developed chaos, and measured the robustness of the dynamics with respect to variations in N , C and s [23, 24].
From Network Structure to Dynamics and Back Again
13
Each node in our model network has a dynamical variable associated with it, which evolves according to a well-known class of difference equations commonly used for modeling population dynamics. By varying a non-linear parameter, the nature of the dynamics (i.e., whether it converges to a steady state or undergoes chaotic fluctuations) at each node can be controlled. However, in the absence of coupling, each node will always have a finite, positive value for its dynamical variable. When coupled in a network (initially in a random fashion) with links that can have either positive or negative weights, it is possible that as a result of dynamical fluctuations, the variable for some nodes can become negative or zero. As this implies the absence of any activity, the corresponding node is considered to be “extinct” and thus isolated from the network. This procedure may create further fluctuations and cause more nodes to becomes “extinct,” resulting in gradual reduction of the size of the network (Fig. 5). The final asymptotic size of the network, relative to its initial size, is a measure of its robustness—the more robust network is one with a higher fraction of nodes having persistent activity. Analysis showed that the network robustness (as measured by the above global criterion) not only decreased with N , C and s, as expected from local stability analysis, but actually matched the May–Wigner theorem quantitatively [23]. In addition, the asymptotic network exhibited robust macroscopic features: (a) the number of persistently active nodes was independent of the initial network size, and (b) the asymptotic number of links between these persistently active nodes was independent of both the initial size and connectivity [24]. This is all the more surprising, as the removal of nodes (and hence, links) is not guided by any explicit fitness criterion, but rather emerges naturally from the nodal dynamics through fluctuations of individual node properties. Our results imply that asymptotically
Pa
Fig. 5. Evolution of a network with non-trivial dynamics at the nodes. The initial (left) and final asymptotic (right) networks are shown. Only nodes having persistent activity are connected to the network. The figures were drawn using Pajek software.
14
S. Sinha
active networks are non-extensive: when two networks of size N are coupled to each other (with the same connectance as the individual networks), although the resulting network initially has a size 2N , the ensuing dynamical fluctuations will reduce its size to N . This implies that simply increasing the number of redundant elements is not a good strategy for designing robust systems. We have also looked at the effect of empirically reported structures, such as small-world connection topology and scale-free degree distribution, on the dynamical stability of networks. Our results indicate that, in general, introducing such structural features does not alter the outcome expected from the May–Wigner theorem [6, 22]. However, these details can indeed affect the nature of the stability-instability transition; for example, the transition exhibiting a cross-over from being very sharp (resembling first-order phase transition) for a random network to a more gradual change as the network becomes more regular in the small-world regime [22].
5 Evolution of Robust Networks This brings us to the issue of how complex networks can be stable at all, given that the May–Wigner theorem seems to hold even for networks that have structures similar to those seen in reality and where non-trivial dynamical situations have also been considered. The solution to this apparent paradox lies in the observation that most networks that we see around us did not occur fully formed but emerged through a process of gradual evolution, where stability with respect to dynamical fluctuations is likely to be one of the key criteria for survival. In earlier work, we have shown that a simple model, where nodes are gradually added to or removed from a network according to whether this results in a dynamically stable network or not, leads to a non-equilibrium steady state in which the network is extremely robust [30]. The robustness is manifested by increased resistance and resilience, as well as decreased probability of large extinction cascades, when the network size (i.e., the system diversity) is increased. Thus, our results reconcile the apparently contradictory conclusions of the May–Wigner theorem and a large number of empirical studies. More recently, we have shown that model networks can evolve many of the observed structural features seen among networks in the natural world, by taking into account the fact that the majority of such systems must optimize between several (often conflicting) constraints, which may be structural as well as dynamical in nature. In particular, most networks need to have high communication efficiency (i.e., low average path length) and low connectivity (to reduce the resource cost involved in maintaining many links) while being stable with respect to dynamical perturbations. If a network satisfied only the first two constraints, the optimal structure would have been that of a star (Fig. 6). Even if the resource cost constraint is somewhat relaxed, so that the network can have more links than the minimum necessary to make it
From Network Structure to Dynamics and Back Again
15
(A)
(I)
(II)
(B)
(C)
Fig. 6. Networks with (I) star and (II) clustered star connection topologies can form the fundamental building blocks of different types of modular networks. Network configurations with clustered star modules can be constructed by (A) connecting different modules by single undirected links among the hub nodes, or (B) connecting nodes of a module to another module only through the hub node of the latter, or (C) connecting nodes of a module randomly to any node of another module.
connected, the resulting optimal configuration is slightly modified to that of a “clustered” star. However, we note that the dynamical equilibria in such systems would be extremely unstable with respect to small perturbations. This happens because the rate of growth of small perturbations is related to the maximum degree of the network, which, in the case of a star or a clustered star, is almost identical to the system size. It is easy to see that dividing the network into multiple stars, connected to each other, will reduce the maximum degree and hence increase the stability. Indeed, our results show that simultaneous optimization of all three constraints results in networks with modular structure, i.e., subnetworks with a high density of connections within themselves compared to between distinct subnetworks, where each module possesses a prominent hub [17] (see Fig. 6 for possible configurations of such modular networks). As these evolved systems also exhibit heterogeneous degree distribution, our findings have implications for a wide range of systems in the biological and technological worlds where such features have been observed.
16
S. Sinha
Acknowledgments I would like to thank my collaborators with whom the work described here has been carried out, in particular, R. K. Pan, S. Sinha, N. Chatterjee, M. Brede, C. C. Wilmers, J. Saram¨ aki and K. Kaski, as well as S. Vemparala, D. Kumar, K. V. S. Rao and B. Saha for helpful discussions.
References 1. Achard, S., Salvador, R., Whitcher, B., Suckling, J., Bullmore, E.: A resilient, low-frequency, small-world human brain functional network with highly connected association cortical hubs. J. Neurosci., 26, 63–72 (2006) 2. Aftabuddin, M., Kundu, S.: Hydrophobic, hydrophilic and charged amino acid networks within protein. Biophys. J., 93, 225–231 (2007) 3. Albert, R., Barab´ asi, A.L.: Emergence of scaling in random networks. Science, 286, 509–512 (1999) 4. Albert, R., Barab´ asi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys., 74, 47–97 (2002) 5. Barrat, A., Weigt, M.: On the properties of small-world network models. Eur. Phys. J.B, 13, 547–560 (2000) 6. Brede, M., Sinha, S.: Assortative mixing by degree makes a network more unstable. Arxiv preprint, cond-mat/0507710 (2005) 7. Chatterjee, N., Sinha, S.: Understanding the mind of a worm: Hierarchical network structure underlying nervous system function in C. elegans. Prog. Brain Res., 168, 145–153 (2007) 8. Deem, M.W.: Mathematical adventures in biology. Physics Today, 60(1), 42–47 (2007) 9. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford Univ. Press, Oxford (2003) 10. Haliloglu, T., Bahar, I., Erman, B.: Gaussian dynamics of folded proteins. Phys. Rev. Lett., 79, 3090–3093 (1997) 11. Krause, A.E., Frank, K.A., Mason, D.M., Ulanowicz, R.U., Taylor, W.W.: Compartments revealed in food-web structure. Nature, 426, 282–284 (2003) 12. Kumar, D., Srikanth, R., Ahlfors, H., Lahesmaa, R., Rao, K.V.S.: Capturing cellfate decisions from the molecular signatures of a receptor-dependent signaling response. Molecular Systems Biology, 3, 150 (2007) 13. Kuo, A., Gulbis, J.M., Antcliff, J.F., Rahman, T., Lowe, E.D., Zimmer, J., Cuthbertson, J., Ashcroft, F.M., Ezaki, T., Doyle, D.A.: Crystal structure of the potassium channel KirBac1.1 in the closed state. Science, 300, 1922–1926 (2003) 14. May, R.M.: Stability and Complexity in Model Ecosystems. Princeton Univ. Press, Princeton (1973) 15. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett., 89, 208701 (2002) 16. Newman, M.E.J.: The structure and function of complex networks. SIAM Review, 45, 167–256 (2003) 17. Pan, R.K., Sinha, S.: Modular networks emerge from multiconstraint optimization. Phys. Rev. E, 76, 045103(R) (2007)
From Network Structure to Dynamics and Back Again
17
18. Pan, R.K., Sinha, S.: The small world of modular networks. Arxiv preprint, arXiv:0802.3671 (2008) 19. Saram¨ aki, J., Kaski, K.: Modelling development of epidemics with dynamic smallworld networks. J. Theor. Biol., 234, 413–421 (2005) 20. Schmidt-Nielsen K: Scaling: Why is Animal Size So Important? Cambridge Univ. Press, Cambridge (1984) 21. Sen, P., Dasgupta, S., Chatterjee, A., Sreeram, P.A., Mukherjee, G., Manna, S.S.: Small-world properties of the Indian railway network. Phys. Rev. E, 67, 036106 (2003) 22. Sinha, S.: Complexity vs. stability in small-world networks. Physica A, 346, 147– 153 (2005) 23. Sinha, S., Sinha, S.: Evidence of universality for the May-Wigner stability theorem for random networks with local dynamics. Phys. Rev. E, 71, 020902(R) (2005) 24. Sinha, S., Sinha, S.: Robust emergent activity in dynamical networks. Phys. Rev. E, 74, 066117 (2006) 25. Sinha, S., Saram¨ aki, J., Kaski, K.: Emergence of self-sustained patterns in smallworld excitable media. Phys. Rev. E, 76, 015101(R) (2007) 26. Steele, A.J., Tinsley, M., Showalter, K.: Spatiotemporal dynamics of networks of excitable nodes. Chaos, 16, 015110 (2006) 27. Strogatz, S.H.: Exploring complex networks. Nature, 410, 268–276 (2001) 28. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature, 393, 440–442 (1998) 29. Weng, G., Bhalla, U.S., Iyengar, R.: Complexity in biological signaling systems. Science, 284, 92–96 (1999) 30. Wilmers, C.C., Sinha, S., Brede, M.: Examining the effects of species richness on community stability: An assembly model approach. Oikos, 99, 363–367 (2002) 31. Wilmers, C.C.: Understanding ecosystem robustness. Trends Ecol. Evoln., 22, 504–506 (2007)
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis Madalena Chaves,1 Thomas Eissing,2 and Frank Allg¨ ower3 1
2
3
COMORE, INRIA, 2004 Route des Lucioles, BP 93, 06902 Sophia-Antipolis, France;
[email protected] Bayer Technologies Services GmbH, PT-AS Systems Biology, Germany
[email protected] Institute for Systems Theory and Automatic Control, University of Stuttgart, Pfaffenwaldring 9, 70550 Stuttgart, Germany;
[email protected]
1 Introduction Programmed cell death (or apoptosis) has an essential biological function, enabling successful embryonic development, as well as maintenance of a healthy living organism [6]. Apoptosis is a physiological process which enables an organism to remove unwanted or damaged cells. Malfunctioning apoptotic pathways can lead to many diseases, including cancer and inflammatory or immune system related problems. A family of proteins called caspases are primarily responsible for execution of the apoptotic process: basically, in response to appropriate stimuli, initiator caspases (for instance, caspases 8, 9) activate effector caspases (for instance, caspases 3, 7), which will then cleave various cellular substrates to accomplish the cell death process [22]. Nuclear factor κB (NFκB) is a transcription factor for a large group of genes which are involved in several different pathways. For instance, NFκB activates its own inhibitor (IκB) [14] as well as groups of pro-apoptotic and anti-apoptotic genes [21]. Among the latter, NFκB activates transcription of a gene encoding for inhibitor of apoptosis protein (IAP). This protein in turn contributes to downregulate the activity of the caspase cascade which forms the core of the apoptotic pathway [6, 8]. The canonical NFκB pathway is induced, among other stimuli, by the cytokine tumor necrosis factor α (TNFα) [21]. Binding of TNFα to death receptor TNFR1 forms a first complex which eventually activates NFκB. A second complex is later formed, which will activate the initiator caspase 8 [6], and hence activate the apoptotic process. The same signal (TNFα stimulation) thus triggers two parallel but contrary pathways: the pro-apoptotic caspase cascade and the anti-apoptotic NFκB-IκB-IAP pathway. These two pathways, together with the interactions among their components, form a N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 2, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
20
M. Chaves et al.
complex network which shapes the decision on cell survival or initiation of programmed cell death. To contribute to a better understanding of the role of NFκB in the regulation of apoptosis, we propose a qualitative study of this system and its dynamics, based on a discrete (Boolean) model of the complex network. This discrete model closely follows a continuous one, recently developed and studied in [23, 24]. The model integrates the well-known model for the NFκB pathway [17] and the caspase cascade [8]. Boolean models provide a convenient formalism to describe protein and gene networks [25]. The states of the network components (e.g., proteins or messenger RNAs) are characterized as “expressed” or “not expressed” and are represented by logical variables (with values 0 or 1). The interactions among the various components are classified as “inhibition” or “activation” links (these can generally be deduced from gene/protein expression data). Boolean models thus describe the network structure of a system without involving any kinetic details. The qualitative behaviour of a system can be seen as an emergent property of this structure. Boolean models are especially useful in the case of large networks [1, 9], for which kinetic parameters are often unknown, but qualitative properties such as generation of specific gene expression patterns, stability or multistability, and oscillatory modes can be studied. Several methods have been developed for analysis of discrete and qualitative models [2, 5, 7, 13, 26]. Using an approach which combines discrete rules with continuous degradation rates, our model reproduces many of the known properties of the system, notably the oscillatory dynamics that can be induced by the NFκB-IκB negative feedback loop [14, 15, 19]. We explore different configurations for the network structure and predict its effects on the decision between cell survival or apoptosis.
2 The Model The network of interactions among the NFκB pathway and the apoptosis signaling cascade to be studied here is shown in Fig. 1. The various components of the network (here messenger RNAs, proteins, or protein complexes) form the set of variables or nodes (Xi , i = 1, . . . , n) of the Boolean model. The system will evolve according to a set of logical rules which are deduced from the interactions or links depicted in the schematic diagram of Fig. 1. The interactions among nodes can be classified as “activation” or “inhibition” links: a directed arrow Xi → Xj means that a high concentration of component Xi activates component Xj , while the symbol Xi Xj means that a high concentration of component Xi inhibits Xj . The components in our model and the activation or inhibition links among them are based on existing literature data. For general aspects, the reviews [6, 21] were used. However, some pathways of regulation among the NFκB pathway and the caspase cascade are not yet clear, and more work is needed to understand how these two signaling pathways are interconnected. In this chapter, we aim to investigate and test several possible hypotheses for
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
21
Fig. 1. Schematic diagram of the NFκB pathway and the caspase cascade (light shaded regions). The oval dark grey shaded region represents the cellular nucleus. Both pathways are activated by binding of TNFα to death receptor TNFR1 (the resulting complex is represented simply by the rectangle TNF). Messenger RNAs are represented by ellipses, while transcription factors, caspases, and other proteins are represented by squares. To study the interconnections between the two pathways, four network variants, based on different combinations of the links A, L, and C, will be analysed and compared (see Table 2).
the combined network structure. We will consider four model variants and try to discriminate between them by comparing our numerical analysis with experimental data from the literature. The four network variants (see Table 2) are based on different combinations of three links (A, L, C in Fig. 1) which have been suggested but are not fully established in the apoptosis literature. The NFκB pathway follows very closely the model presented in [17]. Stimulation of death receptors with TNFα leads (see for instance [6]), first, to the formation of a complex I (T1 in Fig. 1) which will recruit and activate inhibitor of IκB kinases (IKK). Inhibitor of NFκB, or IκB, acts by binding to NFκB molecules and preventing their transcriptional function. Active IKK (IKKa) phosphorylates IκB which releases NFκB, thus enabling its translocation to the nucleus and transcription of NFκB-dependent genes, including genes for inhibitor of apoptosis protein (iap), inhibitor of NFkB (iκB), a protein associated with inhibition of complex T2 (flip), and a protein regulating IKK activity (a20) [21]. Transcription of IκB mRNA generates a negative feedback
22
M. Chaves et al.
loop in the NFκB pathway [14, 20], which may lead to oscillatory behaviour in NFκB and IκB concentrations [19]. In a second step, after dissociation of components of complex I from the death receptor, a second complex is formed (T2 in Fig. 1) which will recruit and activate initiator caspase 8 (C8a). As a result of the signaling cascade [8, 22], effector caspase 3 is also activated (C3a). Thus, complex T1 activates the anti-apoptotic pathway and, after a certain delay, complex T2 activates the pro-apoptotic pathway. Two well-documented points of regulation of the apoptotic pathway by NFκB are inhibition of C3a by IAP and regulation of complex T2 by FLIP [6]. Active caspase 8 was found to be negatively regulated by caspase-8 and caspase-10-associated RING proteins (CARPs) [18], which seem to play an analogous role to IAP’s, but are less well studied. It was found that CARPs are overexpressed in tumors, and that their suppression leads to restoration of the apoptotic pathway, with the CARP being rapidly cleaved. In addition, it was observed that inhibitors of caspase 3 block CARP cleavage. In our model, we introduced CARP and a pre-complex CARP0 , which is inhibited by C3a. Inhibition by C3a is, however, not sufficient to control CARP, and there are probably other regulators. Since CARP plays a similar role to caspases 8 and 10, as IAP plays to caspases 3 and 9 (and in the absence of further details), we assume that the pre-complex CARP0 is also regulated by a product of the NFκB pathway. The points where the caspase cascade influences the NFκB pathway are less well documented. We will use our model to test different hypotheses by studying and comparing the network dynamics for the following cases (see also Table 2): inhibition of IKKa (link L) and/or NFκB (link A) by C3a, or neither of these links present. To obtain the logical rules shown in Table 1, some simplifications of the biological processes were inevitably introduced. For instance, the bound complex NFκB−IκB (either in the cytoplasm or in the nucleus) was not explicitly considered in the system, but was simply treated as an inhibition effect: the rule for NFκB says that it vanishes whenever IκB is expressed. Thus, any state with NFκB = 0 and IκB = 1 represents in fact a high concentration of bound complex NFκB − IκB, while any state with NFκB = 1 and IκB = 0 represents a high concentration of free NFκB and low concentration of free IκB. To translate our diagram into a set of logical rules, the convergence of two or more arrows (either activation or inhibition) at the same node was always treated as a logical AND, except in three cases: IκB, IAP, and CARP0 . For these proteins, the overall effect was treated as an AND in the presence of TNF stimulation, but treated as an OR in the absence of TNF. These three proteins represent inhibitors whose levels should be stable in the absence of any stimulus [8]: IAP and CARP0 (or CARP) should be effective inhibitors of the caspases, and IκB should be at approximately constant levels to control NFκB transcriptional activity. In contrast, with TNF stimulation, the degradation rates of these proteins can vary and lead to rapid changes in their concentrations (different degradation rates in the presence or absence of TNF
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
23
have been observed, notably for bound IκB [20]). For instance, under TNF treatment, the rule for inhibition of NFκB is simplified to IκB+ = [iκB and not IKKa]. Suppose that IKK becomes activated at time t1 , that is IKKa(t1 ) = 1. Then, in the next iteration of the model, the IκB rule implies that IκB will degrade very fast, with IκB(t1 +Δ) = 0. In contrast, in the absence of the TNF stimulus, the rule is IκB+ = [iκB or not IKKa]. If IKK becomes active at time t1 , one has IκB(t1 + Δ) = iκB(t1 ), meaning that IκB is only rapidly degraded if no more of its messenger RNA is available. A similar reasoning justifies the rules for IAP and CARP0 . The rules for these three proteins with inhibiting roles reflect the fact that their degradation rates, and hence turnover, can be much faster in response to TNF stimulation.
3 Analysis of Boolean Models Boolean networks are a representation of a system, consisting of a set of n variables or nodes X = (X1 , . . . , Xn ), together with a set of logical rules (Fi (X), i = 1, . . . , n) describing the evolution of the system from the current state (Xi at time t) to the next state (Xi at time t + Δ). The variables or nodes take values in the discrete set {0, 1}, where 1 (resp., 0) denotes the “expressed” (resp., “not expressed”) state of the node. The associated rules are typically a composition of logical OR and AND functions, which can be determined from gene/protein expression patterns (from Western blots or microarray data, for instance). The set of rules Fi given in Table 1 for the NFκB pathway and the caspase cascade is a translation of the diagram shown in Fig. 1. The temporal evolution of the system, X(t), t ∈ (0, ∞), is determined by successively iterating the logical rules Fi , for which several algorithms are available. Synchronous algorithms assume that all nodes are simultaneously updated: Xi+ = Fi (X1 , . . . , Xn ),
i = 1, . . . , n,
(1)
where Xi ∈ {0, 1}, X = (X1 , . . . , Xn ) denotes the state of the system at time t, and X + = (X1+ , . . . , Xn+ ) denotes the next state (at t + Δ). Alternatively, with asynchronous algorithms, at each iteration the nodes are sequentially updated, according to a given order (which can be prespecified or randomly chosen). Discrete models focus on the structure of the network (links), thus offering a more qualitative description of the system’s dynamics. Continuous models may offer more detailed descriptions of a system, but they also have the disadvantage of involving a large set of kinetic parameters, many of which are unknown. A method for analysis of Boolean models was introduced in [12, 13], which provides a bridge between discrete and continuous approaches. In this method, each node Xi of the network is represented by one continuous variable (xi ) and one discrete variable (Xi , as before). The continous variables are
24
M. Chaves et al.
Table 1. Boolean rules for the model of regulation of apoptosis via the NFκB pathway. TNF is a constant input. Identification of the nodes is given in the text. The letter “a” juxtaposed to a variable name denotes the active form of a molecule. The subscript “nuc” denotes the given component in the cellular nucleus. Alternative rules are given for the presence/absence of links A, C, L. Node +
T1 T2 + IKKa+ NFκB+ NFκB+ nuc iκB+ IκB+ IκB+ nuc a20+ A20+ A20a+ iap+ IAP+ flip+ FLIP+ C3a+ C8a+ CARP+ 0 CARP+
Boolean rule TNF T1 and not FLIP {L} T1 and not A20a and not C3a {no L} T1 and not A20a {A} not IκB and not C3a {no A} not IκB NFκB and not IκBnuc NFκBnuc [T1 and (iκB and not IKKa)] or [not T1 and (iκB or not IKKa)] IκB NFκBnuc a20 T1 and A20 NFκBnuc [T1 and (iap and not C3a)] or [not T1 and (iap or not C3a)] NFκBnuc flip not IAP and C8a {C} not CARP and (C3a or T2 ) {no C} C3a or T2 [T1 and (NFκBnuc and not C3a)] or [not T1 and (NFκBnuc or not C3a)] CARP0
governed by ordinary differential equations, which combine a synthesis rate (based on its Boolean rule) and a linear degradation rate: d xi = −ai xi + bi Fi (X1 , X2 , . . . , Xn ), dt
i = 1, . . . , n.
(2)
At each instant t, the discrete variable Xi is defined as a function of the continuous variable according to a threshold value of its maximal concentration: 0, xi (t) ≤ θi abii Xi (t) = (3) 1, xi (t) > θi abii , where θi ∈ (0, 1) represents the fraction of maximal concentration which is necessary for component Xi to become “active” and perform its biological functions. Initial conditions are equal for discrete and continuous variables: Xi (0) = xi (0). It is easy to see that the hypercube [0, b1 /a1 ] × · · · × [0, bn /an ] is an invariant set for system (2). The continuous variables denote concentrations of molecules; they are translated into a Boolean 0/1 response according to θi . The discrete variables Xi represent expression (1) or not expression (0)
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
25
of species i, according to whether its continuous concentration xi is above or below the threshold θi bi /ai . Letting the parameters ai , bi , and θi be specific for each node i allows us to study different time scales for different biological processes (for instance, transcription, translation, or post-translational processes, as in [5]), or investigate the relative turnover rates of two molecules. Similar piecewise linear systems have also been studied in [7, 26]. 3.1 Steady States The steady states of a Boolean model are given by all the possible solutions X ∗ of the equations: Xi∗ = Fi (X1∗ , . . . , Xn∗ ),
i = 1, . . . , n.
It is easy to see that any steady state of the Boolean model yields a steady state of the piecewise linear equations (2), since d xi bi = 0 ⇔ xi = Fi (X1 , X2 , . . . , Xn ), i = 1, . . . , n, dt ai independently of θi . Because the right-hand side of this equation is discontinuous, it is difficult to provide general results on the existence and uniqueness of solutions for system (2) (see for instance [3] and [11]). In view of this difficulty, in the present study we will assume that trajectories are well defined and analyze their dynamical behavior. For the model of Table 1, the steady states depend on the value of TNF (see Table 2). It is not difficult to check that (both with and without link A) there are exactly two distinct steady states when TNF = 0, characterized by the presence or absence of caspases 3 and 8, and hence corresponding to the survival or apoptotic responses (nodes not indicated below are zero): (4) (Ap0 ) T1 = T2 = 0, C3a = C8a = 1, IκB = IκBnuc = 1, (Lf0 ) T1 = T2 = 0, IκB = IκBnuc = 1, CARP0 = CARP = IAP = 1. This is in agreement with the idea that, under typical conditions, the cell should be capable of stably maintaining either an apoptotic or a survival Table 2. Steady states of the Boolean model, for each model variant, in the presence and absence of TNF. Model I II III IV
Links A, C, no L L, C, no A C, no A, no L L, no A, no C
TNF = 0 Ap0 , Ap0 , Ap0 , Ap0 ,
Lf0 Lf0 Lf0 Lf0
TNF = 1
Oscillations?
Ap1 — — —
Yes Yes Yes Yes
26
M. Chaves et al.
state [8, 4]. If TNF = 1, there is only one possible steady state for models with link A: (Ap1 )
T1 = T2 = 1, C3a = C8a = 1.
(5)
For models with no link A, there is no possible steady state when TNF = 1, and there are only periodic orbits of period higher than 1. Therefore, during TNF treatment, models with link A may at any time make a decision towards the apoptotic pathway, while models with no link A will exhibit oscillatory behaviour and can only make a decision when TNF treatment ceases. Upon removal of TNF stimulation, trajectories of system (2) may be expected to converge to either the apoptotic or survival state. The choice of one or the other state will depend on the initial condition and the set of parameters ai , bi , and θi . Since these parameters are very likely to vary from cell to cell, it is reasonable to consider several (randomly chosen) sets of parameters and then compute the probability of convergence to each steady state. To examine the dynamics of system (2), and its dependence on parameters and the structure of the network of interactions, several numerical studies were performed, as described next. 3.2 Numerical Experiments To test the model and analyse the effects of links A and L (Fig. 1), system (2) was simulated several times, with randomly chosen sets of parameters. For simplicity, the synthesis rates and threshold constants were fixed (bi = 1 and θi = 0.5 for all i), and only parameters ai were allowed to vary, chosen from a uniform distribution in the interval [1/3, 3] (h−1 ). This seems reasonable, as the degradation rates used in [17] are roughly between 0.5 and 4 h−1 . Observe that ai plays a double role: it represents a degradation rate, but also defines the 0/1 threshold concentration (0.5/ai ). Hence, high degradation rates also imply that a lower concentration is needed to achieve the 0/1 transition. Different durations of TNF stimulation were considered, namely: 2, 6, 11, 16, and 21 hours. For these simulations, one initial condition was chosen: IκB(0) = 1 and all other nodes set to zero. This is based on a natural physiological starting point of the system: previous to stimulation, IKK is in its inactive form, while IκB is bound to NFκB, preventing transcriptional activity. Caspases reside in the cytosol in dormant forms [22]. To understand the importance of the links A, C, and L (the least well documented), four variants of the model depicted in Fig. 1 are compared: (I) links A and C present, (II) links L and C present, (III) only link C present, and (IV) only link L present (as listed in Table 2). The first three variants aim at comparing the effects of links A and L, and the last aims at evaluating the effect of link C. Other alternatives gave similar results (for example, a model with all three links gave results very similar to I) and thus are not detailed here. For each variant, the response of the system to each of the five TNF
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
27
durations was simulated 500 times. Since different sets of parameters {ai } introduce different time scales, variations in the dynamics from one simulation to another are expected. These variations may also be interpreted as a result of natural variability in biological systems. The average response over the 500 simulations will then yield the probability of the system converging towards each of the steady states. Other open questions that may be studied with our model include competition between the pro- and anti-apoptotic pathways and the point of irreversibility of the apoptotic decision. For instance, how long after caspase activation is recovery from the apoptotic pathway still possible [22]? To address these questions, numerical experiments were conducted by letting NFκB(0) = 1, setting all others to zero, and maintaining C3a(t) = 1 for durations of 10, 30, 60, and 360 minutes. For analysis of the numerical results, a “peak” in the trajectory of node Xj will be defined as a time interval [T0 , T1 ], during which Xj (t) = 1, and such that Xj (T0 − Δ) = Xj (T1 + Δ) = 0. The period of oscillations is calculated as the average time interval between the onset of two consecutive peaks, i.e., Np 1 T0,i − T0,i−1 , Period = Np − 1 i=2
where Np is the number of peaks observed during the simulation time.
4 Results and Discussion In the numerical simulations, it is observed that, once TNF stimulation ceases, a steady state pattern is always achieved, corresponding to either the apoptosis or survival states (4), (5). In the former case IκB is bound to NFκB, so that mRNAs and proteins downstream of NFκB are not expressed, and the cell has chosen the apoptotic pathway. The latter case represents survival of the cell, with IAP stably expressed preventing C3a activation, and CARP preventing C8a activation (see Fig. 2). In the presence of TNF stimulation, IκB, NFκB, and its dependent mRNAs/proteins may exhibit oscillatory dynamics, as observed experimentally in [14, 19]. In fact, computation of steady states shows that the models with no link A have no alternative but to exhibit oscillatory behaviour in the presence of TNF, since no possible steady states exist (except possible special solutions of the associated differential inclusion). The oscillatory behaviour (see analysis below) is in very good agreement with the experimental data reported in [19]. Qualitatively, all model variants respond in a similar fashion to TNF stimulation. As the stimulus duration increases, more cells choose the apoptotic pathway. Testing the four model variants shows that link A is very strong: not surprisingly, models with link A favour the apoptotic pathway, with 80% of cells reaching the apoptotic state, as opposed to around 50% or 40% in
M. Chaves et al.
TNF
28
1
1
0.5
0.5
C3a
C8a
IAP
NFkBn
IkBn
IKK
0 0 1
5
10
15
0 0 1
20
10
15
20 1.1631
0.5
0.5 0 0 1
5
10
15
0 0 1
20
5
10
15
20 2.9469
5
10
15
20 2.5784
5
10
15
20 1.8348
5
10
15
20 2.5642
5
10
15
1.898 0.5
0.5 0 0 1
5
10
15
0 0 1
20 2.3488
0.5
0.5
0 0 1
5
10
15
0 0 1
20 0.90041
0.5
0.5
0 0 1
5
10
15
0 0 1
20 0.79962
0.5
0.5
0 0 1
5
10
15
0 0 1
20 0.4439
0.5 0
5
2.6733
20 0.69736
0.5 0
5
10
15
0
20
0
5
Time (hours)
10
15
20
Time (hours)
Fig. 2. Example of network dynamics with the hybrid model (variant II), corresponding to cell survival (left) or apoptosis (right) solution. Numbers indicate the degradation rates for these numerical experiments. Solid lines represent normalized continuous variables (xi ) and dashed lines represent discrete variables (Xi ). 90 80
Survival rate (%)
70 60
III
50
II
40
IV
30
I
20 10
2
4
6
8
10
12
14
16
18
20
22
TNF duration (hours)
Fig. 3. Percentage of surviving cells for the four model variants.
models II and IV, or 30% in the model with only link C (which favours the anti-apoptotic pathway) (Fig. 3). These values appear to be in agreement with experimental data: Rehm et al. [22] report that, for 8 hour treatments with
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis Average period (hours)
6
6
I
5
II
5 survival
4
3
apoptosis
3
2
2
2
1
1
apoptosis
1 10
0 0
20
0
20
7
I
6
survival apoptosis
0
TNF duration (hours)
TNF duration (hours) 7
10
III
5
4
survival
0 0
TPeak i −TPeak i−1 (hours)
6
3
4
6
4
4
4
3
3
3
2
2
2
1
1
1
4
6
0 0
2
4
Peak i
III
6
II
5
Peak i
20
7
5
2
10
TNF duration (hours)
5
0 0
29
6
0 0
2
4
6
Peak i
Fig. 4. Top row: Average period of nuclear IκB oscillations for apoptotic or surviving cells, as a function of TNF stimulus duration. Vertical lines represent standard deviation over the 500 numerical experiments. Bottom row: Relative timing of sucessive peaks in IκB oscillations, for apoptotic (grey) or surviving (black) cells. The “+” signs mark the experimental peak timing in [19].
high and low concentrations of TNFα, the percentage of cells undergoing activation of effector caspases was, respectively, 86% and 24%. The numerical experiments with our model capture the response to high (or significant) concentrations of TNFα, so variants I (followed by II and IV) are closer to the real system. Quantitative analysis of the oscillatory behaviour reveals some interesting facts (Fig. 4). To characterize the oscillatory dynamics, the following quantities were computed for nuclear IκB: period of oscillations (approximated), number of peaks, and relative timing between peaks. First, in all cells oscillations cease when TNF stimulation ceases, in agreement with observations. Second, the timing of successive peaks is also in remarkable quantitative agreement with experimental data [19], see Fig. 4 (bottom row). The first peak in nuclear IκB concentration was observed about 72 minutes from the start of TNF stimulation, and the second peak appears about 4 hours later, very close to the 75 minutes and 4.5 hours reported in [19]. It is striking that the time span of the first peak is typically longer than that of the following peaks, and that the time lapse between consecutive peaks decreases (see Figs. 2, 4). Third, the average period of oscillations is fairly constant, but “depends” on the apoptosis/survival decision. Statistical analysis of the period of oscillations
30
M. Chaves et al.
(calculated as indicated in Section 3.2) in nuclear IκB indicates that there is a natural period (for TNF treatment longer than 3 hours) for cells that eventually survived. This period is about 3.5 ± 1 hours for models I, II, and IV, and slightly higher at 4 ± 1 hours for model III. In contrast, for cells that chose the apoptotic pathway, the period of oscillations can be much smaller. For models with link A, essentially no oscillations are observed in apoptotic cells (Fig. 4, top, left): this is because cell death is decided very early on, with link A immediately preventing any further NFκB activity. For model II (links C and L only), oscillations are observed in apoptotic cells with a natural period which is lower (about 3 ± 1 hours) than that for surviving cells (Fig. 4, top, middle). Results for model IV (not shown) are quite similar to those of model II. For model variant III, there is no difference between observed periods (Fig. 4, top, right). These results provide indications for discriminating between the four model variants and also suggest that the period of oscillations may play a role in the survival/apoptosis decision: lower periods/higher frequencies would lead towards the apoptotic pathway. A similar result has been reported, for instance, in the p53-Mdm2 system [16], where more peaks (higher frequency) were detected in response to higher (and more damaging) γ-irradiation doses. The p53-Mdm2 system also contains a negative feedback loop similar to the NFκB-IκB loop. To address the question of irreversibility of the apoptotic decision, we checked the capacity of the network to recover from overexpression of active caspase 3. Fixing node C3a at its maximal value for intervals of 10, 30, 60, and 360 minutes (that is setting discrete C3a(t) = 1, for t <10, 30, 60, or 360), we calculated the percentage of surviving cells. With model I there are no surviving cells after 1 hour of C3a overexpression but, with model II or IV, this percentage drops very fast from 45% to 30% survival at 1 hour overexpression and remains at this value for continued C3a overexpression (see Fig. 5). This suggests that a significant percentage of cells can still invert the apoptotic
60 III
Survival rate (%)
50 40
II 30 IV 20 10 I 0
0
50
100
150
200
250
300
350
400
C3a overexpression interval (mins.)
Fig. 5. Percentage of surviving cells under increasing intervals of C3a overexpression for the four model variants and TNF treatment for 16 hours.
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
31
decision, while for the largest part (70% of all cells) the apoptotic pathway is chosen early on, within an hour of TNF stimulation. Not surprisingly, examination of the relative values of the parameters ai shows that two thirds of cells that were able to recover from the apoptotic pathway had degradation rates for C3a higher than those for NFκB or IκB. Based on our study of regulation of apoptosis and the NFκB pathway, it seems clear that the links A and L play quite important roles, and at least one of these should definitely be included for faithful modeling of apoptosis via TNF receptors. This eliminates model III. Both links contribute to the same physiological function: downregulation of NFκB transcriptional activity. However, link A (direct inhibition of NFκB by C3a) achieves this objective in a much faster way than link L (“indirect” inhibition of NFκB by C3a, through complex IKK). The essential difference between models I and II is thus the length of the pathway representing inhibition of NFκB by C3a. The shorter path (model I, with link A) leads to much higher apoptosis rates than the longer path (models II or IV, with link L). The shorter path also renders recovery from the apoptosis pathway practically impossible, with apoptosis rates higher than 95% after only half an hour with C3a overexpression (Fig. 5). The longer path allows a higher recovery rate from the apoptotic pathway, although the probability of apoptosis does not increase above 70%, even after 6 hours of C3a overexpression. Recent experimental evidence [10] points to the existence of a link L, that is, caspases are responsible for cleavage or degradation of (parts of) complex IKK. To further discriminate between a short or long pathway for the influence of caspases on the NFκB pathway, the results shown in Fig. 4 suggest the following experiment. First, measure the period of oscillations during TNF stimulation and then monitor cells for some time after TNF removal. Next, compare the frequency of oscillations in cells that survive and in cells that eventually go through the apoptotic program. If the frequency of oscillations is similar for both groups of cells, or slightly higher in apoptotic cells, then model II (longer pathway) provides a better description of the system. If oscillations stopped after a short time interval (as compared to TNF duration) in apoptotic cells, then model I (shorter pathway) should be chosen.
5 Conclusion The present study illustrates the usefulness of Boolean and piecewise linear models in the analysis of large complex networks. The qualitative dynamics that emerges from the network structure was studied, leading to predictions on the response to increasing duration of stimulation, response to overexpression of a given protein, or indication of which links/interactions play crucial roles in the regulation of apoptosis. Some quantitative aspects were also analyzed, such as the probabilities of survival or apoptosis and the frequency/period of oscillations, and were shown to be in remarkable agreement with experimental
32
M. Chaves et al.
data. Many other questions can be examined in this hybrid framework: for instance, extending the set of parameters (degradation and synthesis rates, threshold concentrations) and varying the relative strengths of anti- and proapoptotic links will lead to more refined models, capturing a wider range of kinetic variability. Although writing the logical rules requires some simplifications of the biological processes, discrete and hybrid models retain the essential qualitative properties of the network. The effect of the network structure on the qualitative dynamics of the system can be easily studied, even when kinetic details are not well known. This class of models can thus be a powerful method to generate predictions and test new hypotheses for complex biological networks. Acknowledgments The authors thank Peter Scheurich and Monica Schliemann for their many interesting and fruitful discussions.
References 1. R. Albert and H.G. Othmer. The topology of the regulatory interactions predicts the expression pattern of the drosophila segment polarity genes. J. Theor. Biol., 223:1–18, 2003. 2. G. Bernot, J.-P. Comet, A. Richard, and J. Guespin. Application of formal methods to biological regulatory networks: extending Thomas’ asynchronous logical approach with temporal logic. J. Theor. Biol., 229:339–347, 2004. 3. R. Casey, H. de Jong, and J.L. Gouz´e. Piecewise-linear models of genetic regulatory networks: equilibria and their stability. J. Math. Biol., 52:27–56, 2006. 4. M. Chaves, T. Eissing, and F. Allg¨ ower. Bistable biological systems: a characterization through local compact input-to-state stability. IEEE Trans. Automat. Control, 53:87–100, 2008. 5. M. Chaves, E.D. Sontag, and R. Albert. Methods of robustness analysis for boolean models of gene control networks. IEE Proc. Syst. Biol., 153:154–167, 2006. 6. N.N. Danial and S.J. Korsmeyer. Cell death: critical control points. Cell, 116: 205–216, 2004. 7. H. de Jong, J.L. Gouz´e, C. Hernandez, M. Page, T. Sari, and J. Geiselmann. Qualitative simulation of genetic regulatory networks using piecewise linear models. Bull. Math. Biol., 66:301–340, 2004. 8. T. Eissing, H. Conzelmann, E.D. Gilles, F. Allg¨ ower, E. Bullinger, and P. Scheurich. Bistability analysis of a caspase activation model for receptor-induced apoptosis. J. Biol. Chem., 279:36892–36897, 2004. 9. A. Faur´e, A. Naldi, C. Chaouiya, and D. Thieffry. Dynamical analysis of a generic boolean model for the control of the mammalian cell cycle. Bioinformatics, 22(14):e124–e131, 2006. 10. C. Frelin, V. Imbert, V. Bottero, N. Gonthier, A.K. Samraj, K. Schulze-Osthoff, P. Auberger, G. Courtois, and J.F. Peyron. Inhibition of the NF-κB survival pathway via caspase-dependent cleavage of the IKK complex scaffold protein and NFκB essential modulator NEMO. Cell Death Differ., 15:152–160, 2008.
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
33
11. T. Gedeon. Attractors in continuous-time switching networks. Communications on Pure and Applied Analysis, 2:187–209, 2003. 12. L. Glass. Classification of biological networks by their qualitative dynamics. J. Theor. Biol., 54:85–107, 1975. 13. L. Glass and S.A. Kauffman. The logical analysis of continuous, nonlinear biochemical control networks. J. Theor. Biol., 39:103–129, 1973. 14. A. Hoffmann, A. Levchenko, M.L. Scott, and D. Baltimore. The IκB-NFκB signaling module: temporal control and selective gene activation. Science, 298:1241– 1245, 2002. 15. A.E.C. Ihekwaba, D. Broomhead, R. Grimley, N. Benson, and D.B. Kell. Sensitivity analysis of parameters controlling oscillatory signalling in the NF-κB pathway: the roles of IKK and IκBα. IEE Syst. Biol., 1:93–103, 2004. 16. G. Lahav, N. Rosenfeld, A. Sigal, N. Geva-Zatorsky, A.J. Levine, M. Elowitz, and U. Alon. Dynamics of the p53-Mdm2 feedback loop in individual cells. Nat. Genetics, 36:147–150, 2004. 17. T. Lipniacki, P. Paszek, A.R. Brasier, B. Luxon, and M. Kimmel. Mathematical model of NFκB regulatory module. J. Theor. Biol., 228:195–215, 2004. 18. E.R. McDonald and W.S. El-Deiry. Suppression of caspase-8 and -10-associated RING proteins results in sensitization to death ligands and inhibition of tumor cell growth. Proc. Natl. Acad. Sci. USA, 101:6170–6175, 2004. 19. D.E. Nelson, A.E.C. Ihekwaba, M. Elliott, J.R. Johnson, C.A. Gibney, B.E. Foreman, G. Nelson, V. See, C.A. Horton, D.G. Spiller, S.W. Edwards, H.P. McDowell, J.F. Unitt, E. Sullivan, R. Grimley, N. Benson, D. Broomhead, D.B. Kell, and M.R.H. White. Oscillations in NF-κB signaling control the dynamics of gene expression. Science, 306:704–708, 2004. 20. E.L. O’Dea, D. Barken, R.Q. Peralta, K.T. Tran, S.L. Werner, J.D. Kearns, A. Levchenko, and A. Hoffmann. A homeostatic model of IκB metabolism to control constitutive NFκB activity. Mol. Syst. Biol., 3:111, 2007. 21. N.D. Perkins. Integrating cell-signalling pathways with NF-κB and IKK function. Nat. Rev. Mol. Cell Biol., 8:49–62, 2007. 22. M. Rehm, H. D¨ ußmann, R.U. J¨ anicke, J.M. Tavar´e, D. K¨ ogel, and J.H.M. Prehn. Single-cell fluorescence resonance energy transfer analysis demonstrates that caspase activation during apoptosis is a rapid process. J. Biol. Chem., 277:24506– 24514, 2002. 23. M. Schliemann. Modelling and experimental validation of TNFα induced pro- and antiapoptotic signalling. Master’s thesis, University of Stuttgart, Germany, 2006. 24. M. Schliemann, T. Eissing, P. Scheurich, and E. Bullinger. Mathematical modelling of TNF-α induced apoptotic and anti-apoptotic signalling pathways in mammalian cells based on dynamic and quantitative experiments. In Proc. 2nd Int. Conf. Foundations Systems Biology in Engineering (FOSBE), Stuttgart, Germany, pages 213–218, 2007. 25. R. Thomas. Boolean formalization of genetic control circuits. J. Theor. Biol., 42:563–585, 1973. 26. R. Thomas, D. Thieffry, and M. Kaufman. Dynamical behaviour of biological regulatory networks - i. biological rule of feedback loops and practical use of the concept of the loop-characteristic state. Bull. Math. Biol., 57:247–276, 1995.
Network-Based Models in Molecular Biology Andreas Beyer Biotechnology Center, Technische Universit¨at Dresden, 01062 Dresden, Germany
[email protected]
1 Introduction Biological systems are characterized by a large number of diverse interactions. Interaction maps have been used to abstract those interactions at all biological scales ranging from food webs at the ecosystem level down to protein interaction networks at the molecular scale. Organisms consist of thousands of cells with hundreds of different types. Cells in turn contain millions of molecules comprising thousands of different chemical species. Our genome contains about 23,000 protein coding genes [32], and the estimated number of chemically different proteins (considering splice variants and posttranslational modifications) is at least an order of magnitude larger. It is difficult to estimate the true number of different proteins, because there are no reliable methods yet for predicting splice variants. For example, the NCBI database (www.ncbi.nlm.nih.gov) currently lists about 440,000 protein entries—many of them may however be redundant. In addition, our cells contain many other molecules with catalytic or regulatory functions, such as ribosomal RNA, tRNA, and small interfering RNA (siRNA). Further, the cells contain thousands of different lipid species and other small molecules serving as structural components of the cell or as substrates for the biochemical reactions executed by the metabolic program. Hence, our body is coordinating the activity and reactions of hundreds of thousands if not millions of different chemical species [3]. Even a single cell is a prototypic example of a complex system [27]. Although biological systems follow all basic physical and chemical principles, they cannot be modeled sufficiently using standard methods from those two disciplines. Typical physical models describe a system as either a small number of different entities (e.g. mechanics) or a large number of very similar or even identical elements (e.g. thermodynamics). Likewise, also chemical reaction systems can only be appropriately described if the number of reacting species is small. However, the behavior and fate of organisms cannot N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 3, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
36
A. Beyer
be described appropriately without considering the fact that they consist of a large number of very different interacting elements. Networks or interaction graphs are one way to formalize the complex interactions of heterogeneous entities within and between cells. In their most simple form those interaction graphs only indicate the possibility of an interaction between two genes or proteins (Fig. 1, top). At the other end of the scale are detailed models based on ordinary or partial differential equations (Fig. 1, bottom). Those latter models are used for small, well-studied subsystems and they provide significantly more detailed insight into the dynamics than simple interaction maps. Many formal methods have evolved that cover the intermediate space between these two extremes [2, 11, 23, 30, 38, 49, 66, 72, 76]. The best models utilize the available data in an optimal way to provide as much insight into the biological system as possible. So, what is the distinct advantage of network-based methods in comparison to other approaches? As opposed to many alternative methods (such as ordinary or partial differential equations), network models establish a description of the system including all the different entities and their properties, even if the detailed knowledge about the properties and behavior of individual molecules is still very sparse. Network models are a popular way to formalize all the available knowledge about cellular systems in a consistent framework. For example, protein and gene interaction maps have been created for many important model species and for humans, covering thousands of genes and tens of thousands of interactions [45, 59, 60, 67]. Although these interaction maps are still incomplete and even though they only represent a static picture of possible interactions, they constitute the most complete picture of a cell that we have today (i.e. covering the largest number of genes or biomolecules). It is this ability to integrate and formalize knowledge and data about very diverse entities of the system (each gene is different from all other genes) in a consistent way that distinguishes this concept from basically any other existing modeling approach [7]. One should be aware though that “network-based approaches” actually cover a very broad and heterogeneous group of methods (Fig. 1), ranging from the above-mentioned static interaction graphs to systems of coupled differential equations (used e.g. for modeling metabolic networks). During the previous decades network-based modeling approaches have been used for describing cellular regulation and cellular metabolism. Networks have helped to structure and formalize existing knowledge, to summarize and integrate large measurements, and to predict system behavior. Important applications include a better understanding of diseases with the ultimate goal of developing new therapies. Even conceptually simple static protein interaction maps have been used to gain significant insights into how certain genes are associated with specific diseases. The study of Lage and co-workers [42] presents an excellent example: many genetic loci (i.e. regions on the genome) are statistically associated with the occurrence of certain diseases. However, usually this information is insufficient to mechanistically explain the relationship between the genetic locus and the disease. First of all, many genes may be
Network-Based Models in Molecular Biology
37
Static interaction maps: show potential of interaction
Conditional interactons: need condition specific protein abundance or protein activation data
coverage
can predict who affects whom, need regulatory information
level of detail
Causal (directed) interactions:
Logical networks: +
+ -
k1
k2
considers type of effect (repressive/activating) and potentially also Boolean rules (such as “need A AND B to activate C“)
Quantitative models: kinetic rate constants are known or derived from the data, can predict dynamics of system response
k3
Fig. 1. A hierarchy of network models. Depending on the available data and the research question, different network modeling approaches are chosen. The figure shows model types with increasing levels of detail and increasing data demand (top – down). Less detailed models tend to cover a larger number of genes/biomolecules. Note that only the most detailed category of models “quantitative models” allows for true simulation of network dynamics, e.g. using ordinary differential equations. Causal networks and logical networks allow at most for the simulation of the sequence of events or the order in which proteins/genes are activated, but quantifying the speed of processes is impossible with these simplified approaches.
located in the respective region of the genome and it is mostly unknown which of the genes is causal for the disease. Second, even if the causal gene is known, the molecular mechanisms linking the gene to the disease are usually elusive. Lage et al. addressed these problems by mapping all genes located in diseaserelated loci onto a protein-protein interaction network. They hypothesized that truly causal genes would cluster together in common protein complexes of the network. Indeed the authors found protein complexes significantly enriched with candidate genes. Often, these complexes also had a molecular relationship with disease phenotypes. Hence, the investigators not only identified potentially causal genes, but they also identified protein complexes that could aid in understanding the molecular mechanisms by which mutations alter disease susceptibility. This example demonstrated that even comparably simple networks can yield new insight for our understanding of diseases.
38
A. Beyer
2 Molecular Biological Networks Currently, three types of molecular networks are receiving the most attention in the scientific literature: metabolic networks, protein interaction networks, and gene interaction networks. Other popular network types are transcriptional regulatory cascades, which are derived from genome-wide expression data (Appendix 1). The availability of large datasets for these most popular networks has enabled extensive theoretical and computational analysis of the networks. Importantly, this preference does not always reflect biological significance. As discussed below, posttranscriptional regulatory networks are probably as important as transcriptional networks. However, methods for measuring the relevant interactions (e.g. protein-RNA binding) on a large scale are either not yet available or not as established as other methods (Appendix 1). Metabolic networks. These networks describe systems of biochemical reactions catalyzed by enzymes. Depending on the available data and the research question, different formalizations can be used [28, 63]: (i) enzymes can be the nodes (vertices) of the graph and two enzymes are connected if they catalyze subsequent steps in a reaction chain (i.e. the product of the first enzyme is a substrate for the second); (ii) metabolites can be nodes in the graph and metabolites are connected if they participate in the same reaction; (iii) stoichiometry can be considered if known, e.g. in metabolic flux analysis; and (iv) if kinetic parameters are known one can create systems of differential equations describing the dynamics of the system. Many other types of modeling schemes are used for depicting metabolic networks, including for example stochastic processes [62]. Protein-protein networks. Likewise, many different types of proteinprotein interaction networks have been used, utilizing the data in different ways; e.g. static protein interaction maps summarize either known measured or predicted protein interactions [59, 60, 67]. Here, the edges often have weights quantifying the probability of a true physical binding between the two proteins [15]. Other types of protein networks are regulatory networks, such as kinase-substrate cascades [55], protein complexes (often representing molecular machines such as the ribosome) [24, 40], or more detailed structural models of protein interactions [4]. Gene-gene networks. Gene-gene networks are not molecular networks, but in fact logical networks: they describe functional relationships between genes. For example, two genes may be linked if their products participate in the same process or pathway. Such functional networks have been created by integrating diverse evidence for common functions of genes [45, 59]. For example, the fact that two genes appear together in many species (common phylogenetic profile) indicates that the genes participate in a common process. Also, common expression patterns across a wide range of different conditions suggest similar functions of two genes. By integrating such evidence quantitatively using machine learning approaches it has been possible to create relatively
Network-Based Models in Molecular Biology
39
large maps of functionally related genes. Such a “functional network” has for example been used to better predict substrates of kinases [47]. This study nicely demonstrated the value of such data integration, since previous methods relying exclusively on kinase binding motifs suffered from a large number of false positives. Geneticists define a genetic interaction based on the phenotypes observed when the genes are knocked out: if the knock-out of one gene “masks” the phenotype of the other knock-out they are said to be linked [7, 9]. A prototypical example is the synthetic lethal interaction. In this case the knock-out of any single gene has no or only very little effect on viability, whereas the double knock-out of both genes creates a lethal phenotype. Such a synthetic lethal phenotype can be explained by redundant functions of the two genes e.g. in two independent pathways that can compensate for each other [35]. Hence, the fact that two genes create a synthetic lethal phenotype indicates that they participate in distinct pathways. Underlying the functional or genetic relationships are sequences of physical or biochemical interactions “connecting” the two genes. Thus, genetic interactions provide important functional information that can be used for inferring molecular pathways [7]. Other network types. Many other types of molecular interactions have systematically been studied using network approaches. For example, proteinDNA interactions are important for understanding transcriptional regulation, and they have been studied on a large scale for almost a decade [58, 77]. On the other hand, protein-RNA networks are substantially less researched, although the relevance of alternative splicing is immediately apparent and it is known that translation is also heavily regulated via RNA binding proteins [5]. Yet another prominent example is the use of logical networks for describing transcriptional regulatory cascades [19, 57]. These networks are similar to the above-mentioned transcription factor-DNA networks; however, such logical networks; may not always explicitly model the molecular mechanism underlying the regulatory relationship.
3 Identifying Molecular Biological Networks In recent years several new technologies have been developed for measuring all kinds of physical and genetic interactions on a large scale (Fig. 2, Appendix 1). For example protein-protein interactions can be measured with yeast two-hybrid (Y2H) or tandem affinity purification coupled with mass-spectrometry-based protein identification (TAP-MS). Protein-DNA interactions can be measured with chromatin immunoprecipitation and DNA microarrays (Appendix 1) can be used to identify the DNA fragments (ChIP-Chip). Likewise, techniques for the large-scale measurement of gene-gene interactions have been developed. These are just some examples to
40
A. Beyer Kinase - substrate interactions: protein chips binding motifs + other binding evidence eQTL + other binding evidence
Protein-protein interactions: yeast two-hybrid (Y2H) TAP-MS known interacting protein domains
Transcription factor - DNA interactions: TF binding motifs ChIP-Chip, ChIP-Seq knock-out + expression change ('buffering') over expression + expression change
Transcriptional cascades (TF - TF interactions): all methods of the above section time course analysis
= protein
= undirected interaction
= target gene
= directed interaction
Fig. 2. Inferring regulatory networks with high-throughput methods. The four types of interactions can be inferred using experimental high-throughput methods, computational methods, or combinations of the two. The methods listed are not comprehensive. Refer to Appendix 1 for details about the various methods. Advanced computational methods combine different evidences for each type of interaction. For example, TF motif information can be used in combination with ChIP-Chip experiments [6]. Likewise, time course data have been combined with TF binding motifs to infer regulatory cascades [57].
demonstrate the fact that today a wide range of interactions can be measured practically at a genomic scale. However, all of these methods are subject to considerable noise, and often results from different techniques only agree to a small extent [73]. Therefore, numerous bioinformatic approaches are under development for physical and genetic network quality assessment, integration, assembly, and annotation. Although all large-scale studies are subject to noise, the rationale for data integration is that observations of true interactions will reinforce or complement one another when combined across different studies and/or experimental techniques. For example, the independent observation of a protein-protein interaction by both Y2H and TAP-MS methods, or by two independent TAP-MS studies, renders this interaction more likely to be true [73].
Network-Based Models in Molecular Biology
41
Such evidences for interaction can be further supported by including other types of data that are not necessarily measurements of direct physical contact. For example, if the two genes have correlated expression profiles or similar patterns of occurrence across several conditions, these findings lend further support to the raw interaction measurement [53, 69]. Beyer et al. [7] review methods for integrating genetic interactions with physical binding data to further support the various types of interactions. Modern methods for interaction data integration use machine learning approaches or other statistical means for combining heterogeneous types of data in a consistent manner. Importantly, the methods assign different weights to different input data, acknowledging that not all types of evidence are equally predictive. These methods rely heavily on a set of “gold-standard”, or highly accurate, interactions which are used to evaluate the predictive utility of different types of evidence [50, 73]. The result is a statistical measure quantifying the likelihood that any given pair of biomolecules interacts (e.g. two proteins or a protein-DNA pair) [6, 33, 59, 64, 70]. Strictly speaking, these quantitative confidence scores describe the probability or reproducibility of the interaction, not the interaction strength. Nonetheless, there is some evidence that stronger interactions should be more reproducible, leading to higher scores [20]. The scores resulting from such analyses can be used to filter for highconfidence interactions and thereby remove potentially many false positive interactions contained in individual high-throughput measurements. Yet, the reverse problem of false negatives is equally pressing. Because most large-scale screens are conducted under only one or few conditions, it is not possible to fully capture the space of all possible interactions. Here again, data integration can help to mitigate false negatives, as interactions missing from one study can be detected using high-confidence interactions from another. Note that integrating more information can simultaneously reduce both the false-negative and false-positive rates: as the number of ways of detecting an interaction increases, the higher the chance it is found by several of these methods, and the lower the chance it is missed altogether or found by only one. The same notion applies to basically any other type of molecular or genetic interaction. For instance, another important problem is the identification of transcription factor (TF) target genes. Various approaches have been used to infer those interactions (Fig. 1, Appendix 1); however, no single method is perfect. Combining clues for TF-target interactions from different independent sources increases confidence and coverage [6]. For example, L¨ahdesm¨aki et al. [44] combined evidence from DNA binding motifs of TFs with other data such as nucleosome occupancy to infer a transcriptional regulatory network for the mouse. Likewise, Beyer et al. [6] combined experimental evidence from ChIPChip with TF motifs, phylogenetic information, expression data, and even physical protein binding data to infer a high-coverage and high-confidence transcriptional network for yeast.
42
A. Beyer
4 Dynamics of Molecular Biological Networks Many aspects of biological networks are time dependent or condition specific. Although all of the above-mentioned networks have been valuable in the past for better understanding biological processes, most of them ignore some important features of biological networks. Biological interactions are dynamic and condition specific. Only a small subset of all interactions (be it physical protein binding or logical genetic interactions) is constitutively active. Protein expression and activation depends on the cell type, cell state, environmental signals, and the history of the cell (previous states). Further, molecular interactions depend on the genomic sequence of the specific individual since mutations may alter interactions. Hence, the presence of the interactors as well as their ability to interact is highly variable. Even genetic interactions can be condition specific: the same double knock-out may be viable under one condition but lethal under another [9]. The reasons why current models ignore these aspects are certainly manifold, yet availability of data is one of the most important aspects. For example, protein-protein interactions are usually measured for only a single condition, sometimes not even in the original organism (e.g. in the case of Y2H, see Appendix 1). It is thus difficult to consider condition specificity of interactions in mathematical models. One way to address this problem is to take mRNA expression data into account, which are now routinely measured genome-wide using DNA microarrays (Appendix 1). Using these data it is possible to predict the presence or absence of proteins under specific conditions. Since a physical interaction is only possible if both partners are present, this also allows for predicting conditional interactions [16]. However, this approach still does not consider protein localization and activation. Even if two proteins are co-expressed, they may not be located to the same subcellular compartment, impeding an in vivo interaction. The best-studied aspect of molecular network dynamics is transcriptional adaptation. The broad availability of DNA microarrays allows for measuring genome-wide transcriptional profiles relatively easily and at low cost. Today it is a routine technology for measuring mRNA concentrations, and thousands of studies have been conducted during the last decade measuring mRNA expression changes in prokaryotes, in many eukaryotic model species, and in human samples. The two main databases of publicly available microarray data are ArrayExpress (www.ebi.ac.uk/arrayexpress/) and the Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo/), currently listing 3874 and 8408 experiments, respectively (as of April 2008). mRNA changes have been measured in response to external stimuli, following differentiation, between different tissue types, in diseased versus healthy tissue, and for numerous other applications. The transcript signatures have been used to identify not just individual genes responding to the given signal, but also entire pathways or subnetworks activated or deactivated under specific conditions. DNA microarrays are the most abundant tool in molecular biology for measuring the cellular response in an unbiased way. These studies are termed “unbiased” or “systematic” because
Network-Based Models in Molecular Biology
43
there is no need for a priori assumptions about which genes/proteins are likely to respond during the experiment. Kiesel et al. [36] used microarray time course data to study the transcriptional change during osteoclastogenesis (i.e. during the transition from precursor cells to mature osteoclasts). The differentiation of precursor cells into mature osteoclasts involves dramatic changes of the transcriptional program, thereby affecting the topology of the interaction network at the protein level. The authors identified co-expression networks associated with early and late response to the differentiation stimulus. A co-expression network is a graph linking two genes if the two genes are similarly expressed either during the specific experiment or at a range of different conditions. For the Kiesel study it was necessary to create two distinct networks to fully capture the complexity of the dynamical changes. One network described the early, and the second one the late response during differentiation. Accordingly, the two networks contained different pathways that are known to be associated with osteoclastogenesis. These findings emphasize the importance of considering the dynamics of transcriptional changes—often one may lose important details when looking at only two time points (before and after treatment, before and after differentiation, etc.). Whereas analyzing microarray time course data can in itself reveal important insights into the dynamics of transcriptional networks, combining those data with other interaction data is significantly more powerful. Expression data can be combined with transcription factor binding data, adding the dimension of protein-DNA interactions [23]. Thereby it becomes possible to infer the molecular mechanisms by which transcriptional networks change their state. For example, Ramsey et al. [57] combined time course expression data with transcription factor binding data to assess the regulatory program responding to macrophage activation. Putative regulatory relationships were identified by employing a novel method for identifying time-lagged correlation between transcription factors and potential target genes. Those interactions were later corroborated by additionally taking the binding affinity of transcription factors in upstream regulatory regions into account. Subsequent experiments confirmed that this combined analysis of expression and binding data significantly improved the quality of the inferred regulatory network. In a similar approach, Ernst et al. [19] analyzed yeast TF-DNA binding data in combination with respective expression data under various different stress conditions. They identified bifurcation points in the time course expression data indicating regulatory events. Along with the TF binding data they were able to identify TFs that were likely regulators of those bifurcations, i.e. they were regulating a specific subset of the genes. Alternatively, time course data of expression changes can be combined with physical protein interaction networks in order to identify pathways or pathway components that are differentially expressed [11, 30]. Ideker et al. [31] combined physical interaction networks with expression data and devised a method based on simulated annealing for identifying relevant subnetworks
44
A. Beyer
(modules) of the physical network. In this case, the network is not a coexpression network, but a network of proteins binding to other proteins or to DNA. Expression changes are mapped onto the physical network, i.e. they become the nodes’ attributes. The algorithm’s task is to identify the most significant subnetwork enriched for differentially expressed genes. The result will depend on the topology of the physical network and the strength (extent) of differential regulation of the individual genes. Several variants of this idea have been published since then [11, 13, 56, 61]. Most important however is the central idea: combining dynamic expression data with independently derived interaction networks significantly improves the statistical power of the analysis and provides much more insight into the underlying molecular mechanisms [7]. It is well established that protein concentrations are not only regulated at the transcriptional level, but also at the level of translation and protein turnover [5, 10]. These posttranscriptional processes affect the topology of interaction networks just as much as transcriptional changes. For example, Brockmann et al. [10] have shown that proteins responding early in signaling cascades and transcription factors in particular are subject to “translation on demand.” The coding mRNA of such proteins is constitutively expressed, but translation is blocked until the protein is actually needed. This allows for a much faster response compared to transcriptional regulation. Hence, the presence/absence of these network components is highly dependent on posttranscriptional processes, which are often neglected in studies assessing the dynamics of protein expression. It is a very important finding that the regulatory network components themselves are often not regulated at the transcriptional level and are therefore missed by studies only applying DNA microarrays for measuring transcriptional changes [8, 74]. The main bottleneck for studying posttranscriptional network changes more systematically are the experimental limitations of protein detection and quantification. Current state-of-the-art techniques employ mass spectrometry for identifying, characterizing, and quantifying proteins [1]. The current limits of this technology are high costs, relatively complicated protocols, data processing, limited number of detectable proteins (in the range of a few hundred), and limited reproducibility [14, 48]. Recently, significant progress has been made because of much more sensitive instruments, improved protocols, and better data analysis tools [14, 43, 48]. Hence, this progress suggests that in the near future posttranscriptional network dynamics can also be studied at a level of detail and scope comparable to that of mRNA changes [14, 22].
5 Dynamics on Molecular Biological Networks The previous section focused on the dynamic adaptation of the network topology, e.g. the presence or absence of network components. Here, we will address state changes of the nodes themselves, i.e. alterations or activities of biomolecules in response to external or internal stimuli.
Network-Based Models in Molecular Biology
45
Prototypic examples for such networks are signaling networks, in which proteins transmit information (signals) from one to the other via protein state changes. Most intra-cellular signals are transmitted through covalent protein modifications, e.g. via phosphorylation of specific residues. Kinase cascades “send” signals from membrane receptors or from intra-cellular receptors to proteins and other molecules, which ultimately changes the phenotype of the cell or the entire organism. State changes of regulatory proteins such as G-proteins, kinases, transcription factors or histones can be detected with antibodies specific for the respective protein changes. The alternative method of measuring those changes via mass spectrometry holds greater promise for unbiased large-scale studies, because it does not require a specific antibody for every possible protein modification. So far, those mass spectrometry based techniques are subject to the same limitations as the protein concentration measurements discussed above [17]. Fortunately, the same improved technologies that are developed for protein quantification can also be used to study dynamic changes of protein modifications and localization [41, 54, 75]. However, existing studies trying to elucidate the dynamics of cell signaling largely had to rely on the tedious measurement of isolated proteins. Only recently have new methods been developed for systematically identifying targets of kinases either experimentally [55] or using computational predictions [47] (see also Appendix 1). Despite being important for determining the topology of regulatory networks, those techniques still do not provide data on the dynamics or kinetics of network changes on a larger scale. Therefore, kinetic modeling of signaling cascades or protein transport has been restricted to either well-studied pathways such as the cell cycle or relatively simple pathways such as the osmotic shock response of yeast [38, 39, 72]. Due to the lack of sufficient kinetic parameters, researchers have begun to develop new methods for simulating information processing in biological signaling networks [25]. Most of these approaches formalize the logical information processing rather than quantifying the dynamics. For example, information such as “Gene A is activated if gene B is deactivated” can be formalized in logical models such as Boolean networks [2, 21, 29]. Various alternative methods have been used to infer the logical relationships in regulatory networks, including Petri nets [26, 65], Bayesian networks [51], and factor graphs [76]. A new method for the inference of regulatory pathways that combines physical with functional data was recently introduced by Suthram et al. [71]. The authors used expression quantitative trait loci (eQTL) data in combination with a physical protein network to infer molecular regulatory pathways in yeast. eQTL are statistical relationships between positions in the genome (loci) and the expression of a target gene. A strong correlation indicates that the respective locus contains a regulator of the target gene. Suthram et al. simulated the flow of information between the locus and the target gene as an electric current. The physical protein network serves in this case as the wiring diagram on which the information “flows.” The strength of current on any
46
A. Beyer
edge (interaction) is indicative for the importance of the interaction for the regulation of the target gene. By applying their method to yeast eQTL data the authors could infer several known and new regulatory relationships, and they were able to predict the directionality of information flow for hundreds of protein-protein interactions. The method developed by Suthram et al. enables the inference of causal relationships, but it does not lend any insight into the effect of interactions. For example, the currents do not predict whether the regulator increases or represses the activity of the target. Workman et al. [74] went further in this respect. Using ChIP-Chip data and knock-out expression measurements under DNA damaging conditions they could infer a causal network for DNA damage response. They used the method of factor graphs, which is a generalization of Bayesian networks. Factor graphs are minimal graph models explaining the observed (expression) data. Importantly, the method predicts whether any given interaction is activating or repressing. The down side is that this method requires significantly more comprehensive data than more simple approaches. These logical network models cannot fully capture the kinetics of signaling. However, at least some of them predict state changes in response to different inputs, they provide insights into the sequence of events, and they allow for analyzing the stability of the regulatory system and for finding “weak spots.” A weak spot is a gene in the signaling network whose knock-out would maximally alter the output. Those genes could be interesting drug targets, e.g. when looking for new targets in pathogens or when attacking tumor cells. Also, those weak genes could be causal for diseases, for example if they are mutated in patients carrying a certain inheritable disease. Metabolic networks are another important application of dynamic network modeling. They too are highly dynamic, and fully capturing their kinetics would allow for developing new drugs and for optimizing yields in biochemical reactors. However, modelers face similar problems as those in regulatory networks: although the kinetic properties of enzymes have been measured for decades, we are still far from completely covering all relevant enzymes in any multicellular eukaryote [49]. In addition, enzymes may behave completely differently in in vitro systems than in in vivo situation, where pH, temperature, and many other important parameters may differ [68]. Hence, complete dynamic modeling using differential equations is possible only for a relatively small set of well-studied subsystems. Fortunately, methods have been developed that do not require kinetic constants for the analysis of metabolic networks. For instance, Petri nets have also been used for analyzing metabolic networks [12]. One of the most mature methods is flux balance analysis (FBA) [34, 52]. FBA simulates a metabolic network assuming steady state (input balancing output), which greatly simplifies the data requirements. For example, elementary modes represent a minimal set of reactions necessary to produce a given product at steady state [66] (Fig. 3). These elementary modes, can be deduced just from the stoichiometric matrix. Hence, one only has to know the possible reactions in the system along with their educts and products to
Network-Based Models in Molecular Biology
(a)
S1
(b)
R2 R1
M1
M2
R4
P1
S1
R2 R1
R5
M1
M1
(d) M2
R5
R4
P1
S1
R2 R1
M2 R4
M1
P1
R5 R3
P2
P1
R3
R2 R1
R4
P2
P2
S1
M2
R5 R3
(c)
47
R3 P2
Fig. 3. Elementary flux modes. (a) A simple metabolic network consuming substrate S1 and producing products P1 and P2 via the reactions R1 through R5. (b – d) Elementary modes (highlighted) are minimum sets of reactions creating the products P1 (b, c) or P2 (d). Note that removing R1 affects all elementary modes, i.e. synthesis of all products. Removal of R4 disables synthesis of P1 only. {R1 }, {R4 }, and {R2, R3 } are minimal cut sets with respect to P1.
predict all possible chemical fluxes that do not lead to the accumulation of products under the steady state assumption. Depending on the metabolic network there might be many elementary modes leading from certain substrate(s) to specific product(s). Such a network would be redundant. However, even if there are many elementary modes, all of them might require one specific enzyme, thus this enzyme would be essential for synthesizing the respective product (e.g. the enzyme catalyzing reaction R1 in Fig. 3). The concept of elementary modes has been used to make a range of important predictions: for example Stelling et al. [66] were able to predict lethal genes in Escherichia coli by searching for enzymes whose knock-out would remove all possible elementary modes leading to essential products. Klamt [37] extended this idea to the concept of minimal cut sets: whereas Stelling and co-authors were looking for single genes whose knock-out would be detrimental to the organisms, the cut sets define the minimum set of genes required to turn off the synthesis of a given product (Fig. 3). This analysis could be instrumental for developing combinatorial antibiotics targeting different enzymes in bacteria such that their synergistic interaction would be lethal to the pathogens. In summary, the “reduced modeling approaches” that are currently popular do not strictly simulate the dynamics on (or of) the networks, but they simulate dynamic networks in a way that still leads to important conclusions. In most cases it would be impossible to derive those insights without
48
A. Beyer
these computational tools, given the complexity of regulatory or metabolic networks. Also, these less quantitative approaches have a higher chance of truly reaching a genome-wide scale and thus actually achieving a system-wide perspective. One of the main driving forces for progress in computational methods is the development of new experimental techniques. New types of data open up new possibilities for network simulation. For example, relatively cheap deep sequencing methods will aid the identification of all transcripts (protein coding and non-coding) in time courses, which will require a new dimension in transcriptional network modeling. Those future models will be able to incorporate the regulatory effect of micro-RNAs and any other type of non-coding RNA. Likewise, these technologies will generate detailed data on alternative splicing, since every transcript will be known in its entire sequence. Thus, new computational methods capturing the regulation of alternative splicing at a genomic scale will emerge. Another example is the above-mentioned progress in proteomics. It is hoped that it will lead to the creation of the first comprehensive maps of posttranscriptional regulation.
Acknowledgments I wish to thank Angela Simeone, Jacob Michaelson, and Antigoni Elefsinioti for critically reading the manuscript. This work has been funded by the Klaus Tschira Foundation.
Appendix 1: Large-Scale Detection of Interaction Networks Microarrays are used to measure the expression of all genes of an organism in a single experiment. By measuring time course samples or samples from different tissues, conditions, etc., it is possible to reveal transcriptional changes in response to stimuli or under disease conditions. Algorithms have been devised to infer regulatory dependences between genes (transcriptional regulatory networks) from those data. Protein chips are made to measure protein-protein interactions on a large scale. Here, selected proteins are fixed to a glass surface and interactions with unknown proteins in a sample can be measured, e.g. via fluorescence. If the “probe proteins” are antibodies for proteins of interest, the chips can be used to quantify protein amounts in the sample. Ptacek et al. [55] detected kinase substrates by fixing 4400 proteins onto a protein array. They incubated arrays with kinases (two arrays per kinase) and subsequently identified proteins that were phosphorylated.
Network-Based Models in Molecular Biology
49
Kinase binding motifs plus other binding evidence. Prediction of kinase substrates via the protein sequence alone generates many false positive predictions because short kinase binding motifs are not specific enough. However, provided a certain putative substrate contains a binding motif, actual binding can be corroborated if there is additional independent evidence that the two proteins bind directly or that they are at least involved in the same biological process. eQTL plus other binding evidence. Here again, weak evidence from expression quantitative trait loci (eQTL) is combined with other independent evidence for physical binding of the two proteins. Yeast two-hybrid. Two potentially interacting proteins are genetically fused with transcriptional activation domains. If both proteins bind in the nucleus of the yeast cells, the dimer binds the DNA and activates a reporter gene (e.g. GFP). Genes from other species (e.g. mouse or human) have to be transferred into yeast for this method. Interactions between proteins that cannot interact in the yeast nucleus but would bind in their native environment cannot be detected with this method. TAP-MS. “Bait proteins” are purified from a sample using tandem affinity purification (TAP). Other proteins associated with the bait (“prey proteins”) are identified with subsequent mass spectrometry (MS). Although TAP-MS measures the native in vivo situation, it cannot distinguish whether binding of a prey to the bait is direct or indirect (i.e. mediated via another intermediate prey protein). Also, the method cannot detect transiently binding proteins (unstable binding). Physical interactions measured with Y2H or TAP-MS are influenced by artifacts due to gene tagging, which can influence the functioning of the protein produced [18, 46]. Known interacting domains. This computational method searches for known protein-protein interaction domains in the sequences of candidate genes. The domains may be taken from crystal structures of interacting proteins. If the same two domains are found in other proteins with high sequence similarity, this indicates potential physical interactions. This method is applicable genome-wide. However, it is limited by the available crystal structures and it does not take the protein 3D structure into account. TF binding motifs are short DNA sequences that are targets of a specific transcription factor. They can be inferred, e.g. from a set of known binding regions/promoter regions of known target genes. The presence of a binding motif in a promoter of a potential target gene is usually not sufficient for clearly identifying the gene as a target. Therefore, binding motif information is usually supplemented with additional evidence, e.g. whether the motif is conserved upstream of orthologous genes in other species or whether the putative target is co-expressed with another known target gene of the same transcription factor.
50
A. Beyer
ChIP-Chip. Transcription factors (TF) are cross-linked (“fixed”) with DNA, and after fractionating the DNA the TF-DNA duplexes are purified via immunoprecipitation (i.e. with antibodies). Cross-linking is reversed and the DNA fragments are identified by hybridizing them to a DNA microarray. Thereby, it is possible to identify all binding sites of a TF for a given condition genome-wide in a single experiment. A related method (ChIPSeq) replaces the final step of DNA identification by high-throughput deep sequencing. Both methods only measure binding under the specific condition, i.e. DNA targets bound under different conditions are missed. Knock-out and expression change. Here one knocks out a regulator gene of interest and measures the expression difference between wild-type and the knock-out. Genes that are differentially expressed are likely to be targets of the regulator. The method can only detect target genes if the transcription factor is activated under the conditions tested and it cannot distinguish direct from indirect targets. Also, the knock-out itself will trigger a range of indirect responses that are not directly related to the function of the TF, because the cell tries to compensate for the knock-out. Over-expression and expression change. This approach is complementary to the preceding method, in that the transcription factor of interest is constitutively expressed at high levels. One then compares the expression of genes under normal and high expression. Genes that change their expression are likely to be targets. This method suffers from several of the above-mentioned problems as well. For example, it does not distinguish direct from indirect targets. However, it does not require that the TF is normally active under the condition tested. Time course analysis can be used to infer transcriptional regulatory cascades. The underlying hypothesis is that the activity of a transcription factor can to some degree be predicted from its mRNA level. One measures the expression levels of all genes with DNA microarrays for several time points. Using appropriate statistical methods one can then infer likely target genes from the fact that they are expressed after a certain TF is upregulated (or downregulated, depending on whether the TF is an activator or repressor).
Appendix 2: Some Important Definitions Alternative splicing, splice variant. Alternative splicing is a mechanism used by cells to generate different protein sequences from the same gene. All genes are transcribed into RNA and usually only a part of the transcript is used for synthesizing proteins. Some parts of the transcript (called introns) are “spliced out” (i.e. removed) before translation. Many genes splice different parts of their transcript depending on cell type or external conditions. This process of conditional splicing is called alternative splicing and the resulting gene sequences are called splice variants.
Network-Based Models in Molecular Biology
51
Binding motif. A short DNA or RNA sequence that is recognized by a binding protein. For example transcription factors recognize the specific site on the DNA to which they should bind based on a specific sequence of nucleotides. DNA hybridization. The binding of single stranded DNA to its complement (according to Watson–Crick base pairing). DNA hybridization is utilized to specifically bind sample DNA to probe DNA with a known sequence (e.g. on DNA microarrays). DNA microarrays are used to measure RNA concentrations as well as to identify DNA sequences in biological samples, and are also used for SNP detection, for re-sequencing, and for a range of other applications. Such arrays consist of glass slides to which short DNA sequences (“probes”) are fixed. The DNA probes are either synthesized oligonucleotides or amplified DNA fragments. DNA concentrations in a given sample are measured by hybridizing the labeled DNA from the sample to the complementary probes on the array. More DNA hybridizing to a given probe will be indicated by a stronger signal. Hence, the signal intensity is a measure of the abundance of the respective DNA sequence in the sample. RNA first has to be transformed into cDNA using reverse transcriptases. Genetic fusion. A variant of genetic manipulation; adding genes or gene fragments to another gene. Genotype. An individual’s specific genome sequence. Many genes have different variants (alleles). The pattern of alleles that someone inherited is the individual’s genotype. Kinase (protein kinase). A signaling protein adding phosphate groups onto substrate proteins. The substrate is thereby activated (it acquired a higher energy level), which may for example alter the structure of the substrate. The substrate itself can also be a kinase, in which case the substrate in turn can activate its substrates. Such chains of kinases are called kinase cascades. Macrophage activation. Macrophages are immune cells responsible for killing pathogens such as bacteria. The immune response of macrophages is triggered by pathogen-specific molecules such as bacterial lipids. Upon such signals, macrophages undergo a range of morphological and other changes to prepare for attacking pathogens and for “warning” the immune system. Mass spectrometry (MS). Used for identifying chemical molecules and for measuring their concentrations in a sample. MS separates fragments of molecules and measures the mass-to-charge ratio in different types of detectors. By computationally assembling the information about individual fragments it is possible to deduce the nature of the input molecules in the sample. Small molecules are measured directly without prior fragmentation. Osteoclasts are bone cells responsible for the desorption (destruction) of bone. Their counterparts are osteoblasts, which generate new bone mate-
52
A. Beyer
rial. Bone is constantly degraded and newly formed by these two types of cells. An excess of osteoclasts leads to osteoporosis (brittle bones). Phenotype. The expression of a genotype. Individuals may have different physiological or molecular characteristics based on their genotype. For example, eye and hair color are phenotypes determined by the respective gene variants (genotype). A phenotype is generally determined by both environmental and genetic factors. Biologists often refer to “the phenotype of a gene” as the physiological change in response to knocking out the respective gene. Phylogenetic profile. Describes the occurrence pattern of a gene in different species. Two genes occurring in the same species are said to have similar phylogenetic profiles. Simulated annealing is an optimization technique for finding global maxima (or minima) in complex fitness landscapes with many local optima. Simulated annealing starts searching for an optimum from some (random) parameter configuration. After a number of iterations the current parameters are randomized to some extent in order to overcome boundaries between local maxima/minima (“heating” of parameters). This procedure is repeated until convergence, while reducing the level of parameter randomization each time (“annealing”). Stoichiometry describes the type and number of molecules consumed and the type and number of molecules produced by a chemical reaction. Substrate. A molecule chemically changed/consumed by an (enzymatic) reaction. For example, proteins that are phosphorylated by kinases are called substrates of the kinases. Transcription. The process of copying a gene’s sequence into RNA. Polymerases are protein machines “reading” the sequence of a gene and producing the complementary RNA. Transcription factor (TF). A regulatory protein controlling the transcription of genes. TFs bind directly or indirectly (bridged via other proteins) to DNA and change the 3D structure of DNA, attract or block transcriptional machinery at the site, or alter other proteins in the vicinity (e.g. histones) to manipulate the transcription rate of the target gene. Translation. The process of synthesizing a protein from the respective messenger RNA (mRNA). Ribosomes are molecular machines (consisting of RNA and proteins) reading an mRNA sequence and translating it into the corresponding amino acid sequence.
References 1. Aebersold R, Mann M. (2003) Mass spectrometry-based proteomics. Nature. 422(6928):198–207. 2. Albert R, Othmer HG. (2003) The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. J Theor Biol. 223(1 ):1–18.
Network-Based Models in Molecular Biology
53
3. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. (2002) Molecular biology of the cell. Garland Science, New York. 4. Aloy et al. (2004) Structure-based assembly of protein complexes in yeast. Science. 303(5666):2026–9. 5. Beyer A, Hollunder J, Nasheuer HP, Wilhelm T. (2004) Post-transcriptional expression regulation in the yeast Saccharomyces cerevisiae on a genomic scale. Mol Cell Proteomics. 3(11 ):1083–92. 6. Beyer A et al. (2006) Integrated assessment and prediction of transcription factor binding. PLoS Comput Biol. 2:e70. 7. Beyer A, Bandyopadhyay S, Ideker T. (2007) Integrating physical and genetic maps: from genomes to interaction networks. Nature Rev Genet. 8:699–710. 8. Birrell GW, Brown JA, Wu HI, Giaever G, Chu AM, Davis RW, Brown JM. (2002) Transcriptional response of Saccharomyces cerevisiae to DNA-damaging agents does not identify the genes that protect against these agents. Proc Natl Acad Sci USA. 99(13 ):8778–83. 9. Boone C, Bussey H, Andrews BJ. (2007) Exploring genetic interactions and networks with yeast. Nat Rev Genet. 8(6):437–49. 10. Brockmann R, Beyer A, Heinisch JJ, Wilhelm T. (2007) Posttranscriptional expression regulation: what determines translation rates? PLoS Comput Biol. 3(3 ):e57. 11. Calvano SE et al. (2005) A network-based analysis of systemic inflammation in humans. Nature. 437(7061 ):1032–7. 12. Chen M, Hofestaedt R. (2003) Quantitative Petri net model of gene regulated metabolic networks in the cell. In Silico Biol . 3:347–365. 13. Chuang HY, Lee E, Liu YT, Lee D, Ideker T. (2007) Network-based classification of breast cancer metastasis. Mol Syst Biol. 3:140. 14. Collins SR et al. (2007) Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics. 6(3):439–50. 15. Cox J, Mann M. (2007) Is proteomics the new genomics? Cell . 130(3 ):395–8. 16. de Lichtenberg U, Jensen LJ, Brunak S, Bork P. (2005) Dynamic complex formation during the yeast cell cycle. Science. 307(5710 ):724–7. 17. Domon B, Aebersold R. (2006) Mass spectrometry and protein analysis. Science. 312(5771 ):212–7. 18. Downard KM. (2006) Ions of the interactome: the role of MS in the study of protein interactions in proteomics and structural biology. Proteomics. 6: 5374–5384. 19. Ernst J, Vainas O, Harbison CT, Simon I, Bar-Joseph Z. (2007) Reconstructing dynamic regulatory maps. Mol Syst Biol . 3:74. 20. Estojak J, Brent R, Golemis EA. (1995) Correlation of two-hybrid affinity data with in vitro measurements. Mol Cell Biol. 15:5820–5829. 21. Faur´e A, Naldi A, Chaouiya C, Thieffry D. (2006) Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle. Bioinformatics. 22(14 ):e124–31. 22. Foss EJ, Radulovic D, Shaffer SA, Ruderfer DM, Bedalov A, Goodlett DR, Kruglyak L. (2007) Genetic basis of proteome variation in yeast. Nat Genet. 39(11 ):1369–75. 23. Gao F, Foat BC, Bussemaker HJ. (2004) Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics. 5:31.
54
A. Beyer
24. Gavin et al. (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature. 440(7084):631–6. 25. Gilbert D, Fuss H, Gu X, Orton R, Robinson S, Vyshemirsky V, Kurth MJ, Downes CS, Dubitzky W. (2006) Computational methodologies for modelling, analysis and simulation of signalling networks. Brief Bioinform. 7(4 ):339–53. 26. Goss PJ, Peccoud J. (1998) Quantitative modeling of stochastic systems in molecular biology by using stochastic Petri nets. Proc Natl Acad Sci USA. 95(12 ):6750–5. 27. Han JD. (2008) Understanding biological functions through molecular networks. Cell Res. 18(2):224–37. 28. Heinrich R, Schuster S. (1998) The modelling of metabolic systems. Structure, control and optimality. Biosystems. 47(1–2):61–77. 29. Helikar T, Konvalina J, Heidel J, Rogers JA. (2008) Emergent decision-making in biological signal transduction networks. Proc Natl Acad Sci USA. 105(6 ):1913–8. 30. Ideker T et al. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science. 292:929–934. 31. Ideker T, Ozier O, Schwikowski B, Siegel AF. (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics. 18 Suppl 1 :S233–40. 32. International Human Genome Sequencing Consortium (2004). Finishing the euchromatic sequence of the human genome. Nature. 431:931−945. 33. Jansen RC. (2003) Studying complex biological systems using multifactorial perturbation. Nature Rev Genet. 4:145–151. 34. Joyce AR, Palsson BO. (2008) Predicting gene essentiality using genome-scale in silico models. Methods Mol Biol. 416:433–57. 35. Kelley R, Ideker T. (2005) Systematic interpretation of genetic interactions using protein networks. Nature Biotechnol. 23:561–566. 36. Kiesel J, Miller C, Abu-Amer Y, Aurora R. (2007) Systems level analysis of osteoclastogenesis reveals intrinsic and extrinsic regulatory interactions. Dev Dyn. 236(8 ):2181–97. 37. Klamt S, Gilles ED. (2004) Minimal cut sets in biochemical reaction networks. Bioinformatics. 20(2 ):226–34. 38. Klipp E, Nordlander B, Kruger R, Gennemark P, Hohmann S. (2005) Integrative model of the response of yeast to osmotic shock. Nature Biotechnol. 23:975–982. 39. Klipp E. (2007) Modelling dynamic processes in yeast. Yeast. 24(11 ):943–59. 40. Krogan et al. (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 440(7084):637–43. 41. Kr¨ uger M, Kratchmarova I, Blagoev B, Tseng YH, Kahn CR, Mann M. (2008) Dissection of the insulin signaling pathway via quantitative phosphoproteomics. Proc Natl Acad Sci USA. 105(7 ):2451–6. 42. Lage K et al. (2007) A human phenome-interactome network of protein complexes implicated in genetic disorders. Nature Biotechnol. 25:309–316. 43. Lange V et al. (2008) Targeted quantitative analysis of Streptococcus pyogenes virulence factors by multiple reaction monitoring. Mol Cell Proteomics. [Epub ahead of print] 44. L¨ ahdesm¨aki H, Rust AG, Shmulevich I. (2008) Probabilistic inference of transcription factor binding from multiple data sources. PLoS ONE . 3(3 ):e1820. 45. Lee I, Date SV, Adai AT, Marcotte EM. (2004) A probabilistic functional network of yeast genes. Science. 306:1555–1558. 46. Legrain P, Wojcik J, Gauthier JM. (2001) Protein–protein interaction maps: a lead towards cellular functions. Trends Genet. 17:346–352.
Network-Based Models in Molecular Biology
55
47. Linding R et al. (2007) Systematic discovery of in vivo phosphorylation networks. Cell . 129(7 ):1415–26. 48. Malmstr¨ om J, Lee H, Aebersold R. (2007) Advances in proteomic workflows for systems biology. Curr Opin Biotechnol . 18(4 ):378–84. 49. Mo ML, Jamshidi N, Palsson BØ. (2007) A genome-scale, constraint-based approach to systems biology of human metabolism. Mol Biosyst. 3(9 ):598–603. 50. Myers CL, Barrett DR, Hibbs MA, Huttenhower C, Troyanskaya OG. (2006) Finding function: evaluation methods for functional genomic data. BMC Genomics. 7:187. 51. Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR. (2007) A primer on learning in Bayesian networks for computational biology. PLoS Comput Biol . 3(8 ):e129. 52. Papin JA, Stelling J, Price ND, Klamt S, Schuster S, Palsson BØ. (2004) Comparison of network-based pathway analysis methods. Trends Biotechnol . 22(8 ):400–5. 53. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 96:4285–4288. 54. Pflieger D, J¨ unger MA, M¨ uller M, Rinner O, Lee H, Gehrig PM, Gstaiger M, Aebersold R. (2008) Quantitative proteomic analysis of protein complexes: concurrent identification of interactors and their state of phosphorylation. Mol Cell Proteomics. 7(2 ):326–46. 55. Ptacek J et al. (2005) Global analysis of protein phosphorylation in yeast. Nature. 438:679–684. 56. Rajagopalan D, Agarwal P. (2005) Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics. 21(6 ):788–93. 57. Ramsey SA et al. (2008) Uncovering a macrophage transcriptional program by integrating evidence from motif scanning and expression dynamics. PLoS Comput Biol . 4(3 ):e1000021. 58. Ren B et al. (2000) Genome-wide location and function of DNA binding proteins. Science. 290(5500 ):2306–9. 59. Rhodes DR et al. (2005) Probabilistic model of the human protein–protein interaction network. Nature Biotechnol. 23:951–959. 60. Rual JF et al. (2005) Towards a proteome-scale map of the human protein–protein interaction network. Nature. 437:1173–1178. 61. Samoilov M, Plyasunov S, Arkin AP. (2005) Stochastic amplification and signaling in enzymatic futile cycles through noise-induced bistability with oscillations. Proc Natl Acad Sci USA. 102(7):2310–5. 62. Schilling CH, Letscher D, Palsson BØ. (2000) Theory for the systemic definition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective. J Theor Biol. 203(3):229–48. 63. Scott MS, Perkins T, Bunnell S, Pepin F, Thomas DY, Hallett M. (2005) Identifying regulatory subnetworks for a set of genes. Mol Cell Proteomics. 4(5 ): 683–92. 64. Sprinzak E, Altuvia Y, Margalit H. (2006) Characterization and prediction of protein–protein interactions within and between complexes. Proc Natl Acad Sci USA. 103:14718–14723. 65. Steggles LJ, Banks R, Shaw O, Wipat A. (2007) Qualitatively modelling and analysing genetic regulatory networks: a Petri net approach. Bioinformatics. 23(3 ):336–43.
56
A. Beyer
66. Stelling J, Klamt S, Bettenbrock K, Schuster S, Gilles ED. (2002) Metabolic network structure determines key aspects of functionality and regulation. Nature. 420(6912 ):190–3. 67. Stelzl U et al. (2005) A human protein–protein interaction network: a resource for annotating the proteome. Cell. 122:957–968. 68. Stryer L. (1995) Biochemistry. Freeman & Co, New York. 69. Stuart JM, Segal E, Koller D, Kim SK. (2003) A gene coexpression network for global discovery of conserved genetic modules. Science. 302:249–255. 70. Suthram S, Shlomi T, Ruppin E, Sharan R, Ideker T. (2006) A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics. 7:360. 71. Suthram S, Beyer A, Karp RM, Eldar Y, Ideker T. (2008) eQED: an efficient method for interpreting eQTL associations using protein networks. Molec Syst Biol. 4:162. 72. Tyson JJ. (1991) Modeling the cell division cycle: cdc2 and cyclin interactions. Proc Natl Acad Sci USA. 88(16 ):7328–32. 73. von Mering C et al. (2002) Comparative assessment of largescale data sets of protein–protein interactions. Nature. 417:399–403. 74. Workman CT et al. (2006) A systems approach to mapping DNA damage response pathways. Science. 312:1054–1059. 75. Yan W, Hwang D, Aebersold R. (2008) Quantitative proteomic analysis to profile dynamic changes in the spatial distribution of cellular proteins. Methods Mol Biol . 432:389–401. 76. Yeang CH, Mak HC, McCuine S, Workman C, Jaakkola T, Ideker T. (2005) Validation and refinement of gene-regulatory pathways on a network of physical interactions. Genome Biol. 6(7 ):R62. 77. Zhu J, Zhang MQ. (1999) SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. 15:607–611.
Ecological Networks: Structure, Interaction Strength, and Stability Samit Bhattacharyya and Somdatta Sinha Mathematical Modelling and Computational Biology Group, Centre for Cellular and Molecular Biology, CSIR, Hyderabad 500007, India;
[email protected],
[email protected]
1 Introduction The fundamental building blocks of any ecosystem, the food webs, which are assemblages of species through various interconnections, provide a central concept in ecology. The study of a food web allows abstractions of the complexity and interconnectedness of natural communities that transcend the specific details of the underlying systems. For example, Fig. 1 shows a typical food web, where the species are connected through their feeding relationships. The top predator, Heliaster (starfish) feeds on many gastropods like Hexaplex, Morula, Cantharus, etc., some of whom predate on each other [52]. Interactions between species in a food web can be of many types, such as predation, competition, mutualism, commensalism, and ammensalism (see Section 1.1, Fig. 2). Mathematical ecologists have used dynamic models to explore how the size and connectivity of food webs determine the stability and long-term persistence of a community under fluctuations in density [41], invasion of new species [11], or nonlinear population dynamics [24]. There are two different approaches for modeling a food web: the static model and dynamic model. Static models describe the food web by a graph whose vertices are species and whose links are the interactions/relations between them. These models are primarily concerned with the robustness of the food web structure against modifications (i.e., removal and addition) of vertices and links. Based on the hierarchical position of the species in a food web, there exist two types of static models: the cascade model and the niche model. The dynamic models, on the other hand, account for the stability of food webs, and are represented by coupled ordinary differential equations, where different functional forms describe the type of interactions between the species. However, neither the static nor the dynamic models are useful for making long-term predictions of the changes in structural organization of food webs due to extinction or invasion of new species in the community. Other models of food webs — the assembly model and evolutionary model — mainly focus on this aspect. One N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 4, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
58
S. Bhattacharyya and S. Sinha
Fig. 1. Feeding relationship between a predator Heliaster and marine snails in the northern Gulf of California (adapted from [52]).
basic assumption for construction of these models is that, instead of being random as in earlier models, the existence and strengths of links between species here are based on their web history [17](see Section 1.2).
1.1 Some Basic Definitions A community in ecology comprises all species populations interacting in an area. An example of a community is a coral reef, where numerous populations of fishes, crustaceans, and corals coexist and interact. A food web in an ecosystem is an assemblage of various organisms that are interconnected with each other through their different life history processes, such as feeding and shelter. A trophic level in a food web consists of all the species that prey on the same species and are also preyed upon by the same species. Ecological interactions are the relationships between two species in an ecosystem. Based on either effects, or on mechanisms, these relationships can be categorized into many different classes of interactions as described below and shown in Fig. 2. Here, the arrow signifies the flow of resources in the network, and the sign represents the effect of one species on the other. These interactions vary greatly with respect to their duration and strength. In many cases, the interactions of two species may have different impacts under different conditions. This is particularly true in, but not limited to, cases where species have multiple and drastically different life stages. Predation is a biological interaction in which one species feeds on another. Most of the interactions in a food web are predatory. Figure 2(i) shows the network for this interaction, where species 2 preys on species 1. This interaction enhances the fitness of predators (indicated by “+”), but reduces the fitness of the prey species (shown by “−”). Example: There is common
Ecological Networks: Structure, Interaction Strength, and Stability
59
Fig. 2. Types of ecological interactions: (i) predation, (ii) competition, (iii) mutualism, (iv) commensalism, (v) ammensalism.
predation of carnivores on herbivores in a grazing food web. Parasitism is similar to predation by mechanism, as it enhances the fitness of the parasite, but impairs the host. Example: The mite Varroa jacobsoni, is a parasite of the honeybee. Although not severely, it disrupts the honeybee’s colony formation. Competition between two species occurs when they share a limited resource and each tends to prevent the other from accessing it. This reduces the fitness of one or both species, as is shown by “−” in Fig. 2(ii). In mutualism or symbiosis, two species provide resources or services to each other. This enhances the fitness of both species (shown by “+” in Fig. 2(iii)). Example: Pollination is an example of mutualism, which enhances the fitness of both the plant and the pollinator. Commensalism is an interaction, where one species receives a benefit from another species. This enhances the fitness of one species without any effect on fitness of the other species (shown by “0” in Fig. 2(iv)). In the marine aquatic ecosystem, the clown fish and sea anemone share such a relationship. In ammensalism, one species impedes or restricts the success of the other without being affected positively or negatively by its presence (shown by “−” and “0” respectively in Fig. 2(v)). Example: The black walnut tree (Juglans nigra) secretes a chemical (juglone) from its roots that harms or kills some species of neighboring plants.
1.2 Ecological Network Models Ecological networks (i.e., food chains and food webs) have been mathematically described by different types of models, which focus on different aspects of their structure. We give simple descriptions of a few as follows:
60
S. Bhattacharyya and S. Sinha
Cascade model. These models are built on the hierarchical positions of the species in the food chain and use the role of top-down forces (predator to prey) in shaping the ecological communities. In cascade models, high-ranked species prey on lower-ranked species in the food chain, and the probability of consumption depends on the number of connections with these other species. It has been shown that the mean number of species in the cascade model grows linearly with the number of the species present in the food chain [13]. Niche model. Like the cascade models, the niche models are also structural in nature. But unlike the cascade model, the rigid hierarchical effect is relaxed in the niche model by allowing looping and cannibalism. The “niche model” takes its name from the premise that each trophic species (i.e., a group of species that share the same predators and prey) belongs to a specific niche based on what it eats and, in turn, on what eats it. Recent work [74] has shown that many diverse real food webs can be better described by the niche model than by the cascade model, in particular with respect to features such as cycles and species similarities. Assembly model and evolutionary model. In this class of models, as the names suggest, the species composition and structure of the food web can change with time due to the ongoing introduction of new species and by species extinctions. As a consequence, these models focus mainly on the features of the food web after a sufficiently long time, when the size of the food web and its other properties stabilize. The primary concern of these models relates to the stability of the food web due to the introduction of new species into the system by immigration or speciation (assembly) or by altering one or a few individuals of existing species in the system (evolutionary), though the time scales for evolutionary models are longer than those for assembly ones [60]. A few species from a “pool” are added one by one to the existing network. The new species may add stability to the web, become extinct immediately after introduction because of its poor adaptation, or may cause one or more other species to become extinct due to competition for resources. Usually studies of these models involve the measurement of various properties of the underlying network and compare these with the data of real food webs to determine the robustness in a given time period [9]. Keystone species. This concept provides an important representation for understanding the organizing forces in ecological communities. A keystone species is one whose presence contributes critically to the diversity of life in the food web, and whose removal has a strong adverse impact on the community structure, even though the species may occupy only a small part of the ecosystem in terms of biomass or productivity. Keystone species play an important role in conservation biology [53, 55]. Community matrix. In an ecological network, which describes interactions between multiple species at different trophic levels, the community matrix represents the per capita direct effect of one species on other species in the community [35]. In simple words, the community matrix is a spreadsheet, where the rows and columns are species and other elements of their environment,
Ecological Networks: Structure, Interaction Strength, and Stability
61
and the entry records are calculations for describing the interactions among them. The matrix can be used to derive the stability and sensitivity to change in ecological networks. Alternative but related definitions are common in ecological literature [41, 78]. Interaction strength is a critical descriptor of the magnitude of effect of one species on the other in an ecological network. There are several interpretations with the common theme of being a measure of the effect of one species on another or on all others in the network [38, 40]. In this work, we have used “interaction strength” to represent the per capita direct effects of a species on another (e.g., the per capita rate of consumption of the prey species by the predator, “α” in the first equation of Model I in Section 2, which can also be modulated by prey preference).
The complexity of a large ecological network or a whole food web can be described by many indicators: number of trophic levels, number of species and their connections, density of interactions, etc. Each trophic module embedded in the entire food web may also have a similar complexity in its structure and interactions. There are two mutually nonexclusive aspects that underscore a long-standing problem in food web theory: Do “the details of population dynamics in one or a few modules change the structure of the whole system over time” [64]? The first aspect is structural, and involves the distributive pattern of trophic structures or motifs which modulates the real food web system [3, 10, 46]. The other aspect is related to the study of specific trophic modules in order to infer properties of the entire ecosystem dynamics [4, 23]. The first one essentially refers to the robustness of the system against removal of nodes in the functional integrity of the ecological network. In particular, the study involves how removing or replacing native species with exotic invaders can alter the food web structure, which could be measured by the number of secondary extinctions, and by the breakup of the network into smaller components. For instance, conceptualizing food webs as energy-flow networks, Allesina and Bodini [2] have indicated that it is the “dominating nodes” (removal of which would make a number of species disappear from the network) that act as an energy bottleneck for resources flowing to the other members of the food web. However, it has also been shown that removal of species with low connectivity sometimes may have a large effect on the persistence of a community, which reinforces the notion of keystone species in food web theory [18, 31]. The second aspect that is particularly important for understanding real ecological network dynamics is related to the distribution of interaction strengths, which determines how strongly the links in the trophic cascade are coupled. For example in a consumer-resource interaction, the strength is a function of the metabolic rate, ingestion rate, and preference of the consumer species for the resource species [43]. The theoretical finding that interaction strength is one of the key properties promoting persistence in nonlinear models of food webs has attracted considerable attention [19, 43, 63, 67]. It is indeed challenging to ecologists to quantify the strengths of species
62
S. Bhattacharyya and S. Sinha
interactions, identify the patterns that occur between species, and determine the mechanisms that cause interactions to vary across space and time in natural ecosystems. These topics are also important for several other reasons. First, ecosystems, on the whole, provide hospitable conditions for life. Understanding how the provision of this hospitability is affected by extinctions and alien introductions is important, because without knowledge of the strengths of species interactions, our predictions on the consequences of environmental impacts become indeterminate for any ecosystem with a reasonable degree of complexity [6]. Second, development of a general understanding of how ecological communities are structured can benefit from an analysis of the general properties of multispecies population dynamics models [41, 58]. Knowledge of the pattern of interaction strengths in natural ecosystems can help to guide the development of appropriate multispecies models. A number of recent studies consider the influence of interaction strength on the stability of food webs for real communities [6, 7, 49, 51] and in food web models with a large number of species [12, 25, 28, 61]. Martinez et al. [39] studied the effect of the variation of the interaction strength of omnivore links on the stability of large food webs. General patterns of food web structure also appear to be an emergent property of dynamical constraints on species interactions [5, 17]. However, in a more realistic study, the food web dynamics has been considered by introducing time-varying interactions among the adaptive foragers [20, 21, 30, 56, 71, 72]. Adaptation modifies predator-prey interaction strengths, and thus, acts on the topology of the network by eventually removing certain links with zero strength. As a consequence, the complexity of the food web is molded by how adaptive predators design their foraging strategy. Recently, it has also been shown that adaptation of foraging behavior and stability of food webs can lead to a rise in basal species richness and link density, which in turn, increases the emergent complexity of the food web [21, 30]. However, there still remains a significant gap in predictions on food web stability and the role of distribution of interaction strengths in the Food web. This key problem in ecology arises due to the basic difference between empiricists and theoreticians in understanding the concepts of stability, and interaction strength [6, 32]. Many theoretical investigations have focused on the definition of stability in model communities [11, 25, 36, 41, 42, 44]. Multiple definitions of stability have been proposed, some of them designed to have closer ties to empirical data [14, 15, 33, 37]. This is important because most of the earlier analytical studies evaluate linear stability of a community at equilibrium in the face of small perturbations, whereas empirical investigations focus on community changes (with no assumed equilibrium) in response to comparatively large perturbations, such as species removals, species additions, and physical disturbances [75]. The measurement of interaction strengths, on the other hand, actually centers around different concepts in the system, and the only consistent aspect is the use of the words “interaction strength”. Laska and Wootton [32] have clearly mentioned the difference in understanding the
Ecological Networks: Structure, Interaction Strength, and Stability
63
interaction strength for the theoretician and empiricist communities. However, even within theoretical or empirical investigations, there exists a diversity of indices that measure the link weight, or interaction strength. Nevertheless, this creates a critical problem in predicting the relative effects of strong or weak interactions in a community. For example, strong consumption intensity by a predator, or large energy flow from prey to predator, is not necessarily a good predictor of large dynamical effects on prey abundance [6, 50, 54, 62], nor is it necessarily a good predictor of strong interaction coefficients in the community matrix [16]. Similarly, strong interaction coefficients in the community matrix, which are defined by small perturbations, may not necessarily predict strong effects of a large perturbation such as species addition or removal [1, 79]. This gap between experiment and theory in understanding the concepts has remained a long-standing problem in ecological research. In spite of all the recent and ongoing research on networks in general, and food webs in particular, a new synthesis of the relationship between structure and dynamics of complex networks for ecological systems remains elusive. Ecological networks are unlike other networks (including other biological networks) [47], as the process of various interactions (predation, competition, mutualism, etc.) between organisms at different trophic levels ties up the entire organization in a unique way, which is an inherent and distinct feature of the ecological systems. Here, we propose, with simple examples, to explain how the dynamic interplay between the ecological network structure and interaction strengths regulates the structure of the network and the dynamics of the species in it. This may provide some interesting insights about the importance of interaction strengths in ecological network research.
2 Food Web Structure, Interaction Strength, and Stability Food web models are extensions of bioenergetic consumer-resource models, which by definition focus exclusively on trophic interactions. In a recent study, it was shown that predation is the most important process determining the community structure and dynamics [51]. There are a few important related factors that regulate the strength of this process, such as metabolic efficiencies, handling times, foraging strategies, and frequencies of encounters [21]. In the following section, we consider two simple ecological networks of the prey-predator interaction and discuss how the emergence of new functional components that alter interaction strengths can regulate the stability in food web dynamics. 2.1 The Models Model I. Figure 3(i) shows a prey-predator system where the predator species is commensal on the prey species. In this simple model, in the absence of
64
S. Bhattacharyya and S. Sinha
Fig. 3. Food web configurations of (i) Model I and (ii) Model II.
the predator (Y ), the prey species (X) follows a density-dependent logistic growth with r as its intrinsic growth rate, and K as the carrying capacity of the environment. However, in the presence of the predator, the growth of the prey is reduced due to predation of Y on X. This interaction follows a hyperbolic function with γ denoting the half saturation coefficient of predation and α deciding the strength of interaction, i.e., the per capita consumption rate (see [43] for an actual measure of the interaction strength). In the absence of prey, the predator species dies out exponentially at a rate d. On predation, the rate at which this food adds to the growth of the predator population is given by the conversion rate β. The rate of change of the prey (dX/dt) and predator (dY /dt) populations with time are governed by the following equations: X αXY dX = rX 1 − − dt K γ+X βαX dY = Y −d+ . dt γ+X For this study, the parameter values are taken as r = 0.5, K = 5, d = 3, γ = 0.8, and β = 0.7. The main parameter, α, which indicates the interaction strength of predation, is varied from 5.5 to 6.5 to describe the change in dynamics in this model network. Model II. Here, we consider the addition of a new structural component in the food web of Fig. 3(i), which is shown in Fig. 3(ii). It is assumed that the introduction of a virus species (V ), which can infect the prey species (X) in Model I, divides the prey population into two compartments: Susceptible (S) and Infected (I) with S + I = X, where the susceptible class follows the same growth laws as X in Model I. The flux between susceptible to infective compartments is dependent on the strength with which the virus attacks the prey following a simple law of mass action. However, this compartmentalization of the prey species also affects the predation strength; although the predator
Ecological Networks: Structure, Interaction Strength, and Stability
65
(Y ) can consume both the susceptible and the infected prey, it may have a higher preference towards the uninfected prey (S), and a concomitant lower one towards the infected prey (I). Thus, the earlier strong interaction α (in Model I) is now modified into two interactions with variable strength: one strong interaction augmented with one weak interaction (Fig. 3(ii)). The rate of change of the virus population (dV /dt) depends on the the infected prey population (I), as every dead and lysed infected prey releases virus into the environment, initiating a new infection cycle. The temporal evolution of this ecological network is given as follows: S dS ξαSY = rS 1 − − λSV − dt K γ + (S + I) (1 − ξ)αIY dI = λSV − − ηI dt γ + (S + I) dY βα(ξS + (1 − ξ)I) = Y −d+ dt γ + (S + I) dV = −μV + κηI, dt where V is the virus density and λ defines the strength of viral infection on the susceptible class of prey, and represents the “effective per host contact rate with viruses.” Parameter η denotes the death rate of the infected prey and μ is the death rate for the virus. κ denotes the “virus replication parameter,” i.e., the number of virus productions per infected individual due to lysis. The other parameter that regulates the interaction strength of predation, indicating the prey preference of the predator, is ξ (ξ ∈ (0, 1)). The exact choice of these parameter values is arbitrary, but they are kept within the same range as in [8, 22].
3 Results In Model I, the predation strength α is an important determinant of the dynamics of the prey and predator populations. As seen in the bifurcation diagrams of prey and predator populations (Fig. 4), both exhibit equilibrium dynamics for α < 5.6, but the steady state loses its stability and bifurcates to limit cycle oscillations with increasing amplitude for α > 5.6 (Fig. 4). We now show the results of the effect of modifying the structure of this simple two-species network due to, addition of the new link through the virus, which not only separates the prey species into two compartments, but also modifies the predation strength (Model II). For simulation of the Model II network, the new parameter values are chosen as λ = 0.002, η = 0.7, μ = 0.05, and κ = 13. The introduction of the new node (V ) and links to the existing module (Model I) has interesting effects on the population dynamics of the species that
66
S. Bhattacharyya and S. Sinha
Prey
6 4 2 0
Predator
0.6 0.4 0.2 0 5.5
5.7
5.9
6.1
α
6.3
6.5
Fig. 4. Bifurcation diagram of the prey and predator in Model I with increasing predation strength α. At α = 5.6 (approx.), the system undergoes a period-doubling bifurcation.
2
1
10
0
b
−1
10
−2
0
6
101
4
100
Infected
4
10
Susceptible
6 Infected
Susceptible
a
2
10−2
0
10
500
0 5.5
6
α
6.5
100 50 10 5 5.5
0.4 Virus
0.2
Predator
Virus
Predator
500 0.4
10−1
0.2 0
6
α
6.5
5.5
6
α
6.5
100 50 10 5 5.5
6
α
6.5
Fig. 5. Bifurcation diagram of all four populations–Susceptible prey, infected prey, predator, and virus–in Model II as function of interaction strength parameter α, for different prey preference: (A) ξ = 0.5, (B) ξ = 0.99.
depends on the interaction strength. As ξ regulates the predation strength by changing the prey preference of the predator, we analyzed Model II for two different values of this interaction strength: ξ = 0.99 indicating high preference for the susceptible prey and very low preference for the infected prey; and, ξ = 0.5, where the predator has no preference of one over the other. Figure 5 shows the bifurcation diagrams of Model II for the two cases, ξ = 0.5 and 0.99. Figure 5(A) shows that, at ξ = 0.5, there are two important changes that occur in the same range of predation strength, i.e., 5.5 < α < 6.5. First, the network reduces to only a “prey (S and I) and virus (V )” system
Ecological Networks: Structure, Interaction Strength, and Stability
67
with the predator population going to zero. This happens because, in the absence of predation, all of S is available for inducing strong viral infection, which converts the susceptible prey class to the infected one, and the predator does not have enough preys to survive through predation. Second, the dynamics of this prey-virus system remains stable with a large virus population and low prey populations. When the the predator has a strong preference for the S population, i.e., at ξ = 0.99, this situation continues for low predation strength (until α = 6.2), and the reduced prey-virus system remains stable (Fig. 5(B)). However, at higher predation strength (α > 6.2), the predator succeeds in surviving on predation and reduces the population of I strongly enough to reduce the production of V , which in turn reduces infection, thereby increasing S, which is then available for predation. This kind of a delayed feedback on S eventually induces oscillations in all four populations, albeit at higher α compared to Model I. This interesting phenomenon essentially underscores the fact that distribution of the type (+ or −) and the strength of interactions can play a significant role in food web structure and dynamics. It can change the structure of the network by inducing a species to go extinct, and also promote stability in an otherwise oscillatory system.
4 Discussion and Conclusion Community stability in ecology is primarily decided by the topological and functional architecture of the entire organization. Some studies have indicated that weak interactions are one of the most dominant threads in weaving natural communities in tune [50, 77], which is also reasserted by our simple models. Weak interactions have been proposed as the “glue” that binds large networks together [43], with ramifications for biodiversity. In particular, this has important implications for those species whose low abundance and weak per capita consumption rates might otherwise be taken as evidence of a negligible role [42]. Large network simulations have shown that the distribution of interaction strengths is strongly skewed towards weak interactions [29, 61]. Although the experimental quantification of interaction strength in field studies is difficult, preliminary contributions on the nature of distributions of interaction strengths within real food webs are slowly emerging [16, 76]. Similarly, the importance of weak interactions for dynamic stability and species coexistence has been suggested from matrix analyses of soil food webs, numerical simulations of small and large webs, and experimental manipulations [6, 32, 48, 59]. Our study, with two very simple yet realistic ecological networks, points towards some intriguing features. One point of interest is that the introduction of another species in a two-species prey-predator interaction network compartmentalizes the single prey species into two subgroups leading to additional diversity in the network. Such a node can modify the network structure by pushing the predator species to extinction simply based on the interaction strength and its preference level. At higher values of both these interaction
68
S. Bhattacharyya and S. Sinha
parameters, the full network structure persists. These features, i.e., the interaction strength and network structure, also regulate the population dynamics of the species. A combination of type and strength of interactions determines the dynamical stability of the species in the network. One natural extension of our study would be to introduce yet another class of prey species, Recovered, which represents the population of individuals that recover from the infection after a time, and either return to the susceptible class, or may be immune to further infections. This would, obviously, increase the complexity of the network by adding new nodes and interactions among them. However, this would contribute towards understanding the concept “diversity leads to stability” on large-scale food web processes. Most of the recent research on food web theory in ecology centers around the local dynamics of a community, but the evolution of food web dynamics across different spatial scales has also received considerable attention [26, 27, 45, 57, 73]. “Habitat fragmentation and its impact on life” is one of the most important issues of present research [66]. The destruction of habitat occurs due to a variety of environmental threats, such as habitat removal, invading alien species, or hunting, each of which may have different effects on food web structure. Given that they often act concomitantly, these may also interact with each other in unpredictable ways. Introduction of alien species poses a significant threat to global biodiversity by altering ecosystem processes, such as nutrient cycling, or disturbance regimes in a community [65], which, in turn, also affect the strength of the links. If the performance of interacting species is habitat dependent, then interaction strength may change with scale. Certain approaches such as hierarchical communities of competitors [69, 70] and neutral and quasi-neutral communities [68] have been adapted to show that community organization is relevant in determining the effects of habitat loss and spatial patterning. Such research on ecological network theory in the future would involve rigorous modeling approaches, both analytical and through simulations, in combination with field and laboratory experimental studies, to resolve the crucial questions in conservation and restoration ecology. Acknowledgments The authors are thankful to the anonymous referees for constructive, critical comments, and to the Department of Science and Technology, India, for financial support.
References 1. Abrams, P. et al. The role of indirect effects in food webs. In Food Webs: Integration of Patterns and Dynamics (eds G.A. Polis & K.O. Winemiller), 371–395, Chapman & Hall, New York (1996) 2. Allesina, S. and Bodini, A. Who dominates whom in the ecosystem? Energy flow and bottlenecks and cascading extinctions. J. Theor. Biol., 230, 351–358 (2004)
Ecological Networks: Structure, Interaction Strength, and Stability
69
3. Bascompte, J. and Melian, C. J. Simple trophic modules for complex food webs. Ecology, 86, 2868–2873 (2005) 4. Bascompte, J. et al. Interaction strength combinations and the overfishing of a marine food web. Proc. Natl Acad. Sci. USA, 102, 5443–5447 (2005) 5. Bastolla, U., Lassig, M., Manrubia, S. C. and Valleriani, A. Diversity patterns from ecological models at dynamical equilibrium. J. Theor. Biol., 212, 11-34 (2001) 6. Berlow, E. L. et al. Interaction strengths in food webs: issues and opportunities. J. Anim. Ecol., 73, 585–598 (2004) 7. Berlow, E. L., Brose U., and Martinez, N. D. The “Goldilocks factor” in food webs. Proc. Natl. Acad. Sci. USA, 105, 4079–4080 (2008) 8. Bhattacharyya, S. and Bhattacharya, D. K. Pest control through viral diseases: mathematical modeling and analysis. J. Theor. Biol., 238, 177–197 (2006) 9. Caldarelli, G., Higgs, P. G. and McKane, A. J. Modelling coevolution in multispecies communities, J. Theor. Biol., 193, 345–358 (1998) 10. Camacho, J. et al. Quantitative analysis of the local structure of food webs. J. Theor. Biol., 246, 260–268 (2007) 11. Case, T. J. Invasion resistance arises in strongly interacting species-rich model competition communities. Proc. Natl. Acad. Sci. USA, 87, 9610–9614 (1990) 12. Chen, X. and Cohen, J. E. Global stability, local stability and permanence in model food webs. J. Theor. Biol., 212, 223–305 (2001) 13. Cohen, J. E., Briand, F. and Newman, C. M. Community food webs. Biomathematics, 20, Springer-Verlag, Berlin (1990) 14. Dambacher, J. M. et al. Relevance of community structure in assessing indeterminacy of ecological predictions. Ecology, 83, 1372–1385 (2002) 15. Dambacher, J. M. et al. Qualitative stability and ambiguity in model ecosystems. Am. Nat., 161, 876–888 (2003) 16. De Ruiter, P., Neutel, A. M. and Moore, J. C. Energetics, patterns of interaction strengths, and stability in real ecosystems. Science, 269, 1257–1260 (1995) 17. Drossel, B. and McKane, A. J. Modelling food webs. In Handbook of Graphs and Networks (eds S. Bornholdt & H. G. Schuster), 218–247, Wiley-VCH, Berlin (2003) 18. Dunne, J. A. et al. Network structure and biodiversity loss in food webs: robustness increases with connectance. Ecol. Lett., 5, 558-567 (2002) 19. Emmerson, M. C. and Raffaelli, D. Predator-prey body size, interaction strength and the stability of a real food web. J. Anim. Ecol., 73, 399–409 (2004) 20. Garcia-Domingo, J. L. and Saldana, J. Food-web complexity emerging from ecological dynamics on adaptive networks. J. Theor. Biol., 247, 819–826 (2007) 21. Garcia-Domingo, J. L. and Saldana, J. Effects of heterogeneous interaction strengths on food web complexity. Oikos, 117, 336–343 (2008) 22. Ghosh, S., Bhattacharyya, S. and Bhattacharya, D. K. Role of viral infection in pest control: a mathematical study. Bull. Math. Biol., 69, 2649–2691 (2007) 23. Gross, T. et al. Long food chains are in general chaotic. Oikos, 109, 135–144 (2005) 24. Hastings, A. and Powell, T. Chaos in a 3-species food-chain. Ecology, 72, 896–903 (1991) 25. Jansen, V. A. A. and Kokkoris, G. D. Complexity and stability revisited, Ecol. Lett., 6, 498–502 (2003) 26. Keitt, T. H. Network theory: an evolving approach to landscape conservation. Ecological and Modeling for Resource Managers, Springer Berlin, 125–134, (2003)
70
S. Bhattacharyya and S. Sinha
27. Keitt, T. H. and Economo, E. P. Species diversity in neutral metacommunities: a network approach. Ecol. Lett., 11(1), 52–62, (2008) 28. Kokkoris, G. D. et al. Variability in interaction strength and implications for biodiversity. J. Anim. Ecol., 71, 362–371 (2002) 29. Kokkoris, G. D., Jansen, V. A. A., Loreau, M. and Troumbis, A. Y. Variability in interaction strength and implications for biodiversity. J. Anim. Ecol., 71, 362–371 (2002) 30. Kondoh, M. Does foraging adaptation create the positive complexity-stability relationship in realistic food-web structure? J. Theor. Biol., 238, 646–651 (2006) 31. Krause, A. E. et al. Compartments revealed in food-web structure. Nature, 426, 282–285 (2003) 32. Laska, M. S. and Wootton, J. T. Theoretical concepts and empirical approaches for measuring interaction strength. Ecology, 79, 461–476 (1998) 33. Law, R. and Morton, R.D. Permanence and the assembly of ecological communities. Ecology, 77, 762–775 (1996) 34. Lawton, J. H. Food webs. In Ecological Concepts: the Contribution of Ecology to an Understanding of the Natural World (ed. J. Cherret), 43-78, Blackwell, Boston (1990) 35. Levines, R. Evolution in Changing Environments: Some Theoretical Explanations. Princeton University Press, Princeton, NJ, USA (1968) 36. Logofet, D. O. Stronger-than-Lyapunov notions of matrix stability, or how ‘flowers’ help solving problems in mathematical ecology. Linear Algebra and Its Applications, 398, 75–100 (2005) 37. Loreau, M. et al. A new look at the relationship between diversity and stability. In Biodiversity and Ecosystem Functioning: Synthesis and Perspectives (eds M. Loreau, S. Naeem and P. Inchausti), 79–91, Oxford University Press, Oxford (2002) 38. MacArthur, R. H. and Levines, R. Strong, or weak interactions? Tansactions of the Connecticut Academy of Arts and Sciences, 44, 177–188 (1972) 39. Martinez, N. D. et al. Diversity, complexity, and persistence in large model ecosystems. In Ecological Networks, Linking Structure to Dynamics in Food Webs (eds Pascual, M. and Dunne, J. A.) Santa Fe Inst., Studies in the sciences of complexity. Oxford Univ. Press, 163–185 (2006) 40. May, R. M. Will a large complex system be stable? Nature, 238, 413–414 (1972) 41. May, R. M. Stability and Complexity in Model Ecosystems, Princeton University Press, Princeton, NJ, USA(1973) 42. McCann, K. S. The diversity–stability debate. Nature, 405, 228–233 (2000) 43. McCann, K. et al. Weak trophic interactions and the balance of nature. Nature, 395, 794–798 (1998) 44. McCann, K. and Hastings, A. Re-evaluating the omnivory–stability relationship in food-webs. Proc. Roy. Soc. of London, Series B, 264, 1249–1254 (1998) 45. Memmott, J. et al. Biodiversity loss and ecological network structure. In Ecological Networks: Linking Structure to Dynamics in Food Webs (eds. M. Pascual and J.A. Dunne), Oxford University Press, Oxford (2006) 46. Milo, R. et al. Network motifs: simple building blocks of complex networks. Science, 298, 824–827 (2002) 47. Montoya, J. M., Pimm, S. L. and Sole, R. V. Ecological networks and their fragility. Nature, 442, 259–264 (2006) 48. Montoya, J. M. and Sole, R.V. Topological properties of food webs: from real data to community assembly models. Oikos, 102, 614–622 (2003)
Ecological Networks: Structure, Interaction Strength, and Stability
71
49. Navarrete, S. A. and Berlow, E. L. Variable interaction strengths stabilize marine community patterns. Ecol. Lett., 9, 526–536 (2006) 50. Navarrete, S. A. and Castilla, J. C. Experimental determination of predation intensity in an intertidal predator guild: dominant versus subordinate prey. Oikos, 100, 251-262 (2003) 51. Otto, S. B., Berlow, E. L., Rand, N. E., Smiley, J. and Brose, U. Predator diversity and identity drive interaction strength and trophic cascades in a food web. Ecology, 89, 134–144 (2008) 52. Paine, R. T. Food web complexity and species diversity. Am. Nat., 100, 65–75 (1966) 53. Paine, R. T. A note on trophic complexity and community stability. Am. Nat., 103(929), 91–93 (1969) 54. Paine, R. T. Food webs - road maps of interactions or grist for theoretical development. Ecology, 69, 1648–1654 (1988) 55. Paine, R. T. A. Conversation on refining the concept of keystone species. Conservation Biology, 9(4), 962–964 (1995) 56. Petchey, O. L., Beckerman, A. P, Riede, J. O. and Warren, P. H. Size, foraging, and food web structure. Proc. Natl. Acad. Sci. USA, 105, 4191–4196 (2008) 57. Peterson, E. E., Theobald, D. M. and Ver Hoef, J. M. Geostatistical modeling on stream networks: developing valid covariance matrices based on hydrologic distance and stream flow. Freshwater Biology, 52, 267–279 (2007) 58. Pimm, S. L. The complexity and stability of ecosystems. Nature, 307, 321-326 (1984) 59. Polis, G. A. Stability is woven by complex webs. Nature, 395, 744-745 (1998) 60. Post, W. M. and Pimm, S. L. Community assembly and food web stability, Math. Biosci., 64, 169–192 (1983) 61. Quince, C. et al. Topological structure and interaction strengths in model food webs. Ecol. Model., 187, 389–412 (2005) 62. Raffaelli, D. G. Trends in research on shallow water food webs. Journal of Experimntal Marine Biology and Ecology, 250, 223–232 (2000) 63. Rooney, N. et al. Structural asymmetry and the stability of diverse food webs. Nature, 442, 265–269 (2006) 64. Sabo, J. L. et al. Population dynamics and food web structure - predicting measurable food web properties with minimal detail and resolution. In Dynamic Food Webs, Multispecies Assemblages, Ecosystem Development and Environmental Change (eds. de Ruiter, P. C. et al.) Theor. Ecol. Ser., Academic Press, 437– 452 (2005) 65. Schmitz, D. C. and Simberlo, D. Biological invasions: a growing threat. Issues in Sci. & Tech. 13, 33–40 (1997) 66. Singh, B. K., Subba Rao, J., Ramaswamy, R. and Sinha, S. The role of heterogeneity on the spatiotemporal dynamics of hostparasite metapopulation. Ecol. Model., 180, 435–443 (2004) 67. Singh, B. K., Chattopadhyay, J. and Sinha, S. The role of virus infection in a simple phytoplankton zooplankton system. J. Theor. Biol., 231, 153–166 (2004) 68. Sole, R. V., Alonso, D. and McKane, A. self-organized instability in complex ecosystems. Phil. Trans. Roy. Soc. Lond. Ser., B-Biol. Sci. 357, 667–681 (2002) 69. Stone, L. Biodiversity and habitat destruction - a comparative study of model forest and coral-reef ecosystems. Proc. Natl. Acad. Sci. USA, 261, 381-388 (1995) 70. Tilman, D. et al. Habitat destruction and the extinction debt. Nature, 371, 6566 (1994).
72
S. Bhattacharyya and S. Sinha
71. Uchida, S. and Drossel, B. Relation between complexity and stability in food webs with adaptive behavior. J. Theor. Biol., 247, 713–722 (2007) 72. Uchida, S., Drossel, B. and Brose, U. The structure of food webs with adaptive behaviour. Ecol. Model., 206, 263–276 (2007) 73. Urban, D. L., Goslee, S., Pierce K. B. and Lookingbill, T.R. Extending community ecology to landscapes. Ecoscience, 9, 200–212 (2002) 74. Williams, R. J. and Martinez, N. D. Simple rules yield complex food webs. Nature, 404, 180–183 (2000) 75. Woodward, G. and Hildrew, A. G. Body-size constraints on niche overlap and intraguild predation in a complex food web. J. Anim. Ecol., 71, 1063–1074 (2002) 76. Wootton, J. T. Estimates and tests of per-capita interaction strength: diet, abundance, and impact of intertidally-foraging birds. Ecological Monographs, 67, 45– 64 (1997) 77. Wootton, J. T. and Emmerson M. Measurement of interaction strength in nature. Annu. Rev. Ecol. Evol. Syst., 36, 419–444 (2005) 78. Yodzis, P. The indeterminacy of ecological interactions as perceived through perturbation experiments. Ecology, 69, 508–515 (1988) 79. Yodzis, P. and Innes, S. Body-size and consumer-resource dynamics. Am. Nat., 139, 1151–1175 (1992)
Signaling and Feedback in Biological Networks Sandeep Krishna, Mogens H. Jensen, and Kim Sneppen Center for Models of Life, Niels Bohr Institute, Blegdamsvej 17, 2100 Copenhagen, Denmark;
[email protected],
[email protected],
[email protected]
1 Introduction Cellular processes operate on a wide range of time and length scales to produce complex and intricate dynamics. It is a great challenge to understand both how these dynamical patterns are produced, as well as why they are produced; that is, what functional or evolutionary role do they play? This is one of the most fruitful areas in which to apply the ideas of complex networks. Living cells have all the prerequisites for a useful representation as networks. First, cellular systems contain numerous non-identical active components—genes, proteins, RNA, etc. These are the nodes of the network. Second, there are many interactions between these components, which form the links between the nodes. Not every pair of components interacts, so the resulting network is not fully connected, nor is it a tree or other simple topology. Thus, cellular networks provide plenty of scope for analysing their structure and graphtheoretic properties, and numerous studies have taken advantage of this (see [1] for reviews and [2–9] for some examples). Network representations of cellular systems can easily be augmented to address dynamical issues. Each node can be associated with a dynamical variable which could represent, for example, the concentration of that protein or the level of expression of that gene. Equations or rules governing the temporal dynamics of these variables can then be written, where the network structure determines which variables interact with each other. This usually requires encoding more information about the interactions into the network representation. For instance, apart from knowing that one node links to another, one needs to know the sign and strength of the interaction. However, in a network picture it is sometimes difficult to encode more detailed molecular information, such as whether the binding of a protein to DNA is accompanied by DNA looping, or whether a small molecule that binds to a protein can also bind equally well when that protein is bound to DNA. N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 5, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
74
S. Krishna, M.H. Jensen, and K. Sneppen
The question then is: What kind of physiologically useful processes can be illuminated by the kind of information that is easily represented in a network picture of a cell? One broad class of such processes is signal propagation. Signals need to be sent in response to environmental conditions in order to trigger the appropriate functional proteins, and need to be sent between proteins in order to perform necessary computations. For example, the presence of food metabolites in the surroundings triggers signals to proteins involved in transport and metabolism of those molecules; or a sudden change in the temperature triggers signals to proteins which buffer the cell against the shock. Network representations of cellular systems are particularly suited to study signal propagation because they precisely delineate the paths along which signals could travel. The next level of complication occurs when a signal loops back onto itself. Such feedback loops are at the core of every non-trivial computation performed by a cell [10–16]. Feedback loops are necessary for much non-trivial dynamical behaviour, in particular, oscillations and multistability, both of which are important for proper cellular function in different organisms. Our review will therefore introduce biological networks specifically with the intention of investigating signal propagation and feedback. We will describe simple measures for examining signal propagation on networks. We will use the organism-wide cellular network of E. coli to discuss whether the network structure has any particular properties which would affect the cost and specificity of signal propagation. The review will then continue by discussing feedback in sub-networks of mammalian and yeast cells. We will take one example each from a biological setting where, respectively, negative and positive feedback in the network structure play a crucial role in the dynamical behaviour of the system. Finally, we will conclude by looking at combinations of feedback loops. We show that two entangled feedback loops, which are common in bacterial cells, have dynamical properties that are quite different from those of their individual loops.
2 Signaling An organism-wide protein network of the bacterium E. coli can be extracted from the database EcoCyc [17] and represented as a directed, bipartite graph with 2846 protein nodes and 2774 reaction nodes [18]. The reaction nodes include all kinds of cellular reactions between proteins: transcription reactions, complex formations, protein modifications and metabolic reactions. Figure 1A shows the giant weakly connected component of this graph, consisting of 1938 reactions (of which 812 are transcription reactions, squares) and 1897 proteins (circles). Figure 1A also illustrates that the E. coli graph is composed of a large number of relatively small strong components (a strong component is a sub-graph where there is a directed path between every pair of nodes). Figure 1B compares this with the strong component structure of a randomised network with exactly the same number of nodes and links, as well as the same in- and out-degree (number of in- and out-links) of each node. The E. coli
Signaling and Feedback in Biological Networks
75
Fig. 1. E. coli protein reaction network. (A, Left) The graph is the largest weak component of a bipartite network, consisting of proteins (circles) and reaction nodes (promoters (squares), complex formations and modifications (black squares)). The two largest hubs, σ 70 and CRP , and their links, have been removed for ease of visualisation. (A, bottom left) Illustration of the procedure of making the strong component graph. (A, Right) The resulting strong component graph of the E. coli network. An arrow in the strong component graph indicates that there is a path connecting the two strong components in the original graph; nodes correspond to strong components of minimum size two. (B) The strong component graph for a randomized version of the E. coli network. The randomisation preserves the total number of nodes, total number of links and the number of in- and out-links of each node [18].
protein network is much more modular than the randomized network, an overall feature of regulation/signaling that was first suggested in [19]. In such a network, what we call “signals” are perturbations in the dynamical variables associated with the nodes. For instance, if they were all proteins, then a perturbation in the concentration of one protein would alter the concentration of all the proteins downstream from the original one. The simplest aspect of the structure of the network that influences signaling is the number of nodes that are downstream of any given starting node (note that this is a quantity that can be sensibly studied only with a directed graph representation of the network; in any connected undirected graph all nodes are downstream of each other). The possible signals emanating from the starting node are
76
S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 2. The cumulative distribution of number of downstream targets s for nodes of the E. coli network (lower curve) and the randomised network (upper curve) [18].
obviously limited to reach only these nodes. The strong component graphs in Fig. 1 show particularly clearly how the network structure affects signaling possibilities. Within each strong component, every node can, in principle, send a signal to another node. But between strong components the possibilities are hugely reduced. Thus, the E. coli network structure already seems to be set up to allow plentiful signaling on short length scales, but to allow only very specific paths on longer length scales. In the random network, however, most nodes can send signals to almost the entire network (because most of the nodes are part of one giant strong component). A percolating structure like this is not conducive to specific signaling because every node has almost the entire network downstream of it. Figure 2 bolsters this conclusion, showing that in the E. coli network proteins have a much smaller number of downstream targets than in the randomised network. 2.1 Cost of Signaling Signaling is not just about reaching a downstream target. As a signal propagates, it needs other molecules to help it pass the message across consecutive reactions. Consider, for example, a signal initiated by an increase in the concentration of a given transcription factor. The promoter it influences may depend on other transcription factors, for example, in an or-gate construction. If that is the case, and the other transcription factor is already abundant, the promoter activity will not be influenced and thus the signal will not be transmitted. More generally, for each additional reactant along a reaction pathway, signal propagation becomes increasingly coupled to the overall state of the
Signaling and Feedback in Biological Networks
77
Fig. 3. (A) Schematic showing how the “cost” of a signaling path, A → F , is measured. In this case proteins B and D are necessary, giving a cost C = 2. (B) Cost of a signaling path as a function of its length for the real (solid) and randomised (dashed) E. coli networks [18].
molecules in the cell. The more reactions in the path, and the more reactants in each reaction, the more conditions that must be met for propagation of the signal. We quantify this cost C = C(path) for an arbitrary path from a starting protein to a target protein by simply counting the number of reactants along the entire path (not counting the protein nodes which are part of the path), as described schematically in Fig. 3A. If the same reactant is used several times, it is only counted once. Notice that the propagation of a signal does not necessarily mean an increased level of the proteins involved. The key point is that a change in input state should be transmitted to a changed output state of the end product. Our cost function is a simple measure of the complexity of handling such a signal and it could, in principle, be calculated between any pair of proteins where a path exists in the directed network. Figure 3B shows the average cost of signals propagating from one protein to another along the shortest path connecting them, as a function of the length l of that path. Each data point is the average over all pairs which are at the given distance. Except for paths of length two, the average cost for signals
78
S. Krishna, M.H. Jensen, and K. Sneppen
D
E
F
Fig. 4. The six largest strong components of the E. coli network (A–F), along with plots of the average cost, C(l) as a function of signaling distance. The grey areas show the range spanned by C(l) for 100 randomised versions of the subgraphs [18].
is significantly smaller for the real E. coli network than for a randomised networks (error bars are smaller than the symbol size). Figure 4 repeats this analysis for each of the six largest strong components in the network. These strong components capture distinct functional units associated, respectively, to (A) predominantly fatty acid metabolism, (B) the transcription network around σ factors, (C) PTS-sugar transport, (D) ABC transporters, (E) the FeII and FeIII transport system and (F) the chemotaxis module. Overall, we see that the cost within each module is fairly similar to the random expectation. 2.2 Conclusions About Signaling We have shown that the molecular network of E. coli is designed in a way which facilitates local signaling. On longer distances, signal transmission is a priori nearly impossible, but we find statistical evidence for signal pathways in terms of a lower signaling “cost” when we measure this by the number of co-factors needed to transmit a given signal. The fact that the E. coli network has a lower than randomly expected cost of signaling for paths longer than two steps shows that it contains many linear chains which have few incoming branches. That is, the real network is “stringy,” while the randomised network is more “bushy,” having relatively many more branched pathways. Topologically, a low cost is equivalent to less cross talk, which is indeed desirable [3, 19]. This picture of a stringy network of long linear chains applies to the large scale: the place where the real network optimizes specific signaling is between
Signaling and Feedback in Biological Networks
79
strong component modules, rather than within them. A final intriguing point is that at small scales, within modules, the network has widely different design features, as seen from Fig. 4. Some modules (C,F) are dominated by complex formation reactions, and others (D,E) by linear pathways, while the remaining (A,B) are densely interconnected. Obviously, signaling is not only limited by the topology of the network, but also by the type of chemical reactions that facilitate the signals. For example, in pure protein-protein interaction networks, Refs. [20, 21] show that proteins with high concentrations propagate signals to proteins at low concentrations, but not vice versa. Further, when most of a protein is present in an unbound form, rather than in a complex with other proteins, it inhibits propagation of signals through that node of the network. Thus, the overall picture of signaling in biological networks is that one needs careful engineering of both topology and protein binding chemistry in order to facilitate signal propagation over more than one or two reactions.
3 Feedback Figure 5 shows a number of feedback loops. Each node in each loop receives signals (perturbations) from the previous node and sends it on to the next node in the cycle. When the signal travels all the way around the loop, it will Negative feedback loops
a
Hes1
b
d
c p53
Mdm2
IkBα
lactose
β−galacto sidase
LacI
IkBα mRNA
NF−kB
Positive feedback loops
e
f cI
cI
Cro
g
lactose
lactose transporter
LacI
Fig. 5. Examples of positive and negative feedback loops. An ordinary arrow indicates activation, a barred arrow indicates inhibition. (a)–(d) Negative feedback loops found involving proteins important for, respectively, development [29], apoptosis [30], lactose consumption [31, 32] and the immune system [33]. (e)–(g) Positive feedback loops involving proteins important for, respectively, λ phage lysis-lysogeny decision and induction [34], and import of extracellular lactose [31, 32].
80
S. Krishna, M.H. Jensen, and K. Sneppen
act to either dampen (negative feedback) or enhance (positive feedback) the original perturbation. Whether the feedback is positive or negative depends on how the nodes interact. In Fig. 5 we use an ordinary arrow to indicate that a node activates the next node, and a barred arrow if a node inhibits the next node. Then, clearly, all loops with an odd number of repressors are negative feedback loops, while those with an even number of repressors are positive feedback loops. In cellular networks such feedback loops are quite common. Previous studies which searched for small “motifs” in cellular networks found very few feedback loops and an overabundance of feedforward loops [7]. However, these studies looked only at transcription factor networks. As soon as one includes metabolism, then it becomes quickly apparent that feedback loops are by far the most common motif, especially at the interface between the metabolic and regulatory networks of the cell [22]. This interface is quite extensive, as evidenced by the fact that around half of all transcription factors in E. coli have a binding site for small metabolic molecules. The sugar lactose is one such example, being involved in both a negative (Fig. 5c) and a positive feedback loop (Fig. 5g). Figure 5 also shows some other examples of negative and positive feedback loops without small molecules. Positive feedback loops are closely related to the existence of multiple stable states of the system, while negative feedback loops are associated with oscillations. In fact, for a very general class of systems, it has been shown that the existence of at least one negative feedback loop is necessary (but not sufficient) for oscillations, and a similar result holds for positive feedback and multistability [23–25]. References [26–28] study, both theoretically and through the construction of synthetic gene circuits, multistability in positive feedback networks. Ref. [35] further explores the connection between oscillations and negative feedback, showing how the structure of the underlying loop can be extracted from oscillating time series. 3.1 Negative Feedback and Oscillations in Mammalian Immune Response The simplest negative feedback loop is, of course, a protein which represses itself (Fig. 5a). There are many examples of such proteins: the main regulator of the E. coli response to UV damage, LexA, represses its own production [36]; Hes1, involved in development in mammalian cells, also represses transcription of its own gene [29]. A well-known synthetic negative feedback loop is the repressilator, which consists of three proteins each repressing each other [37] (the same structure as Fig. 5c). Here we will concentrate on the negative feedback loop shown in Fig. 5d containing the transcription factor, NF-κB, which is one of the central regulators of the immune system in mammalian cells. The NF-κB family of proteins is one of the most studied, being involved in a variety of cellular processes including immune response, inflammation and development. NF-κB can be activated by a number of external stimuli
Signaling and Feedback in Biological Networks
81
including bacteria, viruses and various stresses and proteins. In response to these signals it controls, directly and indirectly, over 150 genes including many chemokines, immunoreceptors, stress reponse genes and acute phase inflammation response proteins [33]. Nuclear NF-κB is known to activate production of IκBα, an inhibitor protein which inhibits nuclear import of NF-κB by sequestering it in the cytoplasm, thus forming a negative feedback loop. Experimentally, when the NF-κB system is suitably excited, the concentration of NF-κB in the nucleus begins to oscillate [10, 38]. How does the negative feedback loop of NF-κB produce oscillations? Physically, what is required for instability of the fixed point, and hence oscillations, is a time delay, i.e., a sufficient slowing down of the signal going a round the loop. (If a perturbation in the concentration of one variable instantaneously affects the concentration of the next one, and so on, then for a negative feedback loop, any perturbation will be immediately cancelled and the steady state will be stable.) In cellular systems many processes could produce time delays: (i) a process that takes a finite minimum time, (ii) many intermediate steps, (iii) a sharp response by some of the variables, (iv) saturated degradation, or (v) autocatalysis (see Ref. [39] for more details). In the NF-κB system it is, in fact, saturated degradation of IκB that is behind the oscillations. NF-κB forms a complex with its inhibitor protein IκBα. This complex has the curious property that the external stimulus (a protein kinase called IKK) leads to a degradation of IκBα only when it is bound in the complex, and not when it is unbound. As a result, the degradation rate of IκBα has an upper limit, i.e., is saturated, due to the limited amount of NF-κB present and hence the limited amount of complex that can form. Mathematically, it is possible to describe all the essential features of the NF-κB system using a very simple model consisting of only three variables [11], nuclear NF-κB (Nn ), cytoplasmic IκB (I) and IκB mRNA (Im ): dNn (1 − Nn ) INn =A −B , dt +I δ + Nn dIm = Nn2 − Im , dt (1 − Nn )I dI = Im − C . dt +I
(1) (2) (3)
The saturated degradation is the second term in the last equation. Other terms in the equations model processes like nuclear import and export of NF-κB, production of IκB, etc. (see [11] for more details). An obvious question is why the cell requires oscillations in NF-κB in response to inflammation. This is a subject of much debate currently, and there is no clear answer. However, our model of NF-κB provides a possible clue: One property of the oscillations of nuclear NF-κB (in Fig. 6) that stands out is that they are extremely spiky. The spikiness is extremely robust to changes
82
S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 6. (Left) Oscillations of nuclear NF-κB (Nn ) (black curve) and cytoplasmic IκB (grey curve) for simulations of the model with A = 0.007, B = 954.5, C = 0.035, δ = 0.029 and = 2 × 10−5 (these parameter values are derived from the ones used in Ref. [10], see [11]). In order to facilitate comparison with the experimental plot (right, obtained from Ref. [38]), the x-axis has been limited to 600 minutes, but the oscillations are sustained.
Fig. 7. Sensitivity to IKK. (Left) Spike duration, the fraction of time Nn spends above its mean value, as a function of IKK concentration. (Right) Spike peak, the maximum concentration of nuclear NF-κB, as a function of IKK concentration. In both plots, the black dot shows the IKK value used in Fig. 6, which separates regions of spiky and soft oscillations [11].
in parameter values. In general, the existence and spikiness of the oscillations is very robust to changes in most of the parameters of the model [11]. However, the system shows a very sensitive response to change in one parameter: the external stimulus, IKK. Figure 7 shows that both the spike height (or peak level), as well as the spike duration, can change by large amounts in response to small changes in the IKK level. Notice that this sensitivity is particularly high in IKK ranges which are near the transition from spiky to soft oscillations. It can be shown that this sensitivity can be transmitted to genes that are affected by NF-κB, producing a gene response sensitivity
Signaling and Feedback in Biological Networks
83
that is much larger than that obtained by other typical mechanisms which do not involve oscillations [40, 41]. Thus, oscillations could be a by-product of designing the system to have a very high sensitivity to small changes in the external stimulus. 3.2 Positive Feedback and Bistability in Yeast Epigenetics Cells carry information handed down from their ancestors and are able to pass on information to their descendants. In many cases this “memory” is epigenetic—not stored in the DNA sequence—allowing cells with identical DNA to maintain distinct properties. Epigenetic cell memory implies alternative states that are stable over time and are inherited through cell division. One proposed mechanism for epigenetic cell memory invokes positive feedback loops in nucleosome modification [42]. Nucleosomes are protein complexes that package eukaryotic DNA, with a density of about one nucleosome per 200 base pairs (bp). The core nucleosome is composed of two molecules each of four core histone proteins. Nucleosomes may carry various chemical modifications (e.g. acetylation and methylation) at different amino acid positions on the different histones, conferring a large potential information capacity on each nucleosome. Specific additions and removals of these nucleosome modifications are carried out by classes of enzymes, including histone acetyltransferases (HATs), histone methylases (HMTs), histone deacetylases (HDACs) and histone demethylases (HDMs). At least some of these modifications affect the activity of nearby genes, in part because the modifications can alter the binding of regulatory proteins to the DNA. Positive feedbacks are present in this system because nucleosomes that carry a particular modification may recruit (directly or indirectly) the enzymes that catalyse similar modification of neighbouring nucleosomes. Thus, a cluster of nucleosomes may be able to maintain itself stably in a particular modification state. These states can be inherited through DNA replication because nucleosomes on the parental DNA strand are distributed to both daughter strands [43], and the enzymes recruited by these parental nucleosomes may then establish the parental modification pattern on the newly deposited nucleosomes. A specific case in which positive feedbacks in nucleosome modification result in multiple stable states occurs in the mating-type system of the eukaryote S. pombe (fission yeast) [44]. A ∼20 kbp region of S. pombe DNA containing two mating-type cassettes is normally in a stable “silenced” state, with the mating-type genes not expressed. In certain mutants where part of the silenced region is modified, the system is bistable, flipping between states where the ura4 gene is either expressed (active) or not (silenced). Each state is stable and heritable, with transitions occurring at roughly equal frequencies of ≈ 5 × 10−4 per cell division [44]. Switching appears to be stochastic and is determined by factors associated with the region itself. In the silenced state, but not the active state, the region is dominated by nucleosomes that are
84
S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 8. Illustration of basic ingredients of the model: Each oval represents a nucleosome that can be methylated (M), unmodified (U) or acetylated (A). Enzymatic transitions (solid arrows) between the three states are in part random (controlled by a noise level 1 − α), and in part autoregulated by recruitment (dotted lines) of enzymes (open symbols) by nucleosomes in the M or A state [45].
methylated at a particular site. An HMT that can catalyse this modification and certain HDAC proteins are known to be important for silencing. One can construct a simple network model [45] (schematically shown in Fig. 8) of the nucleosome modification system that exhibits all this behaviour, based on three simplifying assumptions. (1) There are only three relevant kinds of nucleosomes: unmodified, methylated and acetylated; methylation and acetylation are mutually exclusive. (2) The nucleosomes are enzymatically interconverted as shown in Fig. 8, by HMT, HDAC, HDM and HAT enzyme(s). (3) The HDAC and HMT enzyme(s) are recruited by methylated nucleosomes; the HDM and HAT enzymes are recruited by acetylated nucleosomes. This is what makes the feedback positive. To model S. pombe we take a system consisting of a fixed number of N = 60 nucleosomes, arranged on a 1-dimensional (1D) string. The region is isolated from neighbouring DNA by boundary elements [46], which we assume to be inert. Each nucleosome may be methylated (M), unmodified (U) or acetylated (A). At each time step one selects a random nucleosome n1 and attempts one of two changes: (a) With probability α one attempts a change associated to enzymatic activity of an enzyme recruited by another nucleosome in the modeled region. That is, one selects another random nucleosome n2 and if this is in either an
Signaling and Feedback in Biological Networks
85
M or A state, the nucleosome n1 is changed one step toward this state. For example, when nucleosome n2 is an M: if n1 is an A, then it is changed to U and if n1 is a U it is changed to M. If nucleosome n1 and n2 are in the same state, or if n2 is a U, then no changes are made. (b) With probability 1 − α one attempts a change of the selected nucleosome n1 : A U is changed to an M with probability 13 , or an A with probability 1 1 3 whereas an A or an M is changed to U with probability 3 . One may view process (a) as occurring due to the action of enzymes recruited by nucleosomes in the region within the isolating boundaries, whereas (b) reflects extrinsic noise caused by unrecruited enzymes. Thus, a lower α value indicates a higher noise level. In Fig. 9 we illustrate the dynamics of the model. One observes a fluctuating number of the three kinds of nucleosomes. In the upper panel α is small (noise is high) and the system has only one stable state, in which the nucleosome modifications are distributed randomly along the chain. In the lower panel, with a higher α, the system exists either in a state dominated by methylated nucleosomes or a state dominated by acetylated nucleosomes, with occasional switches between the two states. As α is increased further (i.e., noise is reduced) the states become more stable, and the switching occurs less often. However, the fact that the epigenetic states in the mutant S. pombe have a finite stability demonstrates that noise in the form of disordered methylation-acetylation events plays a crucial role.
Fig. 9. Time development of the standard model [45] for a system consisting of N = 60 nucleosomes with respectively α = 0.40 (upper figure) and α = 0.64 (lower figure). The light grey curve shows the number of methylated, dark grey the number of acetylated and black the number of unmodified nucleosomes. Time t is measured in number of attempted nucleosome updates per nucleosome.
86
S. Krishna, M.H. Jensen, and K. Sneppen
This simplified model of epigenetic inheritance in eukaryotes provides some unexpected insights. First, it is very important that nucleosomes are modified by enzymes recruited by non-neighbouring nucleosomes. A “1D” variant of the model where nucleosomes can recruit enzymes to modify only one of their neighbours along the string does not produce bistability [45]. The difficulty of obtaining a clear two-state behavior in 1D arises for reasons similar to those preventing spontaneous magnetization in the 1D Ising model, or the helix-coil transition in polymer models [47, 48]. Second, it is also very important that the transition from, say, an M state to an A state requires two consecutive acetylation recruitments by nucleosomes in the A state, and therefore effectively has a rate ∝ A2 . Bistability is lost in variants where this two-step process is replaced by a single step [45]. The non-linearity produced by this kind of “cooperative” two-step modification appears to be essential for bistability. Most importantly, however, at low α, where the modification-demodification events are completely random (and hence there is no feedback), there is only one state where the nucleosome modifications are distributed completely randomly along the string. Thus, we can conclude that positive feedback is essential for bistability.
4 Combining Multiple Feedback Loops In the previous sections we investigated the basic properties of single negative and positive feedback loops. In cellular networks, however, there are multiple entangled feedback loops. This can already be seen in Fig. 5, where some of the proteins are present in more than one example (LacI in Fig. 5c and g; cI in Fig. 5e and f). In an effort to understand how feedback loops interact and the range of dynamical behaviour possible, we begin by examining two interacting feedback loops. Such two-loop network motifs are seen in a large class of cellular response systems designed to regulate the flux and concentration of small molecules. These systems control, via two feedback loops, the transport and metabolism pathways. Typically, these two loops are connected by a common transcriptional regulator that senses the concentration of the small molecule. For instance, in the arabinose utilization system in E. coli, when intracellular arabinose binds to the regulator AraC it alters its binding to DNA such that RNA polymerase and the protein CRP can bind and initiate expression of genes that increase import of extracellular arabinose as well as its metabolic consumption [49]. This is schematically shown in Fig. 10. Here, the transport is controlled by a positive feedback loop, while the metabolism is a negative feedback loop. This is, of course, not the only logical combination of feedback loops possible. Figure 11 (left column) shows four logically distinct combinations of entangled transport and metabolism feedback loops. In each case, the two feedback loops are connected by a transcriptional regulator (R) that senses the concentration of a particular small molecule (s). One loop regulates transcription of
Signaling and Feedback in Biological Networks
87
Fig. 10. Schematic illustration of molecular processes in a two-loop motif. This motif is found in the regulation of uptake and metabolism of, for example, maltose and arabinose [50, 49]. σ, s denote, respectively, extracellular and intracellular concentrations of the small molecule. The molecule binds to the regulator, R, forming the complex {Rs} which activates production of transport proteins, T , and metabolic enzymes, E. γ is a parameter controlling the metabolic rate per enzyme [13].
the transport proteins (T ) facilitating the influx of the small molecule, while the other controls transcription of enzymes (E) responsible for the metabolism of s. The signs show the logic of each feedback loop: positive (+) or negative (-). Each motif can then be described by a notation of two signs, e.g. (+ –), which means that the transport loop is positive and the metabolism loop negative. Thus, there are four logical structures: the socialist (– –), the consumer (+ –), the fashion (– +) and the collector (+ +) [13]. Each can, in turn, be implemented in two distinct but logically equivalent ways, depending on whether s inhibits or activates R. This we denote using the notation (+ – i) or (+ – a), where the i (respectively, a) indicates inhibition (activation) of R by s. Th i- and a-motifs with the same logic behave very similarly, so here we will concentrate on only the a-motifs. The socialist motif. We call the (– –) motif the socialist because at low levels of extracellular s (low σ) it increases transport and reduces the metabolism, while at high levels of extracellular s, it does the opposite. Thus, the two negative feedback loops help maintain s robustly within a small concentration range. Such behaviour would be ideal for a system responsible for maintaining homeostasis. And indeed, a regulatory system with this logic is found in the iron homeostasis system in mammals [51]: iron activates the ferric uptake regulator (Fur), which represses transcription initiation of iron uptake genes, and enhances production of iron-using proteins. For most organisms iron is essential for several proteins, but is poisonous at high concentrations. There,
88
S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 11. Behaviour of four entangled feedback loop motifs. Plots show the steady state values of s (middle column) and influx (σT = γEs + s, right column) as a function of σ. In all plots, the black curve shows the behaviour for the two-loop motif. The two other curves show the behaviour when only the transport loop is active (E = 1) and when only the metabolism loop is active (T = 1) [13].
the (– –) motif maintains the loosely bound iron within a narrow concentration range, and at the same time allows a high consumption of iron molecules by certain proteins that bind iron strongly. The consumer motif. The (+ –) motif we term the consumer, because any amount of extracellular small molecule results in the increase of both transport and metabolism. Thus, it is ideal for food molecules. This logic is in fact typical for sugar transport and metabolism in prokaryotes. The gal [52] and lac [31, 32] operons in E. coli are the most well studied of such systems. They both use the sugar molecule to inhibit the transriciption factor
Signaling and Feedback in Biological Networks
89
regulating transport and metabolism, the (+ – i) motif. In contrast, maltose [50] and arabinose [49] work by activating the regulation of transport and metabolism, the (+ – a) motif. In natural systems, transport and metabolic genes can be part of a single operon, as in lac [31], or separate operons, as in gal [52]. The latter arrangement allows non-coordinated regulation of transport and metabolism and therefore can be engineered to become bistable. This was also demonstrated by experiments on modified lactose and arabinose systems [53, 54], where the accompanying negative feedback loop was eliminated by inactivating E or using a non-metabolisable analogue of s, in agreement with our predictions from a similar cutting of the metabolic loop in Fig. 11. The fashion motif. As the fashion motif (– +) is indeed the opposite of the consumer motif, both logically and functionally, it is not surprising that we have not found any simple example of it in the regulation of small molecules in living cells. However, its behaviour (and the reason we call it the fashion motif) can be illustrated in terms of a market model for a product which is desirable in small amounts. In such a scenario, the resource, s, is analogous to a fashion product, E to the consumers, and T to the producers. R can be considered the value of the product, measured in terms of how much people desire it. When there is plenty of the product s in the market, its value R decreases, which in turn decreases its consumption (a positive metabolism feedback loop) as well as the desire amongst producers to make more of it (a negative transport feedback loop), making it a (– +) motif. The non-monotonicity of the flux of the fashion motif translates in this analogy to a saturation of the market when a fashion product becomes too abundant: Fashion products are most profitable when their availability is below a certain threshold. When the fashion motif is supplemented with a positive feedback of R to itself, the collapse of fashion goods can occur with a remarkably small change in external supply, which is reminiscent of fashion “bubbles” in society [55]. Although the fashion motif does not make much rational sense for small molecule response systems, it may be seen as a mechanism for coherent behaviour in social organization. The collector motif. The collector motif (+ +) is the logical opposite of the (– –) motif. Functionally it allows accumulation of a large amount of s, and is thus also functionally opposite to the socialist motif. Accumulation could be important for short periods of time, for instance, when an animal is preparing for hibernation. However, in such cases the (+ +) motif should eventually be overridden by another system which starts the consumption of the molecule. Such double positive feedback loops may be found in transcription regulatory networks and circuits involved in development and cell differentiation, but we failed to find any examples of them in small molecule regulation. Turning to a human analogy, the collector motif can be illustrated by making an analogy between s and the weight of a person. Then this weight increases with the intake of food (the analogue of transport), and is consumed by exercise (the analogue of metabolism). In this analogy R represents the internal “state” of the person, his or her mindset. An increase in a person’s weight, s, increases, via this internal state, their likelihood to eat more (positive transport feedback
90
S. Krishna, M.H. Jensen, and K. Sneppen
loop) and also decreases their chance to exercise (positive metabolism feedback loop), thus forming a collector motif. The bistable behaviour of the collector motif would then contribute to a broadening of the weight distribution in human populations [60]. 4.1 Two-Loop Motifs are More Than the Sum of Their Single Loops Figure 11 also shows the behaviour of individual loops in these motifs, obtained by keeping either E or T fixed, thereby cutting feedback in one of the loops. The near constant value of s in (– –) comes from the metabolic loop’s ability to constrain s for low σ, and the transport loop’s ability to constrain s at high σ. Thus, the functionality of (– –) is dominated by the sub-motif that best prevents large variation of s and flux. The (+ –) obtains a steady increase in s and a step-like increase in flux with σ by using the negative metabolic loop’s ability to “smooth out” the bistability associated to the positive transport loop. The (– +) motif exhibits a remarkable non-monotonic behaviour of flux, which cannot be obtained from any of the sub-motifs. The (+ +) motif maximizes bistability, by extending it to the extreme of the two bistable regions of its sub-motifs. Overall, we can conclude that whole two-loop motifs are more than a simple sum of their parts. 4.2 Going Beyond Two Loops Our analysis of two entangled feedback loops creates a framework for analysing small molecule regulatory circuits composed of multiple entangled feedback loops. For instance, the regulation of iron in E. coli, while being dominated by interactions that form a socialist motif [56, 57], also contains a positive feedback on the metabolism side involving usage of iron in FeS clusters [58]. An investigation of this three-loop motif suggests that two metabolism loops, connected like this in “parallel” (as opposed to the “series” connection between a transport and metabolism loop), are additive in behaviour [13, 59]. Due to this additiveness, iron regulation in E. coli is able to minimise variation of both the concentration of iron (a property of the socialist part) as well as the flux (a property of the fashion part) [56]. This indicates that an interesting direction to extend these ideas might be to try to formulate “design principles” for combinations of parallel and serially connected feedback loops.
5 Concluding Remarks To extract a useful network representation to describe a particular cellular system, it is necessary to ascertain the sensible level of coarse-graining for that system — is it the whole-cell network, individual proteins/genes or something
Signaling and Feedback in Biological Networks
91
in between? There is, of course, no one answer to this question. In the examples above we have looked at a wide range of scales, from the entire E. coli network, to three or four component sub-networks, down to nucleosomes on DNA. On all these scales the dynamical behaviour is, however, constrained first by the available communication channels, and second by the logical properties of feedback loops in the network. To summarise, we extract the following main “lessons” from our case studies: • • • • •
The E. coli protein network is highly modular. The real E. coli network is more “stringy” than the randomised version, and this reduces constraints on signal propagation. Most feedback loops go through small molecules; there are very few in the transcription network. Biological function is coupled to the logic (positive/negative) of the feedback. Entangled feedback loops are “more” than a simple sum of their parts.
Acknowledgments We thank our collaborators, with whom much of the work described here was done: J. Axelsen, I. Dodd, M. Micheelsen, S. Pigolotti, S. Semsey, G. Thon and G. Tiana. We acknowledge support from The Danish National Research Foundation and the Villum Kann Rasmussen Foundation.
References 1. S. Bornholdt and H.G Schuster, eds., Handbook of Graphs and Networks: From the Genome to the Internet, Wiley-VCH, Weinheim (2002). 2. E. Ravasz, A.L. Somera, D.A. Mongru, Z.N. Oltvai and A.-L. Barabasi, Science, 297, 1551–1555 (2002). 3. S. Maslov and K. Sneppen, Science, 296, 910–913 (2002). 4. K. Sneppen, A. Trusina and M. Rosvall, Europhys. Lett., 69, 853 (2005). 5. A. Trusina, S. Maslov, P. Minnhagen and K. Sneppen, Phys. Rev. Lett., 92, 178702 (2004). 6. J. B. Axelsen, S. Bernhardsson and K. Sneppen, BMC Systems Biology, 2, 25 (2008). 7. S.S. Shen-Orr, R. Milo, S. Mangan and U. Alon, Nat. Genetics, 31, 64–68 (2002). 8. A. Samal, S. Singh, V. Giri, S. Krishna, N. Raghuram and S. Jain, BMC Bioinformatics, 7, 118 (2006). 9. S. Singh, A. Samal, V. Giri, S. Krishna, N. Raghuram and S. Jain, Eur. Phys. J. B, 57, 75–80 (2007). 10. A. Hoffmann, A. Levchenko, M.L. Scott and D. Baltimore, Science, 298, 1241– 1245 (2002). 11. S. Krishna, M.H. Jensen and K. Sneppen, Proc. Natl. Acad. Sci. USA, 103, 10840– 10845 (2006).
92
S. Krishna, M.H. Jensen, and K. Sneppen
12. E. Aurell, S. Brown, J. Johansen and K. Sneppen, Phys. Rev. E, 65, 51914 (2002). 13. S. Krishna, S. Semsey and K. Sneppen, Proc. Natl. Acad. Sci. USA, 104, 20815– 20819 (2007). 14. K.B. Arnvig, S. Pedersen and K. Sneppen, Phys. Rev. Lett., 84, 3005 (2000). 15. G. Tiana, M.H. Jensen and K. Sneppen, Eur. Phys. J. B 29, 135 (2002). 16. M.H. Jensen, G. Tiana and K. Sneppen, Febs Letters 541, 176 (2003). 17. P.D. Karp et al., Nucl. Acids Res., 35, 7577–7590 (2007). 18. J.B. Axelsen, S. Krishna and K. Sneppen, J. Stat. Mech., P01018 (2008). 19. L.H. Hartwell, J.J. Hopfield, S. Leibler and A.W. Murray, Nature, 402(6761), C47–52 (1999). 20. S. Maslov, K. Sneppen and I. Ispolatov, New J. Phys., 9, 273 (2007). 21. S. Maslov and I. Ispolatov, Proc. Natl. Acad. Sci. USA, 104, 13655–13660 (2007). 22. S. Krishna, A.M.C. Andersson, S. Semsey and Kim Sneppen, Nucl. Acids Res., 34, 2455 (2006). 23. R. Thomas, Quantum noise, Springer Series in Synergetics 9, Ed. Gardiner, Springer, Berlin, pp. 180–193 (1981). 24. E.H. Snoussi, J, Biol. Sys., 6, 3–9 (1998). 25. J.L. Gouz´e, J. Biol. Syst., 6, 11–15 (1998). 26. J.E. Ferrell Jr., Curr. Opin. Cell Biol., 14, 140–148 (2002). 27. D. Angeli, J.E. Ferrell and E.D- Sontag, Proc. Natl. Acad. Sci. USA, 101, 1822– 1827 (2004). 28. F.J. Isaacs, J. Hasty, C.R. Cantor and J.J. Collins, Proc. Natl. Acad. Sci. USA, 100, 7714–7719 (2003). 29. H. Hirata, S. Yoshiura, T. Ohtsuka, Y. Bessho, T. Harada, K. Yoshikawa and R. Kageyama, Science, 298, 840–843 (2002). 30. S.L. Harris and A.J. Levine, Oncogene, 24, 2899–2908 (2005). 31. F. Jacob and J. Monod, J. Mol. Biol., 3, 318–356 (1961). 32. P. Wong, S. Gladney and J.D. Keasling, Biotechnol. Prog., 13, 132–143 (1997). 33. H.L. Pahl, Oncogene, 18, 6853–6866 (1999). 34. M. Ptashne, A Genetic Switch: Phage Lambda Revisited, Cold Spring Harbor Laboratory Press Cold Spring Harbor(2004). 35. S. Pigolotti, S. Krishna and M.H. Jensen, Proc. Natl. Acad. Sci. USA, 104, 6533– 6537 (2007). 36. M. Schnarr et al., Biochimie, 73, 423–431 (1991). 37. M.B. Elowitz and S. Leibler, Nature, 403, 335–338 (2000). 38. D.E. Nelson, A.E.C. Ihekwaba, M. Elliott, J.R. Johnson, C.A. Gibney, B.E. Foreman, G. Nelson, V. See, C.A. Horton, D.G. Spiller et al., Science, 306, 704–708 (2004). 39. G. Tiana, S. Krishna, S. Pigolotti, M. H. Jensen and K. Sneppen, Phys. Biol., 4, R1 (2007). 40. C.Y. Huang and J.E. Ferrel Jr, Proc. Natl. Acad. Sci. USA, 93, 10078–10083 (1996). 41. A. Goldbeter and D.E. Koshland, Proc. Natl. Acad. Sci. USA, 78, 6840–6844 (1981). 42. G. Felsenfeld and M. Groudine, Nature, 421, 448 (2003). 43. A.T. Annunziato, J. Biol. Chem., 280, 12065 (2005). 44. G. Thon and T. Friis, Genetics, 145, 685 (1997). 45. I.B. Dodd, M.A. Micheelsen, K. Sneppen and G. Thon, Cell, 129, 813–822 (2007). 46. G. Thon, P. Bjerling, C.M. Brunner and J. Verhein-Hansen, Genetics, 161, 611 (2002).
Signaling and Feedback in Biological Networks 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60.
93
B.H. Zimm, Proc. Natl. Acad. Sci. USA, 45, 1601 (1959). H.A. Scherage, Pure and Applied Chemistry, 36 1 (1972). R. Schleif, Trends Genet., 16, 559–565 (2000). E. Richet and O. Raibaud, EMBO J., 8, 981–987 (1989). E. Mass´e and M. Arguin, Trends Biochem. Sci., 30, 462–468 (2005). M.J. Weickert and S. Adhya, Mol. Microbiol., 10, 245–251 (1993). E.M. Ozbudak, M. Thattai, H.N. Lim, B.I. Shraiman and A. van Oudenaarden, Nature, 427, 737–740 (2004). W.P. Smits, O.P. Kuipers and J.W. Veening, Nat. Rev. Microbiol., 4, 259–271 (2006). R. Donangelo and K. Sneppen, Physica A, 316, 581–591 (2002). S. Semsey, A.M.C. Andersson, S. Krishna, M.H. Jensen, E. Mass´e and K. Sneppen, Nucl. Acids Res., 34, 4960–4967 (2006). N. Mitarai, A.M.C. Andersson, S. Krishna, S. Semsey and K. Sneppen, Phys. Biol., 4, 164–171 (2007). F.W. Outten, O. Djaman and G. Storz, Mol. Microbiol., 52, 861–872 (2004). M. Werner, S. Semsey, K. Sneppen and S. Krishna, preprint (2008). U.S. EPA Exposure Factors Handbook, 1997, http://www.epa.gov/ncea/efh/
Topographic Spreading Analysis of an Empirical Sex Workers’ Network Johannes Bjelland,1 Geoffrey Canright,1 Kenth Engø-Monsen,1 and Valencia P. Remple2 1
2
Telenor R&I, 1331 Fornebu, Norway
[email protected],
[email protected], kenth.engø
[email protected] BC Centre for Disease Control Epidemiology, University of British Columbia, Vancouver, BC, Canada;
[email protected]
1 Introduction The problem of epidemic spreading over networks has received considerable attention in recent years, due both to its intrinsic intellectual challenge and to its practical importance. A good recent summary of such work may be found in Newman [8], while [9] gives an outstanding example of a non-trivial prediction which is obtained from explicitly modeling the network in the epidemic spreading. In the language of mathematicians and computer scientists, a network of nodes connected by edges is called a graph. Most work on epidemic spreading over networks focuses on whole-graph properties, such as the percentage of infected nodes at long time. Two of us have, in contrast, focused on understanding the spread of an infection over time and space (the network) [1, 3, 2]. This work involves decomposing any given network into subgraphs called regions [1]. Regions are precisely defined as disjoint subgraphs which may be viewed as coarse-grained units of infection—in that, once one node in a region is infected, the progress of the infection over the remainder of the region is relatively fast and predictable [3]. We note that this approach is based on the ‘Susceptible-Infected’ (SI) model of infection, in which nodes, once infected, are never cured. This model is reasonable for some infections, such as HIV—which is one of the diseases studied here. We also study gonorrhea and chlamydia, for which a more appropriate model is Susceptible-InfectedSusceptible (SIS) [7] (since nodes can be cured); we discuss the limitations of our approach for these cases below. In this paper we apply the “topographic” regions-analysis approach to an empirical sex network, built from interviews with female sex workers (FSWs) in Vancouver, Canada. (See [3] for a detailed discussion of the “topographic” N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 6, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
98
J. Bjelland et al.
approach.) The network consists of the FSWs themselves, plus their sex partners (paid and unpaid), as well as any partners of these partners which were known to the FSW. This method, beginning with 49 interviewed FSWs, gave a highly connected network of 553 nodes [10]. Furthermore, STI (sexually transmitted infection) status was obtained for many of these nodes. In particular, two of the nodes were identified as being HIV-positive, while 11 other nodes have either gonorrhea, chlamydia, or both. From the collected network data we build an adjacency matrix, where element aij = 1 if i has a link to j, and is zero elsewhere. (In the case of a weighted graph, element aij equals the strength of the link from node i to node j.) The principal eigenvector of the adjacency matrix is a measure of a node’s centrality in the graph and is called the eigenvector centrality or EVC. The EVC scores for the nodes in the (weighted or unweighted) network give the starting point for our approach: they are used for assigning the nodes to regions, and for predicting the spreading of disease within and between regions. The aims of this work are several. One goal is to extend our earlier topographic approach to a graph with weighted links. As we will see, this seemingly small change can have very large effects; but we will also see that the validity of our approach is confirmed, in spite of these large effects. This is because the modified approach (presented here for the first time) is consistent: we use the link weights to modify the graph’s adjacency matrix, and hence the nodes’ EVC values; and we use them again when we define the regions via the steepest-ascent graph (SAG). A second aim of this work is to try to exploit the insights gained from the topographic analysis, in order to find novel suggestions for preventive actions to hinder the spread of the disease in question. We find that our progress towards this second goal is considerably more modest than that towards the first goal. We will show “thought experiments,” based on the empirical graph topology and link strengths, for which our analysis is extremely useful. However, we will not find practical suggestions which are immediately promising for the given Vancouver FSW graph. There are several reasons for this. First, the HIV graph is so thoroughly protected by condom use that we find little to add in terms of ideas for preventive measures. Second, the graphs for gonorrhea and for chlamydia are so thoroughly well connected, and also so well infected, that we do not find small topological changes which can make a large difference. We note that our approach treats the network as static; hence any effects of network dynamics are not taken into account. We believe, however, that our qualitative results are fairly robust to the likely dynamics of this network, since its overall structure is thought to be fairly stable over time. Also, our analysis (once the network is mapped out—which can be time consuming!) is not computationally demanding, and so may be performed in essentially zero time compared to the time scale of epidemic spreading. Hence any suggestions resulting from the analysis may be implemented in something approaching real time.
Topographic Spreading Analysis
99
2 Uniform Transmission Model First we study the FSW graph without taking into account the link weights. That is, each sexual contact is given strength “1” in the adjacency matrix. This is logically equivalent to giving each link the same probability of transmission per unit time. Our purpose in doing this analysis is to be able to compare with the analysis done using non-uniform link strengths (transmission probabilities). As we will see, the differences are large and important. 2.1 Visualization and Bipartiteness Our topographic analysis includes a novel approach to graph visualization: we group the nodes into their respective regions, and lay out the whole graph according to the SAG [4]. We present the basic ideas here, and refer the reader to earlier papers [1, 3, 4] for details. We view the EVC of a node as a measure of its “well-connectedness” and hence of its “spreading power.” Then we single out local maxima of the EVC as being particularly important in spreading; we call these nodes Centers. Also, since EVC (being recursively defined) is “smooth,” we can speak of “neighborhoods” in the graph as having a typical EVC; and we conclude that spreading is fast in neighborhoods of high EVC, and slow in “lower” neighborhoods. We then define regions of the graph— one for each Center. Each node finds its region (mountain) by following a steepest-ascent path until it terminates at a local maximum (mountaintop, or Center). The set of steepest-ascent paths then forms a directed hierarchical tree graph (the SAG), which is useful both for visualizing the graph and for predicting the likely paths of fastest epidemic spreading. In a tree graph, any two nodes are connected by exactly one path, and there are no cycles (closed loops of links). The SAG for the unweighted FSW graph is shown in Fig. 1. We note several interesting points from this visualization. (i) there are many regions (17). (ii) All the Centers (most central nodes in each region) are men. (iii) Many regions are small, i.e., 1–3 nodes, while (iv) the bulk of the nodes (517/553) lie in one of the three largest regions (red—marked R in the figure, blue (B), dark grey (G)). (v) Every region is well connected to the largest, red, region. Hence the red region is expected to play a dominant role in any epidemic spreading. (vi) One HIV-positive node is in the red region, and the other is (while in its own region) well connected to the central part of the red region. Now we comment on these points. We believe that points (i)–(iii) derive from the fact that the graph is nearly bipartite. A bipartite graph consists of two sets of nodes, such that all links are made between the two sets, and there are no links between nodes in the same set. Now we suppose (which is almost true) that the FSW graph is a strictly bipartite graph composed of M and F nodes. If we further assume that an M node is a Center (local maximum of centrality), then all of its neighbors are (a) female, (b) highly central,
100
J. Bjelland et al.
Fig. 1. Regions visualization of the FSW network, with all links set to equal strength. Only the links in the SAG are shown here for visual clarity. The most central node in each region is enlarged. The three largest regions, which will be discussed further in the text, are labeled R (red), G (grey), and B (blue). - Male, - Female.
and (c) automatically excluded from being a Center. Thus bipartiteness will tend to favor one gender over another. By the same token, highly central M nodes are never neighbors of other M Centers, and so are candidate Centers themselves. Hence there may be a tendency for more, and smaller, regions. Points (iv)–(vi) tell us that this network is highly prone to infection: the many regions are not well isolated from one another, because of their common connection to the dense, infectious red region. Also, the two start nodes are in or near the central part of the red region, where spreading is fast. 2.2 Infectious Spreading on the Unweighted Graph We have simulated spreading on the uniform FSW network, by giving each link the same probability per unit time for spreading. The value used is thus arbitrary, as is the unit of time. We typically use a value of a few percent, since much larger values give a very unsmooth time evolution (equivalent to a poor time resolution). We report the results here because they are illustrative of the strengths and weaknesses of our method, for the case of multiple regions. (For reasons given below, these are the only multi-region simulations that we can perform with this graph.)
Topographic Spreading Analysis
101
Fig. 2. Same visualization as Fig. 1, except that all links are shown. The arrows mark the known HIV-positive nodes.
Taking the start (infected at t = 0) nodes as shown in Fig. 2 above, we find, as expected, that the regions as we define them here are again valid coarse units of infection. We also find that it is difficult to stop or even retard the infection, because of the topology of the graph. The upper part of Fig. 3 shows a typical epidemic progression, with the growth in the red, blue, and grey regions resolved. All three “take off” at about the same time, and the infection spreads rapidly. Measures to retard spreading in the red region— without resorting to large topological change—are not found to be effective. We find however that protecting one node—the Center of the grey region— drastically weakens the red/grey connection. We see in the bottom part of Fig. 3 the results when this is done: the red and blue regions take off as before, but the grey region’s takeoff is greatly retarded. This is an example of the kind of benefit that we believe can be obtained from our analysis. We also considered the more promising problem of an infection starting in the grey region—again motivated by the observed red ⇐⇒ grey bottleneck in the topology. The top of Fig. 4 shows that takeoff is retarded by a factor of about 3, compared to the former case (top of Fig. 3). It is retarded even further (about 7 times as slow) if we in addition protect the grey Center (bottom of Fig. 4).
102
J. Bjelland et al.
Fig. 3. HIV spreading simulation without (top) and with (bottom) measures to isolate the grey region from the red region. In each plot, there are four growth curves, showing the total growth of the infection (‘Sum’), and the growth for the red (R), grey (G), and blue (B) regions (the largest regions in the network).
Fig. 4. Same simulation as Fig. 3, except that the infection starts from a peripheral node in the grey region.
Topographic Spreading Analysis
103
3 Links Weighted with Transmission Probabilities In this section we add an important further element of realism by weighting the links of our FSW graph with transmission probabilities. We are forced in many cases to use rather crude approximations. Nevertheless, we feel that the resulting model is considerably closer to reality than the uniform model. Also (as we will see) it is strikingly different—in particular, each disease will have its own graph. That is, while the basic topology is the same as that in Fig. 2, the set of link weights depends on the disease—because these weights represent transmission rates (probability/time). In fact, for the HIV case, the topology itself is changed, since we set some link strengths to exactly zero. In practice, incorporating the link strengths into the analysis involves (1) building a weighted adjacency matrix W using the link strengths, (2) finding the corrected EVC as the dominant eigenvector of this matrix W , and (3) redefining “steepest ascent” to take account of the varying link strengths. The first two steps are clear; and we describe step (3) in Section 3.2. Of course, before doing any of this, we must find the link strengths. We describe our procedure for doing so in the next section. 3.1 Estimating the Probabilities For each link we want a single weight (number) which gives the probability per unit time of transmission from an infected node to an uninfected node. This probability is based on a number of factors which must be estimated from limited data. We list these factors schematically as follows: Transmission probability/unit time = [(unprotected probability/contact)(non-condom use prevalence) × (contacts/time)] + [(protected probability/contact)(condom use prevalence) ×(contacts/time)] Now we discuss each factor in turn. For each disease (HIV, gonorrhea or ‘NG’, and chlamydia or ‘CT’) we estimate (unprotected probability/contact) from Ref. [6]. See Table 1. To correct for condom use, we must know the frequency of condom use for each link (condom use prevalence). For 256 links (about 17% of them) we have an estimate for (condom use prevalence) from survey data [10]. We know very little about the remaining links, except for Table 1. Transmission probabilities/contact for NG (gonorrhea), CT (chlamydia), and HIV.
Unprotected Protected
NG 0.43 0.16
CT HIV 0.10 0.05 0.074 0
104
J. Bjelland et al.
whether they are a “client” relationship or a “non-client” relationship. We explain below how we generate link weights for the links for which we have no survey data. Estimates for (contacts/time) were available (again) for those links for which we obtained survey information; however, here we have yet another source of uncertainty. That is, each interviewed FSW reported contacts with “regulars” and also contacts with new or “non-regular” customers. We take the reported estimates of (contacts/time) for regulars as given. For the nonregulars, we assume that either (i) they will become regular in the future, or (ii) they will be replaced by other non-regular customers who play essentially the same role in the network. In short: we ignore the distinction beween cases (i) and (ii). We still need a reasonable estimate of contacts/time for non-regulars. We proceed as follows: for each FSW, we define T to be the total number of contacts per unit time (summed over all neighbors). Also we let P be the percentage of contacts from regulars, and let C be the number of contacts/time from regulars. Then clearly C = P T ; and since we can estimate both P and C from the survey data, we get an estimate of T (= C/P ). We then estimate the total contacts/time N for non-regulars to be N = T − C. Finally, we take, from the survey data, the expected number of non-regular neighbors (still for each FSW), and call this number K. We then (finally) get the expected contacts/time for each non-regular as N/K. Our model is clearly very crude, treating each non-regular in a very average way; but it enables us to move (as we will see) well beyond the equal-transmission-probability model, and so, we believe, much closer to reality. Now we come to the term due to protected sex. We estimate (protected probability/contact) by correcting the (unprotected probability/contact) data, using data for the correction due to condom use from [5]. We note here that we set (protected probability/contact) for HIV to be exactly zero. Not surprisingly, this will have dramatic effects on the spreading behavior—as we will see in Section 3.3. This completes our prescription for estimating link weights for those links for which we have survey data. We then used a very simple approach—which we find appropriate to the high degree of uncertainty in our data—to estimate the remaining link weights (transmission probabilities/time). Our solution here is to first divide all links (surveyed and not surveyed) into two groups: client and non-client. Then, for each group, we simply reproduced the distribution over the “surveyed” links so as to also assign transmission probabilities to all of the “non-surveyed” links. Since the survey data is discrete, the link-weight distribution obtained is never smooth. Hence we reproduced these discrete distributions by simply repeating (sampling) each value in the discrete distribution with a probability equal to its frequency in the distribution. That is: we do not attempt to create distributions for each parameter in the link-weight estimate; instead we simply copy the discrete link weight values obtained from the survey data onto the unknown links, with appropriate probabilities.
Topographic Spreading Analysis
105
3.2 SAG∗ Now we address another complication arising from the use of weighted links: we must reconsider the definition of the steepest-ascent graph (SAG), which is used both for assigning region membership and for visualization purposes. Our point here is simple, namely that the definition of steepest ascent should take account of the link strength. This rather obvious point has not been addressed in our earlier use of the SAG [1, 2, 3], because these earlier studies were applied to unweighted graphs. Hence we offer a brief account here of the modification used for weighted links. We recall that region membership is assigned by in essence asking each node to find the steepest path to the “top”—i.e., to the “nearest” local maximum of the EVC. The notion of local maximum is independent of link stength. Suppose, however, that a node N has two local maxima (Centers, C1 and C2 ) as neighbors: which region do we place N in? Since we want steepest-ascent paths to represent most likely spreading, it seems reasonable that a neighbor C1 with a very weak link to N should not be assigned the steepest-ascent path—even if it is somewhat higher (in EVC) than C2 . In other words, if we retain the notion that steepest ascent gives the right answer, then we clearly want to define the slope as being slope = Δy/Δx,
(1)
with Δx (‘distance’) decreasing with increasing link strength. Clearly, Δy is the EVC difference, as in earlier (unweighted) work; hence we simply need some reasonable definition for the “distance” Δx. We take here the simple heuristic Δx(i, j) = 1/W (i, j) with W (i, j) the link strength (tranmission probability) between nodes i and j. Our point here is then that node N may find that it is not simply in the region of its highest neighbor: instead, it will be placed in the same region as the neighbor N ∗ with the highest product Δy/Δx = [EV C(N ∗ ) − EV C(N )][W (N, N ∗ )]. In short, if its link to the highest neighbor is very weak, then (reasonably) it will be placed instead in the region of a neighbor with a stronger link. We believe this is consistent with our aim for defining regions—namely, that a region is a coarsegrained unit of infection, such that infection within a region is relatively fast and predictable. We call the resulting steepest-ascent graph SAG∗ (to distinguish it from the SAG, which does not take link strengths into account). We will see below that our spreading simulations can only give a limited test of our SAG∗ definition— since in one case (HIV) the weighted network breaks down, while in the other two (NG and CT) we only obtain a single region. Hence—while we retain a belief that our definition is promising—a thorough test will have to await application to a weighted graph which (i) has several regions, but yet (ii) is better connected than our HIV graph of the following section.
106
J. Bjelland et al.
3.3 HIV Graph The SAG∗ for our weighted HIV graph is shown in Fig. 5. We see immediately that the contrast with Fig. 1 is enormous. In particular, the 17 regions of Fig. 1 have multiplied many times. In addition (which is not so easily seen in the figure) some nodes are completely disconnected due to the zero-weight links, and hence do not appear in the figure at all. The apparently isolated nodes in the corner of the figure are one-node regions; such regions occur typically on the periphery of a graph, where all EVC values are small. What is even more striking is that adding all non-zero links to the SAG∗ picture of Fig. 5 makes very little change; that is, there are only six non-zero links which are not shown in the figure (four connecting the one-node regions to one other node each, and two other inter-region links). Hence we do not show the full graph: it is essentially that of Fig. 5. This means in turn that HIV spreading—while seemingly unstoppable in the picture obtained from Fig. 2—is in fact not a problem for this FSW network. In particular, the
Fig. 5. Regions analysis for the HIV graph, corrected with the transmission probability on each link. Note that the graph breaks into very many small regions, due to the (assumed) zero transmission probability for reliable condom use. The two enlarged nodes are known to be HIV-infected; the four nodes in the upper left corner are singlenode regions in the weighted graph.
Topographic Spreading Analysis
107
two HIV-positive (male) nodes (marked with large squares in Fig. 5) are each confined to an effective two-node network, consisting of themselves and their nonclient partner. Hence our expected picture of condom use for this empirical network implies that HIV spreading will be limited to the non-client partner relationships of the two infected nodes, and so has effectively zero probability of reaching the rest of this dense sexual network. Because the effective graph is so fragmented, and also because the HIVinfected nodes are effectively isolated, we have not performed spreading simulations on the weighted HIV graph. We note that the largest region in Fig. 5 has 24 nodes, with a FSW as the most central node in the region. In fact the strongly bipartite picture obtained from the unweighted graph (Fig. 1) has also broken down here: both male and female Centers of the many regions are found. This is however not so surprising, given the fragmented nature of the effective graph. 3.4 Gonorrhea Figure 6 shows the steepest-ascent (SAG∗ ) graph when we use link strengths approriate to gonorrhea. Since 100% condom use does not give 100% protection [5], the effective gonorrhea graph has all the same links as were present
Fig. 6. Region (SAG∗ ) visualization for the gonorrhea network NG. The enlarged nodes are known to be STI-infected.
108
J. Bjelland et al.
in Fig. 2; but they are reweighted. We see that the reweighting has still had a dramatic effect. In particular, the 17 regions found for the unweighted graph are now a single region for the weighted graph. Also, the Center of this one region (and so of the entire graph) is an FSW. An interesting aspect of the gonorrhea SAG∗ is that one of the few existing homosexual (FSW ⇐⇒ FSW) links plays a very central role in the graph: the link between the Center and the head of the large red subregion is homosexual. This means that the two women involved are highly central in the weighted graph, and also that the link strength between them (transmission probability for gonorrhea) is not too small. One might then propose to remove this link— which (as it is certainly requested and paid for by a male customer) should be possible. However as we will see below, removal of this link—or any single link—has little or no beneficial effect. (This conclusion is perhaps intuitively grasped from the fully linked visualization of Fig. 7 below.) SAGs of either type are strict hierarchical structures—that is, they are directed trees, with links pointing strictly towards the root (Center). This means that, for any given region, one can readily define subregions in terms of branches of the tree. We have picked out the five largest branches of the gonorrhea SAG∗ and color coded them. We see that it is visually meaningful to think in terms of subregions for this region.
Fig. 7. Same layout as in as Fig. 6, but with all non-zero links displayed.
Topographic Spreading Analysis
109
Figure 7 shows the NG-graph again, but with all links displayed. We note that presently infected nodes are enlarged and marked yellow (lighter grey in printed version) in Fig. 6 and in Fig. 7. From Fig. 6 we see two infected nodes lying at the heads of their (large) respective subregions, and hence only one hop from the Center. Also we see that every major subregion is already infected. This immediately suggests that preventing the further spreading of gonorrhea on this graph will be quite difficult. This pessimistic prognosis is also supported by the visualization of Fig. 7. Here we see that all the major subregions are well connected to one aother, with infected nodes lying in the heart of a dense cloud of links. We will test (and confirm) this pessimistic prediction via stochastic simulations—see Section 4. 3.5 Chlamydia In Fig. 8 we show the SAG∗ visualization of the chlamydia graph. Qualitatively we see much the same picture as for the NG graph: a single region, with an FSW at the Center of the region. In fact, the homosexual dyad that we found lying centrally in the NG graph is also central here—with the one difference that here the two FSWs have exchanged roles (Center and subregion head). Our SAG∗ visualizations suggest that the CT graph is perhaps even more well connected than the NG graph—in that there are very few subregions,
Fig. 8. Region (SAG∗ ) visualization for the chlamydia network CT. Enlarged nodes are known STI-infected nodes.
110
J. Bjelland et al.
and they are very large. And since (again) every major subregion is infected, we arrive at the same qualitative prognosis for this graph: it will be difficult to hinder the further spreading of the disease. We have also plotted the analog of Fig. 7 for chlamydia—that is, the full graph with all non-zero links. The result is again qualitatively like that of Fig. 7; hence we do not show it here.
4 Spreading on the Gonorrhea Graph For reasons already given, we have not run spreading simulations on all three disease graphs. The HIV graph is so heavily disconnected by the many condom-use-induced zero links that we see no point in running simulations on it. Of course, these links, involving as they do real sexual contact, do not have exactly zero probability for infectious spreading, even with 100% condom use. Also the reported rates of 100% condom use are most likely overstated in many cases. Hence it would be of interest to set the strength of these “zero HIV links” to some small but positive value, and to examine the resulting graph. We reserve this idea for future work. The remaining two graphs (NG and CT) are qualitatively very similar. Hence we have chosen to focus on one of them—the NG (gonorrhea) graph. We must emphasize immediately however that our simulations, being based on SI dynamics [8], do not accurately model the long-time dynamics of diseases such as gonorrhea and chlamydia. A more appropriate model would be the SIS model [7] in which Infected nodes become again Susceptible after a variable time period. We expect the SI model to give qualitatively correct results in the early stage of any infectious process—when few nodes are infected, and they have not had time to recover. Beyond this early stage the SI model can only overestimate the degree of spreading. Hence we present simulation results in this section, based on the SI model, with two principal caveats: • •
Takeoff of the disease will likely occur later for the more realistic SIS model than what we show here. The long-time infected fraction will not approach 100%, but rather a lower value.
With these caveats clearly in mind, we present some simulations on the gonorrhea graph. Our aim is to see what insights we can gain from our SAG∗ picture. We will focus principally on when the infection takes off. Because we simply compare different scenarios (and their takeoff times) with one another, we feel that our (comparative) conclusions are not greatly weakened by the caveats given above. Our procedure for simulation is the same as before: at each time step, each link ij has a probability pij = W (i, j) of transmitting the infection if exactly one of the pair ij is already infected. Our link strength data, when the unit
Topographic Spreading Analysis
111
of time is one day, have values which vary from a few percent down to about 10−4 . With these small values we can increment the simulator with a time step of one day, and get smooth results. Our simulations differ from one another in three ways: (i) the choice of “start” nodes which are infected at t = 0; (ii) the choice of a set of “immune” nodes which cannot be infected; and (iii) sometimes, the choice of links which are to be blocked from transmission (removed). Choices (ii) and (iii) allow us to test various strategies for hindering spreading. In the real world of human sexual behavior, accomplishing either of these effects may be quite difficult; but we test them here simply to see what can be achieved. First we simulate the reference case, in which those nodes which are known to be infected are the start nodes (see again Figs. 6 and 7), and we immunize no nodes or links. We find (Fig. 9) that the infection takes off very fast—as anticipated in Section 3.4. Specifically, we see that the takeoff time is very short—just a few days. This is consistent with the fact that the infection has already reached three very central (as defined by EVC) nodes. This latter fact is consistent with two interpretations: either (i) the infection has recently come to this dense network, and it is on the verge of taking off, or (ii) the infection has been present for a long time, and has reached an equilibrium (and rather low) level. 600
500
Infected nodes
400
300
200
As−is Center Within 1 hop from center STI red region + head of region 50 random
100
0
0
50
100
150
200 time
250
300
350
400
Fig. 9. Spreading simulations for gonorrhea, based on the SI model, and using various prevention strategies. “As-is” = known infected start nodes and no strategy; the other scenarios involve immunizing various nodes, as described in the text. The unit of time is one day.
112
J. Bjelland et al.
We do not have sufficient empirical information to favor one of these interpretations over the other. If the first one is correct, it implies that one can expect a strong growth of infection rate in a relatively short time. If the second is correct, then our model is likely inadequate, not only in the SI aspect but probably in other aspects as well. We remind the reader that our topographic analysis is most useful in understanding the spreading of new infections over fairly static networks; hence it may be useful in case (i), but has little to say about case (ii). Now, in order to test our ideas further, we assume case (i). Based on our SAG∗ picture, we formulate various immunization strategies and test them via simulation. We have tried (a) immunizing the Center node; (b) immunizing the Center and all nodes within one hop of the Center (subregion heads); (c) immunizing the two infected nodes in the large red subregion, plus that subregion’s head node; and (d) immunizing 50 nodes chosen at random. Results for all of these cases are shown in Fig. 9. A simple conclusion is starkly obvious: none of these immunization strategies is able to retard the takeoff. In fact, the only clear difference is the trivial and useless one: that the long-time infected fraction is reduced by the number of immunized nodes [for example, by 14 for scenario (b), and by 50 for scenario (d)]. In short: as strongly suggested by Fig. 7, the NG network is sufficiently well connected, and sufficiently well infected, so that we find no simple strategy which is at all effective in retarding the takeoff. In order to investigate a different kind of test of the utility of our method of analysis, we next “cure” all infected nodes, and explore scenarios in which we can choose the start nodes freely. Our principal aim is to test the following hypothesis: that time to takeoff is strongly determined by distance from the Center of the SAG∗ . Some simple tests of this hypothesis are shown in Fig. 10. Here we show the progression of infection for three scenarios: (e) the Center is the only infected start node; (f) a node roughly halfway between the Center and the periphery is the start node; and (g) a very peripheral node is the start node. The results of Fig. 10 strongly support our hypothesis. Takeoff times vary from a few days to about 50 days to almost 150 days, as we move the start node outward in the SAG∗ . We also see, in the bottom half of the figure, that our earlier picture [3, 2] of the movement of the infection “front” over the topography is confirmed here: the infection [assuming it doesn’t start at the top as in (e)] moves slowly at first, until it begins to reach more central nodes, at which point it speeds up, while moving “uphill” (towards the Center); subsequently it moves “downhill,” slowing down all the while. While we have seen this dynamic pattern many times before, this is the first time we have tested it on a graph with weighted links (and with the EVC appropriately corrected via the weighted adjacency matrix). While Fig. 10 offers anecdotal evidence for our hypothesis, we also have statistical data. We have in fact run one-start-node simulations for each node on the graph, 10 times for each node, and recorded the average time needed to
Topographic Spreading Analysis
113
Infected nodes
600 500 400 300
100 0
Mean EVC infected nodes
Central node Medium Central Not central
200
0
50
100
150
200 time
250
300
350
400
0
50
100
150
200 time
250
300
350
400
0.2 0.15 0.1 0.05 0
Fig. 10. Three spreading simulations, based on three chosen scenarios, each with a single start node. We see that distance from the Center node (in a metric defined by the SAG∗ ) correlates strongly with time to takeoff. The lower part of the figure shows the average EVC of the newly infected nodes.
reach an infection number of 300 nodes (about 60%). To measure “distance” from the Center, we define the dual notion of “closeness”: a node’s closeness to the Center is simply the product of the link strengths over the (unique) path to the Center in SAG∗ . Thus many weak links give low closeness, while few strong links give high closeness; and both the number of hops and the link strengths of the hops affect the result. Figure 11 gives a scatter plot for average infection time vs closeness, for all nodes in the graph except the Center node. We see a strong decreasing relationship: closer nodes need less time to infect the graph. Thus we find from these results further strong support for our hypothesis.
5 Summary and Discussion In this chapter we have extended the topographic approach to the problem of epidemic spreading over networks to a problem involving two new features. First, the network is real: it is an empirical sex network, with some nodes known to be infected with the STIs HIV, gonorrhea, and chlamydia. Second, we have data which allow us to assign non-uniform link strengths (transmission probabilities), and we have generalized the topographic approach to incorporate these link strengths.
114
J. Bjelland et al.
time
103
102
101 −12 10
10−10
10−8
10−6 closeness
10−4
10−2
100
Fig. 11. Time needed for a single start node to infect 300 nodes, as a function of that start node’s “closeness” to the graph’s Center (averaged over 10 experiments for each start node). Closeness is measured entirely in terms of the modified steepest-ascent graph SAG∗ . We see a thorough statistical corroboration of the results of Fig. 10.
To help in illuminating the effects of incorporating link strengths, we first performed the analysis by ignoring these weights. We visualized the resulting unweighted FSW network, and simulated the progress of HIV on this network (using uniform transmission probabilities). We found some interesting effects from the almost-bipartite nature of the unweighted network. We also found that the network is very highly connected—with the two HIV-infected nodes very close to the network’s Center—so that retarding the spread of HIV was difficult. Nevertheless we were able to show significant benefits to be obtained from our analysis, for some hypothetical cases involving start nodes placed elsewhere. Incorporation of empirically obtained link strengths had large consequences. Each disease yielded a distinct weighted graph, by affecting the transmission probabilities. We found (using our assumption that perfect condom protection was possible) that the HIV graph broke down into many small components. While our visualization may still have some value, we saw no value in running simulations on these small components.
Topographic Spreading Analysis
115
Simulations on the gonorrhea graph gave results much like those on the unweighted FSW graph: the graph was very well connected, and the alreadyinfected nodes had rather central positions. The result was that we were unable to find simple topological fixes, inspired by our analysis, which could significantly retard spreading. However, we were able to find strong evidence confirming the basic applicability of our analysis to spreading. Specifically, we showed that our own notion of a node’s distance from the Center of the graph correlated strongly with the time needed for that node to infect the graph. We emphasize that this is the first application of the topographic approach to a weighted graph. Performing this analysis has required generalizing our earlier definition [3] of steepest ascent. The results we obtain here, based on this new, generalized definition, are very promising. Hence—even as we fail to come up with promising, concrete suggestions for hindering the spread of STIs in the Vancouver sex network—we feel that our results confirm the applicability of our approach to understanding spreading in the real-world case of a network with non-uniformly weighted links. We see a clear need for two obvious extensions of this work. First, it would be useful to reconnect the HIV graph, by assigning small but non-zero probabilities to the 100%-condom-use links. This would allow for a more meaningful regions analysis and the accompanying testing by simulations (perhaps over a long time scale). Second, our approach is most simply understood and applied for diseases for which SI spreading is appropriate (such as HIV). The application to gonorrhea or chlamydia would be greatly strengthened if one could generalize the method to the SIS and/or SIR case. This is an interesting challenge for future work. The data used arrive from self-reported infection status ([10]). To validate our model, empirically collected retrospective data on actual prevalence and incidence of the infections could be obtained. This is also recommended for future work. Finally, we remind the reader of the motivation for this work. We believe that the topographic analysis, based on EVC, is extremely useful for understanding epidemic spreading on a coarse scale. The analysis itself is not computationally demanding; hence it can be performed in essentially real time. Thus, we hope that our approach can be useful for disease prevention, in those cases for which the network can be mapped in reasonably short time—that is, short compared to both the time scale for infectious spreading, and the time scale for significant topology changes. The results presented here do not offer any immediate solution to the problem of STIs in the Vancouver FSW network, but they do add further support to our belief that this approach may be useful for this problem, and for others.
116
J. Bjelland et al.
Acknowledgments GC and KEM acknowledge partial support from the Future and Emerging Technologies unit of the European Commission through Project DELIS (IST-2002-001907). VPR acknowledges the financial and in-kind support, respectively, of the BC Medical Services Fdn and HIV/STI Prevention and Control, BC Centre for Disease Control.
References 1. G. Canright and K. Engø-Monsen. Roles in networks. Science of Computer Programming, pages 195–214, 2004. 2. G. Canright and K. Engø-Monsen. Epidemic spreading over networks: a view from neighbourhoods. Telektronikk, 101:65–85, 2005. 3. G. Canright and K. Engø-Monsen. Spreading on networks: a topographic view. In Proceedings, European Conference on Complex Systems, 2005. 4. G. S. Canright and K. Engø-Monsen. Some relevant aspects of network analysis and graph theory. In J. Bergstra and M. Burgess, editors, Handbook of Network and Systems Administration. Elsevier, Amsterdam, 2007. 5. K. Holmes, R. Levine, and M. Weaver. Effectiveness of condoms in preventing sexually transmitted infections. Bull World Health Organ, 82:454–461, 2004. 6. A. M. Jolly, M. E. Moffatt, M. V. Fast, and R. C. Brunham. Sexually transmitted disease thresholds in Manitoba, Canada. Ann Epidemiol, 15:781–788, 2005. 7. M. Kretzschmar, Y. T. P. H. van Duynhoven, and A. J. Severijnen. Modeling prevention strategies for gonorrhea and chlamydia using stochastic network simulations. American Journal of Epidimiology, 144:306–317, 1996. 8. M. Newman. The structure and function of complex networks. SIAM Review, 45:167–256, 2003. 9. R. Pastor-Satorras and A. Vespignani. Epidemic spreading in scale-free networks. Phys Rev Lett, 86:3200–3203, 2001. 10. V. P. Remple, D. M. Patrick, C. Johnston, M. W. Tyndall, and A. Jolly. Clients of indoor commercial sex workers: Heterogeneity in patronage patterns and implications for HIV and STI propagation through sexual networks. Sexually Transmitted Diseases, May 2007.
Spectral Characterization of Network Structures and Dynamics Anirban Banerjee1 and J¨ urgen Jost2 1
2
Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany;
[email protected] Max Planck Institute for Mathematics in the Sciences, Inselstr.22, 04103 Leipzig, Germany, and Santa Fe Institute, Santa Fe, NM 87501, USA;
[email protected]
1 Introduction Mathematically, graphs defy a systematic and complete classification, and empirically, the graphs representing networks come in a bewildering multitude. We have developed some tools [8, 9, 10] that at least allow for a rough classification of graphs that reflects the difference in the empirical domains from which network data are produced and that does not depend on sophisticated visualization tools. As such, a graph is a rather simple formal structure. It consists of nodes or vertices that are connected by edges or links. These nodes then represent the elements of a network (and we shall often not distinguish between the network and its underlying graph), and the edges represent relations between them. These could be chemical interactions as in intracellular networks of genes, proteins, or metabolites, synaptic connections between neurons, physical links in infrastructural networks, links between Internet pages, co-occurrences between words in sentences or on text pages, email contacts between people, co-authorships between scientists, and so on. This structure then can be expected to be somehow adapted to the function of the network, by evolution, self-organization, or design. In turn, any dynamics supported by the network will be constrained by this underlying structure. Our approach is based on associating certain mathematical objects—which ultimately just yield some numbers—to a graph which reflect its structural properties and which in particular encode the constraints on the dynamics that it can support. The mathematical objects will be an operator, the graph Laplacian (a discrete analogue of the Laplace operator in real analysis), and its eigenfunctions, and the numbers alluded to will be the eigenvalues of that operator.
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 7, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
118
A. Banerjee and J. Jost
2 Growing Networks Empirical networks usually do not spring into existence, but rather grow to their present or final state from smaller beginnings. Naturally, such a growth process involves the sequential addition of nodes and links (connections). Usually, nodes are added at random, but their link formation with other nodes (already present in the network) is often not entirely random. This link formation will follow some rule that typically is still stochastic but also involves properties of those nodes that are candidates for receiving a link. When that rule is such that there is a higher chance of receiving links from those nodes that already have many connections than from those with fewer connections, we have some form of preferential attachment. Such a rule is known to lead to a scale-free degree distribution of the nodes in the network; that is, the number of nodes in the final network that have k links behaves like some power k −α , for some positive exponent α. The first such rule was proposed by Simon [44], and it directly stipulated that those nodes that have more connections also have a higher chance of receiving additional ones (“the-rich-get-richer” principle). This rule and the effects resulting from it were then systematically investigated by Barab´ asi–Albert [2, 11], and subsequently, many empirical networks were found to exhibit such a power-law degree distribution. It would be, however, premature to draw systematic consequences about other network properties from such a power-law degree distribution. In fact, there are many rules for network growth that are plausible in many areas of application that indirectly lead to such a kind of preferential attachment, but can lead to networks with properties that are otherwise rather different from those of the schemes of Simon and Barab´ asi–Albert. For instance, Jost– Joy [28] investigated the “make-friends-with-the-friends-of-your-friends” rule where a new node first forms one link with a randomly selected node in the network and then preferentially makes further links with neighbors of that node. Since the chance of a node being a neighbor of some randomly chosen node depends on its degree, these subsequent links then also constitute some preferential attachment, and the resulting degree distribution will follow a power law. However, other properties of that network are rather different from those obtained by the direct preferential attachment scheme. In particular, because of the preference for local connections, the network diameter will be typically much larger. Even the opposite scheme, where a node preferentially forms additional links with nodes from which it has a large distance, does not lead to a network with a very small diameter. For creating a network with a small diameter, it is rather more efficient that nodes directly use preferential attachment, that is, preferentially form links with other nodes that have a high degree and are therefore well connected in the network. Of course, the most efficient way to achieve a small diameter in a sparse network is to connect every node to one single central node. Another crucial difference between a “make-friends-with-the-friends-ofyour-friends” network and a “the-rich-get-richer” network is that the first
Spectral Characterization of Network Structures and Dynamics
119
eigenvalue of the make friends network will be much smaller, implying for instance that dynamics on such a network are much more difficult to synchronize, as will be explained below. In fact, spectral properties like the behavior of the first eigenvalue of scale-free networks were analyzed in [3, 4], and it was pointed out that the scaling exponent and the first eigenvalue are essentially independent parameters for a network. Of course, when networks are produced by a certain stochastic scheme or drawn from some probability distribution on the space of networks, then that scheme or distribution will also lead to some typical spectral behavior, as systematically investigated in [29]. However, when we only know whether a network is scale free, we should be careful about inferring other network properties. It might be a wiser strategy to find out more about the underlying network evolution rule, like the above “make-friends-with-the-friends-of-your-friends” principle, the Cameo principle of Blanchard–Kr¨ uger [12], or whatever is plausible in the given empirical domain. One important class of rules for which there is much evidence in various domains is the one of node duplications. That means that instead of randomly attaching a new external node, we take some node i already present in the network and double it in the sense that we create a new node i that forms links with all or some of the neighbors of i. It may or may not also form a link with i itself. Again, since the chance of another node j of being a neighbor of the randomly chosen node i and therefore receiving new connections from i depends on the degree of j, we do get a preferential attachment scheme. Again, however, as we shall see below, such a node duplication leads to some specific spectral properties that are not shared by networks arising from different schemes. There also exist other distinctions within the class of scale-free networks. An important one is whether the nodes of high degree are assortative, i.e., prefer connections with other high degree nodes, or disassortative, i.e., avoid connections with high degree nodes and rather form links with low degree nodes.
3 Graph Operators and their Spectral Properties We have already seen several important network parameters or properties, like the diameter, the synchronizability, the degree sequence (counting the number of nodes of degree k in the network as a function of k ∈ N), and the assortativity. Of course, there are many others, like the clustering coefficient, which expresses the relative frequency of triangles, that is, triples of nodes that are pairwise connected. The clustering coeffient is defined as C :=
3 × number of triangles . number of connected triples of nodes
The normalization is that C becomes one for a fully connected graph.
(1)
120
A. Banerjee and J. Jost
Certain properties characterize specific classes of graphs. Complete graphs are those where every vertex is connected with all others. Of course, for large graphs, this is an unrealistic situation, as they are typically sparse, in the sense that the average vertex has connections to only a small fraction of the vertices present in the graph. A graph is bipartite when it consists of two classes inside each of which there are no connections. A graph is bipartite iff it has no closed paths of odd length. In particular, for a bipartite graph, the clustering coefficient C vanishes. A complete bipartite graph is one where each member of one class is connected with all members of the other class. Trees are special bipartite graphs. They have the minimal number of edges, N − 1, that is needed to make a graph of N vertices connected. One may also consider more general structural properties, like cohesion, or functional aspects, like robustness against the destruction of links or the elimination of nodes. Clearly, no such list of parameters and properties can be exhaustive. Also, it may not be easy to understand the relations, if any, between those parameters and properties. In this situation, we have developed the spectral approach to the description of networks. As we shall explain, this means the analysis of the density of eigenvalues of a natural operator associated to a network, the graph Laplacian. While these eigenvalues do not always fully determine a graph, they nevertheless capture all important geometric properties, in a more or less explicit form. Plotting the density of eigenvalues also yields a representation of a graph that can be readily visually inspected. (In contrast, explicit presentation of the nodes and links becomes rather opaque once the graph exceeds some moderate size of, say 1–200 nodes.) Moreover, can easily manipulated by moving the nodes around in a plane. We now formally introduce the graph Laplacian and its spectrum. We represent our network structurally as a graph Γ which we assume to be finite and connected; let it have N vertices. Vertices i, j ∈ Γ connected by an edge of Γ are called neighbors, i ∼ j. The number of neighbors of a vertex i ∈ Γ is called its degree ni . For functions v from the vertices of Γ to R, we define the normalized Laplacian (henceforth simply called the Laplacian) as Δv(i) :=
1 v(j) − v(i). ni j,j∼i
(2)
This operator is different from the algebraic graph Laplacian Lv(i) := ni v(i)− j,j∼i v(j); see, e.g., [13, 14, 20, 32, 35]. In particular, the spectrum of Δ is different from that of L; Δ, however, has the same spectrum as the Laplacian investigated in [15] (in fact, the two operators are equivalent, differing only by a multiplier). The normalized Laplacian is the operator underlying random walks and conservative diffusion processes on graphs. Therefore, it seems to be the more natural operator from a geometric or physical perspective. However, the algebraic Laplacian does possess certain nice algebraic properties that are not shared by the normalized Laplacian, like a trace formula, see [22].
Spectral Characterization of Network Structures and Dynamics
121
Nevertheless, in our empirical studies, we have found that the Laplacian considered here seems to be a better tool for distinguishing different classes of graphs by spectral properties. We now recall some elementary properties, see, e.g., [15, 26]. The Laplacian is symmetric for the product ni u(i)v(i) (3) (u, v) := i∈V
for real-valued functions u, v on the vertices of Γ (and because of this symmetry, we need not consider complex-valued functions). The eigenvalues of Δ therefore are real. Δ is nonpositive in the sense that (Δu, u) ≤ 0 for all u. With the following convention, the eigenvalues λ then are nonnegative: Δu + λu = 0.
(4)
A nonzero solution u is called an eigenfunction for the eigenvalue λ. Since Γ has N vertices, Δ has N eigenvalues, not necessarily distinct, as some of them might occur with higher multiplicity. The smallest eigenvalue is λ0 = 0, with a constant eigenfunction. This eigenvalue is simple because we assume that Γ is connected; in general, the multiplicity of the eigenvalue 0 equals the number of connected components, with the corresponding eigenfunctions being ≡ 1 on one and ≡ 0 on all other components. Returning to our case of a connected graph Γ , then λk > 0
(5)
for k > 0 where we order the eigenvalues as λ0 = 0 < λ1 ≤ · · · ≤ λN −1 . For the largest eigenvalue, we have λN −1 ≤ 2.
(6)
In particular, the spectrum of Δ is always confined to the interval [0, 2], regardless of the size of the graph. This is not true for the algebraic graph Laplacian L, and this property of Δ allows for an easy comparison of the spectra of graphs irrespective of their sizes. We have equality in (6) iff the graph is bipartite. Thus, a single eigenvalue determines the global property of bipartiteness. More generally, a graph is bipartite iff whenever λ is an eigenvalue, then so is 2 − λ. Thus, the characteristic spectral property of a bipartite graph is that its spectrum is symmetric about 1. For instance, for a complete graph of N vertices, λ1 = ... = λN −1 =
N , N −1
(7)
122
A. Banerjee and J. Jost
that is, there is only one nontrivial eigenvalue, NN−1 , occurring with multiplicity N − 1. Among all graphs with N vertices, this is the largest possible value for λ1 and the smallest possible value for λN −1 . Thus, the characteristic spectral property of complete graphs is that there is this eigenvalue with the highest possible multiplicity. Many qualitative properties of graphs can be characterized by inequalities or other relationships between their eigenvalues. For instance, Monasson [36] carried out a systematic investigation of the spectrum of a small-world graph as the superposition of a regular ring and a random graph. Also, [23] develops a method for (re)constructing a graph from its spectrum. We should point out, however, that in general it is not possible to uniquely determine a graph from its spectrum. In fact, there exist isospectral graphs, that is, different graphs with the same eigenvalues. For instance, all complete bipartite graphs with the same number N of vertices have the same eigenvalues. Actually, they possess the eigenvalues 0 and 2 with multiplicity 1 and the eigenvalue 1 with multiplicity N − 2. Any graph with that spectrum is a complete bipartite graph, but among bipartite graphs of N vertices, the two classes may have different sizes N1 , N2 , as long as N1 + N2 = N , of course. We now rewrite the eigenvalue equation (4) as 1 u(j) = (1 − λ)u(i) for all i. (8) ni j∼i We observe that when the eigenfunction u vanishes at i, then also u(j) = 0.
(9)
j∼i
The converse also holds, except for the case λ = 1 when (9) holds at all points regardless of whether the eigenfunction vanishes there or not. We now consider motifs, that is, small subgraphs of Γ of a particular type, and analyze what happens to the spectrum when performing some natural operations with motifs. As our motif, we take some graph Λ. We start with motif joining: Here, the motif Λ is a graph that is independent of Γ . Let j0 be a vertex of Λ. We assume that Λ has eigenvalue λ and an eigenfunction uλ that vanishes at j0 , i.e., uλ (j0 ) = 0. We then form a graph Γ¯ by identifying the vertex j0 with an arbitrary vertex i of Γ . The new graph then also possesses the eigenvalue λ, with an eigenfunction that agrees with uλ on Λ and vanishes at the other vertices, that is, those coming from Γ . Thus, a motif Λ can be joined to an existing graph with a preserved eigenvalue and a localized eigenfunction when the joining occurs at one (or several) vertices where that eigenfunction vanishes. We next consider motif duplication: Here, the motif Λ is a subgraph of Γ , with vertices j1 , . . . , jm . Let the function u on the vertex set of Λ satisfy 1 u(j) = (1 − λ)u(i) for all i ∈ Λ and some λ, (10) ni j∈Λ,j∼i
Spectral Characterization of Network Structures and Dynamics
123
where ni is the degree of the vertex i in Γ . Let Γ¯ be obtained from Γ by doubling the motif Λ, that is, by adding vertices i1 , . . . , im and their connec/ Λ that are neighbors of jα . tions as in Λ and connecting each iα with all i ∈ Then the graph Γ¯ possesses the eigenvalue λ with an eigenfunction uλ that is nonzero at most of the vertices of Λ and its double; it agrees with u on Λ, with −u on the double of Λ. Thus, the eigenvalue λ is produced by motif duplication with symmetric eigenfunction balancing. We point out that for this effect it is essential that there be no connections between a node jα and its double iα . The simplest motif is a single vertex, and the corresponding motif duplication is the doubling of a single vertex j0 ∈ Γ . According to the general scheme, we add a new vertex i0 and connect i0 with all neighbors of j0 . This generates an eigenvalue 1, with an eigenfunction u1 that is nonzero only at j0 and i0 , with u1 (j0 ) = 1, u1 (i0 ) = −1. In the analysis of empirical networks, we often find that the spectral plot has a high peak at the eigenvalue 1. In such a situation, a natural hypothesis is that this network evolved via a sequence of vertex doublings. In fact, vertex duplication with subsequent random edge deletion has been proposed in different application fields as a mechanism for network growth that can reproduce qualitative properties of empirical networks, e.g., for the Internet [30], for protein-interaction networks [6, 45, 46, 47], or for citation networks [31], although the precise rules can differ between those investigations, for instance, whether the duplicated node and its copy are connected or not. The next simplest motif consists of two connected vertices. Thus, we consider an edge in Γ connecting two vertices j1 , j2 . Equation (10) then becomes 1 u(j2 ) = (1 − λ)u(j1 ), nj1 with the solutions
1 u(j1 ) = (1 − λ)u(j2 ), nj2
1 λ± = 1 ± √ . nj1 nj2
(11)
(12)
The duplication of an edge thus yields the eigenvalues λ± which are symmetric about 1. Also, when the degree of j1 or j2 is large, λ± are close to 1. The next motifs consist of three vertices. When we have a chain of vertices j1 , j2 , j3 for which j2 is connected to both j1 and j3 , but without a connection between j1 and j3 (that is, the motif is not a triangle), we obtain the eigenvalues 1 1 1 ( + ). (13) λ = 1, 1 ± nj2 nj1 nj3 The other motif with three vertices is a triangle, with vertices j1 , j2 , j3 . In this case, from (10), we obtain the cubic equation (1 − λ)3 nj1 nj2 nj3 − (1 − λ)(nj1 + nj2 + nj3 ) − 2 = 0 for λ.
(14)
124
A. Banerjee and J. Jost
4 Functional and Dynamical Aspects Determined by the First Eigenvalue In this section, we shall argue that the first nontrivial eigenvalue λ1 plays a special role for understanding important network properties. λ1 is also called the spectral gap, because it is equal to the difference λ1 − λ0 as λ0 = 0. λ1 admits the variational characterization 2 j∼i (v(i) − v(j)) : ni v(i) = 0}. (15) λ1 = min{ 2 v i ni v(i) i A function v attaining this minimum then is an eigenfunction for λ1 . Since the numerator in (15) only takes pairs of neighboring vertices into account, λ1 can become quite small when the graph consists of two large subgraphs that are connected by few edges. In (15), we can then achieve a small value by taking some function that equals a positive constant on one of those subgraphs and a negative constant on theother hand, where the two constants are adjusted so that the normalization i ni v(i) = 0 is satisfied. Therefore, it is intuitively clear that λ1 can be estimated against the Polya–Cheeger constant h(Γ ) of our graph Γ , which is defined as follows. Letting |E| denote the number of edges contained in an edge set E, we define h(Γ ) := inf{
|E0 | }, min( i∈V1 ni , i∈V2 ni )
(16)
where removing E0 disconnects Γ into the components V1 , V2 . We then have the estimates (see [15] for proofs) 1 h(Γ )2 ≤ λ1 ≤ 2h(Γ ). 2
(17)
Incidentally, this implies the inequality h(Γ ) ≤ 4
(18)
for any connected graph. Turning to dynamical aspects, we consider a dynamical system with coupling structure given by Γ . More specifically, we consider the coupled equation for a function u depending on the nodes i ∈ Γ and evolving in discrete time n∈N (f (u(j, n)) − f (u(i, n))). (19) u(i, n + 1) = f (u(i, n)) + ni j,j∼i Here, f : [0, 1] → [0, 1] is some function; the functions we have in mind are those whose iteration generates some chaotic dynamics, like the logistic map f (x) = 4x(1 − x).
(20)
Spectral Characterization of Network Structures and Dynamics
125
What is important about f is its Lyapunov exponent, N −1 1 log |f (¯ u(n))|; N →∞ N n=0
μ0 = lim
The Lyapunov exponent μ0 is positive for chaotic dynamics f . is a coupling parameter, usually in the range 0 ≤ ≤ 1. The specific question we wish to ask is whether, or better, under what circumstances, the solution u of (19) synchronizes, that is, asymptotically, lim (u(i, n) − u(j, n)) = 0 for all nodes i, j.
n→∞
(21)
This question can be understood as asking about the stability of a synchronized solution u(i, n) = u ¯(n) (22) that solves u ¯(n + 1) = f (¯ u(n)).
(23)
Systematic studies of synchronization are [42, 41]. It was then found in [27, 43] that a sufficient condition for such stability is 1 + e−μ0 1 − e−μ0 < < . λ1 λN −1
(24)
In practice, the left inequality, the one involving λ1 , is the crucial one here. In particular, when the eigenvalues satisfy appropriate conditions, we can have a stable synchronized solution that is chaotic (μ0 > 0). Note that the first eigenvalue even determines the synchronization of dynamics with transmission delays between the nodes, see [5].
5 Spectral Plots and What They May Tell Us In this final section, we describe how (a smoothed version of) the density plot for the eigenvalues of the Laplacian of a network yields a good heuristic clustering scheme for networks from different empirical domains. More precisely, we shall see that the spectral plots of different networks from the same domain typically look rather similar to each other, but different from those for networks from different domains. Also, these spectral plots often suggest suitable hypotheses about the dominant evolution mechanisms of the underlying networks. Let us give some examples that summarize some of the discussion in the preceding sections. • •
A high peak at the eigenvalue 1 may indicate many successive node duplications. This is readily visible in many of our spectral plots. Likewise, as analyzed above, see (12), (13), (14), duplications of small motives leave characteristic traces in the spectrum.
126
A. Banerjee and J. Jost 0.04
0.03
a
b
0.035 0.025 0.03 0.02
0.025 0.02
0.015
0.015 0.01 0.01 0.005
0.005 0
0.5
0
1
1.5
2
0.06
0
0.5
1
1.5
2
0.06
d
c 0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0
0
0
0.5
1
1.5
2
0 0
0.5
1
1.5
2
Fig. 1. (a) Protein-protein interaction network of Helicobacter pylori. Network size = 710. Data collected from http://www.cosinproject.org [Download date: 25 Sept. 2005]. (b) Metabolic network of Helicobacter pylori. Size of the network = 940. Nodes represent substrates, enzymes, and intermediate complexes. Data used in [24]. Data source: http://www.nd.edu/∼networks/resources.htm. [Download date: 22 Nov. 2004]. (c) Autonomous Systems (ASS) topology of the Internet. Every vertex represents an AS, and two vertices are connected if there is at least one physical link between the two corresponding ASS. AS graph of 1998/04/02. Network size = 3522. Data collected from http://www.cosinproject.org and data used in [18] [Download date: 23 September 2005]. Main source: BGP routing data collected by University of Oregon Route Views Project, then processed and made available in various formats at the Global ISP interconnectivity by AS number page of NLANR (National Laboratory of Applied Network Research). (d) Word-adjacency networks of a text in Spanish language. Size of the network = 11558. Data downloaded from http://www.weizmann.ac.il/mcb/UriAlon [Download date 3rd Feb. 2005]. Data used in [34].
Spectral Characterization of Network Structures and Dynamics
127
0.02
0.025
a
b
0.018 0.016
0.02
0.014 0.015
0.012
0.01
0.008
0.01
0.006 0.004
0.005
0.002 0
0
0.025
0.5
1
1.5
2
0
0
0.5
1
1.5
2
0.5
1
1.5
2
0.012
c
d 0.01
0.02
0.008 0.015 0.006 0.01 0.004 0.005
0
0.002
0
0.5
1
1.5
2
0
0
Fig. 2. (a) Foodweb network from “Florida bay in wet season”. Data downloaded from http://vlado.fmf.uni-lj.si/pub/networks/data (main data resource: Chesapeake Biological Laboratory. Web link: http://www.cbl.umces.edu/). [Download date 21 Dec. 2006]. Network size 128. (b) Foodweb network from “Ythan estuary”. Data downloaded from http://www.cosinproject.org. [Download Date 21 Dec. 2006]. Network size 135. (c) The network of hyperlinks between weblogs on US politics, recorded in 2005 by Adamic and Glance [1]. Network size 1222. Data downloaded from http://www-personal.umich.edu/∼mejn/netdata [Download date: 23 April 2007]. (d) Neuronal connectivity of Caenorhabditis elegans. Network size 297. Data used in [49, 50]. Data Source: http://cdg.columbia.edu/cdg/datasets [Download date: 18 Dec. 2006]. (e) E-mail interchanges between members of the Univeristy Rovira i Virgili (Tarragona) [21]. Network size 1133. Data downloaded from http://deim.urv.cat/∼aarenas/data/welcome.htm [Download date: 21 March, 2007].
128
A. Banerjee and J. Jost 9
x 10−3
e
8 7 6 5 4 3 2 1 0
0
0.5
1
1.5
2
Fig. 2. (Continued)
•
•
As follows from Section 4, the presence of many small eigenvalues indicates that the graph consists of many components that, while possibly connected densely inside, are only very loosely connected to each other. That is, the graph consists of many different “communities.” As indicated, this has important dynamical implications for the synchronizability of the graph. When the highest eigenvalue equals 2, or, more generally, when the spectrum is symmetric about 1, the graph is bipartite; see the discussion after (6). Thus, an approximate such symmetry, or an eigenvalue very close to 2, will indicate that the graph is close to being bipartite (we hope to present more precise estimates elsewhere). Also, a bipartite graph can readily support period 2 oscillations of coupled dynamics, so again, there are direct dynamical implications here. Also, when a graph is bipartite, a random walk on it need not converge to a stationary distribution. More generally, such convergence properties are related to the small and large (close to 2) eigenvalues. Thus, these eigenvalues will affect the properties of random search schemes on the underlying graph.
In the Figs. 1 through 4, we can clearly see that networks from the same empirical domain yield similar spectral plots. Also, we can distinguish different classes of spectral plots with specific characteristic features. A more detailed analysis of those classes can be found in [10]. The investigation of the graph properties that can be detected from spectral plots has just begun, and we expect significant advances in the detailed understanding of classes of empirical graphs from systematic investigations of their spectra.
Spectral Characterization of Network Structures and Dynamics 0.01
129
0.025
a
0.009
b 0.02
0.008 0.007
0.015
0.006 0.005
0.01
0.004 0.003
0.005
0.002 0.001 0
0
8
0.5
1
1.5
0
2
x 10−3
9
c
0.5
1
1.5
2
x 10−3
d
8
7
0
7
6
6 5
5 4
4 3
3
2
2
1 0
1 0
0.5
1
1.5
2
0
0
0.5
1
1.5
2
Fig. 3. (a) Topology of the Western states power grid of the United States [49]. Network size 4941. Data downloaded from http://cdg.columbia.edu/cdg/datasets [Download date: 1 March 2007]. (b) Jazz band network. Nodes represent jazz bands. Two bands are connected if a same musician played in those two bands. Network size 198. Data downloaded from http://deim.urv.cat/∼aarenas/data/welcome.htm [Download date: 17 March 2008]. Data used in [19]. (c) Co-authorships between scientists posting preprints on the High-Energy Theory E-Print Archive, http://arxiv.org/archive/hepth between 1 Jan. 1995 and 31st Dec. 1999 [37]. Network size 5835. (d) Co-authorships of scientists working on network theory and experiment [38]. Network size 379. (c,d) Data downloaded from http://www-personal.umich.edu/∼mejn/netdata [Download date: 23 April 2007].
130
A. Banerjee and J. Jost 6
x 10−3
7
x 10−3
a
b 6
5
5 4 4 3 3 2
2
1
0 0
1
0.5
7
0 0
2
1.5
1
1
0.5
1.5
2
x 10−3
c 6 5 4 3 2 1 0 0
0.5
1
1.5
2
Fig. 4. Electronic circuits. (a) With size = 122. (b) With size = 252. (c) With size = 512. Data downloaded from http://www.weizmann.ac.il/mcb/UriAlon [Download date: 15 March 2005]. Data used in [33].
References 1. L.A. Adamic and N. Glance, The political blogosphere and the 2004 US election: Divided they blog, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005) 2. R. Albert, A.-L. Barab´ asi, Statistical mechanics of complex networks, Reviews of Modern Physics 74, 2002, 47–97 3. F.M. Atay, T. Bıyıko˘ glu, J. Jost, Synchronization of networks with prescribed degree distributions, IEEE Trans. Circuits and Systems I 53(1), 2006, 92–98 4. F.M. Atay, T. Bıyıko˘ glu, J. Jost, Network synchronization: Spectral versus statistical properties, Phys. D 224, 2006, 35–41 5. F.M. Atay, J. Jost, A. Wende, Delays, connection topology, and synchronization of coupled chaotic maps, Phys. Rev. Lett. 92(14), 2004, 144101
Spectral Characterization of Network Structures and Dynamics
131
6. A. Banerjee, J. Jost, Laplacian spectrum and protein-protein interaction networks, preprint 7. A. Banerjee, J. Jost, On the spectrum of the normalized graph Laplacian, Lin. Alg. Appl. 428, 2008, 3015–3022 8. A. Banerjee, J. Jost, Graph spectra as a systematic tool in computational biology, Discr. Appl. Math., to appear 9. A. Banerjee, J. Jost, Spectral plots and the representation and interpretation of biological data, Theory Biosc. 126, 2007, 15–21 10. A. Banerjee, J. Jost, Spectral plot properties: Towards a qualitative classification of networks, NHM 3, 2008, 395–411 11. A.-L. Barab´ asi, R.A. Albert, Emergence of scaling in random networks, Science 286, 1999, 509–512 12. P. Blanchard, T. Kr¨ uger, The “Cameo” principle and the origin of scale-free graphs in social networks, J. Stat. Phys. 114, 1399–1416, 2004 13. T. Bıyıko˘ glu, J. Leydold, P. Stadler, Laplacian Eigenvectors of Graphs, Springer Berlin, 2007 14. B. Bolob´ as, Modern Graph Theory, Springer, Berlin, 1998 15. F. Chung, Spectral Graph Theory, AMS, Providence, RI, 1997 16. F. Chung, L.Y. Lu, Complex Graphs and Networks, AMS, Providence, RI, 2006 17. S.N. Dorogovtsev, J.F.F. Mendes, Evolution of Networks, Oxford University Press, Oxford, 2003. 18. M. Faloutsos et al., On power-law relationships of the Internet topology, SIGCOMM, 1999. 19. P.M. Gleiser, L. Danon, Community structure in Jazz, Advances in Complex Systems (ACS) 6(4), 2003, 565–573 20. C. Godsil, G. Royle, Algebraic Graph Theory, Springer, Berlin, 2001 21. R. Guimera et al., Self-similar community structure in a network of human interactions, Physical Review E 68, 2003, 065103(R) 22. M. Horton, H. Stark, A. Terras, What are zeta functions of graphs and what are they good for? In Quantum graphs and their applications, Contemp. Math., Amer. Math. Soc., Providence, RI, 415, 2006, 173–189 23. M. Ipsen, A.S. Mikhailov, Evolutionary reconstruction of networks, Phys. Rev. E 66(4), 046109, 2002 24. H. Jeong et al., The large-scale organization of metabolic networks, Nature 407, 2000, 651–654 25. J. Jost, Mathematical methods in biology and neurobiology, monograph, to appear 26. J. Jost, in: J.F. Feng, J. Jost, M.P. Qian (eds.), Networks: From Biology to Theory, 35–62, Springer, Berlin, 2007 27. J. Jost, M.P. Joy, Spectral properties and synchronization in coupled map lattices, Phys. Rev. E 65(1), 2002, 016201 28. J. Jost, M.P. Joy, Evolving networks with distance preferences, Phys. Rev. E 66, 2002, 36126–36132 29. D.H. Kim, A. Motter, Ensemble averageability in network spectra, Phys. Rev. Lett. 98, 2007, 248701 30. J. Kleinberg et al., The Web as a Graph: Measurements, Models, and Methods, LNCS 1627, 1999, 1–17 31. P. Krapivsky, S. Redner, Network growth by copying, Phys. Rev. E 71, 2005, 036118 32. R. Merris, Laplacian matrices of graphs – A survey, Lin. Alg. Appl. 198, 1994, 143–176
132
A. Banerjee and J. Jost
33. R Milo et al., Network motifs: Simple building blocks of complex networks, Science 298, 2002, 824–827 34. R. Milo et al., Superfamilies of evolved and designed networks, Science 303, 2004, 1538–1542 35. B. Mohar, Some applications of Laplace eigenvalues of graphs, in: G. Hahn, G. Sabidussi (eds.), Graph Symmetry: Algebraic Methods and Applications, 227– 277, Springer, Berlin, 1997 36. R. Monasson, Diffusion, localization and dispersion relations on “small-world” lattices, Europ. Phys. J. B 12, 1999, 555–567 37. M.E.J. Newman, The structure of scientific collaboration networks, Proc. Natl. Acad. Sci. USA 98, 2001, 404–409 38. M.E.J. Newman, Finding community structure in networks using the eigenvectors of matrices, Phys. Rev. E 74, 2006, 036104 39. M. Newman, The structure and function of complex networks, SIAM Review 45, 2003, 167–256 40. S. Ohno, Evolution by Gene Duplication, Springer, Berlin, 1970 41. L.M. Pecora, T.L. Carroll, Synchronization in chaotic systems, Phys. Rev. Lett. 64, 1990, 821–824 42. A. Pikovsky, M. Rosenblum, J. Kurths, Synchronization – A Universal Concept in Nonlinear Science, Cambridge University Press, Cambridge, 2001 43. G. Rangarajan, M.Z. Ding, Stability of synchronized chaos in coupled dynamical systems, Phys. Lett. A 296, 2002, 204–212 44. H. Simon, On a class of skew distribution functions, Biometrika 42, 1955, 425–440 45. R. Sol´e et al., A model of large scale proteome evolution, Adv. Compl. Syst. 5, 2002, 43–54 46. A. Vazquez et al., Modelling of protein interaction networks, ComPlexUs 1, 2003, 38–44 47. A. Wagner, How the global structure of protein interaction networks evolves, Proc. Roy. Soc. B 270, 2003, 457–466 48. A. Wagner, Evolution of gene networks by gene duplications — A mathematical model and its implications on genome organization, Proc. Nat. Acad. Sciences USA 91(10), 1994, 4387–4391 49. D.J. Watts, S.H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature 393, 1998, 440–442 50. J.G. White et al., The structure of the nervous system of the nematode Caenorhabditis elegans, Phil. Trans. Royal Soc. of London Series B-Bio. Sc. 314, 1986, 1–340 51. P. Zhu, R.C. Wilson, A study of graph spectra for comparing graphs. In Proc. of British Machine Vision Conf. (MBVC), Sep 2005 52. K.H. Wolfe, D.C. Shields, Molecular evidence for an ancient duplication of the entire yeast genome, Nature 387(6634), 1997, 708–713
Dynamics of Social Complex Networks: Some Insights into Recent Research Sergi Lozano ETH Zurich, Swiss Federal Institute of Technology, UNO D11, Universit¨ atstr. 41, 8092 Zurich, Switzerland;
[email protected]
1 Introduction: Social Networks as Complex Networks Social networks analysis (that is, the study of interactions among social actors from a structural viewpoint) has a long tradition covering several decades [1, 2, 3]. This sort of study has usually been performed over small social networks, and the limitation of size has conditioned the visibility of complexity [4, 5]. However, the situation has changed significantly in recent times due to basically two reasons. First, there is an increasing availability of larger social datasets (obtained in most cases from information and communication technologies). Secondly, a large number of physicists and other scholars from complexity science have started to take active interest in the field. New perspectives and tools have been provided by these ‘newcomers’, which in combination with the expertise and knowledge accumulated by ‘classical’ social network analysts, has formed the basis of a multidisciplinary field suitably termed the science of networks [6, 7]. This research has led to the formal definition of the complexity exhibited by social networks against the following simple ‘check list’ [5]. 1. The network must consist of a large number of nodes showing substantial heterogeneity. Here we understand heterogeneity to mean diversity of degree. 2. Its structure has to present an ‘intricate architecture’, that is, a topology that cannot be expressed in terms of simple patterns (like ‘regular’ or ‘completely random’) but must include several degrees of freedom. 3. This topological complexity is translated into the global system behavior in the form of ‘emergent phenomena’, i.e. even simple local interaction rules lead to a performance of the whole system that is richer than the sum of local effects. 4. This influence of local feedbacks over the macroscopical behavior can be manifested, in particular, as nonlinearities in the operation of the processes N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 8, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
134
S. Lozano
that shape the network itself (i.e. sudden emergencies of determined structural features are observed when a certain external parameter exceeds a certain threshold value). Regarding the fulfillment of this list of requirements by social networks, Vega-Redondo refers to the results of previous studies about social structure to confirm that social networks satisfy the first two. Following the same reasoning, we notice that the other two requirements (covering dynamic aspects) are repeatedly recognized in social phenomena, for instance, collective behavior and social mobilization [8, 9] (third point), or the emergence of hierarchical social structures from interactions at an individual level [10, 11] (fourth point). Once confirmed that social networks are indeed complex networks, in this chapter we will focus on the dynamic aspects of this complexity (the two later points in the check list above). More concretely, we will overview some of the recent research that addresses dynamics on and of social networks from the perspective of complex systems. The rest of the chapter is structured as follows. The second section is devoted to works dealing, as separate topics, with the analysis of social phenomena over static social networks and with the time evolution of the social structure. The third section focuses on the coevolution of social structure and phenomena, stressing the importance of this interplay from the complexity viewpoint. Finally, the last section summarizes the whole chapter and points out some ideas about the future evolution of the field.
2 Approaching the Dynamics on and of Social Networks Separately The majority of recent studies on social networks, from a complexity perspective, treat dynamics on and of social networks as different lines of research. In the first case, each node (social actor) is considered to be a dynamical system whose state evolves, in part, as a function of the topological features of the underlying static social substrate. Taking into account the intricate patterns (using the same expression as that in the Introduction) characterizing social networks, this scenario results in the nonlinear global behaviors already mentioned. In the second case, the whole network is considered to be a dynamical system with a topological state that evolves according to local rules. Investigations along this line have discovered that certain social rules at a local (individual) interaction level can forge some of the referred ‘intricate structural patterns’. In accordance with this scheme, we will address these two research lines separately. 2.1 Dynamics on Social Networks Topology is an important aspect that is always present in social dynamics [12]. Accordingly, social networks analysis has placed great importance on studying
Dynamics of Social Complex Networks
135
the influence of social networks and the individual’s role in the evolution of different social phenomena. A good example of this can be found in the research devoted to diffusion of innovations [13, 14]. This perspective has resulted in an in-depth knowledge of the most important structural characteristics of social networks and their influence on the behavior of the social actors, as has been recognized by scholars recently entering the field from complexity science [6, 7] (although some ‘traditional’ social network analysts claim that this effort by the ‘newcomers’ is not quite appreciable [1]). The incorporation of these ‘newcomers’ has not changed this orientation, but has reinforced it by contributing new analyses and modeling methodologies. The works ensuing from this combination of tools and perspectives have uncovered very relevant results. Some of them, for example, have related the emergence and resilience of cooperation in social groups with certain structural features of its social network, such as the degree heterogeneity [15, 16, 17] or the community structure [18]. Others have shown that scale-freeness and the small-world phenomenon can influence the consensus time of opinions in a population [19, 20] and even force scenarios with coexisting domains of opposite opinions [21]. In order to further understand the various tools and perspectives developed for explaining and modeling social networks, it is useful to resort to the exhaustive recent reviews on game theory [22], opinion dynamics [12, 23], language dynamics [12] or spreading phenomena [5]. Finally, as a sample of work addressing social dynamics on networks, the first chapter of Part in this book presents a work centered on the study of epidemic spreading [24]. In this work, the authors apply a mesoscopic (neither individual nor global, but intermediate) structural approach to predict and understand the spreading of an incurable disease (like HIV) over an empirical static network. First, they study the division into subnetworks or regions of a real social network of sexual contacts obtained by means of interviews. They also deduce qualitative predictions about infection spreading from the observed topological features. Second, they use a computational model to numerically contrast these predictions and design possible protection strategies suitable in this particular case. This work represents an important contribution to the literature on diseases spreading, since it highlights the analysis and visualization possibilities of mesoscopic approximations. 2.2 Dynamics of Social Networks The second separated approach that we are going to consider in this section is based on the study of network processes, that is, “series of events that create, sustain and dissolve social structures” [25]. Logically, this sort of study requires the use of time in addition to the structural description. However, in the past social networks analysis mainly focused on the study of static social networks and their influence over individual and collective behavior [6]. Borgatti [26] argues that one of the reasons behind such an orientation was the
136
S. Lozano
difficulty to obtain longitudinal empirical data. As has been pointed out in the Introduction, this scenario has changed lately with the increasing availability of large social datasets obtained from different information and communication technologies (email traffic, mobile phone calls, activities within peerto-peer systems, social media and social networking websites, etc.). Taking advantage of this new availability, scholars have developed different methodologies to understand the evolution of social networks using these data as input [25, 27, 28]. The (generally) large size of these datasets has given rise to especially interesting applications from a complex network perspective. On one side, we find works that try to deduce the basic mechanisms ruling social network processes. To do that, the authors analyze the evolution of these social datasets from a statistical point of view (macroscopic level) [4], focusing on their modular structure (mesoscopic or intermediate level) [29, 30, 31], or addressing key individual properties such as centrality (microscopic level) [32]. On the other side, following the example of seminal works by Watts and Strogatz [33] and Barab´ asi and Albert [34], datasets are also used to validate simple models based on single mechanisms that forge complex social-like features. In these works, empirical data are contrasted against the models’ simulations in terms of structural parameters at different topological scales. For example, some of these works present extensions of Barab´asi’s preferential attachment models and are focused on the degree distribution [35, 36]. Others present variants of the seceder model (where the mechanism conditioning topological evolution is based on each agent’s efforts to differentiate from the crowd) [37]. Finally, in Ref. [38] the authors propose a model where each agent is assigned a set of social values (representing different social attributes), and ties are established in the function of the social distances among agents (differences between their social attributes) and α, a parameter quantifying the homophily in the system (the individuals’ preference to establish and maintain links with other individuals they feel similar to). Interestingly, for different values of α the resulting social network presents different modular structures, while preserving general topological features of social networks (such as assortativity or high clustering).
3 Coevolution: Social Networks and Phenomena The separation into two different lines of research presented in the previous section has been the common approximation to social complex networks until recently. However, from real life observations we conclude that there is, normally, a certain interdependency among the evolution of both the social structure and the behavior of each one of the social actors [39]. Consider a friendship network as an example. On one side, friendship relationships (network links) are the path used, for instance, to cooperate, inform or imitate behaviors. Thus, the structure conditions different social processes related to
Dynamics of Social Complex Networks
137
these actions (like cooperation and diffusion of habits, for example). On the other side, the stronger the friendship relation among two people, the more probable that they introduce each other to new friends, modifying their mutual ‘friendship local neighborhood’ and, consequently, the whole structure of the network. In general, networks exhibiting such a feedback loop are called coevolutionary or adaptive networks [40]. This interdependency has clear implications from a complexity point of view. If structural patterns of social networks can induce nonlinearity in social phenomena evolving over them and, likewise, social network processes forge the emergence of complex structural features, a coevolutive scheme has to lead, necessarily, to scenarios exhibiting extremely rich behaviors. In their recent review on adaptive networks, Gross and Blasius [40] suport this assertion by reporting a list of four ‘hallmarks’ typically presented by adaptive networks in general (and social networks in particular): • • • •
Self-organization towards a dynamical critical state. Emergence of ‘specialized’ roles from an initially homogeneous population. Formation of complex global topologies (even from very simple local rules). Highly complex macroscopical dynamics due to the interaction of local states and topological complexity.
In the following, we will review some recent works that have addressed interesting sociological topics from an adaptive networks’ perspective. We will also identify some of the preced hallmarks in the referred examples. 3.1 Cooperation in Coevolutive Models In Ref. [41], Skyrms and Pemantle claim to “(..) create models that are more true to life (..)” by incorporating coevolution among structure and strategies in evolutive game theory models. Since then, some authors have proposed models where players’ strategies depend on the structure but, at the same time, they can modify the connectivity in their local neighborhood in order to maximize the payoff of a certain strategy (modifying, as an aggregated effect, the whole topology at the macroscopic level). Cooperation among individuals and, more concretely, the evolution of oneshot versions of the Prisoner’s Dilemma played over adaptive networks, have been intensively studied. In the results of these works, we can find some of the four ‘complexity hallmarks’ in coevolving networks listed previously. For example, in some cases the authors identify the formation of scale-free topologies (which present a power-law distribution) [42, 43] and the emergence of differentiated roles and hierarchies [42, 44, 45]. Moreover, regarding system dynamics, Ebel and Bornholdt [46], Eguiluz and co-workers [42] and Zimmermann and Eguiluz [44] report large avalanches of strategy changes when the system approaches the final state, identifying a sort of self-organized critical behavior. As a particular case, [47] analyzes a scenario where topological changes occur much faster than changes of individuals’ strategies. The authors find
138
S. Lozano
that the evolution of individual strategies in this situation no longer corresponds to the Prisoner’s Dilemma, but to a sort of coordination game, leading to a situation more favorable to cooperation. This result highlights the effect of separating the time scales of structural and individual dynamics. Notice that the scenarios considered in the previous section (with static networks or nonevolving nodes) can be seen as particular extreme cases of coevolving networks with completely separated time scales (i.e. one of the two time scales is so large compared with the other that it is not considered). In accordance with the importance of the relation between the two different time scales in coevolutive scenarios, we find (as we will see in the following subsections) several works that analyze this influence and that consider the cases with one aspect (network or individuals) static as bounding cases. 3.2 Communication and Diffusion of Information in Social Networks The interplay between communication within a population of socioeconomic agents and its underlying social structure, is an interesting social topic that deserves further study [48]. Taking business relationships as an example, an agent would presumably like to occupy a network position that is as strategic as possible in terms of information reception and processing (close to the other agents in terms of average distance or with a high betweenness, for instance). Moreover, since the socioeconomical environment is usually volatile (keeps changing), actors need to be continuously looking for better contacts and ‘fresh opportunities’ [49, 50]. Taking into account such a dynamical scenario, where the “who communicates with whom” and the social structure are strongly entangled, this issue is especially suitable to be studied from a coevolutive viewpoint. Following this perspective, we can find recent works focused on an individual’s movements across the social structure to reach strategic positions while minimizing linking costs [51, 52], or works targeting key positioned individuals [48]. Other authors investigate the impact of communication on social structure both quantitatively (more or less comunication) and qualitatively (different communication strategies) [53]. In general, the models employed in these works generate social structures that present complex patterns like modular structures [48]. Furthermore, some of these works report interesting behaviors of the modeled system like selforganization to states close to the transition between fragmented and ordered states [51], sharp phase transitions and resilience of the structure [49, 50]. 3.3 Opinion and Cultural Dynamics Opinion and cultural dynamics are other important social topics which have been addressed from a coevolutive viewpoint.
Dynamics of Social Complex Networks
139
Centola and co-workers [54] presented a coevolutionary version of Axelrod’s model on dissemination of culture [55]. As in the seminal model by Axelrod, they represent cultural traits and features by numerical values that are transmitted (copied) among individuals in contact, with the difference that the topology of interactions among individuals also evolves. More concretely, agents can erase and rewire links to neighbors with whom they have no common social trait (i.e. the affinity among them is 0). The model presents a complex relationship between heterogeneity and cultural diversity, in which a high diversity can reduce cultural group formation while simultaneously increasing social connectedness. The coevolutive approach has also been used in several recent works addressing opinion formation processes. In Refs. [56, 57, 58, 59], authors propose coevolutive versions of the two-state voter’s model [60] to study consensus in populations’ opinions. In this kind of model, interactions between agents are enhanced or penalized (or even broken) according to whether they succeed in reaching an agreement or not. From a complex network point of view, these models are used to explore the transition between different states, with a special interest in the emergence and duration of metastable states reached before the consensus. Another model of opinion formation based on a coevolutive approach that has received considerable attention is proposed in [61]. This model is especially interesting regarding time-scale separation. In each time step, a rewiring (structural change) or an opinion imitation (evolution of local state) occurs with certain probabilities φ and 1 − φ, respectively. Therefore, by tuning φ the authors can easily recover one of the two extremal single-evolving cases (static network or nonevolving nodes) or travel along different intermediate scenarios. By studying the whole range of possible situations, the authors find that the model undergoes a continuous phase transition as φ is varied, from a regime in which opinions are highly diverse to one in which most individuals hold the same opinion. 3.4 Spreading Phenomena Last but not least, there is a growing literature on spreading (epidemics, diseases, infections) phenomena from an adaptive network perspective. We find an example of this in the series of works proposing and analyzing an adaptive version of the SIS (Susceptible-Infected-Susceptible) model, where susceptible individuals try to avoid infection by erasing their links with the infected population [62, 63, 64]. This sort of work analyzes how different levels of rewiring modify the dynamics of the adaptive SIS model (note that this implies, once again, studying the effect of having separated time scales). One common observation is that high levels of rewiring lead to the self-organization of the susceptible population into a unique, densely connected cluster. In the case of eventual infection of an individual in the cluster, this sort of organization favors a rapid spreading of the disease, which is seen as an avalanche of state change from a macroscopic viewpoint.
140
S. Lozano
We also mention the work presented in [65], where the authors propose an innovative coevolutionary model of HIV infection spreading through the use of dynamic complex networks. On one hand, the state of each individual (her health situation) is determined by means of a Markov process that takes into account both topological data (such as the number of infected neighbours) and information regarding the HIV infections (probability of infection and progression from HIV to AIDS, for instance). On the other hand, the social structure of the population is defined at each time step in a function of certain statistical features and the state of nodes (nodes with AIDS are removed from the network). The authors find a good correspondence between simulation results and real demographic historical epidemiological data from the United States. Moreover, this epidemiological prediction model could be integrated in related decision support systems (regarding anti-drug policy, for instance).
4 Conclusions Summarizing, the analysis of social networks’ dynamics has been revealed to be an outstanding application of the complex network theory, as is demonstrated by the huge (and increasing) amount of work developed in the field during recent years. Two factors have contributed definitively to this success: the availability of large longitudinal social datasets obtained from communication technologies, and the massive integration of scientists from complexity science (especially physicists) to social networks analysis. In this chapter, we have provided a general view of recent research in this area. Following the evolution of the literature in the field, we have first referred to works treating dynamics on and of social networks separately, and later have addressed a more recent approach integrating both sorts of dynamics in a coevolutive scheme. In both cases, but especially in the last one, we have also echoed the results reported by authors regarding some of the points proposed by Vega-Redondo’s check list in the Introduction (emergence of nontrivial structural patterns, nonlinear macroscopical behaviors induced by local processes, etc.). When talking about coevolutive models, we have also stressed the effect of having more or less separated time scales for dynamics on and of social networks. Finally, regarding the future evolution of the research on the dynamics of social complex networks, it is expected to keep growing, as the availability of datasets is increasing and the field continues to attract scholars. Nevertheless, to ensure this growth, issues like the ethical implications of social data collection and analysis [66, 67], the integration among different disciplines and perspectives within the aforementioned science of networks should be seriously addressed.
Dynamics of Social Complex Networks
141
References 1. Freeman, L.C.: The Development of Social Network Analysis: A Study in the Sociology of Science. Empirical Press, Vancouver (BC Canada) (2004). 2. Scott, J.: Social Network Analysis: A Handbook. SAGE Publications, London (2000). 3. Wasserman, S., Faust, K.: Social Networks Analysis: Methods and Applications. Cambridge University Press, New York (1994). 4. Holme, P., Edling, C.R., Liljeros, F.: Structure and time evolution of an Internet dating community. Social Networks 26, 155–174 (2004). 5. Vega-Redondo, F.: Complex Social Networks. Cambridge University Press, New York (2007). 6. Watts, D.J.: Six Degrees: The Science of a Connected Age. W. W. Norton & Company Inc., New York (2003). 7. Barab´ asi, A.-L.: Linked: The New Science of Networks. Perseus Publishing, Cambridge (USA) (2002). 8. Coleman, J.: Foundations of Social Theory. Harvard University Press, Cambridge, MA (1990). 9. Gould, R.V.: Collective action and network structure. American Sociological Review 58 (2), 182–196 (1993). 10. Gould, R.V.: The origins of status hierarchies: A formal theory and empirical test. American Journal of Sociology 107 (5), 114378 (2002). 11. Epstein, J.M.: Generating classes without conquest. In: Generative Social Science: Studies in Agent-Based Computational Modeling. Princeton University Press, Princeton, NJ (2007). 12. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Reviews of Modern Physics (Accepted) 348 (2008). 13. Rogers, E.M.: Diffusion of Innovations (5th ed.). Free Press, New York (2003). 14. Valente, T.W.: Models and methods for innovation diffusion. In: Carrington, P., Scott, J., Wasserman, S. (ed) Models and Methods in Social Network Analysis. Cambridge University Press, New York (2005). 15. Abramson, G., Kuperman, M.: Social games in a social network. Phys. Rev. E 63, 030901 (2001). 16. Duran, O., Mulet, R.: Evolutionary prisoners dilemma in random graphs. Physica D 208 (3–4), 257–265 (2005). 17. Santos, F.C., Pacheco, J.M., Lenaerts, T.: Evolutionary dynamics of social dilemmas in structured heterogeneous populations. Proc. Natl. Acad. Sci. 103, 3490– 3494 (2006). 18. Lozano, S., Arenas, A., Sanchez, A.: Mesoscopic structure conditions the emergence of cooperation on social networks. PLoS ONE 3(4): e1892 doi: 10.1371/ journal.pone.0001892 (2008). 19. Castellano, C., Loreto, V., Barrat, A., Cecconi, F., Parisi, D.: Comparison of voter and Glauber ordering dynamics on networks. Phys. Rev. E 71 (6), 066107 (2005). 20. Sood, V., Redner, S.: Voter model on heterogeneous graphs. Phys. Rev. Lett. 94 (17), 178701 (2005). 21. Castellano, C., Vilone, D., Vespignani, A.: Incomplete ordering of the voter model on small-world networks. Europhys. Lett. 63 (1), 153158 (2003). 22. Szab´ o, G., F´ ath, G.: Evolutionary games on graphs. Phys. Rep. 446 (4–6), 97–216 (2007).
142
S. Lozano
23. Stauffer, D.: Sociophysics Simulations II: Opinion Dynamics. arXiv:physics/ 0503115v1 [physics.soc-ph] (2005). 24. Bjelland, J., Canright, G., Engø-Monsen, K., Remple, V.P.: Topographic spreading analysis of an empirical sex workers network. In: (ed). Springer, Berlin (2008). 25. Doreian, P., Stokman, F.N. (ed): Evolution of Social Networks. Routledge, London (1997). 26. Borgatti, S.P.: The State of Organizational Social Network Research Today. Dept. of Organization Studies. Boston College, Boston, MA (2003). 27. Snijders, T.A.B.: Models for longitudinal network data. In: Carrington, P., Scott, J., Wasserman, S. (ed) Models and Methods in Social Network. Analysis. Cambridge University Press, New York (2005). 28. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Oxford (2003). 29. Palla, G., Barab´ asi, A-L., Vicsek, T.: Quantifying social group evolution. Nature 446 (5), 664–667 (2007). 30. Eckmann, J.-P., Moses, E., Sergi, D.: Entropy of dialogues creates coherent strutures in e-mail traffic. PNAS 101 (40), 14333–14337 (2004). 31. Onnela, J.-P., Saram¨ aki, J., Hyv¨ onen, J., Szab´ o, G., Lazer, D., Kaski, K., Kert´esz, J., Barab´ asi, A.-L.: Structure and tie strengths in mobile communication networks. PNAS 104 (18), 7332–7336 (2007). 32. Braha, D., Bar-Yam Y.: From centrality to temporary fame: Dynamic centrality in complex networks. Complexity 12 (2), 59–63 (2006). 33. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998). 34. Barab´ asi, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999). 35. Jin, E.M., Girvan, M., Newman, M.E.J.: Structure of growing social networks. Phys. Rev. E 64, 046132 (2001). 36. Roth, C.: Generalized Preferential Attachment: Towards Realistic Social Network Models. ISWC 4th Intl Semantic Web Conference. (2005). 37. Gr¨ onlund, A., Holme, P.: Networking the seceder model: Group formation in social and economic systems. Phys. Rev. E 70, 036108 (2004). 38. Bogu˜ na, M., Pastor-Satorras, R., D´ıaz-Guilera A., Arenas A.: Models of social networks based on social distance attachment. Phys Rev E 70, 056122 (2004). 39. Lazer, D.: The co-evolution of individual and network. J. Math. Sociol. 25, 69108 (2001). 40. Gross T., Blassius, B.: Adaptive coevolutionary networks: A review. J. R. Soc. Interfac 5 (20), 259–271 (2007). 41. Skyrms, B., Pemantle, R.: A dynamic model of social network formation. Proc. Nat. Acad. Sci. 97 (16), 9340–9346 (2000). 42. Eguiluz, V.M., Zimmermann, M.G., Cela-Conde, C.J., San Miguel, M.: Cooperation and the emergence of role differentiation in the dynamics of social networks. AJS 110 (4), 9771008 (2005). 43. Biely, C., Dragosits, K., Thurner, S.: The prisoners dilemma on co-evolving networks under perfect rationality. Physica D 228, 4048 (2007). 44. Zimmermann, M.G., Egu´ıluz, V.M.: Cooperation, social networks, and the emergence of leadership in a prisoners dilemma with adaptive local interactions. Phys. Rev. E 72, 056118 (2005). 45. Zimmermann, M.G., Egu´ıluz, V.M., San Miguel, M.: Coevolution of dynamical states and interactions in dynamic networks. Phys. Rev. E 69, 065102(R) (2004).
Dynamics of Social Complex Networks
143
46. Ebel, H., Bornholdt, S.: Coevolutionary games on networks. Phys. Rev. E 66, 056118 (2002). 47. Pacheco, J.M., Traulsen, A., Nowak, M.A.: Coevolution of strategy and structure in complex networks with dynamical linking. Phys. Rev. Lett. 97, 258103 (2006). 48. Rosvall, M., Sneppen, K.: Dynamics of opinions and social structures. arXiv:0708.0368v2 [physics.soc-ph] (2007). 49. Marsili, M., Vega-Redondo, F., Slanina, F.: The rise and fall of a networked society: A formal model. Proc. Nat. Acad. Sci. 101, 1439–1442 (2004). 50. Ehrhardt, G.C.M.A, Marsili, M., Vega-Redondo, F.: Phenomenological models of socioeconomic network dynamics. Phys. Rev. E 74, 036106 (2006). 51. Holme, P., Ghoshal, G.: Dynamics of networking agents competing for high centrality and low degree. Phys. Rev. Lett. 96, 098701 (2006). 52. K¨ onig, M.D, Battiston, S., Napoletano, M., Schweitzer, F.: On algebraic graph theory and the dynamics of innovation networks. Networks and Heterogeneous Media 3 (2) 201–220 (2007). 53. Rosvall, M., Sneppen, K.: Modeling self-organization of communication and topology in social networks. Phys. Rev. E 74, 016108 (2006). 54. Centola, D., Gonz´ alez-Avella, J.C., Egui´ıluz, V.M., San Miguel, M.: Homophily, cultural drift, and the co-evolution of cultural groups. J. of Conflict Resolution 51 (6), 905–929 (2007). 55. Axelrod, R.: The dissemination of culture: A model with local convergence and global polarization. The Journal of Conflict Resolution 41 (2), 203–226 (1997). 56. Benczik, I.J., Benczik, S.Z., Schmittmann, B., Zia, V.: Lack of consensus in social systems. EPL 82, 48006 (2007). 57. V´ azquez, F., Egu´ıluz, V.M., San Miguel, M.: Generic absorbing transition in coevolution dynamics. Phys. Rev. Lett. 100, 108702 (2007). 58. Zanette, D.H., Gil, S.: Opinion spreading and agent segregation on evolving networks. Phys. D 224, 156–165 (2006). 59. Gil, S., Zanette, D.H.: Coevolution of agents and networks: Opinion spreading and community disconnection. Phys. Lett. A 356, 89–95 (2006). 60. Liggett, T.M.: Interacting Particle Systems. Springer, New York (1985). 61. Holme, P., Newman, M.E.J.: Nonequilibrium phase transition in the coevolution of networks and opinions. Phys. Rev. E 74, 056108 (2006). 62. Gross, T., D’Lima, C.J.D., Blasius, B.: Epidemic dynamics on an adaptive network. Phys. Rev. Lett. 96, 208701 (2006). 63. Gross, T., Kevrekidis, I.G.: Coarse-graining adaptive coevolutionary network dynamics via automated moment closure. arXiv:nlin/0702047v1 [nlin.AO] (2007). 64. Zanette, D.: Coevolution of agents and networks in an epidemiological model. arXiv:0707.1249v2 [physics.soc-ph] (2007). 65. Sloot, P.M.A., Ivanov, S.V., Boukhanovsky, A.V., Vijver, D., Boucher, C.A.: Stochastic simulation of HIV population dynamics through complex network modeling, Int. J. of Computer Mathematics 85 (8), 1175–1187 (2008). 66. Borgatti, S.P., Molina, J.L.: Toward ethical guidelines for network research in organizations. Social Networks. 27 (2), 107–117 (2005). 67. Birnbaum, M.H.: Methodological and ethical issues in conducting social psychology research via the Internet. In: Sansone, C., Morf, C.C., Panter, A.T. (ed) Handbook of Methods in Social Psychology. Sage, Thousand Oaks, CA (2004).
The Structure and Dynamics of Linguistic Networks Monojit Choudhury1 and Animesh Mukherjee2 1
2
Microsoft Research India, Sadashivnagar, Bangalore, India – 560080
[email protected] Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India – 721302
[email protected]
1 Introduction Human beings as a species are quite unique to this biological world, for they are the only organisms known to be capable of thinking, communicating and preserving potentially an infinite number of ideas that form the pillars of modern civilization. This unique ability is a consequence of the complex and powerful human languages characterized by their recursive syntax and compositional semantics [40]. It has been argued that language is a dynamic complex adaptive system that has evolved through the process of self-organization to serve the purpose of human communication needs [80]. The complexity of human languages has always attracted the attention of physicists, who have tried to explain several linguistic phenomena through models of physical systems (see e.g., [32, 42]). Like any physical system, a linguistic system (i.e., a language) can be viewed from three different perspectives [52]. On one extreme, a language is a collection of utterances that are produced by the speakers of a linguistic community during the course of their interactions with other speakers of the same community. This is analogous to the microscopic view of a thermodynamic system, where every utterance and its corresponding context contributes to the identity of the language, i.e., the grammar. On the other extreme, a language can be characterized by a set of grammar rules and a vocabulary. This is analogous to a macroscopic view. Sandwiched between these two extremes, one can also conceive of a mesoscopic view of language, where linguistic entities, such as the letters, words or phrases are the basic units and the grammar is an emergent property of the interactions among them. Complex networks provide a suitable framework to model and study the structure and dynamics of linguistic systems from a mesoscopic perspective. Although multi-agent simulation is the preferred modeling paradigm for N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 9, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
146
M. Choudhury and A. Mukherjee
microscopic studies in linguistics (see e.g., [15, 80]), there have been some works where networks are also involved. For instance, in [67], the interaction patterns between the agents are modeled as a social network, and the diffusion of linguistic innovations (which are key to language change) are studied on various network topologies. This survey is confined to the works pertaining to various linguistic networks only at the level of mesoscopy. There has been a plethora of works on linguistic networks with various motivations and at various levels of linguistic structure. On the basis of the primary goal of the research, the work in this area can be broadly classified into two categories: (1) those which investigate the structural properties of language from the perspective of language evolution and, thereby, explain the emergence of certain universal characteristics of languages, and (2) those which try to exploit the network-based representations to develop certain useful practical systems such as machine translation, information retrieval and summarization systems. This article focuses on the former works, but a brief overview of the latter is also presented in Section 5. The survey is organized from the perspective of linguistic structure. Section 2 describes lexical networks, where the nodes are words and edges represent the lexical relationship between two words such as phonetic and semantic similarity. In Section 3 we present an overview of various networks where again the nodes are the words, but unlike the case of lexical networks, the edges represent their co-occurrences in similar context. These networks are representations of the interactions among words as governed by the grammar rules of a language. Section 4 describes the phonological networks, where the nodes are sub-lexical units such as phonemes or syllables. Applications of linguistic networks in natural language processing (NLP) and information retrieval (IR) are discussed in Section 5. Section 6 concludes the survey by enumerating some open problems in the area of linguistic networks.
2 Lexical Networks The phrase “mental lexicon” (ML) usually refers to the repository of word forms that is assumed to reside in the human brain. The average size of the receptive vocabulary for a normal high school student has been found to be more than 100,000 [63]. Quite surprisingly, speakers are capable of navigating this huge lexicon in a very efficient way; reaction time to judge whether a word form is legitimate takes less than 100 milliseconds. Consequently, there can be two important questions associated with ML: (a) how the words are stored in the long-term memory, i.e., how ML is organized, and (b) how these words are retrieved from ML. Note that these questions are highly interrelated—to predict the organization one can investigate how words are retrieved from ML and vice versa. One of the earliest attempts to model the organization of ML was made in [13]. In this work, the authors propose a hierarchical structure of ML, where
Linguistic Networks
147
Fig. 1. The hierarchical structure of ML.
the concepts are arranged in the form of a tree and the attributes of a particular concept in this tree can be inherited by all the child concepts. Figure 1 shows a representative example formed from the concepts “animal”, “mammal” and “fish”. While early studies like [13] focused mainly on representation of the local structure of ML, its global structure remained largely unexplored. Recently, researchers have also started to investigate the global structure of ML primarily within the framework of complex systems and, more specifically, complex networks (see [36, 45, 77, 83, 86] for reference). In all of these studies ML is modeled as a web of interconnected nodes, where each node corresponds to a word form and the interconnections may be based on any one (or more) of the following: • • • • • •
Phonological similarity (e.g., the words banana, bear and bean may be connected since they start with the same phoneme), Semantic similarity (e.g., the words banana, apple and pear may be connected since all of them are names of fruits), Frequency of usage, Age at which the word forms are acquired, Parts of speech, and Orthographic properties.
In the rest of this section we review one representative study each (referring, wherever applicable, to the other relevant ones) of such complex networks constructed based on (a) phonological, (b) semantic, and (c) orthographic similarities of the word forms. Syntactic similarity-based networks will be discussed in detail in the next section. 2.1 Phonological Similarity-Based Networks Phonological similarity among the word forms has been extensively studied in the past to infer the structure of ML and, consequently, the nature of a linguistic system [4, 35, 71, 81]. This large-scale phonological ML has also been studied in the framework of complex networks in which the word forms represent the nodes and two nodes (read words) are connected by an edge if they differ only by the addition, deletion or substitution of one or more phonemes [36, 45, 83, 86]. [45] reports one of the most popular studies, where
148
M. Choudhury and A. Mukherjee
the author constructs a phonological neighborhood network (PNN) in order to unfurl the organizing principles of ML. In PNN there is an edge (u, v) connecting the nodes u and v iff at least two-thirds of the phonemes that occur in the word represented by u also occur in the word represented by v. For instance, if the word is 6 phonemes long, then one can derive all its neighbors by changing at most two phonemes through insertions, deletions, and substitutions. The author uses the Hoosier Mental Lexicon database [68] and builds the above network from the phonologically transcribed forms of each word present in the database. More specifically, he constructs a directed network, where a long word can have a short word as its neighbor without the short word being the neighbor of the long word. For instance, if the number of segments in which the two words, say w1 and w2 , differ is less than 1/3 of the length of w1 , then there will be a directed edge from the node corresponding to w1 to the node corresponding to w2 . The fraction 1/3 is chosen, because it has been useful in earlier experiments for predicting reaction times and familiarity ratings (see [53] for reference). The author shows that PNN is characterized by a very high clustering coefficient (0.235) but at the same time exhibits a long average path length (6.06) and diameter (20). This indicates that, like a small-world network, the lexicon has many densely interconnected neighborhoods. However, unlike small-world network, links between two nodes from different neighborhoods are harder to find. Low mean path lengths are necessary in networks that are to be traversed quickly; the purpose of traversal being search in most of cases. However, in the case of ML, the search should not inhibit the neighbors of the stimulus neighbors that are non-neighbors of the stimulus itself and are, therefore, not similar to the stimulus. Hence, it can be conjectured that, in order to search in PNN, traversal of links between distant nodes is usually not required. In contrast, the search involves an activation of the structured neighborhoods that share a single sub-lexical chunk, which could be acoustically related during word recognition [55]. Further, the author shows that the degree distribution of the nodes in PNN is exponential rather than scale free. Thus, one can posit that the structure of ML is not consistent with “growth via preferential attachment”—at least for the neighborhood density metrics used for this study. The reason is that the standard preferential attachment model, the emergent degree distribution of the network is known to be scale free [5]. The cause for the emergence of the exponential degree distribution for PNN is not yet well understood and is quite an open area for further research. 2.2 Semantic Similarity-Based Networks One of the classic examples of semantic similarity-based networks is the WordNet [20]. In this network, concepts (known as synsets) are the nodes,
Linguistic Networks
149
and semantic relationships between them are represented through the edges. In [77] the authors analyze the structure of the nouns in the English WordNet database (version 1.6). The semantic relationships between the nouns can be primarily of four types: (i) hypernymy/hyponymy (e.g., animal/cat), (ii) antonymy (e.g., day/night), (iii) meronymy/holonymy (e.g., trunk/tree) and (iv) polysemy (e.g., the concepts “the main stem of a tree”, “the body excluding the head and neck and limbs”, “a long flexible snout as of an elephant” and “luggage consisting of a large strong case used when traveling or for storage” are connected to each other due to the polysemous word “trunk” which can mean all of these). Some of the important findings of this work are as follows. • • • •
•
Semantic relationships are scale invariant. The hypernymy tree forms the skeleton of the network. Inclusion of polysemy reorganizes the network into a small world. The nodes with the most traffic (i.e., nodes with the maximum number of paths passing through them) correspond to those concepts which are expressed by the most polysemous words. They are also found to have very high clustering coefficients. In the presence of polysemous edges, the distance between two nodes across the network is not in correspondence with the depth at which they are found in the hypernymy tree.
Further references to the studies on such semantic relationship-based networks can be found in [1, 82]. Although there are several works attempting to analyze the structure of the semantic network of words, one hardly finds any study explaining the emergence of these topological properties through models of network synthesis. It would be very interesting to study the correlates of semantic acquisition and symbol grounding with the model parameters. 2.3 Orthographic Similarity-Based Networks Like phonological similarity networks, one can also construct networks based on orthographic similarity, where the nodes are the words and the edit distance between two words defines the edge weight between the nodes corresponding to them. Such networks have been studied in order to investigate the difficulties involved in spelling error detection and correction [11]. In this work the authors construct such networks (SpellNet) for three different languages (Bengali, Hindi and English) and analyze them to show the following. • • •
For a particular language, the probability of real word errors can be equated to the average weighted degree of SpellNet. The difficulty of non-word error correction correlates to the average clustering coefficient for a language. The basic topological properties are invariant in nature for all the languages; for instance, the authors find that the SpellNet for all of the three
150
M. Choudhury and A. Mukherjee
languages is characterized by an exponential degree distribution, high clustering coefficient and positive correlation between the degree and clustering coefficient of the nodes.
3 Word Co-Occurrence Networks In this section, we review the work on word co-occurrence networks, where the nodes are the words and an edge between two words indicates that the words have co-occurred in the language in certain context(s). Depending on the definition of the context, various networks can be defined. We describe in detail two such networks: the collocation network and the syntactic dependency network. As an application, we discuss the work by [79] where the collocation network has been used for unsupervised induction of the grammatical structure of a language. 3.1 Collocation Network One of the most basic and well-studied co-occurrence network types is that of word collocation networks, where two words are linked if they are neighbors, that is, if they collocate, in a sentence [24]. In this work, two types of collocation networks, unrestricted and restricted ones, were constructed for English from the British National Corpus. In an unrestricted network, all the collocation edges are preserved, whereas in a restricted one only those edges are preserved for which the probability of occurrence of the edge is higher than the case when the two words collocate independently. All these networks are undirected and unweighted, even though in language the order of words (“ticket book” is different from “book ticket”) as well as the frequency of the collocations have obvious significance. The authors found that both the networks exhibit small-world properties. The average path length between any two nodes is small (around 2 to 3), and the clustering coefficients are high (0.69 for the unrestricted and 0.44 for the restricted networks). However, the most striking observation regarding these networks is that the degree distributions follow a two-regime power law. The degree distribution of the 5000 most connected words follows a power law with an exponent −3.07, which is surprisingly close to that of the Barab´ asi-Albert growth model [5]. These findings led the authors to argue that the word usage of the human languages is preferential in nature, where the frequency of a word defines the comprehensibility and production capability. Thus, the higher the usage frequency of a word, the higher the probability that the speakers will be able to produce it easily and the listeners will comprehend it quickly. This is known as the recency effect in linguistics [3]. The small-world property of the collocation network, on the other hand, makes it easier to search the mental lexicon (ML). In essence, the authors conclude that the evolution of language has resulted in an optimal structure of the word interactions that facilitate easier and faster production, perception and navigation of the words.
Linguistic Networks
151
It does not follow, however, from the collocation networks that a word with high degree is indeed a word with high usage frequency (unless the word co-occurrences are completely independent in nature, which essentially is not the case). In a separate study, Cancho and Sol´e [25] have shown that the rank-degree distribution of the words in a very large corpus also follows a two-regime power law, supporting their claim regarding the presence of a core lexicon whose size is about 5000 words. In order to explain the tworegime power law in word collocation networks, Dorogovtsev and Mendes [18] proposed a preferential attachment-based growth model. At every time step t, a new word (i.e., a node) enters the language (i.e., the network) and connects itself preferentially to one of the pre-existing nodes. Simultaneously, ct (where c is a positive constant) new edges are grown between pairs of old nodes that are chosen preferentially. Through mathematical analysis and simulations, the authors establish that this model gives rise to a two-regime power law with exponents very close to those observed in [24]. There have been studies on the properties of collocation networks for languages other than English, including Russian [46] and many others [41]. The basic topological properties of the networks (e.g., scale-free, small-world, assortative) are similar across languages, which points to the fact that like Zipf’s law, these characteristics are also linguistic universals and call for a non-trivial psycholinguistic account of their emergence and existence. 3.2 Syntactic Dependency Network Although collocation networks are easier to construct, they do not necessarily capture the syntactic and semantic relationships between the words, because syntactic and semantic relations often extend beyond the local neighborhood of a word. Syntactic relations between the words of a language are governed by the underlying grammar. There are various formalisms, such as phrase structure grammar, tree-adjoining grammar and dependency grammar, to capture these relationships. In the dependency grammar formalism, a relationship, often shown as a directed edge, connects two words—the head and the dependent. The dependent word modifies the head word in a certain way. For example, the nouns are the heads of the adjectives that modify them. Similarly, the verbs are the heads of their subjects, objects and other arguments. Thus, in the dependency formalism, every sentence is represented as a directed acyclic graph or a dependency tree as illustrated in Fig. 2. Usually, the finite verb is the head of the whole sentence and is not dependent on any other word. Cancho and his co-authors [21, 26] defined the syntactic dependency network (SDN) where the words are the nodes and there is a directed edge between two words if in any of the sentences of a given corpus there is a directed dependency relation between these words. The direction of the dependencies in their construction is from the dependent word to the head word. In order to construct the SDN, one needs to know the dependency relations between the words of a sentence. Fortunately, there are large dependency treebanks for
152
M. Choudhury and A. Mukherjee
Fig. 2. Example of a dependency tree. The arrows are labeled by the type of dependency relation and run from the dependent to the head words.
some languages consisting of human annotated dependency trees for several thousand sentences. The authors studied the SDN for three languages: Czech, German and Romanian, and observed strikingly similar characteristics. All the networks exhibit power-law degree distributions and small-world structures. Some of the very interesting topological properties observed are the following. • • •
Disassortative mixing. This shows that words that are used for linking other words (such as prepositions) and, therefore, have high degree in the networks, are not linked themselves. Hierarchical organization. This implies that there is a top-down hierarchy that is the basis of phrase structure formalism. Small-world structure. This is necessary for recursion and fast navigation of the mental lexicon.
It is a well-known fact that syntactic dependency links usually do not intersect in any of the world’s languages. In [22], the author conjectured that this phenomenon is an outcome of minimization of the Euclidean distance between the syntactically related words of a sentence, where the Euclidean distance between two words is given by the number of words separating them.1 Later on, Cancho et al. [23] showed that spectral clustering of SDN classifies words belonging to the same syntactic categories in the same cluster. As we shall see in Section 5, quite similar techniques are being used in the field of NLP for unsupervised induction of syntactic categories. 3.3 Unsupervised Grammar Induction One of the fascinating applications of word collocation networks, illustrated in [79], is related to unsupervised induction of grammar. Explaining the process of language acquisition is one of the greatest challenges to modern science. Children learn languages that they are exposed to quite accurately and effortlessly. This is one of the strongest evidences in support of our instinctive 1
While it is true that syntactic dependencies have a tendency to avoid crossing, there are systematic exceptions to that generalization in languages with relatively free constituent order. In German, for example, about one-third of all relative clauses are extraposed, thus creating cross dependencies.
Linguistic Networks
153
capacities towards languages [70], which is dubbed the universal grammar by Noam Chomsky [10]. In [79], the authors proposed a very simple algorithm for learning hierarchical structures from the collocation graph of a raw text corpus. The algorithm, ADIOS, works as follows. A directed collocation graph is constructed from the corpus, where the words are the nodes, and an edge is drawn from words w to v if v follows w in a sentence. In fact, each sentence is represented as a separate path in the graph. The algorithm then iteratively searches for motifs that are shared by different sentences. A linguistic motif is defined as a sequence of words, which tends to occur quite frequently in the language and also serves some special functions. For example, “that the X is Y” is a very commonly occurring motif in English, where X and Y can be substituted by a large number of words and this whole pattern can be embedded in various parts of a sentence. Solan et al. [79] define the probability of a particular structure being a motif in terms of network flows. After finding the motifs, the algorithm proceeds to identify interchangeable motifs and merge them into a single node. Thus, at every step the network becomes smaller and a hierarchical structure emerges. This structure can then be presented as a set of phrase structure grammar rules. ADIOS has a high precision (≈70%), but low recall (≈40%). Through a comparative analysis of the induced grammars, the authors were able to construct a dendrogram of 6 languages that have been studied. Quite surprisingly, the dendrogram reflects the phylogenetic relations between these 6 languages. There are other graph-based methods for unsupervised induction of syntactic structures, but unlike ADIOS, these algorithms are based on standard probability theory and Bayesian models.
4 Phonological Networks In the earlier sections, we have seen how complex networks can be used to study the different types of interactions (phonological, syntactic and semantic) between the words of a language. In this section, we shall review some of the works where the networks are constructed from linguistic units that are smaller than words, e.g., phonemes and syllables. 4.1 Network of Human Speech Sounds The most basic units of human languages are the speech sounds. The repertoire of sounds that make up the sound inventory of a language are not chosen arbitrarily, even though the speakers are capable of perceiving and producing a plethora of them. In contrast, the inventories show exceptionally regular patterns across the languages of the world, which is arguably an outcome of the self-organization that goes on in shaping their structure. In fact, numerous computational models have been proposed in the literature in order to
154
M. Choudhury and A. Mukherjee
explain the self-organization of the vowel inventories [15, 47, 51, 76]. A few attempts have also been made in the area of linguistics to reason the observed patterns across the consonant inventories. Most of these works confine themselves to explaining certain individual principles rather than formulating a general theory describing the pattern emergence. However, complex networks have been recently used quite successfully to explain the self-organization of the consonant inventories. In [65] the authors construct a bipartite network called PlaNet, or the Phoneme-Language Network, in which one of the partitions consists of nodes representing the languages while the other partition consists of nodes representing the consonants. There is an edge between the nodes of these two partitions if a particular consonant occurs in a particular language. The authors further construct PhoNet (Phoneme-Phoneme Network), which is the one-mode projection of PlaNet onto the consonant nodes i.e., a network of consonants in which the nodes are linked as many times as they have co-occurred across the language inventories. The data used for constructing the above networks is drawn from the UCLA Phonological Segment Inventory Database (UPSID) [54], which consists of 317 languages and 541 consonants that are found across these languages. Several important observations are made from the study of PlaNet and PhoNet. The observations are noted below. From the study of PlaNet [65] • The degree distribution of the consonant nodes in PlaNet roughly follows a power law with an exponential cut-off towards the tail. • A synthesis model based on preferential attachment (a language node attaches itself to a consonant node depending on the current degree (k) of the consonant node) can explain the emergence of the degree distribution of PlaNet. The results match the empirical data more accurately if the attachment kernel is super-linear (i.e., the attachment probability is proportional to k α , where α > 1). From the study of PhoNet [64, 65] • The degree distribution of the consonant nodes in PhoNet also roughly indicate a power-law behavior with exponential cut-offs. • The clustering coefficient of PhoNet (=0.89) is significantly higher than that of a random graph with the same number of nodes and edges (=0.08). • Community structure analysis of PhoNet can capture the strong patterns of co-occurrence of consonants that are prevalent across the languages of the world. • The driving force that leads to the emergence of these communities is feature economy, which states that languages tend to use a small number of distinctive features and maximize their combinatorial possibilities to generate a large number of consonants. • The emergence of the degree distribution and the clustering coefficient of PhoNet can be explained through a synthesis model that is based on both preferential attachment and triad (i.e., fully connected triplet) formation. While the preferential part of the model reproduces the degree distribution
Linguistic Networks
•
155
of the network, the triad formation part imposes a large number of triangles onto the generated network, thereby increasing the clustering coefficient. The emergence of feature economy can be explained by having a synthesis model, which is a linear combination of two different parts, one driven by the usual degree-dependent preference and the other by a factor that favors the choice of those consonants that share many features with the already chosen ones.
The authors postulate that the physical significance of the synthesis models is grounded in the process of language change. Language change is a collective phenomenon that functions at the level of a population of speakers [80]. They also conjecture that it is possible to explain the significance of the models at the level of an individual, primarily in terms of the process of language acquisition. Further, they argue that there are two orthogonal preferences: (a) the occurrence frequency of a consonant, and (b) the feature-dependent preference (that increases the ease of learning), which are instrumental in the acquisition of the inventories. The synthesis model is essentially a linear combination of these two mutually orthogonal factors. 4.2 Network of Syllables The syllable inventory of each language can also be modeled and analyzed in the framework of a complex network. Each node in this network is a syllable, and links are established between two syllables each time they are shared by a word. In [78] the authors report the study of the network of Portuguese syllables from two different sources: a Portuguese dictionary (DIC) and the complete work of a very popular Brazilian writer—Machado de Assis (MA). The authors show that • • •
The networks have a low average shortest path (DIC: 2.44, MA: 2.61), The networks indicate a high clustering coefficient (DIC: 0.65, MA: 0.50), Both the networks show a power-law behavior.
Since in Portuguese the syllables are close to the basic phonetic units, unlike the case in English, the authors argue that the properties of the English syllabic network should be different from that of Portuguese. The authors further conjecture that since Italian has a strong parallelism between its structure and syllable hyphenization it is possible that the Italian syllabic network has properties close to that of the Portuguese network, pointing to certain universal characteristics of language.
5 Applications in NLP and IR Graph-based approaches are quite common in the areas of natural language processing (NLP) and information retrieval (IR). Interestingly, although there are no obvious technical differences between the scope of graph theory in these areas and in complex networks, the terminologies used and the objectives are
156
M. Choudhury and A. Mukherjee
often quite different. The works on linguistic networks discussed in the last three sections were primarily targeted to the statistical physics community, and the objective was to unfurl the structure of languages and their dynamics. In this section, we will survey some equally interesting and significant works, which use the same set of mathematical tools, but the objective is to develop practical applications concerning languages. 5.1 Induction of Syntactic and Semantic Categories One of the earliest and recurrent applications of networks in NLP has been in automatic induction of syntactic and semantic categories based on the distributional hypothesis [39]. The distributional hypothesis states that words of similar syntactic (semantic) category are found in similar contexts [39]. To illustrate this concept, consider two unknown words X and Y that occur in the following sentences: (1) The red X is very beautiful. (2) If you Y then I shall punish you. Even though we do not know what X and Y are, it is easy to infer that the former is a noun and the latter is a verb. We can draw such inferences about the syntactic categories (in this case the parts of speech) of words based on our knowledge that nouns, but not verbs, can be preceded by articles (the) and adjectives (red). The concept of distributional hypothesis is equally relevant for semantic categories. Words belonging to the same domain club together. Thus, the word student is expected to be in vicinity of the word school, rather than market. Measuring to what extent two words appear in similar contexts defines their similarity [62]. The general methodology [12, 27, 31, 72, 74, 75] for inducing word class information can be outlined as follows. 1. Define the context of a word as a vector. It could be just the set of words which occur in the same sentence, or only the immediate neighbors of the words. For syntactic class induction, usually the word order is preserved during construction of the vectors and the context vectors are defined only in terms of the function words (such as is, of, the and a). 2. Collect global context vectors for the words by summing up the local contexts. 3. Construct a weighted network, where the nodes are the words and the weight of the edge between two words is the distance between their context vectors. There are several ways to define the distance between the vectors. Some of the common measures are Euclidean distance, cosine similarity and correlation coefficients. 4. Apply a clustering algorithm on these networks to obtain the word classes. In the syntactic category induction literature, the 150–250 words with the highest frequency are considered as function words, and the context vectors
Linguistic Networks
157
are defined based on them. Some authors employ a much larger number of features and reduce the dimensions of the resulting matrix using singular value decomposition [72, 74]. [27] uses the spearman rank correlation coefficient and a hierarchical clustering, [74, 75] use the cosine between vector angles and buckshot clustering, [31] uses cosine on mutual information vectors for hierarchical agglomerative clustering and [12] applies Kullback–Leibler divergence in his CDC algorithm. [28] does not sum up the contexts of each word in a context vector, but uses the most frequent instances of four-word windows in a co-clustering algorithm [16]: rows and columns (here words and contexts) are clustered simultaneously. Two-step clustering is undertaken by [74]: clusters from the first step are used as features in the second step. More recently, Biemann [6] proposed the Chinese Whispers algorithm for clustering, which is fast and does not require any parameters to be specified. [7] reports application of Chinese Whispers for parts-of-speech (POS) induction in English, Finnish and German, which has also been applied very recently to Bengali [66]. In this work, the authors also investigate the topological properties of the word networks so constructed and report a scale-free degree distribution, high clustering coefficient and powerlaw cluster size distribution. Widdows and Dorow [87] propose an unsupervised incremental cluster building approach for acquisition of semantic classes. There are also graph-based algorithms to infer semantic classes (sets of synonyms, to be specific) from the lexicons (see, e.g., [17, 43]). Identification of syntactic or semantic classes is of great importance to NLP and IR. For instance, POS tagging is the first step towards parsing. However, the supervised machine learning techniques for POS tagging demand a large amount of human annotated data, which is expensive as well as non-existent for most of the languages. Since automatic induction of POS tags through graph clustering does not require annotated data, it might turn out to be a very useful technique in NLP for resource-poor languages. Similarly, semantic clustering of the words is useful for search and IR. 5.2 Word Sense Disambiguation Word sense disambiguation (WSD) refers to the task of assigning the appropriate sense or meaning to a word in a given context (i.e., sentence or paragraph) out of the several possibilities. For example, the English word bank has two different meanings as a noun: 1) river bank, and 2) a financial institution. However, as shown in the following sentences, in a given context only one of the senses is appropriate. (1) They were walking down the bank enjoying the cool river breeze. (2) She went to the bank to cash her check. There are several ways in which graph-based techniques have been applied for WSD. Examples include lexical chaining [29], semantic relatedness
158
M. Choudhury and A. Mukherjee
Fig. 3. Example of Hyperlex: (a) the network of words for disambiguation of the word “light”; (b) the minimal spanning tree obtained after introduction of the word “light”. The hubs are shown in bold font.
measures based on path lengths and random walks on semantic networks [57, 61] and lexicon graphs [50]. Due to the paucity of space, here we discuss in detail only one of the approaches—HyperLex [85]—that rely on the word co-occurrence graphs. Consider the problem of automatically identifying and disambiguating the various senses of the word light. The HyperLex algorithm works as follows. A sub-corpus consisting of all the paragraphs featuring at least one occurrence of the word light is extracted from a raw text corpus. A word co-occurrence graph is constructed from this sub-corpus, where the nodes are the content words except for the word light. Two words are connected by an edge if they co-occur in a paragraph more than a preset number of times. The weight of an edge decreases as the number of times the words co-occur increases. It has been found that word co-occurrence graphs built in this manner exhibit small-world properties. In this co-occurrence network, nodes with very high degree are identified as hubs. The word light, for which we want to build the disambiguator, is then introduced to the network and connected to the hubs. A minimal spanning tree is constructed from the co-occurrence graph, where light is the root node and the first level consists of the hubs. Figure 3 illustrates this process. Each node in the spanning tree can be thought of as a sense. Thus, the hubs denote the basic senses and, as we move further down the tree, we have more refined senses of the word. This tree can then be used for disambiguating the sense of the target word (here light) in a particular context. 5.3 Information Retrieval The central problem of IR is to rank a given collection of documents with their similarity to a query. Queries are usually very short and the collection of
Linguistic Networks
159
documents huge. In a typical IR setup, the whole web consisting of billions of webpages represents this collection of documents to be ranked and the query is only one or two words long. One of the challenges of IR is to utilize the network structure of the web to compute the ranks of the documents. The web can be conceptualized as a directed graph where the nodes are the webpages and a hyperlink from webpage A to webpage B represents a directed edge between the nodes corresponding to A and B. The very popular PageRank [9] is one of the first ranking algorithms that is allegedly used by Google search engine. The basic idea behind the PageRank algorithm is that the rank (or popularity) of a node is a function of the rank of its neighbors. In other words, the page which has a hyperlink from a popular page is also popular. An alternative view of the PageRank algorithm involves a random walker (here a random surfer). A random walker starts from a random node and follows the edges of the graph randomly to reach other nodes. The PageRank of a page is proportionate to the probability that a random surfer reaches that page by following random hyperlinks on the web. Yet another way to define PageRank is that it is the components of the principal eigenvector of the nodes. Thus, PageRank is also known as eigenvector centrality in the complex network literature. PageRank considers only the incoming edges of a node. Kleinberg [48] proposed another ranking algorithm, called HITS, where every node has two scores, hub and authority. The authority scores are similar to PageRank, whereas the hub scores are based on the outgoing links, but computed in the same way. The final rank of a node is the combination of its hub and authority scores. Kleinberg and co-authors [33] also demonstrated how eigenvectors of the web structure can be used to cluster and disambiguate the pages corresponding to ambiguous words such as “Jaguar” (referring to an animal or a football team or the car). One drawback of both PageRank and HITS is that the algorithms assume that all the hyperlinks have the same importance. There are various modifications of these algorithms, which use machine learning techniques to learn weights of the different types of hyperlinks. Examples include RankNet [73], TrustRank [37] and NetRank [2]. Link analysis, as this field is popularly called, is a very active area of research in the IR community. Some of the other emerging applications of complex networks in IR include mining social networks and blogs. The blogosphere [49], for example, can be represented as a multi-tier network, where blogs, bloggers and other webpages (typically news articles) are the nodes, and there are various types of edges representing the social network of bloggers, the links between blogs and those between the blogs and other webpages. Analysis of the Blogosphere network is useful in classification and personalized suggestion of blogs, opinion and sentiment analysis, as well as in investigating the dynamics of the world of blogs.
160
M. Choudhury and A. Mukherjee
5.4 Other Applications Due to space limitations, it is impossible to do justice to the network-based techniques in NLP and IR. There are a variety of NLP tasks, ranging from parsing to text summarization, where graph-based methods have been applied. In the previous three subsections we have discussed three specific problems to illustrate the various usages of such techniques. Before we wrap up this section, we list a few more example applications to demonstrate the extent and potential of graph-based techniques in these areas. Text summarization is a notably important and challenging application of NLP, which has been elegantly modeled within the framework of complex networks. The problem of text summarization involves identification of a small number of sentences from a set of given documents that best summarize the content of the documents. In [19] summarization has been reformulated as the problem of finding out the node centrality in a network whose nodes are the sentences and whose edges represent the word-level similarity between two sentences. The most central sentences are those which cover most of the ideas present in the given documents. Other application areas include dependency parsing [56], textual entailment [38], sentiment classification [34, 69], keyword extraction [60], novelty detection [30] and prepositional phrase disambiguation [84]. See [8, 58, 59] for further references.
6 Conclusion So far we have seen that there has been a substantial amount of work to understand the structure and dynamics of languages at the mesoscopic level within the framework of complex networks. A parallel thread of research in the field of NLP and IR tries to achieve a different goal, but uses very much the same means. Nevertheless, mesoscopic models of language as well as network-based approaches to NLP are in a nascent state, especially when compared to similar lines of research in the fields of biology, economics and other social sciences (refer to the surveys in this volume). On the other hand, there seems to be a great potential for application of complex network theory to a variety of open problems in linguistics and language engineering. One of the fundamental problems of linguistics is characterization and explanation of linguistic universals, i.e., properties that are common to all human languages. Differences among the languages, on the other hand, are restricted by the typologies and implicational hierarchies [14]. We have seen that, like Zipf’s law, there are many linguistic universals observable in the linguistic networks. For example, the SDNs as well as word collocation networks of all languages exhibit scale-free degree distributions and the small-world property. A systematic investigation of topological universals of linguistic networks can substantially improve our understanding of languages. At the same
Linguistic Networks
161
time, there are properties for which the linguistic networks vary across languages. For example, the average degrees of the SpellNets are very different for English, when compared to Hindi or Bengali. This difference has been attributed to the different writing systems used by English (which is alphabetic) and the two Indo-Aryan languages (which is abugida). Typological variations have also been predicted in the topological properties of syllable networks. Thus, it would be interesting to have a typological theory of languages based on the structure of the linguistic networks. Another question of great importance for any linguistic network is on the emergence of its structural properties. It is least clear why the word collocation networks should display small-world and scale-free properties. Even though the Dorogovtsev and Mendes model [18] can explain the emergence of the two-regime power law observed in the collocation networks, it does not explain by itself the validity and the physical significance of this model based on preferential attachment. In other words, the phenomenon of preferential attachment at the mesoscopic level needs an independent microscopic explanation in terms of psycholinguistic factors, because words cannot voluntarily link to other words. Similar microscopic explanations are required for the non-trivial topological properties of the other linguistic networks, such as ML, SDN, PhoNet and SpellNet. This is presumably a hard problem, but any mesoscopic explanation is incomplete without a corresponding microscopic model. In the context of NLP and IR applications, network-based models are mostly ad hoc and this reduces their credibility and, thereby, the popularity, as compared to the more principled Bayesian approaches. A network-based language model can bridge this gap and provide us with a more systematic way of solving the NLP problems within this framework. Although there have been some initiatives in this direction [44], this area is largely unexplored and presents numerous challenging problems. Another relatively unexplored, but potentially fecund, area of research is processes “on” linguistic networks. Navigation of the ML can be modeled as guided random walks on the ML network; similarly, typographical errors can be modeled as walks on SpellNet. The exact nature of such guided walks is still to be explored and can provide a strong understanding of underlying cognitive principles. In the previous sections we have seen several ways to define networks where the nodes represent words. One can conceive of a universal word network obtained through superimposition of these partial representations of a linguistic system into a multi-tier network where the nodes are the words and two nodes can be connected by several labeled edges signifying their phonetic, collocational, syntactic, orthographic, semantic and various other kinds of similarities. Studies on such a network can reveal a holistic picture of the interaction patterns between the words, thereby providing a unified model of grammar at different levels of linguistic structure.
162
M. Choudhury and A. Mukherjee
References 1. M. E. Adilson, A. P. S. de Moura, Y. C. Lai, and P. Dasgupta. Topology of the conceptual network of language. Physical Review E, 65(065102):1–4, 2002. 2. A. Agarwal, S. Chakrabarti, and S. Aggarwal. Learning to rank networked entities. In Proceedings of KDD, 2006. 3. A. Akmajian. Linguistics. An introduction to Language and Communication. MIT Press, Cambridge, MA, 1995. 4. A. Albright and B. Hayes. Rules vs. analogy in english past tenses: A computational/experimental study. Cognition, 90:119–161, 2003. 5. A.-L. Barab´ asi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999. 6. C. Biemann. Chinese whispers - an efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of TextGraphs: the Second Workshop on Graph Based Methods for Natural Language Processing, pages 73–80, New York, NY, June 2006. Association for Computational Linguistics. 7. C. Biemann. Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the COLING/ACL 2006 Student Research Workshop, pages 7–12, Sydney, Australia, July 2006. Association for Computational Linguistics. 8. C. Biemann, I. Matveeva, R. Mihalcea, and D. Radev, editors. Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing. Association for Computational Linguistics, Rochester, NY, 2007. 9. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. CNIS, 30(1–7):107–117, 1998. 10. N. Chomsky. The Minimalist Program. MIT Press, Cambridge, MA, 1995. 11. M. Choudhury, M. Thomas, A. Mukherjee, A. Basu, and N. Ganguly. How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach. In Proceedings of the Second Workshop on TextGraphs: GraphBased Algorithms for Natural Language Processing, pages 81–88, Rochester, NY, 2007. Association for Computational Linguistics. 12. A. Clark. Inducing syntactic categories by context distribution clustering. In C. Cardie, W. Daelemans, C. N´edellec, and E. T. K. Sang, editors, Proceedings of the Fourth Conference on Computational Natural Language Learning and of the Second Learning Language in Logic Workshop, Lisbon, 2000, pages 91–94. Association for Computational Linguistics, Somerset, NJ, 2000. 13. A. M. Collins and M. R. Quillian. Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Memory, 8:240–247, 1969. 14. W. Croft. Typology and Universals. Cambridge University Press, Cambridge, MA, 1990. 15. B. de Boer. Self-organisation in vowel systems. Journal of Phonetics, 28(4): 441–465, 2000. 16. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pages 89–98, 2003. 17. W. B. Dolan, L. Vanderwende, and S. Richardson. Automatically deriving structured knowledge base from on-line dictionaries. In Proceedings of the Pacific Association for Computational Linguistics, 1993.
Linguistic Networks
163
18. S. N. Dorogovtsev and J. F. F. Mendes. Language as an evolving word Web. Proceedings of the Royal Society of London B, 268(1485):2603–2606, December 22, 2001. 19. G. Erkan and D. Radev. LexRank: Graph-based lexical centrality as salience in text summarization. JAIR, 22:457–479, December 4, 2004. 20. C. Felbaum. WordNet, an Electronic Lexical Database for English. MIT Press, Cambridge, MA, 1998. 21. R. Ferrer-i-Cancho. The structure of syntactic dependency networks: insights from recent advances in network theory. In: “The Problems of Quantitative Linguistics”, G. Altmann, V. Levickij, and V. Perebyinis (eds.). Chernivtsi: Ruta. 60–75, 2005 22. R. Ferrer-i-Cancho. Why do syntactic links not cross? Europhysics Letters, 76:1228–1235, 2006. 23. R. Ferrer-i-Cancho, A. Capocci, and G. Caldarelli. Spectral methods cluster words of the same class in a syntactic dependency network. International Journal of Bifurcation and Chaos, 17(7):2453–2463, 2007. 24. R. Ferrer-i-Cancho and R. V. Sol´e. The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1482):2261–2265, November 2001. 25. R. Ferrer-i-Cancho and R. V. Sol´e. Two regimes in the frequency of words and the origin of complex lexicons: Zipf’s law revisited. Journal of Quantitative Linguistics, 8:165–173, 2001. 26. R. Ferrer-i-Cancho and R. V. Sol´e. Patterns in syntactic dependency networks. Physical Review E, 69(051915), 2004. 27. S. Finch and N. Chater. Bootstrapping syntactic categories using statistical methods. In Background and Experiments in Machine Learning of Natural Language: Proceedings of the 1st SHOE Workshop, pages 229–235. Katholieke Universiteit, Brabant, Holland, 1992. 28. D. Freitag. Toward unsupervised whole-corpus tagging. In COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, page 357, Morristown, NJ, 2004. Association for Computational Linguistics. 29. M. Galley and K. McKeown. Improving word sense disambiguation in lexical chaining. In Proceedings of IJCAI, 2003. 30. M. Gamon. Graph-based text representation for novelty detection. In Proceedings of the Workshop on TextGraphs at HLT-NAACL, pages 17–24, 2006. 31. S. Gauch and R. Futrelle. Experiments in Automatic Word Class and Word Sense Identification for Information Retrieval. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 425–434, Las Vegas, NV, April 1994. 32. M. Gell-Mann. Language and complexity. In J. W. Minett and W. S.-Y. Wang, editors, Language Acquisition, Change and Emergence: Essays in Evolutionary Linguistics. City University of Hong Kong Press, July 2005. 33. D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia, pages 225–234, 1998. 34. A. B. Goldberg and J. Zhu. Seeing stars when there aren’t many stars: Graphbased semi-supervised learning for sentiment categorization. In HLT-NAACL 2006 Workshop on Textgraphs: Graph-based Algorithms for Natural Language Processing, 2006. 35. J. H. Greenberg and J. J. Jenkins. Studies in the psychological correlates of the sound system of American English. Word, 20:157–177, 1964.
164
M. Choudhury and A. Mukherjee
36. T. M. Gruenenfelder and D. B. Pisoni. Modeling the mental lexicon as a complex system: Some preliminary results using graph theoretic measures. In Research on Spoken Language Processing Progress Report No. 27, Bloomington, Indiana University, 27–47, 2005. 37. Z. Gy¨ ongyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In Proceedings of VLDB, pages 576–587, 2004. 38. A. D. Haghighi, A. Y. Ng, and C. D. Manning. Robust textual inference via graph matching. In HLT ’05: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 387–394, Morristown, NJ, 2005. Association for Computational Linguistics. 39. Z. S. Harris. Mathematical Structures of Language. Wiley, New York, 1968. 40. M. D. Hauser, N. Chomsky, and W. T. Fitch. The faculty of language: What is it, who has it, and how did it evolve? Science, 298:1569–1579, 2002. 41. R. F. i-Cancho, A. Mehler, O. Pustylnikov, and A. D´ıaz-Guilera. Correlations in the organization of large-scale syntactic dependency networks. In TextGraphs-2: Graph-Based Algorithms for Natural Language Processing, pages 65–72. Association for Computational Linguistics, 2007. 42. Y. Itoh and S. Ueda. The Ising model for changes in word ordering rules in natural languages. Physica D: Nonlinear Phenomena, 198(3-4):333–339, 2004. 43. J. Jannink and G. Wiederhold. Thesaurus entry extraction from an on-line dictionary. In Proceedings of Fusion, 1999. 44. B. Jedynak and D. Karakos. Unigram language models using diffusion smoothing over graphs. In Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing, pages 33–36, Rochester, NY, 2007. Association for Computational Linguistics. 45. V. Kapatsinski. Sound similarity relations in the mental lexicon: Modeling the lexicon as a complex network. Speech Research Lab Progress Report, Indiana University, Bloomington, IN, 2006. 46. V. Kapustin and A. Jamsen. Vertex degree distribution for the graph of word cooccurrences in Russian. In Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing, pages 89–92, Rochester, NY, 2007. Association for Computational Linguistics. 47. J. Ke, M. Ogura, and W. S.-Y. Wang. Optimization models of sound systems using genetic algorithms. Computational Linguistics, 29(1):1–18, 2003. 48. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM, 46, 1999. 49. R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. Structure and evolution of blogspace. Communications of the ACM, 47(12):35–39, 2004. 50. M. Lesk. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of SIGDOC, 1986. 51. J. Liljencrants and B. Lindblom. Numerical simulation of vowel quality systems: the role of perceptual contrast. Language, 48:839–862, 1972. 52. H. Liljenstrom. Micro Meso Macro: Addressing Complex Systems Couplings. World Scientific Publishing, Singapore, 2005. 53. P. A. Luce and D. B. Pisoni. Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19:1–36, 1998. 54. I. Maddieson. Patterns of Sounds. Cambridge University Press, Cambridge, 1984.
Linguistic Networks
165
55. W. Marslen-Wilson. Activation, competition, and frequency in lexical access. In: G. T. M. Altmann (ed.), Cognitive Models of Speech Processing: Psycholinguistic and Computational Perspectives, MIT Press, Cambridge, MA, pages 148–173, 1990. 56. R. McDonald, F. Pereira, K. Ribarov, and J. Hajiˇc. Non-projective dependency parsing using spanning tree algorithms. In HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 523–530, Morristown, NJ, 2005. Association for Computational Linguistics. 57. R. Mihalcea. Graph-based ranking algorithms for large vocabulary word sense disambiguation. In Proceedings of HTL-EMNLP, 2005. 58. R. Mihalcea and D. Radev. Graph-based algorithms for information retrieval and natural language processing. Tutorial at HLT/NAACL 2006, 2006. 59. R. Mihalcea and D. Radev, editors. Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing. Association for Computational Linguistics, 2006. 60. R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proceedings of EMNLP, 2004. 61. R. Mihalcea, P. Tarau, and E. Figa. PageRank on semantic networks with applications to word sense disambiguation. In Proceedings of COLING, 2004. 62. G. A. Miller and W. G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28, 1991. 63. G. A. Miller and P. M. Gildea. How children learn words. Scientific American, 257(3):86–91, 1987. 64. A. Mukherjee, M. Choudhury, A. Basu, and N. Ganguly. Modeling the cooccurrence principles of the consonant inventories: A complex network approach. International Journal of Modern Physics C, 18(2):281–295, 2007. 65. A. Mukherjee, M. Choudhury, A. Basu, and N. Ganguly. Self-organization of sound inventories: Analysis and synthesis of the occurrence and co-occurrence networks of consonants. Journal of Quantitative Linguistics, http://arXiv.org/ physics/0610120. 66. J. Nath, M. Choudhury, A. Mukherjee, C. Biemann, and N. Ganguly. Unsupervised parts-of-speech induction for Bengali. In Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC), 2008. 67. D. Nettle. Using social impact theory to simulate language change. Lingua, 108: 95–117, 1999. 68. H. G. Nusbaum, D. B. Pisoni, and C. K. Davis. Sizing up the Hoosier mental lexicon: Measuring the familiarity of 20,000 words, Indiana University. Research on Speech Perception Progress Report No. 10, pages 357–376, 1984. 69. B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 271–278, Barcelona, Spain, July 2004. 70. S. Pinker. The Language Instinct: How the Mind Creates Language. HarperCollins, New York, 1994. 71. S. Pinker and A. Price. On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28:195–247, 1988. 72. R. Rapp. A practical solution to the problem of automatic part-of-speech induction from text. In Conference Companion Volume of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), Ann Arbor, MI, 2005.
166
M. Choudhury and A. Mukherjee
73. M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: Machine learning for static ranking. In Proceedings of WWW, pages 707–715, 2006. 74. H. Sch¨ utze. Part-of-speech induction from scratch. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, pages 251–258, Morristown, NJ, 1993. Association for Computational Linguistics. 75. H. Sch¨ utze. Distributional part-of-speech tagging. In Proceedings of the 7th Conference on European Chapter of the Association for Computational Linguistics, pages 141–148, San Francisco, CA, 1995. Morgan Kaufmann Publishers Inc. 76. J.-L. Schwartz, L.-J. Bo¨e, N. Vall´ee, and C. Abry. The dispersion-focalization theory of vowel systems. Journal of Phonetics, 25:255–286, 1997. 77. M. Sigman and G. A. Cecchi. Global organization of the wordnet lexicon. Proceedings of the National Academy of Science, 99(3):1742–1747, 2002. 78. M. M. Soares, G. Corso, and L. S. Lucena. The network of syllables in Portuguese. Physica A: Statistical Mechanics and its Applications, 355(2-4): 678–684, 2005. 79. Z. Solan, D. Horn, E. Ruppin, and S. Edelman. Unsupervised learning of natural languages. Proceedings of National Academy of Sciences, 102(33):11629–11634, 2005. 80. L. Steels. Language as a complex adaptive system. In Proceedings of PPSN VI, pages 17–26, 2000. 81. D. Steriade. Knowledge of similarity and narrow lexical override. BLS, 29: 583–598, 2004. 82. M. Steyvers and J. B. Tenenbaum. The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1): 41–78, 2005. 83. M. Tamariz. Exploring the Adaptive Structure of the Mental Lexicon. Ph.D. thesis, Department of Theoretical and Applied Linguistics, Univerisity of Edinburgh, Scotland, 2005. 84. K. Toutanova, C. D. Manning, and A. Y. Ng. Learning random walk models for inducing word dependency distributions. In ICML ’04: Proceedings of the TwentyFirst International Conference on Machine Learning, page 103, New York, NY, 2004. 85. J. V´eronis. HyperLex: Lexical cartography for information retrieval. Computer Speech and Language, 18(3):223–252, 2004. 86. M. S. Vitevitch. Phonological neighbors in a small world (network): What can graph theory tell us about the mental lexicon? Departmental Colloquy co-sponsored by the Linguistics and Psychology Departments, Rice University, January 27, 2006. 87. D. Widdows and B. Dorow. A graph model for unsupervised lexical acquisition. In Proceedings of COLING, 2002.
Networks Generated from Natural Language Text Chris Biemann and Uwe Quasthoff Institute for Computer Science, NLP Department, University of Leipzig, Johannisgasse 26, 04103 Leipzig, Germany;
[email protected],
[email protected]
1 Introduction The study of large-scale characteristics of graphs that arise in natural language processing is an essential step in finding structural regularities. Structure discovery processes have to be designed with an awareness of these properties. Examining and contrasting the effects of processes that generate graph structures similar to those observed in language data sheds light on the structure of language and its evolution. In this chapter, we examine power-law distributions and small world graphs (SWGs) originating from natural language data. There are several reasons for the special interest in these structures. 1. Power laws appear in many rank-frequency statistics. Furthermore, we can construct graphs with words as nodes and use various rules to introduce edges between words. In many cases, this results in SWGs, which again often have a power-law distribution for their node degrees. 2. SWGs appear in many other real world data, like social networks of many kinds, in the link structure of the World Wide Web or in traffic networks. It is interesting to analyze all these networks in more detail to identify similarities and differences. 3. From an application-driven view, SWGs allow effective clustering strategies in nearly linear time. Because these clusters are often related to the growth process of the underlying graph, they are often meaningful. In the case of natural language these clusters usually reflect semantic and/or syntactic structures. After discussing several data sources that exhibit power-law distributions with respect to rank frequency in Section 2, graphs with small world properties in language data are discussed in Section 3. We shall see that these characteristics are omnipresent in language data, and we should be aware of them when designing structure discovery processes. For example, the knowledge that a N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 10, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
168
C. Biemann and U. Quasthoff
few hundred words make the bulk of words in a text allows one to use only these words as contextual features with only a minor loss in text coverage. Knowing that word co-occurrence networks possess the scale-free small world property has implications for clustering these networks. An interesting aspect is whether these characteristics are only inherent to real natural language data or whether they can be produced with generators of linear sequences in a much simpler way than our intuition about language complexity would suggest. In other words, we shall see how distinctive these characteristics are with respect to tests deciding whether a given sequence is natural language or not.
2 Power Laws in Rank-Frequency Distribution G. K. Zipf [31, 32] described the following phenomenon: if all words in a corpus of natural language are arranged in decreasing order of frequency, then the relation between a word’s frequency and its rank in the list follows a power law. Since then, a significant amount of research has been devoted to the question of how this property emerges and what kinds of processes generate such Zipfian distributions. Hence, some datasets related to language will be presented that exhibit a power law on their rank-frequency distribution. For this discussion, basic units of language will be examined. 2.1 Word Frequency The relation between the frequency of a word at rank r and its rank is given by f (r) ∼ r−z , where z is the exponent of the power law that corresponds to the slope of the curve in a log-log plot. The exponent z was assumed to be exactly 1 by Zipf. In natural language data, slightly differing exponents in the range of about 0.7 to 1.2 are also observed [30]. B. Mandelbrot [21] provided a formula that more closely approximates the frequency distributions in language data after noticing that Zipf’s law holds only for the medium range of ranks, whereas the curve is flatter for very frequent words and steeper for high ranks. Figure 1 displays the word rank-frequency distributions of corpora of different languages taken from the Leipzig Corpora Collection.1 There exist several exhaustive collections of research capitalising Zipf’s law and related distributions2 ranging over a wide area of datasets; here, only findings related to natural language will be reported. A related distribution is the lexical spectrum [16], which gives the probability of choosing a word from the vocabulary with a given frequency. For natural language, the lexical spectrum follows a power law with slope γ = z1 + 1, where z is the exponent 1 2
LCC, see http://www.corpora.uni-leipzig.de [July 7th, 2007]. e.g. http://www.nslij-genetics.org/wli/zipf/index.html [April 1, 2007].
Networks Generated from Natural Language Text
169
Zipf's law for various corpora 1e+007
German 1M English 300K Italian 300K Finnish 100K power law gamma=1 power-law gamma=0.8
1e+006
frequency
100000 10000 1000 100 10 1 0.1 1
10
100
1000 rank
10000
100000
1e+006
Fig. 1. Zipf’s law for various corpora. The numbers next to the language give the corpus size in sentences. Enlarging the corpus does not affect the slope of the curve, but merely moves it upwards in the plot. Most lines are almost parallel to the ideal power-law curve with z = 1. Finnish exhibits a lower slope of γ ≈ 0.8, akin to higher morphological productivity.
of the Zipfian rank-frequency distribution. For the relation between lexical spectrum, Zipf’s law and Pareto’s law, see [1]. But Zipf’s law in its original form is just the tip of the iceberg of power-law distributions in a quantitative description of language. While a Zipfian distribution for word frequencies can be obtained by a simple model of generating letter sequences with space characters as word boundaries [21, 22], these models based on “intermittent silence” can neither reproduce the distributions on sentence length [26] nor explain the relations of words in sequence. Next, more power-law distributions in natural language are discussed and exemplified. 2.2 Letter N -Grams To continue with a counter example, letter frequencies do not obey a power law in the rank-frequency distribution. This also holds for letter N -grams (including the space character), yet for higher N , the rank-frequency plots show a large power-law regime with exponential tails for high ranks. Figure 2 shows the rank-frequency plots for letter N -grams up to N = 6 for the first 10,000 sentences of the British National Corpus (BNC,3 [10]). Still, letter frequency distributions can be used to show that letters are not forming letter bigrams from the single letters independently, but there are restrictions on their combination. While this intuitively seems obvious for 3
http://www.natcorp.ox.ac.uk/ [April 1, 2007]
170
C. Biemann and U. Quasthoff rank-frequency letter N-gram 1e+006 letter 1gram letter 2gram letter 3gram letter 4gram letter 5gram letter 6gram power-law gamma=0.55
100000
frequency
10000
1000
100
10
1 1
10
100
1000 rank
10000
100000
1e+006
Fig. 2. Rank-frequency distributions for letter N -grams for the first 10,000 sentences in the BNC. Letter N -gram rank-frequency distributions do not exhibit power laws on the full scale, but increasing N results in a larger power-law regime for low ranks.
letter combination, the following test is proposed for quantitatively examining the effects of these restrictions: from letter unigram probabilities, a text is generated that follows the letter unigram distribution by randomly and independently drawing letters according to their distribution and concatenating them. The letter bigram frequency distribution of this generated text can be compared to the letter bigram frequency distribution of the real text from where the unigram distribution was measured. Figure 3 shows the generated plot and the real rank-frequency plot, again from the small BNC sample. The two curves clearly differ. The generated bigrams without restrictions predict a higher number of different bigrams and lower frequencies for bigrams of high ranks as compared to the real text bigram statistics. This shows that letter combination restrictions do exist, as not all bigrams predicted by the generation process were observed, resulting in higher counts for valid bigrams in the sample. 2.3 Word N -Grams For word N -grams, the relation between rank and frequency follows a power law, just as in the case for words (unigrams). Figure 4 (left) shows the rankfrequency plots up to N = 4, based on the first 1 million sentences of the BNC. As more different word combinations are possible with increasing N ,
Networks Generated from Natural Language Text
171
letter bigram: generated and real 10000 letter 2-grams generated by letter-1-gram distribution letter 2-gram real
frequency
1000
100
10
1 1
10
100 rank
1000
10000
Fig. 3. Rank-frequency plots for letter bigrams, for a text generated from letter unigram probabilities and for the BNC sample. word bigram: generated and real
word N gram rank-frequency 1e+006
1e+007
word 1-gram word 2-gram word 3-gram word 4-gram
1e+006
100000
100000
10000
frequency
frequency
word 1-gram-generated word 2-grams word 2-grams
10000 1000
1000
100 100
10
10
1
1 1
10
100
1000 10000 1000001e+0061e+0071e+008
rank
1
10
100
1000
10000 100000 1e+006 1e+007
rank
Fig. 4. Left: Rank-frequency distributions for word N -grams for the first one million sentences in the BNC. Word N -gram rank-frequency distributions exhibit power laws. Right: Rank-frequency plots for word bigrams, for a text generated from letter unigram probabilities and for the BNC sample.
the curves become flatter as the same total frequency is shared amongst more units, as previously observed (e.g. [27, 18]). Testing concatenation restrictions quantitatively as above for letters, it might at first seem surprising that the curve for a text generated with word unigram frequencies differs only very little from the word bigram curve, as Fig. 4 (right) shows. Small differences are only observable for low ranks: more top-rank generated bigrams reflect
172
C. Biemann and U. Quasthoff
that words are usually not repeated in the text. More low-ranked and less high-ranked real bigrams indicate that word concatenation takes place not entirely without restrictions, yet is subject to much more variety than letter concatenation. This coincides with the intuition that it is, for a given word pair, almost always possible to form a correct English sentence in which these words are neighbours. Regarding quantitative (as opposed to syntactic or semantic) aspects, the frequency distribution of word bigrams can be produced by a generation process based on word unigram probabilities. 2.4 Sentence Frequency Larger corpora that are compiled from a variety of sources contain a considerable amount of duplicate sentences. In the full BNC, which serves as the data basis in this case, 7.3% of the sentences occur two or more times. The most frequent sentences are “Yeah.”, “Mm.”, “Yes.” and “No.”, which are mostly found in the section of spoken language. But also longer expressions like “Our next bulletin is at 10.30 p.m.” have a count of over 250. The sentence frequencies also follow a power law with an exponent close to 1 (see Fig. 5), indicating that Zipf’s law also holds for sentence frequencies. 2.5 Other Power Laws in Language Data The preceding results strongly suggest that when counting document frequencies in large collections such as the World Wide Web, another power-law rank-frequency for sentences in the BNC 100000
sentences power-law gamma=0.9
frequency
10000
1000
100
10
1 1
10
100
1000
10000
100000 1e+006 1e+007
rank
Fig. 5. Rank-frequency plot for sentence frequencies in the full BNC, following a power law with γ ≈ 0.9, but with a high fraction of sentences occurring only once.
Networks Generated from Natural Language Text
173
rank-frequency for search queries 100000
search queries power-law gamma=0.75
frequency
10000
1000
100
10
1 1
10
100
1000
10000
100000
1e+006
rank
Fig. 6. Rank-frequency plot for AltaVista search queries, following a power law with γ ≈ 0.75.
distribution would be found, but such an analysis has not been carried out and would require access to the index of a web search engine. Further, there are more power laws in language-related areas, some are mentioned here briefly to illustrate their omnipresence. • • •
Web page requests follow a power law, which was employed for a caching mechanism in [17]. Related to this, frequencies of web search queries during a fixed time span also follow a power law, as exemplified in Fig. 6 for a 7-million queries log of AltaVista4 as used by Lempel [19]. The number of authors of Wikipedia5 articles was found to follow a power law with γ ≈ 2.7 for a large regime in [29]. The same paper further discusses various other power-law relationships.
3 Scale-Free Small Worlds in Language Data The previous section discussed the shape of rank-frequency distributions for natural language units. Now the properties of graphs with units represented as vertices and relations between them as edges will be the focus of interest. Internal as well as contextual features can be employed for computing similarities between language units that are represented as (possibly weighted) edges 4 5
http://www.altavista.com http://www.wikipedia.org
174
C. Biemann and U. Quasthoff
in the graph. Some of the graphs discussed here can be classified as scale-free SWGs; others have different characteristics and represent other, but related, graph classes. 3.1 Word Co-Occurrence Graph The notion of word co-occurrence is used to model dependencies between words. If two words X and Y occur together in some contextual unit of information (as neighbours, in a word window of 5, in a clause, in a sentence, in a paragraph), they are said to co-occur. When regarding words as vertices and edge weights as the number of times two words co-occur, the word co-occurrence graph of a corpus is given by the entirety of all word cooccurrences. In the following, two specific types of co-occurrence graphs are considered: the graph as induced by neighbouring words, henceforth called the neighbour-based graph, and the graph as induced by sentence-based cooccurrence, henceforth called the sentence-based graph. The neighbour-based graph can be undirected or directed with edges going from the left to the right words as found in the corpus, the sentence-based graph is undirected. To find out whether the co-occurrence of two specific words A and B is merely due to chance or exhibits a statistical dependency, measures are used that compute, to what extent the co-occurrence of A and B is statistically significant. Many significance measures can be found in the literature; for extensive overviews consult e.g. [9] or [14]. In general, the measures compare the probability for A and B to co-occur under the assumption of their statistical independence with the actual probability of their joint co-occurrence in the corpus. In this work, the log likelihood ratio [13] is used to sort the wheat from the chaff. It is given in expanded form in [9]: ⎤ ⎡ n log n − nA log nA − nB log nB + nAB log nAB ⎥ ⎢ + (n − nA − nB + nAB ) log (n − nA − nB + nAB ) ⎥ −2 log λ = 2 ⎢ ⎣ + (nA − nAB ) log (nA − nAB ) + (nB − nAB ) log (nB − nAB ) ⎦ , − (n − nA ) log (n − nA ) − (n − nB ) log (n − nB ) where n is the total number of contexts, nA the frequency of A, nB the frequency of B and nAB the number of co-occurrences of A and B. As pointed out by Moore [23], this formula overestimates the co-occurrence significance for small nAB . For this reason, often a frequency threshold t on nAB (e.g. a minimum of nAB = 2) is applied. Further, a significance threshold s regulates the density of the graph; for the log likelihood ratio, the significance values correspond to the χ2 tail probabilities [23], which makes it possible to translate the significance value into an error rate for rejecting the independence assumption.6 The operation of applying a significance test results in pruning edges 6 For example, a log likelihood ratio of 3.84 corresponds to a 5% error in stating that two words do not occur by chance, a significance of 6.63 corresponds to a 1% error.
Networks Generated from Natural Language Text
175
that exist due to random noise and keeping almost exclusively those edges that reflect a true association between their endpoints. Graphs that contain all significant co-occurrences of a corpus, with edge weights set to the significance value between their endpoints, are called significant co-occurrence graphs in the remainder. For convenience, no singletons in the graph are allowed, i.e. if a vertex is not contained in any edge because none of the co-occurrences for the corresponding word is significant, then the vertex is excluded from the graph. As observed previously [15, 24], word co-occurrence graphs exhibit the scale-free small world property. This is in line with co-occurrence graphs reflecting human associations [25] and human associations in turn forming SWGs [28]. The claim is confirmed here on an exemplary basis with the graph for Leipziy Corpora Collection’s (LCC’s) 1 million sentence corpus for German. Figure 7 gives the degree distributions and graph characteristics for various co-occurrence graphs. The shape of the distribution is dependent on the language, as Fig. 8 shows. Some languages—here English and Italian—have a hump-shaped distribution in the log-log plot where the first regime follows a power law with a lower exponent than the second regime, as observed in [15]. For the Finnish and German corpora examined here, this effect could not be found in the data. This property of two power-law regimes in the degree distribution of word co-occurrence graphs motivated the Dorogovtsev-Mendes (DM)-model, see [12]. There, the
de1M neighbour-based graphs degree distribution
fraction of vertices per degree
10000 1000 100 10 1 0.1 0.01 0.001 0.0001
de1M sb t=10 power law gamma=2 de1M sig. sb t=10 s=10
100000 fraction of vertices per degree
de1M nb. t=2 indegree de1M nb. t=2 outdegree power law gamma=2 de1M sig. nb t=10 s=10 indegree de1M sig. nb. t=10 s=10 outdegree
100000
de1M sentence-based graphs degree distribution
10000 1000 100 10 1 0.1 0.01 0.001 0.0001
1
10
100 1000 degree interval
10000
1
10
100 1000 degree interval
10000
Fig. 7. Graph characteristics for various co-occurrence graphs of LCC’s 1-million sentence German corpus. Abbreviations: nb = neighbour-based, sb = sentence-based, sig. = significant, t = co-occurrence frequency threshold, s = co-occurrence significance threshold. While the exact shapes of the distributions are language and corpus dependent, the overall characteristics are valid for all samples of natural language of sufficient size. The slope of the distribution is invariant to changes of thresholds. Characteristic path length and a high clustering coefficient at low average degrees are characteristic for SWGs.
176
C. Biemann and U. Quasthoff significant sentence-based graphs for various languages
fraction of vertices per degree
100000
Italian 300K sig. sentence-based graph t=2 s=6.63 English 300K sig. sentence-based graph t=2 s=6.63 Finnish 100K sig. sentence-based graph t=2 s=6.63 power law gamma=2.5 power law gamma=1.5 power-law gamma=2.8
10000 1000 100 10 1 0.1 0.01 0.001 0.0001
1
10
100 degree interval
1000
10000
Fig. 8. Degree distribution of significant sentence-based co-occurrence graphs of similar thresholds for Italian, English and Finnish. degree distribution with window size 2
degree distribution with window size 2 1e+006 Icelandic window 2 German window 2 power-law gamma=2
10000 100 1 0.01
Italian window 2 English BNC window 2 power-law gamma=1.6 power-law gamma=2.6
10000 # of vertices for degree
# of vertices for degree
1e+006
100 1 0.01 0.0001
0.0001
1e-006
1e-006 1
10
100
1000 10000 100000 1e+006 degree
1
10
100
1000 10000 1000001e+006 degree
Fig. 9. Degree distributions in word co-occurrence graphs for window size 2. Left: The distribution for German and Icelandic is approximated by a power law with γ = 2. Right: For English (BNC) and Italian, the distribution is approximated by two powerlaw regimes.
crossover point of the two power-law regimes is motivated by a kernel lexicon of about 5000 words that can be combined with all words of a language. The original experiments of [15] operated on a word co-occurrence graph with window size 2: an edge is drawn between words if they appear together at least once in a distance of one or two words in the corpus. Reproducing their experiment with the first 70 million words of the BNC and corpora of German, Icelandic and Italian of similar size reveals that the degree distribution of the English and the Italian graph is in fact approximated by two power-law regimes. In contrast to this, German and Icelandic show a single power-law distribution, just as in the experiments above; see Fig. 9. These results suggest
Networks Generated from Natural Language Text degree distribution with distance 1 1e+006
100 1 0.01
0.0001
Italian distance 2 English BNC distance 2 power-law gamma=1.6 power-law gamma=2.6
10000 # of vertices for degree
# of vertices for degree
degree distribution with distance 2 1e+006
Italian distance 1 English BNC distance 1 power-law gamma=1.8 power-law gamma=2.2
10000
177
100 1 0.01 0.0001
1e-006
1e-006 1
10
100
1000
10000
100000
1
degree
10
100
1000 degree
10000
100000
Fig. 10. Degree distributions in word co-occurrence graphs for distance 1 and distance 2 for English (BNC) and Italian. The hump-shaped distribution is much more distinctive for distance 2.
that two power-law regimes in word co-occurrence graphs with window size 2 are not a language universal, but only hold for some languages. To examine the hump-shaped distributions further, Fig. 10 displays the degree distribution for the neighbour-based word co-occurrence graphs and the word co-occurrence graphs for connecting only words that appear in a distance of 2. As it becomes clear from the plots, the hump-shaped distribution is mainly caused by words co-occurring in distance 2, whereas the neighbourbased graph shows only a slight deviation from a single power law. Together with the observations from sentence-based co-occurrence graphs of different languages in Figure 8, it becomes clear that a hump-shaped distribution with two power-law regimes is caused by long-distance relationships between words, if present at all. 3.1.1 Applications of Word Co-Occurrences Word co-occurrence statistics are an established standard and have been used in many language processing systems. The authors have used co-occurrences in practical applications like bilingual dictionary acquisition [4, 11], semantic lexicon extension [8] and visualisation of concept trails [7]. The aim of this chapter is to underpin their applications with a theoretical foundation. 3.2 Co-Occurrence Graphs of Higher Order The significant word co-occurrence graph of a corpus represents words that are likely to appear near to each other. When one is interested in words co-occurring with similar other words, it is possible to transform the abovedefined (first-order) co-occurrence graph into a second-order co-occurrence graph by drawing an edge between two words A and B if they share a common
178
C. Biemann and U. Quasthoff band saxophonist album music
albumn music
concerts
roll singer concert
Marsalis
jazz
pop
jazz
band
trumpeter
stars
star
rock
musicians pianist
rock
strata
singer
blues
classical musician
Jazz
mass burst
bursts
coal
Fig. 11. Neighbourhoods of jazz and rock in the significant sentence-based word cooccurrence graph as displayed on LCC’s English corpus website. Both neighbourhoods contain album, music, singer and band, which leads to an edge weight of 4 in the second-order graph.
neighbour in the first-order graph. Whereas the first-order word co-occurrence graph represents the global context per word, the corresponding second-order graph contains relations between words which have similar global contexts. The edge can be weighted according to the number of common neighbours, e.g. by weight = |neigh(A) ∩ neigh(B)|. Figure 11 shows neighbourhoods of the significant sentence-based first-order word co-occurrence graph from LCC’s English web corpus7 for the words jazz and rock. Taking into account only the data depicted, jazz and rock are connected with an edge of weight 4 in the second-order graph, corresponding to their common neighbours album, music, singer and band. The fact that they share an edge in the first-order graph is ignored. In general, a graph of order N + 1 can be obtained from the graph of order N , using the same transformation. The higher-order transformation without thresholding is equivalent to a multiplication of the unweighted adjacency matrix A with itself, then a zeroing of the main diagonal by subtracting the degree matrix of A. Since the average path length of scale-free SWGs is short and local clustering is high, this operation leads to an almost fully connected graph in the limit, which does not allow one to draw conclusions about the initial structure. Thus, the graph is pruned in every iteration N in the following way. For each vertex, only the maxN outgoing edges with the highest weights are taken into account. Notice that this vertex degree threshold maxN does not limit the maximum degree, as thresholding is asymmetric. This operation is equivalent to only keeping the maxN largest entries per row in the adjacency matrix A = (aij ), then At = (sign(aij + aji )), resulting in an undirected graph. To examine quantitative effects of the higher-order transformation, the sentence-based word co-occurrence graph of LCC’s 1-million German sentence corpus (s = 6.63, t = 2) underwent this operation. Figure 12 depicts the degree distributions for N = 2 and N = 3 for different maxN . 7
http://corpora.informatik.uni-leipzig.de/?dict=en [April 1, 2007]
Networks Generated from Natural Language Text German cooc order 3
German order 2 full German order 2 max 10 German order 2 max 3
1000 100 10 1 0.1 0.01 0.001 0.0001
German order 2 max 3 power-law gamma=1 power-law gamma=4
10000 vertices per degree interval
vertices per degree interval
German cooc order 2 10000
179
1000 100 10 1 0.1 0.01 0.001 0.0001
1
10
100 1000 10000 100000 degree
1
10
100 1000 degree
10000
Fig. 12. Degree distributions of word-co-occurrence graphs of higher order. The firstorder graph is the sentence-based word co-occurrence graph of LCC’s 1-million German sentence corpus (s = 6.63, t = 2). Left: N = 2 for max2 = 3, max2 = 10 and max2 = ∞. Right: N = 3 for t2 = 3, t3 = ∞, using the second-order graph with max2 = 3.
Applying the maxN threshold causes the degree distribution to change, especially for high degrees. In the third-order graph, two power-law regimes are observable. Studying the degree distribution of higher-order word co-occurrence graphs revealed that the characteristic of being governed by power laws is invariant to the higher-order transformation, yet the power-law exponent changes. This indicates that the power-law characteristic is inherent at many levels in natural language data. To examine what this transformation yields on the graphs generated by other random graph models, Figure 13 shows the degree distribution of second-order and third-order graphs as generated by the graph generation models of [3] (Barab´ asi-Albert (BA)-model), [28] (Steyvers-Tenenbaum (ST)-model) and [12] (DM-model). The underlying first-order graphs are the undirected graphs of order 10,000 and size 50,000 (k=10) from these three models. While the thorough interpretation of second-order graphs of random graphs might be subject to further studies, the following should be noted: the higher-order transformation reduces the power-law exponent of the BAmodel graph from γ = 3 to γ = 2 in the second order and to γ ≈ 0.7 in the third order. For the ST-model, the degree distribution of the full second-order graph shows a maximum around 2M , then decays with a power law with exponent γ ≈ 2.7. In the third-order ST-graph, the maximum moves to around 4M 2 for sufficient max2 . The DM-model second-order graph shows, like the first-order DM-model graph, two power-law regimes in the full version, and a power-law with γ ≈ 2 for the pruned versions. The third-order degree distribution exhibits many more vertices with high degrees than predicted by a power law.
180
C. Biemann and U. Quasthoff BA order 3
BA order 2
1000 BA full BA max 10 BA max 3 power-law gamma=2
vertices per interval
10000 1000 100 10 1 0.1
10 1 0.1 0.01
0.01 0.001
BA order 2 max 10 BA order 2 max 3 power-law gamma=0.7
100 vertices per interval
100000
1
10
100 degree
0.001
1000
1
10
10 1 0.1
vertices per degree interval
vertices per degree interval
ST full ST max 10 ST max 3 power-law gamma=2.5
100
ST order 2 max 10 ST order 2 max 3
1000 100 10 1 0.1 0.01
0.01 1
10
100 degree
1
1000
10
DM full DM max 10 DM max 3 power-law gamma=2 power-law gamma=1 power-law gamma=4
1000 100 10 1 0.1 0.01 1
10
100 degree
100 degree
1000
DM order 3
1000
1000 vertices per degree interval
vertices per degree interval
DM order 2
0.001
1000
ST order 3
ST order 2
1000
100 degree
DM order 2 max 10 DM order 2 max 3
100 10 1 0.1 0.01 0.001
1
10
100 degree
1000
Fig. 13. Second- and third-order graph degree distributions for BA-model, ST-model and DM-model graphs.
Networks Generated from Natural Language Text
181
In summary, all random graph models exhibit clear differences for word co-occurrence networks with respect to the higher-order transformation. The ST-model shows maxima depending on the average degree of the first-order graph. The BA-model’s power law is decreased with higher orders, but is able to explain a degree distribution with power-law exponent 2. The full DM model exhibits the same two power-law regimes in the second order as observed for German sentence-based word co-occurrences in the third order. 3.2.1 Applications of Co-Occurrence Graphs of Higher Orders In [6] and [20], the utility of word co-occurrence graphs of higher orders are examined for lexical semantic acquisition. The highest potential for extracting paradigmatic semantic relations can be attributed to second- and third-order word co-occurrences. In [9] second-order graphs are evaluated against lexical semantic resources. 3.3 Sentence Similarity Using words as internal features, the similarity of two sentences can be measured by the number of common words they share. Since the few top frequency words are contained in most sentences as a consequence of Zipf’s law, their influence should be downweighted or they should be excluded to arrive at a useful measure for sentence similarity. Here, the sentence similarity graph of sentences sharing at least two common words is examined, with the maximum frequency of these words bounded by 100. This maximum frequency threshold was arbitrarily chosen and could be replaced by a weighting scheme that attributes more weight to less frequent words. However, a hard threshold reduces the computational cost significantly. The corpus of examination is here LCC’s 3-million sentences of German. Figure 14 shows the component size distribution for this sentence similarity graph, Figure 15 shows the degree distributions for the entire graph and for its largest component. The degree distribution of the entire graph follows a power law with γ close to 1 for low degrees and decays faster for high degrees; the largest component’s degree distribution plot is flatter for low degrees. This can be attributed to limited sentence length: as sentences are not arbitrarily long, they cannot be similar to an arbitrary high number of other sentences with respect to the measure discussed here, as the number of sentences per feature word is bounded by the word frequency limit. However, the extremely high values for transitivity and clustering coefficient and the low γ values for the degree distribution for low degree vertices and comparably long average shortest path lengths indicate that the sentence similarity graph belongs to a different graph class than all other graphs discussed above.
182
C. Biemann and U. Quasthoff sentence similarity graph component distribution sentence similarity components power-law gamma=2.7
# of vertices
10000
1000
100
10
1 1
10
100
1000
10000
100000
component size
Fig. 14. Component size distribution for the sentence similarity graph of LCC’s 3-million sentence German corpus. The component size distribution follows a power law with γ ≈ 2.7 for small components, the largest component comprises 211,447 out of 416,922 total vertices. The component size distribution complies with the theoretical results of [2]. sentence similarity graph component distribution sentence similarity de3M sentences, >1 common, freq <101 only largest component power-law gamma=0.6 power-law gamma=1.1
100000
# of vertices
10000
1000
100
10
1 1
10
100
1000
degree
Fig. 15. Degree distribution for the sentence similarity graph of LCC’s 3-million sentence German corpus and its largest component. An edge between two vertices representing sentences is drawn if the sentences share at least two words with corpus frequency <101; singletons are excluded.
Networks Generated from Natural Language Text
183
3.3.1 Applications of the Sentence Similarity Graph A similar measure is used in [5] for document similarity and obtains wellcorrelated results when evaluated against a given document classification. A precision-recall tradeoff arises when lowering the frequency threshold for feature words or increasing the minimum number of common feature words two documents must have in order to be connected in the graph: both improve precision but result in many singleton vertices, which lowers the total number of documents that are considered. 3.4 Summary of Scale-Free Small Worlds in Language Data The preceding examples confirm the claim that graphs built on various aspects of natural language data often exhibit the scale-free small world property or similar characteristics. Experiments with generated text corpora suggest that this is mainly due to the power-law frequency distribution of language units. The slopes of the power law approximating the degree distributions can often not be produced using the random graph generation models. Specifically, all previously discussed generation models fail to explain the properties of word co-occurrence graphs, where γ ≈ 2 was observed as the power-law exponent of the degree distribution. Of the generation models inspired by language data, the ST-model exhibits γ = 3, whereas the universality of the DMmodel to capture word co-occurrence graph characteristics can be doubted after examining data from different languages.
References 1. Adamic, L. A. (2000). Zipf, power-law, pareto – a ranking tutorial. Technical report, Information Dynamics Lab, HP Labs, HP Labs, Palo Alto, CA 94304. 2. Aiello, W., Chung, F., and Lu, L. (2000). A random graph model for massive graphs. In STOC ’00: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pages 171–180, New York, NY, USA. ACM Press. 3. Barab´ asi, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509. 4. Biemann, C. and Quasthoff, U. (2005). Dictionary acquisition using parallel text and co-occurrence statistics. In Proceedings of NODALIDA’05 , Joensuu, Finland. 5. Biemann, C. and Quasthoff, U. (2007). Similarity of documents and document collections using attributes with low noise. In Proceedings of the Third International Conference on Web Information Systems and Technologies (WEBIST-07), pages 130–135, Barcelona, Spain. 6. Biemann, C., Bordag, S., and Quasthoff, U. (2004a). Automatic acquisition of paradigmatic relations using iterated co-occurrences. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-04), Lisbon, Portugal.
184
C. Biemann and U. Quasthoff
7. Biemann, C., Bhm, C., Heyer, G., and Melz, R. (2004b). Automatically building concept structures and displaying concept trails for the use in brainstorming sessions and content management systems. In Proceedings of Innovative Internet Community Systems (IICS-2004), Springer LNCS, Guadalajara, Mexico. 8. Biemann, C., Shin, S.-I., and Choi, K.-S. (2004c). Semiautomatic extension of corenet using a bootstrapping mechanism on corpus-based co-occurrences. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), Morristown, NJ, USA. Association for Computational Linguistics. 9. Bordag, S. (2007). Elements of Knowledge-free and Unsupervised Lexical Acquisition. Ph.D. thesis, University of Leipzig. 10. Burnard, L. (1995). Users Reference Guide for the British National Corpus. Oxford University Computing Service, Oxford, U.K. 11. Cysouw, M., Biemann, C., and Ongyerth, M. (2007). Using Strong’s numbers in the Bible to test an automatic alignment of parallel texts. Special issue of Sprachtypologie und Universalienforschung (STUF), pages 66–79. 12. Dorogovtsev, S. N. and Mendes, J. F. F. (2001). Language as an evolving word web. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1485), 2603–2606. 13. Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74. 14. Evert, S. (2004). The Statistics of Word Co-occurrences: Word Pairs and Collocations. Ph.D. thesis, University of Stuttgart. 15. Ferrer-i-Cancho, R. and Sol, R. V. (2001). The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1482), 2261–2265. 16. Ferrer-i-Cancho, R. and Sol, R. V. (2002). Zipf’s law and random texts. Advances in Complex Systems, 5(1), 1–6. 17. Glassman, S. (1994). A caching relay for the world wide web. Computer Networks and ISDN Systems, 27(2), 165–173. 18. Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., and Smith, F. J. (2002). Extension of Zipf’s law to words and phrases. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), pages 1–6, Morristown, NJ, USA. Association for Computational Linguistics. 19. Lempel, R. and Moran, S. (2003). Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW-03), pages 19–28, New York, NY, USA. ACM Press. 20. Mahn, M. and Biemann, C. (2005). Tuning co-occurrences of higher orders for generating ontology extension candidates. In Proceedings of the ICML-05 Workshop on Ontology Learning and Extension using Machine Learning Methods, Bonn, Germany. 21. Mandelbrot, B. B. (1953). An information theory of the statistical structure of language. In Proceedings of the Symposium on Applications of Communications Theory. Butterworths. 22. Miller, G. A. (1957). Some effects of intermittent silence. American Journal of Psychology, 70, 311–313. 23. Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In D. Lin and D. Wu, editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-04), pages 333–340, Barcelona, Spain. Association for Computational Linguistics.
Networks Generated from Natural Language Text
185
24. Quasthoff, U., Richter, M., and Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-06), pages 1799–1802, Genoa, Italy. 25. Rapp, R. (1996). Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz . Olms, Hildesheim. 26. Sigurd, B., Eeg-Olofsson, M., and van de Weijer, J. (2004). Word length, sentence length and frequency – Zipf revisited. Studia Linguistica, 58(1), 37–52. 27. Smith, F. J. and Devine, K. (1985). Storing and retrieving word phrases. Inf. Process. Manage., 21(3), 215–224. 28. Steyvers, M. and Tenenbaum, J. B. (2005). The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1), 41–78. 29. Voss, J. (2005). Measuring Wikipedia. In P. Ingwersen and B. Larsen, editors, ISSI2005 , volume 1, pages 221–231, Stockholm. International Society for Scientometrics and Informetrics. 30. Zanette, D. H. and Montemurro, M. A. (2005). Dynamics of text generation with realistic Zipf’s distribution. Journal of Quantitative Linguistics, 12(1), 29–40. 31. Zipf, G. K. (1935). The Psycho-Biology of Language. Houghton Mifflin, Boston. 32. Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. AddisonWesley, Cambridge, MA.
Efficiency of Navigation in Indexed Networks Petter Holme1,2,3 1
2
3
Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA School of Computer Science and Communication, Royal Institute of Technology, 10044 Stockholm, Sweden Department of Physics, Ume˚ a University, 90187 Ume˚ a, Sweden;
[email protected]
1 Introduction The interplay between network structure and search dynamics has emerged as a busy subfield of statistical network studies (see e.g. Refs. [1, 9, 10, 13, 14]). Consider a simple graph G = (V, E) (where V is a set of n vertices and E is a set of m edges—unordered pairs of vertices). Assume information packets travel from a source vertex s to a destination t. We assume the packages are myopic agents (at a given time step they have access to information about the vertices in their neighborhood, but not more), and have memory (so they can e.g. perform a depth-first search) but no previous knowledge of the network. Let τ (p) be the time for a packet p to travel between its source and destination. One commonly studied quantity of search efficiency is the expectation value of τ , τ¯, for randomly chosen s and t. In this chapter we attempt to find efficient ways to index V (with numbers from 1 to n) and utilize these indices for packet navigation. In other words, we try to find ways to compress the global information about the network into numbers 1, . . . , n so that the information can be used by packets to find short paths to their destinations. We propose two schemes of indexing the vertices, and corresponding methods for packet navigation. These schemes, along with two depth-first search methods (not using vertex indices for more than remembering the path) are examined on four network models. We will first present the indexing and search schemes, then the network models for testing the algorithms and at last the numerical results.
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 11, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
190
P. Holme
2 Indexing and Search Schemes Now we turn to the schemes for assigning indices to the vertices and using them in search processes. Our two main schemes are both inspired by search trees. Packets first move towards a root vertex r, then towards the destination. Unless the network really is a tree, this approach cannot be exact—a packet is not guaranteed to find the shortest way both from s to r and from r to t. However, as we will see, one can assign indices such that the search either from s to r or from r to t is certain to be as short as possible. One of our schemes, ASU (accurate search up), will be such that the shortest upward search is guaranteed. The other, ASD (accurate search down), will have the shortest possible r to t search. On a technical note, V is a set of distinct elements and an indexing scheme is a bijection φ : V → I where I ∈ [1, n]. In the remainder of the text we will not explicitly distinguish i ∈ V from φ(i). 2.1 ASD Indexing and Search The numbers 1, . . . , n can be arranged into a search tree [3] such that the expected value of τ scales like log n. In Fig. 1(a) we give an example of a search tree. To go from source s to destination t a packet first moves to the root r by going to the neighbor with the lowest index value. From the root to the destination, the package moves to the neighbor with the largest index smaller than, or equal to, t. Our strategy for the ASD indexing and search scheme is to construct a spanning tree T (G) for the network, index the tree to make it a search tree, and use the algorithm above to navigate from s to t. The problem is, however, that real networks are not trees. Imagine adding edges between vertices of the same heights and branches to the tree in Fig. 1(a)—the tree will still be a spanning tree, but the packets may not take the same path from s to t any more. As we will see, with certain ways of constructing the tree and indexing the vertices, the search either from s to r or r to t will be optimal. We construct T (G) in the following way. 1. Let the root r be a vertex of smallest eccentricity (maximal distance to another vertex). 2. Construct the tree such that the distances to the root are the same in T (G) and G. In other words, construct it such that all edges in T go between different neighborhoods Γl (r) = {i ∈ V : d(i, r) = l} and Γl+1 (r) for some level 0 ≤ l ≤ h, where h is the height of the tree (by the choice of r, h is also the radius of the graph). Such a tree can be constructed by finding the set of followed edges in a breadth-first search [3] starting from r. When it is not clear which vertex, or edge, to choose in the above construction, we choose one at random from all the possible candidates. When T is constructed, let the indices be a preordering of the vertices in T (i.e. the order of first occurrence of the vertex in a depth-first search of the graph) [3].
Efficiency of Navigation in Indexed Networks root, r = 1
a
b
2
3
6
7
4
6
3
1
2 10
c
t=6
4
2
1
3
d 5
10 7
even
1
f
4
6
3
10
6 1
2 2
8
9 7
4
t=8
5
5 3
10
odd
s=9
8
e
9
8
9
8
5
5
10 7
4
191
s=9
7
Fig. 1. Illustration of the ASD (panels (a)–(c)) and ASU (panels (d)–(f)) indexing and search schemes. (a) shows a search tree where a local search algorithm can find the shortest path from one vertex to another fast. (b) shows a network indexed by the ASD scheme. The tree used in the construction is identical to the one shown in (a). Panel (c) shows an ASD search from s to t (with τ = 4). On the way from s to r the packet chooses the neighbor (of the current vertex) with lowest index, which here gives a longer route than the optimal {(9, 10), (10, 1)}. (d) shows a possible partition of branches of non-root vertices into classes of as similar size as possible (as done in the ASU indexing scheme). (e) shows a possible indexing based on the partition in (d). Panel (f) displays a search from s to t with τ = 6. The shortest path from t to r is accurately found, but a detour to 6 makes the search from r to t suboptimal.
Now we prove that this indexing and the search algorithm always give the shortest paths from the root to a vertex t. Let ET be the edges of T and let Ti be the maximal subtree with i as root. By construction, all vertices in Ti have indices in [i, i + |Ti |] (where | · | denotes the cardinality of a subgraph). Let i be the largest index in i’s neighborhood smaller than t. Assume there is an edge (i, j) ∈ E \ ET that the search will follow, i.e. that i < j < t. This means that j ∈ Ti . By construction, i is the only vertex in Ti at a distance d(r, i ) (the distance from the rest of Ti to the root is at least d(r, i ) + 1). Since d(r, i ) = d(r, i) + 1, we have d(r, j) ≥ d(r, i) + 2, which contradicts the existence of an edge (i, j) ∈ E. Thus, searches from r to t will always follow the edges of T , which also means the r–t-searches will be as short as possible. Searching upwards, from i to r, in a graph indexed as above is harder. We know that one shortest path goes via a vertex j with smaller index than i, but there might exist suboptimal paths via indices i in the intervals r < i < j and j < i < i, and there might also be paths via vertices of index larger than j
192
P. Holme
that are optimal. For example, assume the search tree in Fig. 1(a) comes from a graph with the additional edges (5, 9), (8, 9) and (9, 10) (see Fig. 1(b)). Then, the shortest path from 9 to r via a vertex of lower index is {(9, 7), (7, 1)}, but there is an equally long path via a vertex of larger index, {(9, 10), (10, 1)}, and longer paths via vertices both smaller and larger than 7 but smaller than 9. Thus, there is no general way of finding the shortest way from s to r. Instead, we always choose the vertex with the smallest index in the neighborhood. By this strategy a packet will come closer to r, in index space, for every step. Furthermore, in tree-like parts of the graph, the search will follow the shortest paths. An illustration of the ASD search is shown in Fig. 1(c). 2.2 ASU Indexing and Search Consider a tree T (G) constructed as in the previous section and an indexing such that d(i, r) < d(j, r) implies i < j (i.e., all indices of a level further from the root are larger than in levels closer to r). With such an indexing, since the neighbor of a vertex with the smallest index necessarily is one step closer to the root, a packet can always find one shortest way to the root. But once the package is at the root the indices are not of so much help. The search from r to t has to be, essentially, a depth-first search. There are, however, a few tricks to speed up the search. First, there is no need to search deeper than t; if j > t, then t ∈ / Tj . Second, one can choose the indices i, . . . , i + |Γl (r)| of one level in the tree in a way to narrow down the search. For example, one can divide the vertices into ν classes (defined by e.g. the remainder when the index is divided by ν) and index vertices of connected regions of the graph with indices of the same class. The search can then be restricted to the same class as the destination. We will pursue this idea with ν = 2. To derive the ASU indexing scheme, the first goal is to divide the vertices into classes of connected subgraphs. Furthermore, we require all classes to be connected to the root vertex. Another aim is to make the classes of as similar sizes as possible. Our first step is to make kr (the degree, or number of neighbors, of r) parallel depth-first searches.1 Second, we group the kr search trees into ν groups with maximally similar sizes. In our case, we seek a partition of the search trees into two classes such that the sums of vertices in the respective classes are as close as possible.2 Then we go through the levels, starting from the root, and assign numbers such that vertices of one partition have even indices, while those of the other have odd numbers (this assignment might not always work). To avoid systematic errors we sample the elements of 1 Every iteration, one step is taken in all branches. The different search branches mark the visited vertices with their indices. A search proceeds only to vertices not marked by any search. When there are no unmarked vertices, the search branch is finished. 2 We do this by randomly exchanging search trees between the two classes and accept changes that improve the partition. The search is continued until their vertex sums differ by at most one or until the partition has not improved for 1000 trials.
Efficiency of Navigation in Indexed Networks
193
levels randomly. This construction scheme is illustrated in Figs. 1(d) and (e). An illustration of the ASU search scheme is shown in Fig. 1(f). 2.3 Degree-Based and Random Search As a reference, we also run simulations for two depth-first search methods that do not utilize indices [1]. One of them, Rnd, is a regular depth-first search where the neighbors are traversed in random order. In the other, Deg, the neighbors are chosen in order from high to low degree. Just as in the ASU and ASD methods, a packet is assumed to have knowledge about its neighborhood. If the destination is in the neighborhood of a vertex, then the search will be over the next time step.
3 Network Models The efficiency of our indexing and search schemes is more or less directly affected by the network structure. To investigate this relationship we test the search schemes on four different types of network models: modified Erd˝ os– R´enyi (ER) graphs [5], square lattices, From Barab´ asi–Albert (BA) [2] and Holme–Kim (HK) [8] networks. To facilitate comparison, we use the same average degree, four (dictated by the square grid), in all networks. 3.1 Modified ER Graphs The ER model is the simplest model for randomly generating simple graphs with n vertices and m edges. The edges are added one by one to randomly chosen vertex pairs (the only restriction being that loops or multiple edges are not allowed). A problem for our purpose is that ER graphs are not necessarily connected (something required to measure τ¯). To remedy this we propose a scheme to make networks connected. 1. Detect the connected components. 2. Go through the connected components sequentially. Denote the current component CI . a) Pick a component CJ randomly. b) Pick a random edge (i, j) whose removal would not fragment CJ . If no such edge exists, go to step 2. c) Pick a random vertex i of CI . d) Replace (i, j) by (i , j). If the edge (i , j) would exist already (an unlikely event), go to step 2a. If there is no vertex i ∈ CI such that (i , j) does not already exist, then go to 3. 3. If the network is still disconnected, go to step 1. In practice, even for our largest system sizes, the preceding algorithm converges in a few iterations. The number of edges needed to be added never
194
P. Holme
exceeds a few percent of m, and this addition is made with the greatest possible randomness; hence, we believe the essential network structure of the ER model is conserved. 3.2 Square Lattice We use square lattices with periodic boundary conditions. We have n vertices spread out regularly on an L × L grid such that the vertex with coordinates (x, y), 1 ≤ x, y ≤ L, is connected to (x, y + 1), (x + 1, y), (x, y − 1), (x − 1, y) (if x = 1, we formally let x − 1 = L, if x = L we let x + 1 represent 1; and correspondingly for y).
3.3 BA Model The networks with a power-law degree distribution are constructed as follows (with our parameter settings). Start with one vertex connected to two degreeone vertices. Iteratively add vertices connected to two other vertices. Let the probability of connecting the new vertex to a vertex i already present in the network be proportional to ki (preferential attachment). 3.4 HK Model The Holme–Kim model is a modification of the BA model to give the network a higher number of triangles. When edges are added from the new vertex to already present vertices, the first edge is added to an existing node i by preferential attachment. The second edge is added to one of i’s neighbors, forming a triangle.
4 Numerical Results We study the search schemes on the four different network topologies numerically. We use 100 independent networks and 100 different s–t-pairs for every network. The network sizes range from n = 16 to n = 16,384. In Fig. 2 we display the average search times as a function of system size for our simulations. The most conspicuous feature is that the ASD scheme is always, by far, the most efficient. While ASU and Deg are close to the least efficient method (Rnd), ASD is rather close to the theoretical limit (equal to the average distances τ¯, the upper border of the shaded areas in Fig. 2). To be more precise, τ¯ is quite constant, about two times larger than the average distance. The other search schemes (ASU, Deg and Rnd) follow faster increasing functional forms. For the square lattice, these three schemes increase, approximately proportional to n (the analytical value for two-dimensional random
transit time, ¿
Efficiency of Navigation in Indexed Networks
103
195
modified ER
100 10 1 ASU
100 10
ASD
transit time, ¿
square lattice 103
1 DEG
100 10
RND
transit time, ¿
103 BA
1 104 transit time, ¿
HK 103 100 10 1
100
10 3 network size, N
104
Fig. 2. The average search time τ¯ as a function of the graph sizes n. In all panels, we display data for the different indexing and search schemes. The shaded areas are unreachable (corresponding to τ¯ values smaller than the theoretical minimum, the ¯ The different panels correspond to the modified ER model, square average distance d). grid, BA model and HK model networks, respectively. Error bars would have been smaller than the symbol sizes.
walks) whereas for ASD, τ¯ scales like distances in square grids, n1/2 . One way of interpreting this result is to say that while ASD manages to find the root as fast as it finds the destination from the root, ASU fails to find t faster than a random search. The slow downward performance of ASU is not unexpected. The r–t-search in ASU only differs from a random depth-first search in that it
196
P. Holme
3
n n−2 n−3 18 19 17
5 2 1
n −1 16 14
15
4
7
6
9
8 12
10
11
13
Fig. 3. A worst-case scenario for navigating from s to r with the ASD indexing and search scheme. A packet from n − 2 to 1 will travel along the perimeter to 3 and then move towards the center.
does not search further than the level of the destination, and that it restricts the search space to half its original size by dividing the vertices into odd and even indices. The fast upward search of ASD is more surprising. In Fig. 3 we show a network where ASD performs badly. The average time to search upwards is (n2 + 20n − 13)/8n → n/8 as n → ∞. The downward search takes 3(n − 1)/2n ∼ 3/2, giving a total expected value of τ¯ ∼ n/8. This can be compared to the average distance d¯ = 3 − 21/4n + 2/n2 ∼ 3. For this example, τ¯ and d¯ diverge in a way not seen in the network models. Why is the search so much faster in the model networks? One point is that the worst-case indexing seen in Fig. 3 is very unlikely. Since the spokes would be sampled randomly, the chance that a vertex at the perimeter does not find r in two steps is 1/2, the probability that it finds r in 3 steps is 1/4, and so on. Continuing this calculation, a vertex at the perimeter reaches r in 2 k k2k +2 ∼ 6 time steps, giving τ¯ ∼ 5—not too far from the observed τ¯/d¯ ∼ 2. We note, however, that for the model networks many other factors that are not present in the wheel graph of Fig. 3 affect τ¯. For example, the high density of short triangles in the HK model networks will introduce many edges between vertices of the same level in T (G), which will affect the search efficiency. τ¯ is approximately linear for the ASU, Deg and Rnd on all network models. The slopes of these curves are, however, a little different. First, the Deg method is more efficient (compared to ASU and Rnd) for BA networks than for the modified ER model. This observation (also made in Ref. [1]) can be explained by the skewed degree distribution in the BA-network—the packet reaches high-degree vertices quickly. The packet can see a large part of the network from these hubs, and is therefore more likely to see t. More interesting, perhaps, is the observation that ASU is more efficient for the networks with a higher density of short cycles (the square lattice and HK models). A rough explanation is that the partition procedure of ASU cuts off many edges between vertices at the same distance from r. Since there are many such edges in these network models, the network will effectively be sparser (without changing G’s diameter), which results in a better performance.
Efficiency of Navigation in Indexed Networks
197
5 Discussion We have investigated navigation in valued graphs, and more specifically in indexed graphs—graphs where every vertex is associated with a unique number in the interval [1, n]. These indices can be assigned to facilitate the packet navigation. The packets are assumed to have no a priori knowledge about the network, except the neighborhoods of their current positions, but memory enough to perform a depth-first search. We find that one of our investigated methods, ASD, is very efficient for four topologically very different network models. The searches with the ASD scheme are roughly twice as long as the shortest paths (scaling in the same way as the average distance). Navigation on indexed graphs has applications in distributed information systems. If, specifically, the amount of information that can be stored at the vertices were limited, search strategies such as ours would be useful. One such system is the Autonomous System level Internet, where the information stored at each vertex (with the current protocols) increases at least as fast as the networks themselves. For most real-world applications (other examples being ad hoc networks [4] or peer-to-peer networks [6, 7, 12]) there are additional constraints so that the algorithms of this paper cannot immediately be applied. Such networks are typically changing over time, so ideally it should be possible to extend the indexing “on the fly” as vertices and edges are added and deleted from the network. Apart from this, a future direction for research on indexed graphs is to improve the performance of the algorithms presented in this work. There might be a fast search-tree-based algorithm that neither finds the shortest path to the root, nor finds the shortest way to the destination. For some network topologies there might be faster algorithms that are not based on constructing a spanning tree. Consider, for example, modular networks [11] (i.e. networks with tightly connected subgraphs that are only sparsely interconnected) in which the search can be divided into two stages— first find the cluster of the destination, then the destination. These two stages should be reflected in a fast navigation algorithm. Acknowledgments PH acknowledges financial support from the Wenner-Gren Foundations, The Swedish Foundation for Strategic Research and the National Science Foundation (grant CCR–0331580).
References 1. L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman. Search in power-law networks. Phys. Rev. E, 64:046135, 2001. 2. A.-L. Barab´ asi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999.
198
P. Holme
3. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. 2nd edition. The MIT Press, Cambridge MA, 2001. 4. C. de Morais Cordeiro and D. P. Agrawal. Ad Hoc & Sensor Networks: Theory and Applications. World Scientific, Hackensack, NJ, 2006. 5. P. Erd˝ os and A. R´enyi. On random graphs I. Publ. Math. Debrecen, 6:290–297, 1959. 6. N. Ganguly, L. Brusch, and A. Deutsch. Design and analysis of a bio-inspired search algorithm for peer to peer networks. In O. Babaoglu, M. Jelasity, A. Montresor, C. Fetzer, and S. Leonardi, editors, Self-star Properties in Complex Information Systems, pages 358–372, Springer-Verlag, New York, 2007. 7. G. Ghoshal and M. E. J. Newman. Growing distributed networks with arbitrary degree distributions. European Physical Journal B, 59:75, 2007. 8. P. Holme and B. J. Kim. Growing scale-free networks with tunable clustering. Phys. Rev. E, 65:026107, 2002. 9. B. J. Kim, C. N. Yoon, S. K. Han, and H. Jeong. Path finding strategies in scalefree networks. Phys. Rev. E, 65:027103, 2002. 10. J. M. Kleinberg. Navigation in a small world. Nature, 406:845, 2000. 11. M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E, 69:026113, 2004. 12. N. Sarshar, P. O. Boykin, and V. P. Roychowdhury. Percolation search in power law networks: Making unstructured peer-to-peer networks scalable. In Proceedings of Fourth International Conference on Peer-to-Peer Computing, pages 2–9. IEEE, 2004. 13. P. Sen. A novel approach for studying realistic navigations on networks. J. Stat. Mech., page P04007, 2007. 14. H. Zhu and Z.-X. Huang. Navigation in a small world with local information. Phys. Rev. E, 70:036117, 2004.
Evolution of Apache Open Source Software Haoran Wen, Raissa M. D’Souza, Zachary M. Saul, and Vladimir Filkov University of California, Davis CA 95616, USA;
[email protected],
[email protected],
[email protected],
[email protected]
1 Software: A General Paradigm for Network Systems? Our modern infrastructure relies increasingly on computation and computers. Accompanying this is a rise in the prevalence and complexity of computer programs. Current software systems (composed of an interacting collection of programs, functions, classes, etc.) implement a tremendous range of functionality, from simple mathematical operations to intricate control systems. Software systems are inherently extendable and tend to gain new functionality over time. Modern computers and programming languages are Turing complete and, thus, capable of implementing any computable function no matter how complex. The interdependencies between the elements of a software system form a network, and, therefore, we believe software systems can provide useful prototypic examples of how to build complex networked systems which require minimal maintenance, are robust bugs to and yet are readily extendable. Thus we ask: What makes for good design in software systems? We are particularly interested in open source software (OSS)—software with source code that is freely available for download and modification. A typical OSS project is a collaborative effort by volunteers, with no central authority assigning development tasks. Instead individuals, or self-organized teams of developers, fix bugs and maintain and extend the code. In OSS, modularity is essential [1, 2], and remarkably, the software resulting from an OSS process can rival or even surpass the quality of commercial software [3, 4]. Software systems are always evolving, responding to user demands for “bug fixes” and new features. Invariably, systems grow in size and complexity, eventually becoming difficult to parse, maintain and extend further. In response to this, developers refactor their systems [5], streamlining and restructuring the entire code base. Thus there are several strong analogies between OSS systems and biological systems. Both classes of systems are inherently modular, readily evolvable, must be robust to anomalies and experience periods of punctuated equilibrium [6]. Yet high-confidence data on the structure of OSS, unlike data on biological networks, is easily obtained for minimal cost. N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 12, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
200
H. Wen et al.
We analyze a series of 50 monthly snapshots of the function call graph of the Apache 2.0 HTTP Server (called Apache herein). Apache is the most popular web server on the Internet, and has been since 1996 [7]. It is a mature, well-established OSS project managed by a group of volunteers worldwide; to date, hundreds of users have contributed to the code base. Apache is written in the C programming language, a procedural language. The basic elements are functions that explicitly invoke one another through function calls which express the command flow of the program/system. In object-oriented systems, in contrast, the software networks are made of edges representing abstract relationships between objects, such as inherits, invokes, etc. Motivated by advances in network science, we first analyze a collection of measures on global properties of the Apache call graphs. Certain measures behave consistently, and we quantify their baselines. Moreover, we find that punctuated changes in these global measures can signal the points at which a more detailed, fine-grained examination of code structure is required. Jumps in global properties can indicate major refactorings, but can also result from restructuring just a few functions (and radically reduce interdependencies). We then turn our focus to a bottom-up approach, studying how observable attributes of the Apache call graph interact using exponential random graph models. Ultimately, by coupling top-down and bottom-up approaches, we want to extract how code is restructured over time to achieve better design. As a mature project, Apache is more in “maintenance” than in growth mode, and the details of changes can be subtle. Yet, these changes may be especially important given that a major expense associated with software is maintenance [8]. Interest in OSS spans multiple communities, from software engineering, to network science, to economics and organizational behavior. Raymond’s seminal work [1] is an excellent review of the latter, contrasting the “cathedral” organization of proprietary software to the open “bazaar” nature of OSS. Perhaps the first work to consider software systems as complex networks was that by Valverde, Ferrer Cancho and Sol´e [9], in which they show that software collaboration graphs have “scale-free” properties which may result from optimal design. Shortly thereafter, Myers conducted a detailed investigation of software collaboration graphs [10], quantifying many features we discuss herein. Both [9] and [10] focus primarily on object-oriented software (unlike Apache, which is procedural software), looking at one time snapshot of the collaboration network between classes and objects for several different software systems. Similar to MacCormack, Rusnak and Baldwin [11], we are interested in tracking the evolution of a software system, focusing on the function call graph. In [11], their interest is in understanding the impact of managerial organization on resulting software structure (primarily the modularity). This manuscript is organized as follows. In Section 2 our data set is defined. Section 3 presents the top-down approach, studying evolution of global measures. Section 4 presents the bottom-up approach to understanding the relative importance of measures of structure via statistical modeling. Section 5 contains the discussion and conclusions.
Evolution of Apache Open Source Software
201
Fig. 1. The “5-core” of Apache on November 2005. Each node is a function, with size indicating its relative length in lines of code. Each directed edge is a function call.
2 The Apache Call Graphs We analyze the evolution of Apache for a 50-month period using call graph snapshots taken at one month intervals from October 2001 to November 2005. Each monthly call graph was created via a two-step process (see [12] for more details). First, the source was checked out from the Apache Concurrent Versions System (CVS) repository (for that month) along with matching versions of both the compiler (and associated tools) and the libraries used by Apache (e.g., the Apache Portable Runtime). Then, the call graph was extracted using CodeSurfer [13], a proprietary source code analysis tool. The resulting call graphs are directed graphs where the nodes are functions, and each edge represents an explicit call from its source node to its target node. The CodeSurfer tools extract all explicit function calls, including those to functions in libraries. The resulting call graphs are extremely interconnected. In November 2005, there were 2909 nodes and 8284 edges (average node degree of 5.7). The largest connected component contains all but 72 nodes, while the second largest component has only 12 nodes. Figure 1 is a subgraph showing the k-core [14, 15] at k = 5 for Apache functions (excluding library calls).
3 Evolution of Apache: Global Measures 3.1 Nodes and Edges The most basic constituents of the Apache call graph network are the functions (i.e., nodes) and function calls (i.e., edges). We denote the number of functions and calls at a given time by, respectively, N (t) and E(t). Figure 2
H. Wen et al.
Number of functions N
3000 2900
8500 N E 8000
2800 2700
7500
2600 7000
2500 2400 10
20
30
40
6500 50
8500 Number of calls E Number of calls E
202
8000 7500 7000 6500 2400
Month
E ~ N1.18
2600 2800 Number of functions N
3000
Fig. 2. (Left) Evolution of the number of functions N (left-hand axis) and the number of function calls E (right-hand axis) during the 50 month period. (Right) E as a function of N since the first stable release of Apache 2.0 in May 2002 through Nov 2005 (months 8–50). Dots are individual data points. The line is the best fit, E ∼ N 1.18 .
(left) shows their evolution over the entire 50-month period. Our first evidence for a restructuring of the code is observed during the Fourth and the fifth months, when there is a dramatic decrease in N , of approximately 250 functions, accompanied by a much smaller decrease in E, of approximately 75 function calls. Thus the average degree (N/E) increases dramatically during this period. Investigating the Apache release history [16], we find that this period (from 2002-1-1 to 2002-2-1) marks the transition from the second to the third beta release of Apache 2.0. According to the release logs, approximately 130 changes were made to the code, with ten of these changes being the addition of new features. The bulk of the remaining changes were “bug” fixes along with a few performance improvements. The functionality of the system was enhanced during a period where the number of functions decreased. We assume redundancy in functions was eliminated, while “functionality” (perhaps more closely related to number of edges) was preserved and enhanced. The first stable (non-beta) releases of Apache 2.0 were issued shortly thereafter, in April and May 2002. From there on, the relationship between E and N is extremely consistent as shown in Fig. 2 (right). We find that E ∼ N 1.18 . Remarkably, Valverde and Sol´e find almost identical scaling, of E ∼ N 1.17 , for a collection of 80 object-oriented systems [17], where N is the number of classes and E is the total number of edges, with each edge representing a relationship between classes. This suggests some universal trend in software systems. 3.2 Degree and Degree Distribution The degree of a function conveys much information, and it is important to distinguish in-degree (being called) from out-degree (calling another function). In-degree is a measure of code reuse, and functions with high in-degree
Evolution of Apache Open Source Software 100
100
2001−10−1 2005−11−1
2001−10−1 2005−11−1 10−1 p(k)
p(k)
10−1
10−2
10−3 100
203
10−2
10−3 101 102 In−degree, k
103
100
101 Out−degree, k
102
Fig. 3. (Left) In-degree distribution and (right) out-degree distribution for the first month and final month. The dashed line is the best fit functional form for the final k−μ)2 ], with month: (left) p(k) = 0.55 · k−1.84 , and (right) p(k) = (2πσ 2 k2 )−1/2 exp[ −(ln2σ 2 2 μ = 0.75 and σ = 0.93.
are information producers. Nodes of high out-degree are information consumers/brokers, consolidating information from many external sources. In the Apache call graphs the largest observed in-degree is approximately 200, while the largest out-degree is approximately 30. Due to these differences, we examine in- and out-degree independently. One of the most investigated aspects of “complex networks” is their degree distributions, found to exhibit extreme heterogeneity, with node degrees spanning decades of range. Here too, we find such broad-scale features. Figures 3 (left) and (right) show respectively the in- and out-degree for the first and the last of the 50 months investigated, where p(k) is the fraction of nodes observed with degree k. Following [18], we assess the best fit to the data between power-law, lognormal and stretched-exponential distributions, using a weighted least squares fit. The weight given to each data point reflects inversely how much uncertainty there is in that point (more uncertainty in the tail where the values are much smaller). The quality of a fit between the set of data points {hi }, measured value {xi } and a function f is quantified as Q=
k 1 [hi − f (xi )]2 h i i=1
with a smaller Q being better. We find that, for in-degree, a power law provides the best fit for each of the 50 months with Q ≈ 0.04. Fitting a lognormal distribution to in-degree gives Q ≈ 0.08, and stretched-exponential gives Q ≈ 0.15. For out-degree, log-normal provides the best fit for all 50 months with Q ≈ 0.02. A stretched-exponential distribution gives the next best fit with Q ≈ 0.06, and a power-law fit is the worst, with Q ≈ 0.16. There are small, almost indiscernible changes to the distributions over the 50 months. For in-degree we find the exponent of the best fit power law
204
H. Wen et al.
slowly decreases from γ ≈ 1.9 to γ ≈ 1.84, reflecting that the maximum values of in-degree slowly increase with time. For out-degree, the mean out-degree of the best fit log-normal distribution slowly increases from μ ≈ 0.64 to μ ≈ 0.75. However, the shapes of both the in- and out-degree distributions (powerlaw and log-normal, respectively) are global properties which are established before our data sampling begins and remain invariant throughout. 3.3 Dependencies, Visibility and Propagation Cost A simple call graph is shown in Fig. 4 (left). The corresponding dependency (or adjacency) matrix, Fig. 4 (right), captures the complete call graph information. Matrix element Mij = 1 if function i calls function j, and is zero otherwise. As edges are directed, M is not symmetric about the diagonal. The dependency matrix captures direct dependencies. However, indirect dependencies are also important. For instance, as shown in Fig. 4 (left), a change in function C could potentially destroy or change the functionality implemented by A. Changing function F also has indirect impact on A. Yet, it is less direct, as the shortest path between A and F is of length 3, whereas the shortest path between A and C has length 2. We can quantify these indirect dependencies as a function of path length using the “reachability matrix” [19] and the related “visibility matrix” [20]. The reachability matrix at path length d, is denoted R(d). Matrix element R(d)ij = 1 if there is a path of exactly length d connecting function i to j. Note the convenient relationship, R(d) = Md , where M is the direct dependency matrix. The visibility matrix at distance d, denoted V(d), is the binary sum of the reachability matrix, V(d) = R(1) ∨ R(2) ∨ · · · ∨ R(d) = M1 ∨ M2 ∨ · · · ∨ Md , where the operator “∨” (logical or) is equivalent to the binary sum. V(4) for our simple call graph example is shown in Fig. 5. Matrix element V(d)ij = 1 if there is a path of length less than or equal to d connecting function i to j. Note we assume V(d)ii = 1, i.e., functions are visible to themselves.
A Dependency Matrix
B
D
C
E F
A
B
C
D
E
F
A
0
1
0
1
0
0
B
0
0
1
0
0
0
C
0
0
0
1
0
0
D
0
0
0
0
1
0
E
0
0
0
0
0
1
F
0
0
0
0
0
0
Fig. 4. (Left) A simple call graph. (Right) The equivalent dependency matrix.
Evolution of Apache Open Source Software
205
Visibility Matrix with n=4 A
B
C
D
E
F
A
1
1
1
1
1
1
B
0
1
1
1
1
1
C
0
0
1
1
1
1
D
0
0
0
1
1
1
E
0
0
0
0
1
1
F
0
0
0
0
0
1
Fig. 5. V(4), the visibility matrix up to path length d = 4 for the simple call graph in Fig. 4 (Left).
3.3.1 Propagation Cost The propagation cost (PC) was introduced in [11] as a scalar value to quantify the extent of indirect dependencies in a network. It is defined as the number of 1’s in V(4) divided by N 2 (the total number of 1’s possible). In other words, PC is the number of pairs of functions connected by a path of length less than or equal to 4, divided by the number of all possible pairs. We find that changes in PC (a global variable) can be useful indicators of important small-scale changes in the code base. Note that we also analyze PC for V(5), but get almost identical results. Figure 6 shows the evolution of PC, along with that of N , for the 50 months of Apache data. The baseline behavior indicates an inverse relationship (as N increases PC decreases and vice versa). There is only one region that violates this trend, encompassing months 24 to 33. Removing these months from consideration, we see an extremely consistent relation between PC and N , as shown in Fig. 6 (right), that PC ∼ N −0.70 . The first anomalous event which does not conform to this scaling relationship is month 24 (September 2003), when N decreases slightly yet PC jumps disproportionately. The second anomalous event is from months 33 to 34 (June 2004 to July 2004), when PC drops dramatically while N remains essentially constant. No other global property discussed herein shows marked changes in this time frame, not even during the second anomaly which is most dramatic. N and E are both essentially invariant (see Fig. 2). The degree distribution is invariant, and the average clustering coefficient is invariant. We attempt to isolate what changes in the details of Apache are responsible for these two anomalous events. Motivated by findings in [10], which suggest that functions with simultaneously high in- and high out-degree are particularly problematic, we isolate functions whose in- or out-degree changed during the time frame of interest. Functions with simultaneously high in-degree and out-degree have a tremendous amount of upstream and downstream dependencies. They are simultaneously information consumers and information
H. Wen et al. x 10−3
3000 2900 2800 2700 2600 2500
Prop Cost N
7.8
20 30 Month
40
7.4 PC ~ N−0.70
7.2 7 6.8
2400 10
x 10−3
7.6 Propagation cost
7.8 7.6 7.4 7.2 7 6.8 6.6 6.4 6.2 6
Number of functions N
Propagation Cost
206
6.6 2400
50
2600 2800 Number of functions N
3000
2004−6−1 2004−7−1
1
Out−degree
10
100 100
101 In−degree
102
Propagation Cost
Fig. 6. (Left) Propagation cost (left-hand axis) and N (right-hand axis) as functions of time. (Right) PC as a function of N since the first stable release of Apache 2.0, with anomalous months (23 thru 34) removed. We find that PC ∼ N −0.70 . 7.8 7.6 7.4 7.2 7 6.8 6.6 6.4 6.2 6
x 10−3 PC w/o 2
10
20 30 Month
40
50
Fig. 7. (Left) Scatter plot of in-degree and out-degree, using log-log scale, with only functions whose degree changed in this time period shown. (Right) Propagation cost over time. Top line is for the entire system. Bottom line is resulting PC if the the two functions indicated in (left) are removed, denoted “w/o 2” in the legend.
producers. Figure 7 (left) is a scatterplot of in-degree versus out-degree on June 2002 (open circles) and July 2002 (filled circles), including only functions with changes in these quantities. Circled in Fig. 7 (left) are two suspicious functions. They have high indegree (of 33 and 34) and reasonably high out-degree (of 5 and 4) in June 2002. They maintain the in-degree but drop, as indicated, to an out-degree of one in July 2002. We remove these two functions (and their edges) from the call graph for each of the 50 months and plot the resulting evolution of PC as shown in Fig. 7 (right). The top line is the same as Fig. 6 (left), PC for the entire system. The bottom line is the resulting PC with the two functions removed. We no longer see the anomalous behavior and recover the baseline behavior PC ∼ N −0.70 shown in Fig. 6 (right).
Evolution of Apache Open Source Software
207
These functions (apr thread mutex lock and apr thread mutex unlock) are members of the Apache Portable Runtime layer that implements functionality related to multithreading. Investigating the detailed commit logs written by developers [21], we find that on August 7, 2003 (between months 23 and 24) attempted “bug” fixes to these two functions were made, with accompanying comments indicating a history of problems with these two functions. On June 4, 2004 (between months 33 and 34) these two “racy/broken” functions were dropped from the code entirely and replaced with lower-level system library calls. 3.4 Path Lengths, Clustering Coefficient and “Small Worlds” A simple example call graph is given in Fig. 4 (left). There are directed paths connecting various functions. For instance, function A is connected to function F via two paths, one of length 3 and one of length 5, where length is measured by number of hops in the call graph. The path of length 3 is obviously the shortest path connecting A and F . We consider all such pairs of functions which are connected by a directed path and calculate the shortest path between them. The fraction of shortest paths of a specified length (i.e., the normalized distribution) is shown in Fig. 8 (left), for the first month (October 2001) and the final month (November 2005) of our study. Similar distributions result for all 50 months, with the typical shortest path of length between 4 and 5, and the largest shortest path (i.e., the graph diameter) of length 14. We compare this distribution of shortest paths to those resulting from two different random graph growth processes. First we consider an ensemble of 20 realizations of Erd˝ os–R´enyi random graphs [22, 23] with N = 2909 nodes and E = 4142 undirected edges (equivalent to the N = 2909 nodes and E = 8284
0.1
0.25 2001−10−1 2005−11−1
0.15 0.1
0.06 0.04 0.02
0.05 0
0.08 Frequency
Frequency
0.2
0
5 10 Length of shortest path
15
0 0 10 20 30 40 Length of shortest path (skewness=0.98206)
Fig. 8. (Left) Normalized shortest paths in Apache, first month and last month. (Right) Normalized shortest paths averaged over 20 realizations of random networks with the exact in- and -out degree distributions of Apache on November 2005. The vertical axis “frequency” means the fraction of shortest paths having that length.
208
H. Wen et al.
directed edges in the November 2005 Apache call graph). Here we find the typical shortest path is of length 7 or 8, much larger than for the Apache call graphs. However, the diameter is comparable, ranging from length 14 to 16. The degree distributions of the Apache call graphs (see Fig. 3) are much broader and more heterogeneous than the Poisson distribution which characterizes Erd˝os–R´enyi random graphs [22, 23]. Thus we next compare the Apache graphs to random graphs constructed to match exactly the Apache degree distribution by extending the ideas in [24, 25] to directed graphs. We begin with N = 2909 nodes and map each one to a distinct node in Apache. We assign to each of these new nodes the in- and out-degree of their corresponding Apache node. We do not yet specify the connectivity, only the final degree. In other words, we assign unconnected half-edges. We next perform a random matching and pair up each in-degree half-edge with a different out-degree half-edge chosen at random. We construct an ensemble of 20 such random graphs. The resulting normalized shortest path distribution, averaged over the full ensemble, is shown in Fig. 8 (right). Note that the typical path length is much larger than for Apache, peaking at length 10, and the maximum shortest path is around 30. Matching degree distribution alone is not enough to reproduce the shortest path lengths observed for Apache. “Small world” networks are characterized by small diameters and large clustering. We have established the small diameter above. Throughout the 50month period the average clustering coefficient, C, fluctuates in the range 0.09 < C < 0.099. Calculating C over an ensemble of corresponding Erd˝ os– R´enyi random graphs yields C = 0.0018, and for the ensemble of random graphs with the Apache degree distribution C = 0.023. The Apache call graphs thus have the “small world” characteristics of short average path length and relatively large clustering coefficient when compared to a comparable random graph. Note that to measure C we temporarily assume the edges are undirected. A more thorough treatment is presented in the next section, where “transitive” triads are distinguished from “cyclic” triads. (Cyclic triads are rarely seen in software, though transitive ones occur frequently.)
4 Evolution of Apache: Models of Network Structure We have made a number of empirical observations about the Apache call graph using complex network measures, effectively obtaining a multifaceted characterization of the graph. One can ask: How do these, and possibly other, measures combine to tell the story of the whole Apache call graph? And in general, to what extent is its structure determined by any given observations? To answer these questions, here we present the statistical modeling approach of exponential random graph models (ERGMs), developed in recent social network theory [26, 27] for understanding the relationships between a large class of local network observations and the full network structure. This
Evolution of Apache Open Source Software
209
bottom-up approach models the extent to which a set of specific observations (e.g., counts of transitive triads) explains the global structure of a network (e.g., the Apache call graph), and, in the process, determines which of the observations best explain its structure. More specifically, given a set of observations, or explanatory variables, an ERGM models networks as random samples from an exponential probabilistic space given by linear combinations of those explanatory variables. Thus, given a network and fitted ERGM, one can calculate the probability that the network is determined by those variables, via direct calculations. In practice, these models are very appealing, as there exist methods for both model fitting (observation available) and simulations (observations unavailable). The advantage of the ERGM approach is that it is very general and scalable. The architecture of the graph is represented by the chosen set of explanatory variables, which can describe either local or global features of the network, and the values of the model parameters can be quite instructive, indicating the relative importance of the explanatory variables to the maximum likelihood probability density function (pdf). In addition, ERGMs have been well studied, and theoretical results exist which can offer some understanding of the model’s behavior in practice [27]. 4.1 ERGM Theory Here, we describe formally the ERGM statistical framework for modeling networks, in particular as it pertains to modeling software call graphs. Let X be a random variable representing the adjacency matrix of a software network. The pdf for this random variable, P (X = x), tells us the probability that an observed graph, x, was drawn from X. Unfortunately, the pdf of X is unknown and cannot be directly calculated. To estimate this pdf, let z(x) = (z1 (x), z2 (x), . . . , zr (x)) be a vector of explanatory variables, where each explanatory variable can be any function of the observed data. We postulate that there exists θ = (θ1 , θ2 , . . . , θr ) such that log(P (X = x)) ∝ θ1 z1 (x) + θ2 z2 (x) + · · · + θr zr (x) ∝ θ T z(x).
(1)
If we exponentiate both sides and divide by a normalizing constant, κ(θ), ensuring that the probabilities will sum to one, we get the following model: P (X = x) = eθ
T
z(x)
/κ(θ).
(2)
This is the standard log linear probability model that is used in a wide range of fields from the social sciences to biology [28, 29]. To create an ERGM, a set of explanatory variables (virtually any function from the observed graph to the real numbers) is chosen by the modeler. The choice of variables is based on the pertinent features of the graph under study, or on a set of desired features, if the graphs are being simulated. An example,
210
H. Wen et al.
Table 1. Exponential random graph models are extremely flexible. This table shows several example explanatory variables, identifying the variables by their names in the statnet package for R [30]. Variable
Description
istar(k)
The number of k-tuples of edges that point to the same node in the network. The number of 3-cycles in the network. The number of two-edge paths for which there is a one-edge shortcut in the network. The sum of ctriad and ttriad for the network. The number of nodes with exactly k incoming edges in the network. The number of nodes with exactly k outgoing edges in the network. The sum of the counts of each in-degree, weighted by the geometric sequence, (1 − e−θk )i where θk is a decay parameter. The number of edges in the graph.
ctriad ttriad triangle idegree(k) odegree(k) gwidegree edges
non-exhaustive, set of explanatory variables is given in Table 1, most of which are important for modeling the Apache call graph. The coefficients, θ, can be interpreted as a preference of the observed network for a given explanatory variable, if its coefficient is positive, and a preference against a variable, if it is negative. Estimating θ based on an observed network is referred to as fitting the model, while using a predetermined θ to generate networks is referred to as simulating with the model. Given a set of explanatory variables, the best fit to the observed network is given by the parameter vector θ which maximizes the likelihood that the observation is drawn from the probability distribution given in Eq. 2. In this case, though, the standard maximum likelihood method to estimate the parameters is difficult because the function for the normalizing constant κ(θ) is not known a priori. Instead, one typically uses Markov chain Monte Carlo maximum likelihood estimation (MCMC MLE), a family of methods based on the Newton-Raphson MLE algorithm [26]. The maximum likelihood formula for the pdf obtained via fitting can be used with Markov chain Monte Carlo (MCMC) sampling methods to simulate networks. There are a number of software packages available for MCMC MLE fitting. These include the “statnet” package [31] for R [30] and the stand-alone SIENA software [32]. In practice, one rarely knows which explanatory variables to choose to fully describe a network using ERGMs. To compare if a particular set of explanatory variables models an observed network better than another, one can use several different approaches. For example, the modeler can use the fitted model to simulate a suite of networks and check how well the simulated networks match the observed network on any measure of interest (e.g., the degree distribution). Along this line, the “statnet” statistical package has a
Evolution of Apache Open Source Software
211
built-in goodness-of-fit function which compares simulated networks to the observed network on a set of such measures. Another approach for comparing sets of explanatory variables is to use information-theoretic measures, like the Akaike information criterion (AIC), to assess how well a model fits the observed data. In addition to providing information on the goodness of fit, the AIC (which penalizes more complicated models to protect against overfitting [33]) can also be used to guide the search through the space of possible models, helping to identify the best variables to include in the model as follows. If the modeler suspects that a particular variable might be useful in modeling an observed network, the AIC can be used to test this hypothesis by toggling the variable in and out of the model, accepting the hypothesis if significant improvement in the AIC is observed. 4.2 Modeling Process and Results As an exploratory first step to our modeling process, we fit models made from many of the possible combinations of a diverse set of explanatory variables that we expect to be important in explaining the Apache call graph. We include the counts of connected triads (ctriad and ttriad, cf. Table 1) in many of our exploratory models because these small connected graphs (graphlets) may be important architecturally in many types of larger networks [34, 35]. However, we do not expect the ctriad graphlet to be helpful in modeling software because it implies indirect recursion, an uncommon and difficult programming technique, but we include it in our modeling process as a sanity check. We also investigate various in- and out-degree counts because these counts provide a local measure of the network’s topology. Further, the success of in-degree count as an explanatory variable leads us to investigate the related in-star variables. In previous modeling efforts [36], degeneracy in the fitting algorithms was often observed for models using the variables above. To circumvent such degeneracies it has become standard ERGM practice to include the geometrically weighted in-degree distribution and the simple edge count as variables in every model, and we do so here too. We toggle the variables described above in and out of several models, identifying variables that are important in fitting the Apache call graph to a single representative month (June 2003). The results for several representative models are given in Table 2, together with the AIC for the model. As expected, the AIC changes very little when ctriad is added to the basic edges+gwidegree model, indicating the lack of importance of ctriad, but the AIC improves significantly when the ttriad variable is added, showing us that the tendency of Apache programmers to include layer-crossing function calls is important in determining the global nature of the graph. Given these results, we further refine our search, looking at many more models that include the ttriad variable, and we find that the out-degree and the higher in-degree terms are less important than others we consider.
212
H. Wen et al.
Table 2. The AIC for a sample of fitted models. Note: For space and readability, the notation we use here to describe the models omits the θi parameter coefficient from Eq. (1). Each term (seperated by +) is a separate model predictor variable with its own coefficient. Model edges+gwidegree edges+gwidegree+ctriad edges+gwidegree+ttriad edges+gwidegree+ttriad+odegree(2) edges+gwidegree+ttriad+istar(3) edges+gwidegree+ttriad+idegree(2) edges+gwidegree+ttriad+istar(2) edges+gwidegree+ttriad+idegree(2)+idegree(3)+istar(2) edges+gwidegree+ttriad+idegree(2)+idegree(3)+istar(2)+istar(3)
AIC 104090 104088 101473 100065 97723 97589 94383 91017 89491
Table 2 allows us to see the variables that are important to the AIC and, hence, are better at predicting the topology of the Apache call graph. For example, it is interesting that the out-degree of a function is less important to the global topology than the in-degree, indicating that the emergent structure of the call graph is more dependent on how many times each function is called than on how many dependencies they have, which is in line with the findings in Section 3. Next, we perform a longitudinal, 50-month study of the Apache call graph using a few of the best-fitting models from the one-month study. This experiment lets us see if the relative importance of explanatory variables changed throughout the Apache development process. The ranking by AIC of the models we fit remains constant across all 50 months, but the values of the parameters do not. Figure 9 shows a plot of the coefficient values over time for ttriad, idegree(2,3) and istar(2,3). These variables were chosen because they were contained in our best-fitting model (as determined by AIC) from Table 2, and we chose not to study any variables (such as odegree) from other, less well-fitting models. Our exploratory procedure eliminated the other variables that we considered because they did not contribute as large an improvement to the AIC as the variables from the final model. All of the variables that we’ve measured relating to in-degree (istar(2,3), idegree(2,3) and gwidegree) are generally negative in this model. On the other hand, the transitivity variable ttriad is consistently positive throughout the development cycle. This indicates that there are functions in Apache that call their callee’s callees (perhaps due to the standard library functions being included in the Apache call graph). Interestingly, over the 50-month period, indegree(2) is almost perfectly anti-correlated with indegree(3) (as seen in Fig. 9). One explanation is that these two variables are measuring two aspects of the same phenomenon (how
Evolution of Apache Open Source Software
213
Fig. 9. Plots of several interesting coefficients across all 50 months. Top: ttriad. Middle: idegree(2,3). Bottom: istar(2,3).
many functions are called approximately twice), and, hence, the importance of the two variables to the model is correlated. Similarly, edges and gwidegree (not shown) are strongly anti-correlated, perhaps because they both measure aspects of network density.
5 Discussion and Conclusions We study the evolution of the function call graph for the Apache 2.0 HTTP Server over a 50-month period. Apache is a mature, OSS project, written in a procedural programming language. We characterize Apache first with several global measures: 1) nodes and edges, 2) degree distribution, 3) dependency matrices and propagation cost, 4) path length and clustering. We find that these measures have certain baseline behaviors and that deviations can indicate important structural changes in the code base. In particular, we find that propagation cost (introduced in [11]) is a sensitive measure that can signal when a detailed, fine-grained examination of the code base may be required. Using ideas proposed in [10] (that functions with simultaneously high in- and out-degrees are problematic), we are able to isolate that the large changes observed in propagation cost are attributable to just two individual functions (out of approximately 2900 total functions). By examining the detailed development logs we corroborate that indeed these two functions have repeatedly troubled developers. The techniques presented herein may be useful in general for code written in procedural programming languages, as they may allow developers to identify particular functions which, when restructured, can reduce overall system dependencies. Using exponential random graph modeling, we investigate the relationships between the attributes that we empirically observe, and find that the most important attribute for predicting the global structure of the Apache call graph is ttriad, the number of transitive triads in the graph. In future work we intend to explore how the appearance of unexpected features might help to identify bugs.
214
H. Wen et al.
Acknowledgments We are indebted to Christian Bird for supplying the call graph data which is central to our analysis and to Premkumar Devanbu for many useful discussions. This work was funded in part by the National Science Foundation under Grant No. IIS-0613949.
References 1. E. S. Raymond. The Cathedral & the Bazaar. O’Reilly and Associates, Sebastopol, CA, 1999. 2. T. O’Reilly. Lessons from open source software development. Communications of the ACM, 42(4), 1999. 3. P. Ball. Openness makes software better sooner. Nature, June 25, 2003. 4. D. Challet and Y. Le Du. Microscopic model of software bug dynamics: Closed source versus open source. International Journal of Reliability, Quality and Safety Engineering, 12(6), 2005. 5. M. Fowler. Refactoring: Improving the Design of Existing Programs. AddisonWesley, Reading, MA, 1999. 6. A. A. Gorshenev and Yu. M. Pis’mak. Punctuated equilibrium in software evolution. Phys. Rev. E, 70(6):067103, 2004. 7. http://httpd.apache.org. 8. Software Maintenance Costs and references therein, http://www.cs.jyu.fi/ ∼koskinen/smcosts.htm. 9. S. Valverde, R. Ferrer Cancho, and R. V. Sol´e. Scale-free networks from optimal design. Europhys. Lett., 60(4):512–517, 2002. 10. C. R. Myers. Software systems as complex networks: Structure, function, and evolvability of software collaboration graphs. Phys. Rev. E, 68:046116, 2003. 11. A. MacCormack, J. Rusnak, and C. Y. Baldwin. Exploring the structure of complex software designs: An empirical study of open source and proprietary code. Management Science, 52(7), 2006. 12. Z. M. Saul, V. Filkov, P. T. Devanbu, and C. Bird. Recommending random walks. In Proceedings ESEC/SIGSOFT FSE, pages 15–24, 2007. 13. http://www.grammatech.com/products/codesurfer/overview.html. 14. S. B. Seidman. Network structure and minimum degree. Social Networks, 5:269– 287, 1983. 15. B. Bollobas. The evolution of sparse graphs. In Graph Theory and Combinatorics, pages 35–57. Academic Press, New York, 1984. 16. http://www.apacheweek.com/features/ap2#rh. 17. S. Valverde and R. V. Sol´e. Hierarchical small worlds in software architecture. In Dynamics of Continuous Discrete and Impulsive Systems: Series B; Applications and Algorithms, volume 14, pages 1–11, 2007. 18. G. Baxter, M. Frean, J. Noble, M. Rickerby, H. Smith, M. Visser, H. Melton, and E. Tempero. Understanding the shape of Java software. In OOPSLA ’06: Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, pages 397–412, New York, NY, USA, 2006. ACM.
Evolution of Apache Open Source Software
215
19. J. N. Warfield. Binary matrices in system modeling. IEEE Transactions on Systems, Man, and Cybernetics, 3:441–449, 1973. 20. D. Sharman and A. Yassine. Characterizing complex product architectures. Systems Engineering Journal, 7(1), 2004. 21. http://svn.apache.org/viewvc/. 22. P. Erd˝ os and A. R´enyi. On random graphs. Publicationes Mathematicae, 6:290– 297, 1959. 23. P. Erd˝ os and A. R´enyi. On the evolution of random graphs. Publ. Math. Inst. Hungar. Acad. Sci., 5(17), 1960. 24. M. Molloy and B. Reed. A critical point for random graphs with a given degree sequence. Random Struct. Alg., 6:161–179, 1995. 25. M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E, 64:026118, 2001. 26. T. A. B. Snijders. Markov chain Monte Carlo estimation of exponential random graph models. Journal of Social Structure, 3(2), 2002. 27. C. J. Anderson, S. Wasserman, and B. Crouch. A p* primer: Logit models for social networks. Social Networks, 21:37–66, 1999. 28. D. Kaplan. The Sage Handbook of Quantitative Methodology for the Social Sciences. Sage Publications Inc., London, 2004. 29. C. Infante-Rivard, C. R. Weinberg, and M. Guiguet. Xenobiotic-metabolizing genes and small-for-gestational-age births: Interaction with maternal smoking. Epidemiology, 17(1):38–46, 2006. 30. The R Project for Statistical Computing, http://www.r-project.org. 31. M. S. Handcock, D. R. Hunter, C. T. Butts, S. M. Goodreau, and M. Morris. statnet: An r package for the statistical modeling of social networks, 2003. http://www.csde.washington.edu/statnet. 32. T. A. B. Snijders, P. E. Pattison, G. L. Robins, and M. S. Handcock. New specifications for exponential random graph models. Sociological Methodology, 99–153 2006. 33. S. Konishi and G. Kitagawa. Information Criteria and Statistical Modeling. Springer, New York, 2008. 34. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298:824–827, 2002. 35. S. Valverde and R. V. Sol´e. Network motifs in computational graphs: A case study in software architecture. Phys. Rev. E, 72:026107, 2005. 36. D. R. Hunter and M. S. Handcock. Inference in curved exponential family models for networks. Technical report, Penn State Department of Statistics, 2004. Available from http://www.stat.psu.edu/reports/2004/.
Some New Applications of Network Growth Models Gourab Ghoshal Department of Physics and Michigan Center for Theoretical Physics, University of Michigan, Ann Arbor, MI, 48109, USA;
[email protected]
1 Introduction The study and analysis of complex networks has in recent times sparked widespread attention from the scientific community [1, 2, 3]. This interest has been spurred partly by researchers recognizing networks as useful representations of real-world complex systems, and also due to the widespread availability of computing resources, enabling them to gather and analyze data on a scale much larger than before. Studies have ranged from large-scale empirical analysis of the World Wide Web, social networks and biological systems, to the development of theoretical models and tools to explore the various properties of these systems [4, 5]. A topic that has garnered significant interest is the subject of growing networks, inspired by real-world examples such as that of the Internet, the World Wide Web and scientific citation networks [6, 7, 8]. The particular case of the World Wide Web has led to what is perhaps the best-known body of work on this topic: the preferential attachment model [9, 10], in which vertices are added to a network with edges that attach to pre-existing vertices with probabilities depending on those vertices’ degrees. When the attachment probability is precisely linear in the degree of the target vertex, the resulting degree sequence has a power-law tail, in the limit of large network size. The appearance of the power-law tail is what first led to the popularity of growth models as a method to describe network evolution, as most real-world networks appear to have degree distributions that are approximately power laws. The preferential attachment model, though a good starting point, is insufficient for describing networks such as the World Wide Web. One can imagine a variety of processes taking place in addition to the mere deposition of vertices and edges. In particular, it is a matter of common experience that web pages are sometimes permanently or temporarily removed from the web along with their links to other web pages. Consequently, there is plenty of room to build on these models, which are principally growth based, and add another N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 13, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
218
G. Ghoshal
level of complexity by including processes where vertices and edges are also removed from the network. It is also possible to extend the model to study general growth and deletion processes, and not just preferential attachment. Indeed, in the last couple of years, there has been some activity in this regard [11, 12, 13]. A notable example is the work done in [12], where, among other things, the authors extended the preferential attachment model to include the deletion of vertices, potentially at different rates, from the addition of new vertices. They demonstrated that networks still retain their power-law tail when the rate of vertex accrual outstrips vertex deletion (with an exponent dependent on the relative rate). However, when the rates are equal, the exponent diverges and the degree distribution transitions into a stretched exponential (Weibull distribution). This could have interesting consequences, for example, in the future character of the web where web pages would share a more even load than at present. Most growth models are designed to solve for the degree distribution of the network. Although one can design a model which describes the evolution of a number of other network properties, say the clustering coefficient, those that deal with degree distributions are the most attractive for chiefly two reasons. The first is in terms of practicality; the degree distribution is relatively straightforward to deal with mathematically, and thus one can calculate a number of properties exactly. The second reason is that the degrees of vertices typically have a strong effect on the overall behavior of the network; therefore, they are a useful guide in determining its characteristics. In fact, the traditional use of growth models has been mostly in this regard, where researchers define some evolution process and then solve for the equilibrium degree distribution of the network. However, it is certainly possible to think of alternative applications. Instead of defining a set of processes under which a network evolves and then determine its final structure, one can turn the question around and specify a final structure, and then solve for the rules which give rise to that structure. To see when and why this is applicable, consider the following. Generally, we can divide evolving networks into broadly three classes. There are those that evolve naturally, in the sense that they are driven by dynamical processes not under our control; representative examples are social, biological and information networks. This class is suited to the more traditional use of growth models. At best, we can measure the degree distribution of these networks, guess a set of rules that govern their evolution and then check our calculations against the measurements, to see if we made a reasonable choice. There is a different class, mostly infrastructure related, such as the transportation and power grids, communication networks such as the telephone and Internet, that are designed by a centrally controlled authority. Since the rules for the evolution process defined in growth models are mostly local in nature, we are hard pressed to find a suitable application for them in this particular class. Finally, there is a relatively new class of networks which falls in between these two types, the classic example being peer-to-peer file-sharing networks. These
Some New Applications of Network Growth Models
219
networks grow in a collaborative, distributed fashion so that we have no direct influence over their structure. However, we can manipulate some of the rules by which they form, giving us a limited but potentially useful influence over their properties. It is this last class that provides a fertile testbed for alternative uses for our growth model. Two applications immediately spring to mind. Consider the case of peerto-peer networks. Measurements [14] have shown that the degree distribution of these networks roughly follow a power law. Based on these findings, recent papers in the literature [15, 16] have proposed search strategies with costs (search time, bandwidth, etc.) scaling sublinearly or logarithmically with the size of the network. While it is certainly a practical and worthwhile approach to determine the structures of existing networks and then try to find ways to optimize their properties, it also seems natural to approach the question from the other direction. Some authors [17] have taken the approach of identifying desirable properties that a network should possess and then proposing appropriate designs to generate networks possessing these attributes. Since in a peer-to-peer network users are continually joining and leaving the network, it is well suited to be described by network growth models. The idea will be to specify a suitable structure a priori that optimizes some properties, say efficient information transfer, and solve for a set of local rules which can generate that same structure. The other application is in the realm of network resilience. Networks typically experience a significant amount of node/edge turnover, due to possible failures of key components and resources, or intentional attacks. These factors can lead to severe disruption of the network structure and, as a result, loss of its key properties. Thus, it is worth analyzing the effects of these failures/attacks and use our limited control to attempt to adaptively restore the original structure of these networks. Authors who work in this field have focused mostly on the effects of disruption on static networks, where they have studied the connectivity structure under the random/targeted removal of nodes and edges [18, 19, 20]. However, under the aegis of growth models we can move away from the static regime and instead focus on networks that evolve in time with sustained node and edge removals. We can allow the network to react to these disruptions by introducing new nodes and edges and attach them in a manner such that the network is able to retain its original degree distribution. These kinds of models are conventionally referred to as reactive network models and have previously been studied in [21, 22]. In this chapter, we will move away from the more traditional uses of network growth models and focus on some new applications. The outline is as follows. In Section 2 we will first define our model. The model is based on a rate equation approach that governs the evolution of the degree distribution of the networks that we study. Based on this model, in Section 3 we will talk about applying growth models to design networks with a set of properties that may be desirable for its functioning, and illustrate our ideas with the specific example of peer-to-peer file-sharing networks. In Section 4 we will talk about
220
G. Ghoshal
applying our growth model to the preservation of network degree distributions from attacks on, or failure of, its resources. In Section 5 we will state our conclusions. In all cases we use a combination of analytical calculations coupled with numerical simulations to come to our results.
2 The Model In this section we will define our model for growing a network. Our approach will be based on the attachment kernel introduced in [23] in addition to a general deletion kernel. We will assume that nodes join/leave the network at intervals, and on doing so form/lose connections with other pre-existing nodes in the network. For the networks that we consider, we will make the assumption that, on the typical time scales over which nodes enter or leave the network, the size of the network n does not change substantially. We will, however, not assume that our networks are uncorrelated (in terms of degree correlations) and will take this into account explicitly in our evolution process. The reasons for doing so will be clarified in Section 4. To start off, let us define pk to be the fraction of nodes in the network that at a given time have degree k. Alternatively, one can think of it as the probability of a randomly chosen node to have degree k. Then by definition, it satisfies the normalization condition ∞
pk = 1.
(1)
k=0
Let us now define the process by which a newly arriving node chooses to attach to others extant in the network and how a node is removed from the same. Let πk be the probability that a given edge from a new node is connected to a given node of degree k, multiplied by the total number of nodes n. Since the total number of nodes in the network with degree k is npk , this implies that πk pk is the probability that an edge from a new node is connected to any node of degree k. Similarly, let ak be the probability that a given node with degree k fails or is attacked during one node removal, again multiplied by the total number of nodes n. Then ak pk is the total probability to remove a node with degree k during one node removal. Since each newly attached edge goes to some node with degree k, we have the following normalization conditions: πk pk = 1, (2) k (3) k ak pk = 1. Finally, let us also allow ourselves to choose the number of edges of the newly joining nodes. Let mk be the distribution from which the edges of these nodes are drawn, with the constraint k kmk = c, in other words, the average degree of incoming vertices is c.
Some New Applications of Network Growth Models
221
2.1 Rate Equation Armed with the given definitions, we are now in a position to write a rate equation governing the evolution of the degree distribution. For a network of n nodes at a given unit of time, the total number of nodes with degree k is npk . After one unit of time we add a node and take away another, so the new number with degree k is now npk , where pk is the new value of pk . Therefore, we have npk = npk +cπk−1 pk−1 +cπk pk + ek+1|j jaj pj − ek|j jaj pj − ak pk + mk , j
j
(4) where ek|j is the conditional probability of following an edge from a node of degree j and reaching a node of degree k. Note that e0|j and ej|0 are always zero, and for an uncorrelated network, ek|j = kpk /k, where k = k kpk is the average degree for the entire network. The πk terms describe the flow of nodes with degree k − 1 to k and k to k + 1 as a consequence of the new edges gained due to the addition of new nodes. The first two terms containing aj describe the flow of nodes with degree k + 1 to k and k to k − 1 as they lose edges as a result of losing neighbors. The term −ak pk represents the direct removal of a node of degree k. Finally mk represents the addition of a node with degree k. Processes where nodes gain or lose two or more edges vanish in the limit of large n and are not included in Eq. (4). The rate equation described above presents a formidable challenge due to the appearance of ek|j from the terms representing edges lost due to neighbors lost, which makes it hard to find a closed-form solution. Nevertheless, we can make progress in one of two ways. The first will be described here, and is applicable to our first application of designing networks. We will, for the moment, leave the description of the second method for a later section. Equation (4) has a particularly pleasing form if we limit ourselves to the case of uniformly random deletion, which amounts to setting ak = 1. Doing so then leads to the following: ek|j jpj = kpk , (5) j
which renders Eq. (4) independent of ek|j and thus independent of any degree correlations. Random deletion thus closes the rate equation for pk , enabling us to seek a solution for the degree distribution for a given mk and πk . If we now assume that pk has an asymptotic form in the limit of large time, which amounts to setting pk to pk , we get the following equation: cπk−1 pk−1 − cπk pk + (k + 1)pk+1 − kpk − pk + mk = 0.
(6)
222
G. Ghoshal
At this point it is convenient to define the following set of generating functions: ∞ pk z k , (7) G(z) = k=0
F (z) = M (z) =
∞ k=0 ∞
πk p k z k ,
(8)
mk z k .
(9)
k=0
If we then multiply Eq. (6) by z k and sum over the index k with the convention p−1 = 0, we find that the generating functions satisfy the following differential equation: dG − G(z) − c(1 − z)F (z) + M (z) = 0. (10) (1 − z) dz Our task will be to solve for a set of rules that generate/preserve the degree distribution of a network that is specified beforehand. In other words, given a G(z), our aim is to solve for the attachment kernel F (z). We can rearrange Eq. (10) to get F (z) in terms of the other two distributions, 1 dG M (z) − G(z) + F (z) = . (11) c dz 1−z It is a relatively straightforward exercise, starting from the equation above, to show that the average degree of the network k = c. In other words, solutions to Eq. (10) require that the average degree c of vertices added to the network be equal to the average degree of vertices in the network as a whole. Thus, we can write Eq. (11) as: F (z) = G1 (z) +
M (z) − G(z) , c(1 − z)
(12)
where G1 (z) = G (z)/G (1) = k qk z k is the generating function for what is called the excess degree distribution: qk =
(k + 1)pk+1 . k
(13)
The excess degree refers to the number of edges at the end of a vertex that is reached by following a randomly chosen edge. The factor of k is present since we are now effectively sampling vertices in proportion to the number of edges extant on them. Note that the excess degree of a vertex is one less than the actual degree. We are now in a position to derive the desired attachment kernel. Noting that ∞ 1 = zk , (14) 1−z k=0
Some New Applications of Network Growth Models
223
we can simply read the coefficient of z k on either side of Eq. (12) to give 1 (k + 1)pk+1 + Pk+1 − Mk+1 , (15) πk = cpk where Pk is the cumulative distribution of the degrees of nodes in the network, and Mk is the cumulative distribution of the degrees of nodes added, ∞ ∞ pl , Mk = ml . (16) Pk = l=k
l=k
We have a number of options for solving Eq. (15); given (almost) any choice of the distribution mk of the degrees of added vertices, we can find the corresponding πk that will give the desired final degree distribution of the network. A particularly convenient choice would be to make the degree distribution of the added vertices the same as the desired degree distribution, so that Mk = Pk . Then, πk =
qk (k + 1)pk+1 = . pk cpk
(17)
In other words, if we have some desired degree distribution pk for our network, one way to achieve it is to add vertices with exactly that degree distribution and then arrange the attachment process so that the degree distribution remains preserved thereafter, even as vertices and edges are added to and removed from the network. Equation (17) tells us the choice of attachment kernel that will achieve this. For example, say we want to generate a Poisson network with degrees distributed according to μk (18) pk = e−μ , k! where μ is the average degree of the network. Equation (17) tells us that all we have to do is to introduce nodes with degrees distributed according to Eq. (18) and attach them uniformly at random to the pre-existing vertices in the network. Figure 1 shows the degree distribution of a Poisson network generated using the method described above. Having built our mathematical framework, we are now free to move on to specific applications.
3 Generating Networks with Desired Properties In this section we will discuss the application of our growth model to generate networks with desired structural properties. Specifically, we will consider the case of peer-to-peer file-sharing networks. As motivation for this, consider the following. A problem which has gained a lot of attention is that of designing an efficient search strategy to find items or data stored on the vertices of a network.
224
G. Ghoshal 100
Probability Pk
10−1 10−2 10−3 10−4 10−5 10−6
1
10 Degree k
100
Fig. 1. The degree distribution for a network of fixed size n = 50,000 generated using the growth mechanism described in the text, with c = 10. The points represent the simulation results and the solid line is the distribution Eq. (18).
Interest in this has been inspired partly by the emergence of networked distributed databases such as peer-to-peer file-sharing networks. In such networks the structure of the network and the distribution of the items stored on it typically change rapidly and frequently, which means that searches must be performed in real time. In peer-to-peer networks searches typically consist of queries that are forwarded from one vertex to another until the target item is found. Real-time searches place heavy demands on computer power and bandwidth, and there is interest in finding efficient search strategies to decrease these costs. As mentioned in the Introduction, direct measurements of real peer-to-peer networks have shown that typically the degree distribution of these networks follows a power law, which has led some authors to propose search strategies that exploit this power-law form to improve efficiency. Here we describe an alternative approach to the problem: instead of tailoring our algorithm to the observed network, we instead tailor the structure of the network to optimize the performance of the search algorithm. We will start by defining our algorithm and then outline the properties of interest. We will then consider a candidate network with a structure that optimizes those properties. The ideas of this section have been discussed in detail in [24]. 3.1 Definition of the Problem Consider a distributed database consisting of a set of computers, each of which holds some data items. Copies of the same item can exist on more than one
Some New Applications of Network Growth Models
225
computer, which would make searching easier, but we will not assume this to be the case. Computers are connected together in a virtual network, meaning that each computer is designated as a neighbor of some number of other computers. These connections between computers are purely notional: every computer can communicate with every other directly over the Internet or other physical network. The virtual network is used only to limit the amount of information that computers have to keep about their peers. Each computer maintains a directory of the data items held by its network neighbors, but not by any other computers in the network. Searches for items are performed by passing a request for a particular item from computer to computer until it reaches one in whose directory that item appears, meaning that one of that computer’s neighbors holds the item. The identity of the computer holding the item is then transmitted back to the origin of the search, and the origin and target computers communicate directly thereafter to negotiate the transfer of the item. This basic model is essentially the same as that used by other authors [15] as well as by many actual peer-to-peer networks in the real world. Note that it achieves efficiency by the use of relatively large directories at each node of the network, which inevitably use up memory resources on the computers. However, with standard hash-coding techniques and for databases of the typical sizes encountered in practical situations (hundred thousands or millions of items) the amounts of memory involved are quite modest by modern standards. The two metrics of search performance that we consider are search time and bandwidth, both of which should be low in a good algorithm. We define the search time to be the number of steps taken by a propagating search query before the desired target item is found. We define the bandwidth for a node as the average number of queries that pass through that node per unit time. Bandwidth is a measure of the actual communications bandwidth that vertices must expend to keep the network as a whole running smoothly, but it is also a rough measure of the CPU time they must devote to searches. Since these are limited resources, it is crucial that we do not allow the bandwidth to grow too quickly as vertices are added to the network; otherwise, the size of the network will be constrained. 3.2 Search Strategy and Search Time In order to treat the search problem quantitatively, we need to define a search strategy or algorithm. Our candidate will be a very simple one, the random walk search, which, though certainly not the most efficient strategy possible, has two significant advantages. First, it is simple enough to allow us to carry out analytic calculations of its performance. Second, as we will show, even this basic strategy can be made to work very well. Our results constitute an existence proof that good performance is achievable; searches are necessarily possible that are at least as good as those analyzed here.
226
G. Ghoshal
In a random walk search a node i originating a search sends a query for the item it wishes to find to one of its neighbors j, chosen at random. If that item exists in the neighbor’s directory, the identity of the computer holding the item is transmitted to the originating node and the search ends. If not, then j passes the query to one of its neighbors chosen at random, and so forth. Let pi be the probability that our random walker is at node i at a particular time. Then the probability pi of its being at i one step later, assuming the target item has not been found, is pi =
Aij j
kj
pj ,
(19)
where kj is the degree of node j and Aij is an element of the adjacency matrix, 1 if there is an edge joining vertices i, j, (20) Aij = 0 otherwise. After reaching equilibrium, the probability distribution over nodes then tends to the fixed point of (19), which is at pi =
ki , 2m
(21)
where m is the total number of edges in the network. That is, the random walk visits nodes with probability proportional to their degrees. When our random walker arrives at a previously unvisited node of degree ki , it “learns” from that node’s directory about the items held by all of its immediate neighbors, of which there are ki −1 excluding the one we arrived from (whose items by definition we already know about). Thus, at every step the walker gathers more information about the network.The average number of nodes it learns about upon making a single step is i pi (ki − 1), with pi given by (21), and hence the total number it learns about after τ steps is 2 k τ −1 , (22) ki (ki − 1) = τ 2m i k where k and k 2 represent the mean and mean-square degrees in the network respectively and we have made use of 2m = nk. The time taken for the walker to find the desired item, of course, depends on how many instances of the target exist in the network. In many cases of practical interest, copies of items exist on a fixed fraction of the nodes in the network, which makes for quite an easy search. Here we will consider the much harder problem in which copies of the target item exist on only a fixed number of nodes, where that number could potentially be just 1. In this case, the walker will need to learn about the contents of O(n) nodes in order to find the target, and hence the average time to find the target is given by
Some New Applications of Network Growth Models
τ =A
n , k 2 /k − 1
227
(23)
for some constant A. 3.3 Bandwidth Bandwidth is the mean number of queries reaching a given node per unit time. Equation (21) tells us that the probability that a particular current query reaches node i at a particular time is ki /2m, and assuming as discussed above that the number of queries initiated per unit time is proportional to the total number of vertices, the bandwidth for node i is βi = Bn
ki ki =B , 2m k
(24)
where B is another constant. This implies that high-degree nodes will be overloaded in comparison with low-degree ones, which means that networks with power-law or other highly right-skewed degree distributions may be undesirable, resulting in bottlenecks around the nodes of highest degree that could in principle harm the performance of the entire network. If we wish to distribute load evenly among the computers in our network, then a network with a tightly peaked degree distribution is desirable. 3.4 Candidate Network We wish to choose a structure for our network that gives low search times and modest bandwidth demands, keeping in mind that the structure we consider must be realizable in practice. In peer-to-peer networks users continually exit the network whenever they want. Since we as designers have limited control over this aspect of the network dynamics, we will assume that nodes are effectively deleted at random. With this in mind, we are ideally placed to use our model from Section 2. A simple and attractive choice for our network is the Poisson distributed network. For a Poisson degree distribution with mean μ we have k = μ and k 2 = μ2 + μ. Then, using Eq. (23), the average search time is n τ =A . μ
(25)
Now if we allow μ to grow as some power of the size of the entire network, i.e. μ ∝ nα with 0 ≤ α ≤ 1, then τ ∝ n1−α . For smaller values of α searches will take longer, but the nodes’ degrees are lower on average, meaning that each vertex will have to devote less memory resources to maintaining its directory. Conversely, for larger α, searches will be completed more quickly at the expense of greater memory usage. In the limiting case α = 1, searches
228
G. Ghoshal
are completed in constant time, independent of the network size, despite the simple nature of the random walk search algorithm. The price we pay for this good performance is that the network becomes dense, having a number of edges scaling as n1+α . However, remember that this is a virtual network, in which the edges are a purely notional construct whose creation and maintenance carry essentially zero cost. There is a cost associated with the directories maintained by nodes, which for α = 1 will contain information on the items held by a fixed fraction of all the nodes in the network. For instance, each node might be required to maintain a directory of 1% of all items in the network. Because of the nature of modern computer technology, however, we do not expect this to create a significant problem. User time (for performing searches) and CPU time and bandwidth are scarce resources that must be carefully conserved, but memory space on hard disks is cheap, and the tens or even hundreds of megabytes needed to maintain a directory is considered in most cases to be a small investment. By making the choice α = 1 we can trade cheap memory resources for essentially optimal behavior in terms of search time, and this is normally a good deal for the user. As a test of our proposed search scheme, we have performed simulations of the procedure on Poisson networks generated using the methods described in Section 2. Figure 2 shows as a function of network size the average time τ taken by a random walker to find an item placed at a single randomly chosen node in the network. As we can see, the value of τ does indeed tend to a constant (about 100 steps in this case) as the network size becomes large. While we have described here the theoretical ideas to grow a network with a desired degree distribution, within the constraints outlined above, we have not provided a realistic way to place edges between nodes with the desired 170 160 150
Time τ
140 130 120 110 100 90 0
5000
10000
15000
20000
Network size n
Fig. 2. The time τ for the random walk search to find an item deposited at a random vertex, as a function of the number of vertices n.
Some New Applications of Network Growth Models
229
attachment kernel πk . If each node entering the network knew the identities and degrees of all the others, this would be easy; we would simply select a degree k at random in proportion to πk pk , and then select a node uniformly at random with that degree. In the real world, however, and particularly in peer-to-peer networks, no node knows the identity of all others. Typically, computers only know the identities (such as IP addresses) of their immediate network neighbors. There is indeed a way to get around this problem, and that is by using biased random walks to generate the network. The main purpose of this paper is more to discuss ideas, rather than implementation; consequently, we will not describe this here. For a detailed discussion of the practical implementation, along with other details such as data replication, we refer the interested reader to [24], and instead move on to our next application.
4 Preserving Network Structure from Disruptions We now turn our attention to quite a different topic: the field of network resilience. Quite a lot of work has been done in this regard, though most have focused on the effects of disruption on static networks. Typically, authors have studied networks where the nodes and edges are progressively removed in some fashion, and then measured the effect of these removals against the existence of a giant component. The giant component constitutes the largest set of nodes in the network, of size O(n), where n is the size of the network, that are connected to each other by at least one path. The network is considered static in that no compensatory measures, such as the (re)-introduction of new edges or nodes, are permitted. There is indeed good reason to study the resilience of networks. In the real world, networks suffer from a variety of disruptions, stemming from failure of key components, continuous addition/removal of nodes and edges and intentional attacks such as Denial of Service, among other things. Since these disruptions affect the structure of networks and structure is directly related to performance, it is important to understand how the networks are affected. However, we can do better than that. We can try and restore some or all of the structure of the network by allowing it to react to the disruptions of new nodes and edges. As evidenced from the previous section, considerable effort can be expended in tailoring a network to have structures that optimize properties of interest, and it is a worthy effort to try and maintain that structure in the face of varied disruptions. Note, that in the context of this paper, when we talk about the structure of the network, we limit ourselves to the degree distribution. For the purposes of our study we assume that the designers of the network are only aware of the statistical properties of the removed nodes and have no ability to influence the existing network beyond the introduction of new nodes along with their corresponding edges. They thus have two processes under their control to compensate for the attack. The first is the degree of
230
G. Ghoshal
the introduced vertices, and the second is the process by which a newly introduced node chooses to attach to a previously extant one on the network. Consequently, failure is compensated by adding nodes and edges chosen from an appropriate degree distribution and attaching them to the network via specially tailored schemes. As mentioned before, a variety of models have been proposed to simulate network evolution and growth where vertices are both added and deleted, but these have concentrated on the relatively simple case of uniform deletion. We have already shown in Section 2 that, under uniform failures, the appearance of degree correlations that typically arise as a result of growth processes can be neglected. For the case of non-uniform deletion, correlations cannot be ignored. Here we will proceed by demonstrating how to preserve an initially uncorrelated network throughout the evolution process, with the introduction of an additional rate equation for the degree correlations; consequently, our focus will be on the currently neglected case of non-uniform failures. The results of this section are based on the work of [26]. 4.1 Types of Disruptions Before we move on to our method for repairing networks, we provide a brief description of the types of attacks or failures that most networks are subject to. Random failures are the most generally studied schemes in both static and evolving networks, because they lend themselves to relatively simple analysis. These types of failures may be representative, say, of disruption of power lines or transformers in a power grid owing to extraneous factors such as weather. However, the functionality of most networks often depends on the performance of higher-degree nodes; consequently, non-uniform attack schemes focus on these. For example, in a peer-to-peer network, a high-degree node could be a central user with large amounts of data. High degree could also be indicative of the amount of load on a node during its operation, or on the public visibility of a person in a social network. It is reasonable to assume that a malicious entity such as a computer virus is more likely to strike these important nodes. We can simulate these kinds of attacks using preferential failures ak ∝ k, that sample nodes in proportion to their number of connections, and through an outright attack on the highest-degree nodes represented by ak ∝ θ(k − kmin ), where θ(x) is the Heaviside step function. Our method of compensation will involve control over two processes: the first where our newly incoming/repaired node chooses a degree for itself drawn from some distribution mk , and second, the process by which this node decides to attach to any other in the network, governed by the attachment kernel πk . 4.2 Repair Method The evolution process, specifically non-uniform removal of nodes, can, and in many cases will, introduce degree correlations into our networks. In order to
Some New Applications of Network Growth Models
231
confront this issue, we proceed as follows. First we will find choices for mk and πk that satisfy the solutions to the rate equation for a given pk in a network that is uncorrelated. We will then demonstrate that a special subset of those solutions for mk and πk is an uncorrelated fixed point of the rate equation for the degree correlations. Our goal here is to solve for the attachment kernel πk , that will preserve the original probability distribution pk , subject to a deletion kernel ak for some choice of mk . Before we move on, we need to make a slight modification to Eq. (6). In the earlier instance, we exploited the simplification that arose from uniform deletion. We will assume here that the initial network is uncorrelated; however, we will retain the general form of the deletion kernel ak . Let ka be the mean degree of nodes removed from the network (i.e. ka = k kak pk ), and k the mean degree of the original degree distribution pk . Then we have cπk−1 pk−1 − cπk pk + (k + 1)
ka k pk+1 − k a pk − ak pk + mk = 0. (26) k k
Once again, it can be easily shown from Eq. (26) that the average degree of nodes removed is ka = c. Introducing the cumulative distribution for the attacked and newly added nodes, Ak and Mk respectively, Ak =
∞
al pl ,
Mk =
l=k
∞
ml ,
(27)
l=k
we sum Eq. (26) from k = k + 1 to ∞, to get πk p k =
Ak+1 − Mk+1 (k + 1)pk+1 + . k c
(28)
Dividing both sides by pk then gives us an expression for the attachment kernel, πk =
1 pk
(k + 1)pk+1 Ak+1 − Mk+1 + . k c
(29)
This equation represents the set of possible solutions for the attachment kernel that will lead to the desired degree distribution, given that the final network is uncorrelated. The correct choice of solution from the above set must obey the consistency condition that, when inserted into the rate equation for the degree correlations, the correlations vanish. The following ansatz chosen from the above set is such a choice: mk = ak pk , qk (k + 1)pk+1 πk = = . pk k pk
(30)
232
G. Ghoshal
The reason behind this choice will be made more clear in the next section. Note the similarity with Eq. (17) which was derived in the context of uniform deletion. Here we see that it holds true even for non-uniform deletion, albeit with some caveats that we will see shortly. There are basically two conditions for the existence of a solution given by this equation; ak pk must be a valid probability distribution, and k must be finite. These are not very stringent conditions and are typically satisfied by most degree distributions. In other words, barring some pathological cases, it is always possible to find a solution of the above form. We are now in a position to effect our repair on the network. Given the original degree distribution pk and the form of the attack ak , Eq. (30) gives us the precise recipe for recovering the degree distribution. We need to sample the degrees of the newly introduced nodes in proportion to the product of the deletion kernel and the degree distribution, and then attach these edges in proportion to the excess degree distribution of the network. To test our repair method, we provide two examples for initially uncorrelated networks with 10,000 nodes generated using the configuration model [25]. In the configuration model, only the degrees of vertices are specified. Apart from this sole constraint the connections between vertices are made at random. We employ two types of attack kernels, preferential attack represented by ak ∝ k and a targeted attack only on high-degree nodes represented by ak ∝ θ(k − kmin ) on our two example networks. Our first network has links distributed according to a power law with an exponential cutoff, −γ −k/κ Ck e k = 0, pk = (31) 0 k = 0, where C is a normalization constant. Our second choice of network has an exponential degree distribution, pk = 1 − e−λ e−λk . (32) In Fig. 3 we show the resulting degree distribution for the power-law network where nodes were attacked preferentially, while Fig. 4 shows the results for the exponentially distributed network undergoing targeted attack. Both figures indicate that the initial and final networks are in excellent agreement. 4.3 Neglecting Degree Correlations To demonstrate the validity of our results, we must prove that our initially uncorrelated networks remain uncorrelated under our repair scheme. Here we give a brief sketch of the idea; for full details, see [26]. We start off by defining a rate equation for the correlations. The rate equation describes the evolution of the expected number of edges in the network with ends of degree k and l. Let the number of such edges in the network be mel,k ,
(33)
Some New Applications of Network Growth Models
233
100 10−1
Probability pk
10−2 10−3 10−4 10−5 10−6 10−7
1
10
100
Degree k
Fig. 3. Log-binned degree distribution of a power-law network (104 nodes) with exponent γ = 3 and exponential cutoff κ = 100, under preferential attack ak ∝ k using πk from Eq. (30) after setting mk = ak pk . The data points are averaged over multiple realizations of the network, each subject to 105 iterations of addition and deletion. The points along with corresponding error bars represent the final degree distribution, whereas the solid line represents the initial network. 100
Probability pk
10−1 10−2 10−3 10−4 10−5 10−6 1
10 Degree k
Fig. 4. Degree distribution of an exponential network (104 nodes) with λ = 0.4 under targeted attack ak ∝ Θ(k − 5) using πk from Eq. (30) after setting mk = ak pk .
234
G. Ghoshal
where m = nk/2, and el,k is the probability that a randomly selected edge has degree k at one end and degree l in the other. The expected number of edges after one time step where we add c and take away ka edges is then [m + c − ka ]el,k = mel,k + Δ,
(34)
where Δ represents all other edge addition and removal processes. We have already established that in the steady state case, ka = c irrespective of the degree distribution, so our goal is equivalent to showing that Δ is equal to zero for an uncorrelated network generated/repaired with our special choices of πk and mk . As a result ek,l = ek,l , implying that the degree correlations (if any) remain constant over time. So according to Eq. (34) there exists a set of solutions such that an initially uncorrelated network will not develop correlations as a consequence of the evolution process. The attachment kernel Eq. (30) that was employed in the network evolution process is a subset of these solutions. This allows the repair method to be employed by maintaining negligible correlations in the network. To briefly summarize, we have demonstrated that if a network with a certain degree structure is subjected to an attack that aims to destabilize that structure, one can recover the same, by manipulating the rules by which newly added/removed vertices are (re)-introduced back to the network. The rules that we employ in our repair method are dependent on the types of attacks on our networks.
5 Conclusion In this paper we have discussed some interesting alternative applications of network growth models. Traditionally these models have been used to determine the processes via which networks in the real world form. However, the mathematical framework can be adopted to other uses. Here we have provided two examples. In the first example, we have considered the problem of designing networks by trying to manipulate the rules by which they evolve. For a certain class of networks, such as peer-to-peer networks, the limited control that this manipulation gives us over network structure may be sufficient to generate significant improvements in network performance. Using generating function methods, we have shown that it is possible to create networks with a desired degree distribution by appropriate choice of the attachment kernel that governs how newly added vertices connect to the network. We studied in detail one particularly simple case of a Poisson network that can be realized in straightforward fashion and allows us to perform decentralized searches in constant time, and makes only constant bandwidth demands per node, even in the limit where the database becomes arbitrarily large. In the second example, we have shown how to preserve a network’s degree distribution from various forms of attack or failures by allowing it to react to
Some New Applications of Network Growth Models
235
the disruptions via the introduction of new nodes and edges. Recent empirical studies [27] have suggested that node removal, for example, in the World Wide Web, is typically non-uniform in nature. Unfortunately as we have seen, nonuniform removal leads to the creation of degree correlations in the network, which makes analysis difficult. To deal with the special case of non-uniform deletion we have introduced a rate equation for the evolution of degree correlations and have used that in combination with the equation for the degree distribution to work around this problem. The structure of many networks in the real world is crucially related to their performance, and consequently, loss of these properties can lead to severe constraints on their performance. In view of this, it is crucial for researchers to come up with effective solutions to try and manage these types of disruptions. The ideas in this paper have been presented chiefly to demonstrate the use and versatility of network evolution models. There remains much opportunity for other applications than those discussed here, as well as for ways to execute them in the real world. We hope that this will stimulate the imagination of researchers working in the field and look forward to new and exciting developments.
Acknowledgments The author thanks Mark Newman and Brian Karrer for illuminating discussions. This work was funded by the James S. McDonnell Foundation.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
R. Albert and A.-L. Barab´ asi, Rev. Mod. Phys. 74, 47 (2002). S. N. Dorogovtsev and J. F. F. Mendes, Adv. Phys. 51, 1079 (2002). M. E. J. Newman, SIAM Review 45, 167 (2003). D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998). R. J. Williams and N. D. Martinez, Nature 404, 180 (2000). R. Albert, H. Jeong and A.-L. Barab´ asi, Nature 401, 130 (1999). D. J. de S. Price, Science 149, 510 (1965). S. Redner, Eur. Phys. J. B 4, 131 (1998). D. J. de S. Price, J. Amer. Soc. Inform. Sci. 27, 292 (1976). A.-L. Barab´ asi and R. Albert, Science 286, 509 (1999). E. Ben-Naim and P. L. Krapivsky, J. Phys. A: Math. Theor. 40, 8607 (2007). C. Moore, G. Ghoshal and M. E. J. Newman, Phys. Rev. E 74, 036121 (2006). J. Salda˜ na, Phys. Rev. E 75, 027102 (2007). T. Hong, in Peer-to-Peer, Harnessing the Benefits of a Disruptive Technology, edited by Andy Oram (O’Reilly, Sebastopol, CA, 2001), Chap. 14, pp. 203–241. 15. L. A. Adamic, R. M. Lukose, A. R. Puniyani and B. A. Huberman, Phys. Rev. E 64, 046135 (2001). 16. N. Sarshar, P. O. Boykin and V. P. Roychowdhury, Fourth International Conference on Peer-to-Peer Computing, pp. 2–9, Washington, D.C. (2004).
236
G. Ghoshal
17. G. Paul, T. Tanizawa, S. Havlin and H. E. Stanley, Eur. Phys. J. B 38, 187 (2004). 18. R. Cohen, K. Erez, D. ben-Avraham and S. Havlin, Phys. Rev. Letts. 85, 4626 (2000). 19. D. S. Callaway, M. E. J. Newman, S. H. Strogatz and D. J. Watts, Phys. Rev. Letts. 85, 5468 (2000). 20. M. E. J. Newman and G. Ghoshal, Phys. Rev. Letts. 100, 138701 (2008). 21. B. Rezai, N. Sarshar, V. Roychowdhury and P. Oscar Boykin, Physica A 381, 497 (2007). 22. A. E. Motter, Phys. Rev. Letts. 93, 098701 (2004). 23. P. L. Krapivsky and S. Redner, Phys. Rev. E 63, 066123 (2001). 24. G. Ghoshal and M. E. J. Newman, Eur. Phys. J. B, 58, 175 (2007). 25. M. Molloy and B. Reed, Random Struct. Algorithms 6, 161 (1995). 26. B. Karrer and G. Ghoshal, Eur. Phys. J B, 62, 239 (2008). 27. J. S. Kong and V. P. Roychowdhury, e-print arXiv:0711.3263v2.
The Big Friendly Giant: The Giant Component in Clustered Random Graphs Yakir Berchenko,1 Yael Artzy-Randrup,2 Mina Teicher,1 and Lewi Stone2 1
2
Interdisciplinary Brain Research Center, Bar Ilan University, Ramat Gan 52900, Israel;
[email protected],
[email protected] Biomathematics Unit, Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel;
[email protected],
[email protected]
1 Introduction Network theory is a powerful tool for describing and modeling complex systems having applications in widely differing areas including epidemiology [16], neuroscience [34], ecology [20] and the Internet [26]. In its beginning, one often compared an empirically given network, whose nodes are the elements of the system and whose edges represent their interactions, with an ensemble having the same number of nodes and edges, the most popular example being the random graphs introduced by Erdos and Renyi [11]. As the field matured, it became clear that the naive model above needed to be refined, due to the observation that real-world networks often differ significantly from the Erdos–Renyi random graphs in having a highly heterogenous non-Poisson degree distribution [5, 15] and in possessing a high level of clustering [33]. Methods for generating random networks with arbitrary degree distributions and for calculating their statistical properties are now well understood. This is usually achieved with the aid of the configuration model [6] and by employing an analysis of a certain branching process based on generating functions [24]. However, clustering, the other property that characterizes realworld networks, remains far less understood. Clustering refers to the relative number of triangles in a network, and is commonly measured by the coefficient 3×N introduced in [24] as C = N3 . Here N is the total number of triangles in the network, while N3 is the number of connected triples of nodes. This definition has the advantage that C is also the probability that two nodes which connect to a mutual node are connected themselves, thereby forming a triangle whereby “a friend of a friend is also a friend.” The main difficulty when studying clustered networks is that the branching processes, which are at the heart of the generating function formalism of [24], no longer seem applicable due to the formation of short loops, namely triangles. The lack of obvious analytical tools [16] and techniques for incorporating N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 14, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
238
Y. Berchenko et al.
triangles into random graph models with an arbitrary degree distribution [21] has led researchers to pursue several different avenues. One should mention several of these attempts: •
•
•
Giving up on analytic predictions, and conducting instead descriptive studies [30], where various clustering indices are defined and measured for a given real-world network. Resorting to simulations is also quite common [33]. Considering special cases, which are amenable for analysis. For example, constructing a one-mode projection of a bipartite graph [14, 22], or the framework of [25, 29], which generates exponential random graphs, or Markov random graphs [32], which are flexible but more difficult to analyze. There is yet another common but somewhat naive practice: adopting results and criteria from the unclustered case, and wrongly applying these criteria for studying clustered graphs. Relevant is the example concerning the emergence of the giant component (GC)—where it was shown [24] that in the usual, unclustered, case there is a GC if the mean number of nodes at a distance two (z2 ) is larger than the mean number of nodes at a distance one (z1 ). This result is often (wrongly) taken as the criterion for clustered networks [22, 31], thereby initiating the quest to calculate z2 in the presence of clustering [22, 23, 31].
Here we suggest constructing a branching process that is applicable for networks with triangles [7, 28]. This recent approach seems very promising, and we will pay attention to it, using the formalism of [7], rather than that found in [28]. The latter relies on the restrictive assumption that any two triangles in a network will never share an edge. Even in this limited setting, the results are only applicable for relatively low levels of clustering (C), and the concepts are difficult to interpret and broaden. In Section 2 we review the application of generating functions for unclustered (C = 0) random networks [24] (2.1), and describe the novel free-excess degree formalism for clustered networks [7] (2.2). In Section 3 we discuss criticality in random clustered graphs. Most of this section is devoted to the emergence of the GC, as indeed is the bulk of the literature, but we will also discuss briefly the second critical point (3.2), where the graph becomes connected, which has impact on processes such as synchronization in networks [17]. In Section 4 we show how to estimate the size of the GC as shown in [7]; then we broaden the setting to study the robustness and resilience of the GC, i.e., bond, site and joint bond+site percolation (4.2). In Section 5 we describe our simulations and compare the theory with data from real-world networks. We discuss our findings in Section 6.
The Big Friendly Giant
239
2 Generating Functions A generating function is a clothesline on which we hang up a sequence of numbers for display.. [35]. For an excellent introduction to generating functions the reader is referred to the book generatingfunctionology by Wilf [35]. Here we use the terminology and notation used by Newman and colleagues [24] as it has been adapted for network theory. 2.1 Unclustered Random Networks: C = 0 We begin by reviewing the application of generating functions for unclustered (C = 0) random networks [24]. Define the generating function G0 (x) =
∞
pk xk ,
(1)
k=0
where pk is the probability that a randomly chosen node on the graph has degree k. The distribution pk is assumed to be normalized, so that G0 (1) = 1. The same will be true for all generating functions considered here. Because the probability distribution is normalized and positive definite, G0 (x) is also convergent for all |x| ≤ 1, and hence has no singularities in this region. The function G0 (x), and indeed any probability generating function, has a number of properties that will prove useful in subsequent developments. Moments. The average over the probability distribution generated by a generating function, for instance, the average degree z1 of a node in the case of G0 (x), is given by z1 = k = kpk = G0 (1). (2) k
Thus, if we know the values of the coefficients of a generating function, we can calculate the mean of the probability distribution which it generates. Powers. If the distribution of a property k of an object is generated by a given generating function, then the distribution of the total of k summed over m independent realizations of the object is generated by the m-th power of that generating function. For example, if we choose m vertices at random from a large graph, then the distribution of the sum of the degrees of those vertices is generated by [G0 (x)]m . Another quantity that will be important to us is the distribution of the degree of the vertices that we arrive at by following a randomly chosen edge. Such an edge arrives at a node with probability proportional to the degree of that node, and the node therefore has a probability distribution of degree proportional to kpk . The correctly normalized distribution is generated by k G (x) k kpk x . (3) = x 0 G0 (1) k kpk
240
Y. Berchenko et al.
Beginning at a randomly chosen node and following one of the edges at that node, we reach a neighbor v1 . We are interested in the distribution of the outgoing edges of v1 or its “excess degree” (i.e., the node’s degree minus one, accounting for the edge we arrived along). Since the probability, qk , to have k outgoing edges is qk = (k + 1)pk+1 /z1 , the distribution of outgoing edges, or excess degree distribution [24], is generated by the function G1 (x) :=
qk xk =
k
1 G0 (x) = G0 (x), G0 (1) z1
and the average excess degree is thus ze = kqk = G1 (1).
(4)
(5)
k
When the clustering coefficient, C, is zero,1 the probability that any of these outgoing edges connects to the original node that we started at, or to any of its other immediate neighbors, scales as N −1 and hence can be neglected in the limit of large N .2 Thus, making use of the “powers” property described above, the generating function for the probability distribution of the number of second neighbors of the original node can be written as pk [G1 (x)]k = G0 (G1 (x)). (6) k
Similarly, the distribution of the third-nearest neighbors is generated by G0 (G1 (G1 (x))), and so on. The average number z2 of the second neighbors is d z2 = G0 (G1 (x)) = G0 (1)G1 (1) = G0 (1) = z1 ze , (7) dx x=1 where we have made use of the fact that G1 (1) = 1. 2.2 The Free-Excess Degree The preceding calculations can be modified for application to clustered networks (C > 0) [7]. Analogous to the excess degree, beginning at a randomly chosen node v0 and following one of the edges at that node, we reach a neighbor v1 . We are now interested in {ei }∞ i=0 , the distribution of the outgoing edges of v1 that are not connected to a neighbor of v0 . Suppose we travel from node v0 along an edge to node v1 having degree d(v1 ) = i + 1 (i.e., with an excess degree of i). The probability that it will have k neighbors that are not connected back to v0 (via a triangle) is More accurately: ∀ε > 0 P r(C > ε) → 0 as N → ∞. In Section 2.2, when C > 0, we will need a similar observation; namely, that the probability to have a cycle of length four, that is not composed of two triangles, scales as N −1 and hence can also be neglected for large N . 1 2
The Big Friendly Giant
i (1 − C)k C i−k . k
241
(8)
This is just the probability that of the i outgoing edges of v1 , i−k are connected in a triangular formation that includes v0 , while the other k edges do not. Here, as before, C is just the probability of a triangular formation. When d(v1 ) is not known, from (8) we obtain ek :=
∞ i=0
k ∞ 1−C i i k i−k i qi = qi . (1 − C) C C k k C
(9)
i=0
The generating function, Gc (x), for the distribution is Gc (x) :=
∞
k
ek x =
∞ ∞ k=0 i=0
k=0
k 1−C i i qi xk . C k C
(10)
The order of summation may be changed to obtain Gc (x) =
∞ i=0
k ∞ 1−C i x . qi C k C i
(11)
k=0
Using the binomial theorem we obtain Gc (x) =
∞
qi C
i
i=0
i ∞ 1−C x = qi (C + (1 − C)x)i = G1 (C + (1 − C)x). 1+ C i=0 (12)
Thus, we arrive at the key relationship Gc (x) = G1 (C + (1 − C)x).
(13)
Let us remark that in deriving (8)–(13), it is possible to use any other clustering index, such as c(k)—the degree-dependent clustering coefficient used in [28]. However, it might be hard, if not impossible, to obtain a solution with such a simple closed form. As an example of how (13) may be useful, it is possible to determine the mean free-excess degree: i
iei =
dGc (x) = (1 − C)G1 (1) = (1 − C)ze . x=1 dx
(14)
Similarly, it will prove useful to calculate the mean number of edges emanating outwards from nodes at a distance one to nodes at a distance two, beginning from some arbitrary source node (note that this is not the mean number of nodes at a distance two, due to the fact that there is a positive probability
242
Y. Berchenko et al.
that two edges reach the same node at a distance two). Similarly to (6) and (7), the mean is dG0 (Gc (x)) = G0 (1)G1 (1) · (1 − C) = (1 − C)z1 ze . x=1 dx
(15)
This parameter was also calculated in [23] by a different technique, but as will be discussed shortly, its importance appears to have been overlooked.
3 The Critical Point The interest in random graph theory was initiated by, and is in great debt to, a striking discovery by Erdos and Renyi [11]. They studied the following simple model of a network, referred to as GN , p, or simply as the ER random graph: Take some number N of nodes and connect each pair with probability p,3 thus defining a probability measure over the ensemble of all such graphs. Erdos and Renyi demonstrated what is considered to be one of the most important properties of the random graph, namely that it possesses a phase transition, from a low-p state (p(N ) < (1−) N ) in which all components are small (of size o(N )), to a high-p state (p(N ) > (1+) N ) in which an extensive fraction of all nodes (i.e., Θ(n)) are joined together in a single GC. This result has been extended by Molloy and Reed [18, 19] and [1] to graphs with an arbitrary degree distribution, thus making them more applicable for analyzing real-world networks. Here we examine the critical point, where a GC emerges, in the context of clustered networks (Section 3.1). There is yet another interesting point, though not as studied as the latter, where the graph becomes connected—there is a path from each of the nodes ) [8]. to any other node. For the ER graph, GN , p, this occurs when p = ln(N N In Section 3.2 we shall discuss briefly this issue for clustered networks. 3.1 The Emergence of the GC In their seminal paper, Molloy and Reed [18] introduced the parameter Q := i ipi (i − 2), which identifies the phase transition in random graphs, i.e., the point where a GC is born. Their procedure utilizes a method for constructing a random graph, which may be viewed as “walking through a graph” (Fig. 1a) and assessing the number of unknown nodes encountered along the way. Suppose one follows a random edge to a node v having degree k. How does this change the number of unknown nodes? First of all, by arriving at v the number of unknown nodes decreases by one. However, because v itself has degree k, then this leads to an increase of (k − 1) in the number of unknown nodes. The net effect is that the number of unknown nodes increases by (k − 2). In order to calculate the expected change, the probability 3
p is usually a function of N , p(N ).
The Big Friendly Giant
a
b
c1 c2 b2 b3 b1
a1 V0 a3
a2
a4
c1 c2 b2 b3
b4 b5
b1
243
b4
a1
a2
V0 a3
a4
b5
Fig. 1. Graphical illustration of the exposure procedure. Choose a node at random, say V0 , and start diffusing from it and counting the nodes encountered on the way. a) When C = 0 and the network is tree-like (see footnote 1), after counting the new nodes (a1 − a4 ) we pick one of them at random, say a1 , and count its new neighboring nodes (b1 − b3 ), which are distributed according to {qi }∞ i=0 . In the next step, we randomly choose one of the nodes (a2 − a4 , b1 − b3 ) and continue until the entire component is exposed. b) When C > 0, two modifications are required to deal with cycles due to triangles (the dashed edges): we use {ei }∞ i=0 and diffuse depthwise. After counting a1 − a4 , when we count the neighbors of a1 we avoid overcounting a2 because {ei }∞ i=0 governs the distribution of the solid-black edges. In the next step if we go from a1 to b3 in order to count the neighbors of b3 , again we avoid overcounting a2 (because it is connected to a1 ). The depthwise exposure, which is a permissible scheme [18], is used to avoid dependencies.
of arriving at v, which is proportional to the degree k, must also be factored in. This makes the expected increase in the number of unknown neighbors proportional to Q = i ipi (i − 2). If Q is positive, then with each step of the walk through the graph the number of unknown nodes, and the size of the component, grows larger—the hallmark traits of the GC. If Q is negative, then the number of unknown neighbors reduces to zero; therefore, we are not walking through a GC. Recalling earlier definitions, the condition Q > 0 may be stated as (16) ze > 1. Since in unclustered (C = 0) networks ze = z2 /z1 , Ref. [24] advocates the following equivalent criterion. Criterion A. There is a GC in random networks if z2 > z1 , i.e., the mean number of second-nearest neighbors is greater than the mean number of neighbors. This has the intuitive epidemiological interpretation: If the mean number of infected individuals grows with distance from the source, an epidemic outbreak will occur. In [7] we have adapted Molloy and Reed’s procedures in a manner that makes them applicable for clustered networks. Again, suppose we follow a random edge that begins from a source node and ends at some node v. Previously, if v had degree k, the number of “unknown” neighbors would increase by k − 2. However, with triangles there is a possibility that some of the k − 1 outgoing edges will return to nodes that are already known (via dashed edges
244
Y. Berchenko et al.
in Fig. 1b). It is possible to avoid counting these nodes twice, by counting them in a manner that considers the free-excess degree distribution ek . Thus, when a node v of free-excess degree i is encountered, the number of “unknown” neighbors increases by i − 1, and the expected increase in the number of unknown neighbors is thus proportional to Qc = i ei (i − 1). The criterion for the GC in a clustered network is just Qc > 0. However, from (14), this condition becomes (17) (1 − C)ze > 1, which differs from (16) by the scale factor (1 − C). Multiplying both sides by z1 , we obtain (1 − C)z1 ze > z1 . Recalling (15), this may be interpreted as the following criterion. Criterion B. There is a GC if the mean number of edges emanating outwards from nodes at a distance one to nodes at a distance two (beginning from some arbitrary source node) is larger than the mean degree.
a
L2
L1
V0
b
largest component
Note that in the epidemiological sense, the emphasis is on the growth in the number of outward edges or transmission routes from a typical source node to its neighbors, and then to its neighbors’ neighbors (Fig. 2a). Although previously criterion A was used for clustered networks without any proper justification [31, 22], Fig. 3a shows that it provides poor predictions of the critical mean degree z1∗ as a function of the clustering, C (predictions are made using estimates of z2 in the presence of clustering as detailed in [31, 23]). The accuracy of the prediction can be assessed against simulations (Fig. 3). In contrast, criterion B is a much better predictor as shown in Fig. 2b and Fig. 3a. The latter plots the analytic result for a Poisson degree distribution where z1 = ze [24] and z1∗ = (1 − C)−1 (from (17)). Simulation y = Const × N2/3
2
10
102
103 size of network (N)
Fig. 2. The difference between the new criterion B and the conventional criterion A. a) Consider the following example: a typical node has a neighborhood similar to V0 —3 nodes at a distance one in the first layer, L1 , and 2 nodes at a distance two in the second layer, L2 , but 4 edges to the second layer (from L1 to L2 ). Criterion B predicts a GC, while criterion A fails to predict a GC. b) The size of the largest component plotted vs. N for Poisson networks having mean degree z1 = 1.25 and C = 0.2 (i.e., at the critical point according to criterion B). Indeed the size at the critical point correctly scales as ∼N 2/3 , as is known for the case z1 = 1, C = 0 (see references in [8]). Note that criterion A would wrongly predict this regime to be below the critical point (since z2 ≈ 1.19 < z1 ) and would suggest that all components should scale as O(log N ).
The Big Friendly Giant
b
a
C=0
size
3
300
245
C = 0.25
z*1 0
z1
c 1.5
1 0
0.2 C
0.4
5
1 0
0.2
0.4
C
Fig. 3. The critical mean degree z1∗ for the formation of a GC, plotted as a function of C. a) Poisson degree distribution. Predictions of criterion A (grey line; z2 estimated as in [31]). Predictions of criterion B (black line; z1∗ = (1 − C)−1 (see text)). Empirical estimates of z1∗ (circles) were obtained through the following procedure in order to overcome finite size effects: first the value of the size of the largest component was found for networks with C = 0 at the known threshold z1∗ = 1 (b; dashed line). This value was used to identify the critical threshold in comparable networks with C > 0. c) SF degree distribution. Symbols as in a. Black and grey lines, which practically overlap, are based on expressions for z1 and ze for SF networks [24].
Scale-free (SF) networks, where pk ∼ k −α , are usually characterized by their exponent α. However, for the purpose of discussing criticality, when α ≈ 3.45 and the tail of the distribution is not very significant, we can also characterize them by their mean degree. Taking this approach we see that as opposed to the Poisson degree distribution, Fig. 3c shows that the critical mean degree for SF networks is almost constant as a function of C. Its constancy results from the fact that z1 ze and ze increases to a great extent with a small increase in z1 [24]. However, criterion A, being based on the behavior of the second moment of the distribution as well, gives similar predictions (Fig. 3c) from the same considerations. 3.2 Complete Connectivity Although the transition to complete connectivity is less well studied, the following example makes clear the need for further work in this area, particularly for clustered networks. In a recent series of papers [12, 17], the effect of clustering on a network of coupled phase oscillators was examined. These authors made the plausible assumption that by investigating a network with a very high mean degree their network will be connected. When they [17] found groups of oscillators, each group oscillating at a different frequency, they named them “dynamical clusters,” in order to distinguish them from the topological clusters (i.e., connected components).
246
Y. Berchenko et al. 1
N=100 N=200
GC 0.85 0
C
0.6
Fig. 4. Size of the GC vs. C for Poisson network with z1 = 1.5ln(N ).
However, from the previous section we might be tempted to guess whether the second critical point, where the graph become connected, scales with (1 − C)−1 . Unfortunately, while simulations do not confirm our guess for a ) disintegration at C ∗ = 1 − ln(N z1 N , they do clearly demonstrate that by introducing clustering to the network, it breaks down quite early (Fig. 4). When conducting studies such as [12, 17] or considering the validity of their implication, one should especially be careful while checking complete connectivity by counting the multiplicity of the eigenvalue 0 of the graph Laplacian (as done in [17]).4 In practical use, often numeric implementation will result in finding very small, though non-zero, eigenvalues instead of the correct ones [2].
4 The Size of the GC and Its Robustness 4.1 The Size of the GC In order to find the size of the GC, Andersson [3] examined the probability of extinction in a two-phase branching process that mimics the construction of a random graph (with C = 0). In this branching process the source node has a number of direct descendants distributed according to {pi }∞ i=0 (the first phase), while each of its descendants has a number of direct descendants distributed according to {qi }∞ i=0 (the second phase). First, consider the probability u for a lineage of a single branch that arrives at some node, v1 , to eventually die out. This necessitates that all k branches leaving v1 die out, an event that occurs with probability uk . Since the degreek of v1 is unspecified, we obtain the self-consistency condition u = ∞ k=0 qk u = G1 (u), which can be solved to find u. The second step takes into consideration that the branching process begins from some arbitrary source node. Because all branches originating from the source must die out in order for the process to become extinct, the probability
The idea is basically as follows: find the eigenvalues of the matrix L = D − A, where A is the graph adjacency matrix and D is a diagonal matrix with the degree of node j at the Djj -th entry; the multiplicity of the eigenvalue 0 is the number of connected components. 4
The Big Friendly Giant
247
of extinction (which is equivalent to belonging to a small component) is equal to G0 (u), while the probability of persistence (or belonging to a GC) is S = 1 − G0 (u), which is also the size of the GC. The preceding argument needs to be modified for clustered networks [7]. For the latter, the probability u for the lineage of a single branch to die out no longer fulfills the condition u = G1 (u), because the progeny in the second phase are no longer distributed by {qi }∞ i=0 . Instead, we can replace qi with ei so that the self-consistency condition is, to a close approximation, u = Gc (u). The error remaining is largely due to higher order correlations between nodes in the branching process that occur with probability of the order of C 2 (and even smaller when triangles sharing an edge are known to be rare, as is the focus of Ref. [28]). Indeed C 2 1 in many real-world networks. Thus, we get the following procedure: (a) Solve for u such that Gc (u) = u. (b) Calculate GC size as S = 1 − G0 (u). 4.2 The Robustness and Resilience of the GC Another related question concerns the size of the GC in the presence of dilution, i.e., when a fraction r of the nodes or edges (or a combination of nodes and edges) has been randomly removed.5 This is understood to be related to the robustness and resilience of the networks against breakdowns of its units, the classic example being the World Wide Web. Although the naive identification of functionality with the existence of the GC is sometimes considered problematic,6 this formalism does have important applications as in, for example, the study of epidemic outbreaks [10]. We can take the same approach from the previous section and ask again the probability u for a lineage of a single branch that arrives at some node, v1 , to eventually die out. In the case of node removal, in the branching process, following an edge we reach a node that is unoccupied (was removed) with probability rn . Therefore, the lineage will die out with probability rn plus 1−rn times the probability that any of the lineages of the outgoing edges from v1 will eventually die out (found via the self-consistency condition). Thus, step (a) becomes: Solve for u such that rn +(1−rn )G1 (u) = u. Similar consideration of edge removal with probability re , replacing the {qi }∞ i=0 with the free-excess (or G with G ) and demanding all branches originating probabilities {ei }∞ 1 c i=0 from the source to die out eventually, we get the size of the GC in clustered networks after joint edge+node removal: 5
Also known respectively as site, bond and joint site+bond percolation. Durret [10] gives a nice critique on the claim that “the internet is robust.. after dilution (in a certain parameters regime) we still get a GC.” In the regime referred to, “if all 6 billion people were initially connected then after the removal only 36 people can check their email.” 6
248
Y. Berchenko et al.
(a) Solve for u such that 1 − (1 − rn )(1 − re ) + (1 − rn )(1 − re )Gc (u) = u. (b) Calculate GC size as S = 1 − rn − (1 − rn )G0 (u). When C = 0, these equations coincide with those in [9]. Indeed, we feel that our formalism, in contrast to that of [28], has the advantage of being a natural generalization of previous theory [9, 24]. This theory for the size of the GC is evaluated against simulation and real-world data in the next section, showing good agreement.
5 Simulations and Real Data Clustered networks were generated by three different methods, all giving similar results, each having its own advantages in terms of efficiency. In all the methods, a degree sequence was generated by sampling from a desired distribution. In two of the methods, a network was constructed according to the generated degree sequence by using a fill algorithm [13]. In one case we then selectively switched links [4] to reach a desired degree of clustering. In the second case, we selectively reconnected links to nodes of distance two, which lead to an increase in the number of triangles. The third method was based on distributing triangles in an empty network under the restrictions of the degree sequence, and later filling in additional links using a fill algorithm [13]. In Fig. 5 we plot simulations against theory for the size of the GC for a variety of parameters. Figure 5a shows the size of the GC vs. the mean degree for different values of C, rn and re , the fraction of nodes and/or edges removed respectively. In order to isolate the effect of clustering, we have also plotted in figure 5b the size of the GC vs. C for a fixed mean degree. The most revealing plot is that of the case rn = re = 0 (top line in Fig. 5b), where there is good agreement at the lower values of C (i.e. C < 0.3),
a
b
200
0. 8
rn=0 re=0 rn=0 re=0.2 rn=0.2 re=0 rn=0.2 re=0.2
GC
0. 6 C=0.1 rn =0 re=0 C=0.1 rn=0.1 re=0 C=0.2 rn=0.2 re=0 C=0.2 rn=0.2 re=0.2
0
2
4 z1
6
0. 4 0. 2 0
0. 2
0. 4 C
0. 6
0. 8
Fig. 5. The size of the GC after dilution. a) As a function of the mean degree for networks with Poisson degree distribution. A fraction rn and re of the nodes/edges were removed randomly, for C = 0.1 and C = 0.2. b) As a function of C for networks with Poisson degree distribution and z1 = 2. A fraction rn and re of the nodes/edges were removed randomly. Black lines: our prediction for each case.
The Big Friendly Giant
a 450
0
0.2 0.4 0.6 0.8
1
b
c
1400
30
0
0.2
0.4
0.6
0.8
1
0
0.2 0.4 0.6 0.8
249
1
Fig. 6. The size of the GC after dilution in real-world networks. Grey: simulations with bars at a width of one std, black: our predictions, broken line: the naive predictions which do not consider C (i.e., C = 0). a) Nodes removal for the C. elegans neural network. N = 453, C = 0.124, z1 = 8.9. b) Edges removal for the yeast protein-protein interaction network. N = 2112, C = 0.055, z1 = 2.1. c) Joint nodes+edges removal for the network of Zachary’s Karate club. N = 34, C = 0.255, z1 = 4.4.
as well as for its higher values (at C ≈ 0.5), as opposed to a deviation at intermediate values. This is explained by the fact that initially the O(C 2 ) error in our approximation is rather small, at intermediate values it can grow, (but still < C 2 ) and towards the critical point it needs to converge back to the exact result, producing again a very small deviation. Notice as well that after dilution the deviations become smaller still (Fig. 5b). This might be explained by the sensitivity of the higher order correlations, which require many edges, and their fast destruction due to it. We can also take data from real-world networks and compare their behavior under dilution with the prediction. When doing so, we often find, due to the skewed degree distribution that characterizes many real-world networks and their “denseness,” that the network stays almost as one connected unit for a large range of dilution. It is thus not surprising that allowing for clustering does not improve the predictions. A distinct example is given in Fig. 6a, where the size of the GC of the neural network of C. elegans [34] is plotted vs. rn , the fraction of nodes removed. The size of the GC decreases almost linearly as rn . Nevertheless, Fig. 6b, c show two real-world networks, the yeast proteinprotein interaction network and Zachary’s Karate club [36], where considering the value of C gives an advantage in predicting the size of the GC as a function of dilution.
6 Discussion Perhaps the most far-reaching result presented here is our criterion B for the existence of the GC. This simple and intuitive criterion (Is the mean number of edges going to the second layer larger than the one going to the first?) is a natural generalization of the well-established Molloy–Reed condition (Is the mean number of nodes at the second layer larger than the one at the first?),
250
Y. Berchenko et al.
which is often misused. It might be that the Molloy–Reed condition gained much of its appeal due to the interpretation which identifies the existence of a GC with the possibility of a random walker, originating from a source node, to reach a large distance from the source (see as well the related and interesting electrostatic approach [27]). Although grossly oversimplified, we may conjecture that this is true for the general case. Indeed, when inspecting Fig. 2a, for example, we see that in order to have a positive drift away from the source we need not have an increasing number of nodes at each layer—rather an increasing number of edges between layers! We did not study the topological effects of having z2 > z1 in clustered networks. We expect still to find interesting behavior at z2 = z1 from quantities such as the diameter of the network. This is indeed a subject for future work.
Acknowledgments MT and YB are grateful for the support of the EC (project MATHfSS 15661) and DIP (project Compositionality F 1.2). LS and YAR are grateful for the support of the James S. McDonnell Foundation and the Israeli Science Foundation.
References 1. Aiello, W., Chung, F., Lu, L.: A random graph model for massive graphs, Proc. of the 32nd Annu. ACM Symposium on Theory of Computing (2000) 2. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK User’s Guide, 3rd edition, SIAM, Philadelphia (1999) 3. Andersson, H.: Limit theorems for a random graph epidemic model, Ann. Appl. Probab. 8, 1331–1349 (1998) 4. Artzy-Randrup, Y., Stone, L.: Generating uniformly distributed random networks, Phys. Rev. E. 72 (5): 056708 (2005) 5. Barabasi, A.-L., Albert, R.: Emergence of scaling in random networks, Science 286, 509512 (1999) 6. Bender, E. A., Canfield, E. R.: The asymptotic number of labeled graphs with given degree sequences, J. Combin. Theory A 24, 296307 (1978) 7. Berchenko, Y., Artzy-Randrup, Y., Teicher, M., Stone, L.: The emergence and the size of the giant component in clustered random graphs with a given degree distribution, submitted. 8. Bollobas, B.: Random Graphs, 2nd edition, Academic Press, New York (2001) 9. Callaway, D. S., Newman, M. E. J., Strogatz, S. H., Watts, D. J.: Network robustness and fragility: Percolation on random graphs, Phys. Rev. Lett. 85, 5468 (2000) 10. Durrett, R.: Random Graph Dynamics, Cambridge U. Press, Cambridge, UK (2006)
The Big Friendly Giant
251
11. Erdos, P., Renyi, A.: On the evolution of random graphs, Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5, 1761 (1960). 12. Gomez-Gardenes, J., Moreno, Y., Arenas, A.: Paths to synchronization on complex networks, Phys Rev Lett. 98 (3):034101 17358685 (2007) 13. Gotelli, N. J., Entsminger, G. L.: Swap and fill algorithms in null model analysis: Rethinking the Knight’s Tour, Oecologia 129, 281–291 (2001) 14. Guillaume, J. L., Latapy, M.: A realistic model for complex networks, (2003) condmat/0307095. 15. Jeong, H., Mason, S., Barabasi, A.-L., Oltvai, Z. N.: Lethality and centrality in protein networks, Nature 411, 4142 (2001) 16. Keeling, M. J.: The effects of local spatial structure on epidemiological invasion. Proc. R. Soc. London B 266, 859–867 (1999) 17. McGraw, P. N., Menzinger, M.: Analysis of nonlinear synchronization dynamics of oscillator networks by Laplacian spectral methods, Phys. Rev. E 75, 027104 (2007) 18. Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence, Random Structures and Algorithms 6, 161179 (1995) 19. Molloy, M., Reed, B.: The size of the giant component of a random graph with a given degree sequence, Combin. Probab. Comput. 7, 295 (1998) 20. Montoya, J. M., Sole, R. V.: Small world patterns in food webs, J. Theor. Bio., 214, 405–412 (2002) 21. Newman, M. E. J.: The structure and function of complex networks, SIAM Review 45, 167 (2003) 22. Newman, M. E. J.: Properties of highly clustered networks, Phys. Rev. E 68, 026121 (2003) 23. Newman, M. E. J.: Random graphs as models of networks. In: Bornholdt, S., Schuster, H. G. (eds.) Handbook of Graphs and Networks, Wiley-VCH, Berlin (2003) 24. Newman, M. E. J., Strogatz, S. H., Watts, D. J.: Random graphs with arbitrary degree distributions and their applications, Phys. Rev. E. 64, (2001) 25. Park, J., Newman, M. E. J.: Solution for the properties of a clustered network, Phys. Rev. E 72, 026136 (2005) 26. Pastor-Satorras, R., Vasquez, A., Vespignnani, A.: Dynamical and correlation properties of the internet, Phys. Rev. Lett. 87, 258701 (2001) 27. Redner, S.: A Guide to First-Passage Processes, Cambridge University Press, New York (2001) 28. Serrano, M. A., Boguna, M.: Percolation and epidemic thresholds in clustered networks, Phys. Rev. Lett. 97, 088701 (2006) 29. Strauss, D.: On a general class of models for interaction, SIAM Review 28, 513–527 (1986) 30. Vazquez, A.: Growing networks with local rules: Preferential attachment, clustering hierarchy and degree correlations, cond-mat/0211528 (2002) 31. Volz, E.: Networks with tunable degree distribution and clustering, Phys. Rev. E 70, 056115 (2003) 32. Wasserman, S., Pattison, P.: Logit models and logistic regressions for social networks: I. An introduction to Markov random graphs and p*, Psychometrika 61, 401426 (1996) 33. Watts, D. J., Strogatz, S. H.: Collective dynamics of small-world networks, Nature 393, 440442 (1998)
252
Y. Berchenko et al.
34. White, J. G., Southgate, E., Thompson, J. N., Brenner, S.: Structure of the nervous system of the nematode C. elegans, Phil. Trans. R. Soc. London 314, 1340 (1986) 35. Wilf, H. S.: generatingfunctionology, 2nd edition, Academic Press, London (1994) 36. Zachary, W.: An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452–473 (1977)
Technological Networks Bivas Mitra Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, 721302, India;
[email protected]
1 Introduction The study of networks in the form of mathematical graph theory is one of the fundamental pillars of discrete mathematics. However, recent years have witnessed a substantial new movement in network research. The focus of the research is shifting away from the analysis of small graphs and the properties of individual vertices or edges to consideration of statistical properties of large scale networks. This new approach has been driven largely by the availability of technological networks like the Internet [12], World Wide Web network [2], etc. that allow us to gather and analyze data on a scale far larger than previously possible. At the same time, technological networks have evolved as a socio-technological system, as the concepts of social systems that are based on self-organization theory have become unified in technological networks [13]. In today’s society, we have a simple and universal access to great amounts of information and services. These information services are based upon the infrastructure of the Internet and the World Wide Web. The Internet is the system composed of ‘computers’ connected by cables or some other form of physical connections. Over this physical network, it is possible to exchange e-mails, transfer files, etc. On the other hand, the World Wide Web (commonly shortened to the Web) is a system of interlinked hypertext documents accessed via the Internet where nodes represent web pages and links represent hyperlinks between the pages. Peer-to-peer (P2P) networks [26] also have recently become a popular medium through which huge amounts of data can be shared. P2P file sharing systems, where files are searched and downloaded among peers without the help of central servers, have emerged as a major component of Internet traffic. An important advantage in P2P networks is that all clients provide resources, including bandwidth, storage space, and computing power. In this chapter, we discuss these technological networks in detail. The review is organized as follows. Section 2 presents an introduction N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 15, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
254
B. Mitra
to the Internet and different protocols related to it. This section also specifies the socio-technological properties of the Internet, like scale invariance, the small-world property, network resilience, etc. Section 3 describes the P2P networks, their categorization, and other related issues like search, stability, etc. Section 4 concludes the chapter.
2 The Internet The Internet is a global network connecting millions of computers in a decentralized form. Each Internet computer, called a host, is independent and operators can choose any of the commercial Internet service providers (ISPs). Many computer scientists observe the Internet as a “prime example of a largescale, highly engineered, yet highly complex system” (Fig. 1). The Internet is extremely heterogeneous in nature; for instance, data transfer rates and physical characteristics of connections vary widely. In addition, the Internet evolves and emerges based upon its large-scale self-organization property. Technically, the Internet can be defined as the network of networks working with Transmission Control Protocol (TCP)/Internet Protocol (IP). This definition visualizes the Internet as a purely technological system. However, this assumption overlooks the fact that knowledgeable human activities make the Internet work. Hence, more accurately, the Internet is a global socio-technological system that is based on a technological structure and a set of protocols [13]. Some of the important Internets-based services are e-mail, World Wide Web, remote access, and Internet telephony.
Fig. 1. Internet as complex network.
Technological Networks
255
2.1 Protocols Used in the Internet Once we have more than one computer, it is theoretically possible to communicate, provided that the computers ‘speak’ a common language. The Internet uses a suite of communication protocols, of which the two most important are the TCP and the IP [19]. These protocols have the following responsibilities: First, the protocol defines the basic unit of data transfer, called the ‘datagram’, used throughout the Internet. Thus, it specifies the exact format of all data as it passes across the Internet. Second, the TCP/IP software performs the routing function, choosing a path over which data will be sent. Third, the protocol includes a set of rules that embody the idea of reliable packet delivery over unreliable connections. In addition, these protocols introduce the IP addressing scheme which is integral to the process of routing datagrams through the Internet to the particular destination host. Each host on a TCP/IP network is assigned an unique 32-bit IP address that is divided into two main parts: the network number and the host number (Fig. 2). The network number identifies a network and must be assigned by the Internet Network Information Center (InterNIC) if the network is to be part of the Internet. An ISP can obtain blocks of network addresses from the InterNIC and can itself assign address space as necessary. The host number identifies a host on a network and is assigned by the local network administrator. To make them easier to remember, IP addresses are normally expressed in decimal format as a ‘dotted decimal number’. The four numbers in an IP address are called octets, because they each have eight bit positions when viewed in binary form. Currently three classes of networks (A, B, C) are commonly used. These classes may be segregated by the number of octets used to identify the network, and also by the range of numbers used by the first octet. If the value of the first octet is 127, it represents the local host, regardless of what network it is really in.
Fig. 2. IP addressing.
256
B. Mitra
2.2 Scale Invariance and Small World Property of the Internet The topology of the Internet is studied at two different levels. At the router level, the nodes are the routers, and edges are the physical connections between them. At the interdomain (or autonomous system) level, each domain, composed of hundreds of routers and computers, is represented by a single node, and an edge is drawn between two domains if there is at least one route that connects them. The topology of large-scale networks like the Internet is characterized by the degree distribution pk , which is defined as the fraction of nodes in the network having degree k. In 1999, Faloutsos et al. [12] studied the Internet at both levels, concluding that in each case the degree distribution follows a power law (Fig. 3) i.e. pk ∼ k −γ . The interdomain topology of the Internet, captured at three different dates between 1997 and the end of 2002, resulted in degree exponents between γ = 2.15 and γ = 2.2. The 1995 survey of the Internet topology at the router level, containing 3888 nodes, found γ = 2.48. In 2000, Govindan and Tangmunarunkit [15] mapped the connectivity of nearly 150,000 router interfaces and nearly 200,000 router adjacently, confirming the power-law scaling with γ = 2.3. It is widely believed that the scale invariance property of the Internet is related to the self-organization property of the participating nodes. The preferential attachment tendency of
Fig. 3. The first data file holds link directions corresponding to the traceroute directions, while the second file is an undirected version of the first file. There are a total of 192,244 nodes, 636,643 directed links, and 609,066 undirected links. The average and maximum node degrees (undirected) are 6.34 and 1071 respectively, and the node degree distribution is plotted.
Technological Networks
257
the nodes to join the network [42] stabilizes the degree distribution as the size of the Internet becomes very large. Internet as small world. An accurate characterization of the emergent topological properties of the Internet and a better understanding of the underlying processes that yield these characteristics are crucial for proper evaluation of network protocols and systems. In that vein, recent works [20, 5] have shown the prevalence of small-world phenomena [24, 44] in the Internet. Small-world graphs exhibit a high degree of clustering, yet have typically short path lengths between arbitrary vertices. Yook [47] and Pastor-Satorras [32] have studied the Internet at the domain/autonomous system level between 1997 and 1999 and found that its clustering coefficient ranges between 0.18 and 0.3, compared to the clustering coefficient 0.001 for random networks of similar parameters. On the other hand, the average path length of the Internet ranges between 3.70 and 3.77 and at the router level it is around 9, indicating its small-world character. Small-world behavior in the Internet maps to two possible causes: first, the high variability of node degree distributions and, second, the preference of vertices to have local connections [20]. With the high variability of the node degree distribution, it is likely that two interconnected vertices, say u and v, will have the same neighbor, say w specifically, when w is a node with extremely large degree. It means that u, v, and w form a triangle. Such a pattern contributes directly to the computation of the clustering coefficients of u, v, and w, (i.e. Cu , Cv , and Cw ) and results in a larger overall average clustering coefficient C of the network. Thus, C grows with the variability of vertex degree. Also, notice that with highly variable vertex degrees, the average distance between two vertices (L) is short. This happens because the shortest path is usually through those extremely popular vertices. That is, highly popular vertices serve as good navigators through the graph. On the other hand, preference for the local connectivity also results in small-world behavior. The reason behind this is that, with a non-negligible probability of a local connection, if a node u is connected to v and w, then it is likely that v and w are also close to each other. As a result, there is a non-negligible probability that a triangle will form among these vertices, resulting in a higher clustering coefficient. Meanwhile, since there are still many long-range connections, it is easy to find a short path between two randomly chosen nodes. In addition, researchers from Stanford University [37] found that as networks grow very large, they become very efficient in the number of steps a data packet takes to get from one node to another node. The number of steps grows logarithmically with the size of the network, which means that for 10,000 nodes we need five steps, but for 100 million the number grows only to 6.5. They also exhibit a clustering property, i.e. the relationships among nodes are not randomly distributed, but are grouped. Short path links means that there are some very short paths sprinkled throughout the network that
258
B. Mitra
may directly link one group to another. This conforms to Watts and Strogatz’s model [44], where a low dimensional regular lattice is transformed to a small world network. 2.3 Fault Tolerance of the Internet The Internet and other communication networks display a high degree of robustness: while key components regularly malfunction, local failures rarely lead to the loss of the global information-carrying ability of the network [3]. It has been observed that network topology plays an important role in the robustness of the Internet. Consider an arbitrary connected graph of N nodes, and assume that an f fraction of the nodes have been removed. This leads to important questions, like: What is the probability that the resulting subgraph is connected, and how does it depend on the removal probability f ? For a broad class of graphs there exists a threshold probability fc such that if f < fc the resulting subgraph is connected, but if f > fc the subgraph becomes disconnected (Fig. 4). Here fc is termed the percolation threshold. In the following discussion, we will call a network fault tolerant (or robust) if it contains a giant component comprising of most of the nodes even after a fraction of its nodes are removed. 2.3.1 Stability Criteria The topology of the Internet and the failure probability of nodes can be characterized by probability distributions pk and fk respectively. Here pk signifies the degree distribution which is the probability that a randomly chosen node has degree k. Similarly fk is the probability that a vertex of degree k, will be removed from the network. Nodes leave the Internet due to their faulty nature [8] or due to the attack mounted on the important nodes [9]. Based upon these basic parameters, an analytical framework has been derived to
Fig. 4. Illustration of the effects of node removal on an initially connected network.
Technological Networks
259
examine the stability of the Internet (or any kind of networks) where the vertices undergo some dynamics [28]. The analytical framework can be expressed with the help of the following equation: ∞
kpk (k(1 − fk ) − (1 − fk ) − 1) = 0.
(1)
k=0
Equation. (1) states the critical condition for the stability the Internet (characterized by pk ) undergoing any type of failure and attack (characterized by fk ). Stability analysis of networks under different node disturbance schemes. The existing empirical and theoretical results indicate that complex networks can be divided into two major classes based on their degree distribution pk . In the first class of networks, pk peaks at an average degree k and decays exponentially for large k. The most investigated examples of such exponential networks are the random graph model of Erdos and Renyi [11] and the small-world model of Watts and Strogatz [44], both leading to a fairly homogeneous network. In contrast, results on the Internet, World Wide Web, and other large networks indicate that many systems belong to a class of inhomogeneous networks, referred to as scale-free networks, for which pk decays as a power law, i.e. pk ∼ k −γ [8]. While the probability that a node has a very large number of connections (k k) is practically prohibited in exponential networks, highly connected nodes are statistically significant in scale-free networks. In this review, we concentrate on the scale-free network, as this kind of network is widely used to model the Internet. In this section, we consider two types of node removal schemes. The first scheme studies the removal of randomly selected nodes. In this case, the probability of removal of any randomly chosen node having degree k after this kind of failure is fk = f (independent of k) [8]. In the second technique, most highly connected nodes are removed at each step. This second scheme emulates an intentional attack on the network [9]. Formally, fk = 0 when k ≤ kmax and fk = 1 when k > kmax , i.e. all the nodes in the network having degree more than kmax are removed. Next we discuss the stability of scale-free networks in the face of failure and attack. The stability is measured by the change in the size of the giant component S and the average path length l after removal of the fraction of nodes. The maximum reduction in the size of the giant component indicates the breakdown of the network. Stability against random failure. We start by investigating the stability of scale-free network to random removal of nodes, looking at the changes in the relative size of the giant component S and the average path length l [8]. In a scale-free network, the size of the giant component S decreases slowly from S = 1 as the fraction of nodes removed f increases (see Fig. 5). In random failure, most of the removed nodes in the network have low degree; hence, they have little impact upon the size of the giant component S. Eventually,
260
B. Mitra
Fig. 5. The size of the giant component S and average path length l of an initially connected network when a fraction f of the nodes are removed. Scale-free network generated by the scale-free model with N = 10,000 and k = 4. Squares indicate random node removal, while circles correspond to preferential removal of the most connected nodes [3].
S reaches 0 at some higher f , which is denoted as the percolation threshold fc . The analytical calculations indicate that the percolation threshold fc → 1 as the size of the network increases to infinity. In simple terms, scale-free networks display an exceptional robustness against random node failures. On the other hand, the average path length l increases with the fraction of removed nodes f , as paths are disrupted in the network, and eventually l peaks at percolation threshold fc . In random failure, the average path length l increases slowly with f ; hence, its peak becomes less prominent. After the network breaks into isolated components, l decreases as well since in this regime the size of the largest component gradually decreases. Stability against intentional attack. In the case of intentional attack, the nodes with the highest degrees are targeted for removal. Naturally, in this kind of attack, the network breaks down into components faster than in the case of random failure. The stability of the scale-free networks mainly depends upon a few highly connected nodes. Removal of these key nodes during the intentional attack severely affects the stability of the scale-free networks [9]. This phenomenon also becomes predominant from the behavior of the average path length l, which increases rapidly and reaches its peak at percolation threshold fc . After the network breaks into isolated components, l decreases quickly since in this regime the size of the largest component decreases. 2.4 Spreading of Viruses in Internet Computer viruses and worms are posing serious challenges to the network research community. In computer science jargon, ‘virus’ refers to malicious software that spreads from computer to computer and can halt or hinder operations at numerous businesses and other organizations, disrupt
Technological Networks
261
cash-dispensing machines, delay airline flights, and even affect emergency call centers [41, 23, 4]. The structure of contact networks affects the rate and extent of spreading of computer viruses, just as it does for human diseases; understanding this structure is a key element in the control of infection. Thus, recent works in epidemiological models have emphasized the effects of the virus spread in scale-free networks, in which the degree distribution follows a power law [16]. There are various epidemic models available in the literature which can be used to formalize the spread of viruses in the network [33]. In these models, the susceptible (S) individuals do not have the disease and are ready to be attacked with a disease if they come in contact with virus infected (I) individuals. The infected individuals may gain permanent or temporary immunity after some time period and become recovered (R). The R individuals do not take part in disease transmission. Various epidemic dynamics like SI, SIS, SIR, SIRS exist in the literature [35, 36]. In SI dynamics, infected individuals increase until all the S individuals becomes infected. If the I individuals in SI dynamics become susceptible again after some time period, the SIS dynamics results [34]. Computer viruses mostly fall into this category; they can be ‘cured’ by antivirus software, but without a permanent virus-checking program the computer has no way to fend off subsequent attacks by the same virus. Let us assume that any susceptible individual has a uniform probability β per unit time of being infected from any other infected one, and that infected individuals recover and become immune at some stochastically constant rate γ. Then s, i, r, the individual fraction of nodes in the states of S, I, and R respectively, are governed by the following differential equations: ds = −βis, dt
di = βis − γi. dt
(2)
The classical SIS model can be applied to the networked system where infection probability of the node is not constant but varies between the nodes of the network depending upon its degree. The quantity βi represents the average rate at which a susceptible individual becomes infected by its neighbors. If λ is the rate of infection via contact with the single infective node and θ(λ) is the probability that the neighbor of a k degree susceptible node is infective, then the average rate of infection of the k degree susceptible node becomes βi = kλθ(λ). The implicit expression for θ(λ) is obtained in [35] by the following expression: k 2 pk λ = 1, z 1 + kλθ(λ)
(3)
k
where z is the average degree and pk is the degree distribution. For particular choices of pk , this equation can be solved for θ(λ) either exactly or approximately. For instance, for a power-law degree distribution, Pastor-Satorras and Vespignani [34] solve it by making an integral approximation, and hence show
262
B. Mitra
that there is no non-zero epidemic threshold for the SIS model in the powerlaw case, i.e. the disease will always persist, regardless of the value of the infection rate parameter. They have also generalized the solution to a number of other cases, including other degree distributions, finite-sized networks, and models that include vaccination of some fraction of individuals [35, 36]. In the latter case, they tackle both random vaccination and vaccination targeted at the vertices with highest degree. The results have shown that the propagation of the disease turns out to be relatively robust against random vaccination, at least in networks with right-skewed degree distributions, but highly susceptible to vaccination of the highest-degree individuals.
3 Peer-to-Peer Networks In client-server architecture, each computer or process in the network is either a client or a server. A large number of clients request and receive the service from the servers, and a fixed set of servers provides the service to those clients. Peer-to-peer (P2P) networks (shown in Fig. 6) provide a different paradigm of computer networks, where each workstation has equivalent capabilities and responsibilities [26, 6]. P2P networks diverge the responsibility between participants in a network and cumulate the bandwidths of network participants rather than using conventional centralized resources. An important advantage in this kind of network is that all clients provide resources, including bandwidth, storage space, and computing power. Thus, as nodes arrive and demand on the system increases, the total capacity of the system also increases simultaneously. This is not true for a traditional client-server architecture, in which adding more clients could mean slower data transfer for all users. In addition, popular items (like songs, movies) in the network become replicated over multiple peers due to repeated exchange of items, which increases the robustness of the shared items in the face of frequent joining and leaving of peers (termed as peer churn).
Fig. 6. Client-server model and P2P model.
Technological Networks
263
Overlay networks. Peers in the P2P networks are typically connected via ad hoc overlay connections. If a participating peer knows the location of another peer in the network, then there is a link from the former node to the latter in the overlay network. Based on how the nodes in the overlay network are linked to each other, the current P2P architecture can be classified into three types [43], centralized, decentralized and structured, and decentralized but unstructured. 1. Centralized: All object index items are kept in a centralized server in the form of object key, node address etc. Each arriving node needs to actively notify this server about its kept object information. Therefore, the querying node only needs to consult the central server to obtain the peer address containing its searched object. In order to download the searched object from the peer, the querying node directly establishes the connection with that peer and downloads the item. This type of P2P architecture is very simple and easy to deploy. But it has the problem of a single point of failure, although we can use several parallel servers. An example of this network type is Napster [31]. 2. Decentralized and structured: A structured P2P network employs a globally consistent protocol to ensure that any node can efficiently route a search query to a peer that has the desired file. Most of the structured P2P networks are based on the distributed hash table (DHT), in which a variant of consistent hashing is used to assign ownership of each file to a particular peer [27]. A DHT is a hash table whose table entries are distributed among different peers located in arbitrary locations. Each data item is hashed to a unique numeric key. Each node is also hashed to a unique ID in the same key space. Each node is responsible for a certain number of keys; that is, the responsible node stores the key and a pointer to the data item with that key. Keys are mapped to their responsible nodes. The searching and routing algorithms support two basic operations: lookup(key) and put(key); lookup(k) is used to find the location of the node that is responsible for the key k, and put(k) is used to store a data item (or a pointer to the data item) with the key k in the node responsible for k. It appears that searches in structured systems follow the well-defined neighboring links; henceforth, these systems provide guarantees on finding existing data in bounded overlay hops. However, the strict network structure imposes high overhead for handling dynamicity in P2P networks due to peer churn. Some well-known DHT based structured P2P networks are Chord, Pastry, Tapestry, CAN, and Tulip. 3. Decentralized and unstructured: An unstructured and decentralized P2P network is formed when the overlay links are established arbitrarily. As no special network structure needs to be maintained, unstructured P2P systems are extremely resilient to peer churn. Searching in unstructured networks is often based on flooding or its variation because there is no control over data storage [26]. The main disadvantage with such networks is that the queries may not always be resolved. Popular content is likely to
264
B. Mitra
be available at several peers, but if a peer is looking for rare data shared by only a few other peers, then it is highly unlikely that the search will be successful [10]. Since there is no correlation between a peer and the content managed by the peer, there is no guarantee that flooding will find a peer that has the desired data. However, due to the high dynamicity of peers, robustness is given the topmost priority. Most of the popular P2P networks such as Gnutella and FastTrack are unstructured in nature [14]. In addition, superpeer topologies have also emerged as the most influencing unstructured networks. Here some peers, called dominating nodes or superpeers, serve the search request of other regular peers [39, 46]. Most of the commercial systems like KaZaA, Skype have adopted superpeers in their design. In these systems, superpeer nodes with higher bandwidth and connectivity connect to each other, forming the upper level in the network hierarchy. Each superpeer node provides service to a set of regular peers which form the lower level of the network hierarchy. 3.1 Peer-to-Peer Search Schemes Searching is one of the most important services and utilities provided by the P2P networks where users try to locate the desired object in the network. Existing P2P systems support the simple object lookup by key or identifier. Some existing P2P systems can handle more complex keyword queries, which find documents containing keywords in queries. Searching techniques are primarily forwarding based. Starting with the requesting node, a query is forwarded or routed until the node which has the desired object is reached. To forward query messages, each node must keep information about some other nodes called neighbors. The information of these neighbors constitutes the routing table of a node. The desired features of searching algorithms in P2P systems include high-quality query results, minimal query packet overhead, high routing efficiency, load balance, resilience to node failures, and support of complex queries. The quality of query results is application dependent. Generally, it is measured by the number of results and relevance. The query packet overhead signifies the amount of packets generated in the network to satisfy a specific search query. The routing efficiency is generally measured by the number of overlay hops per query. Different searching techniques make different trade-offs between these desired characteristics. Searching in structured P2P networks follows the well-defined neighboring links to locate some specific object. This provides guarantees on finding existing data and bounds data lookup efficiency in terms of the number of overlay hops. But it shows poor performance in the dynamic condition where peers join and leave the network quite frequently. Searching in the unstructured P2P systems is more challenging, as the overlay network does not follow any structure dependent on the data storage. Searching techniques in unstructured networks can be classified as either flooding based or random walker
Technological Networks
265
based. Broadly, flooding-based techniques are fastest and most inefficient in terms of overhead, whereas random-walk-based schemas have low overhead and minimum speed. Therefore, both techniques lie at the extreme ends of the efficiency/speed spectrum. The following section describes flooding techniques and their variations and also the random-walk-based techniques. 3.1.1 Flooding-Based Search Techniques Searching in unstructured P2P networks is often based on flooding or its variations because there is no control over the location of objects. In these techniques, query packets are propagated to all neighbors within a certain radius until the desired object is found. However, blind flooding mechanism generates large numbers of redundant query packets in the network, which misutilizes the valuable bandwidth and makes the unstructured P2P systems far from scalable. Some proposed controlled flooding-based schemes such as iterative deepening/expanding ring, informed search, dynamic query-based flooding, LightFlood, Hurricane flooding, etc. try to improve bandwidth utilization. Iterative deepening. Yang and Garcia-Molina [45] borrowed an idea from artificial intelligence and used it in iterative deepening. Like ordinary flooding, in this case no node has information about the location of the desired data. The querying node periodically issues a sequence of breadth-first searches (BFSs) with increasing depth limits. The query terminates when the query result is satisfied or when the maximum depth limit has been reached. LightFlood. The LightFlood technique [17] (also called the expanding ring) not only retains the merits of pure flooding, but also eliminates most of the redundant messages caused by pure flooding. Thus, LightFlood greatly enhances the scalability of Gnutella-style P2P systems. The design of LightFlood is motivated by two observations: first, the majority of redundant messages are generated within high hops; second, the network coverage growth rates in low hops are much higher than those within high hops. Thus, the LightFlood scheme is divided into two stages. In the first stage, the messages are allowed on their low hops to be flooded by pure flooding (by giving a small time to live (TTL) number). Those peers reached on the last hop of pure flooding (TTL = 0) become seeds, from which the flooding is initiated for the second stage. The initial pure flooding ensures that a considerable number of seeds are dispersed across the overlay with a small number of redundant messages. The next stage of flooding ensures that most redundant messages caused by pure flooding within the rest of its hops are eliminated. The integration of these two stages retains the advantages of pure flooding: low latency, high coverage, and high reliability. Hurricane flooding. In Hurricane flooding [21], the source of a search cautiously but exponentially expands its search horizon in a spiral pattern. Like the expanding ring algorithm, Hurricane flooding increases the scope of flooding after each round. The source peer divides its neighbors into several
266
B. Mitra
groups with approximately of same size. The source sends query packets to its neighbors in the first group, starting the first round of flooding. These neighbors faithfully broadcast the query packets (but not back to the source). The source also sets a limit on the scope of these broadcasting query packets, e.g., by using a TTL value. The first round of flooding may have a very narrow scope with small TTL. This round of flooding may not return the desired result. Then the source sends query packets to its neighbors in the second group, with a larger limit on the scope of the flooding. This process repeats until the source obtains the desired result. It has been shown that Hurricane flooding reduces the search cost to arbitrarily close to a lower bound for any search algorithms and bounds the search latency, which is a logarithmic function of the location of the target. 3.1.2 Random-Walk-Based Search Techniques Random walk is a popular alternative to flooding for locating resources in P2P networks under scarcity of network bandwidth. In the standard random walk algorithm [25], the querying node forwards the query message to one randomly selected neighbor with some specific TTL value T . When an intermediate node receives the random walker, it checks to see if it has the resource. If the intermediate node does not have the resource, it checks the TTL field, and if T > 0, it decrements T by 1 and forwards the query to a randomly chosen neighbor; else if T = 0 the query message is dropped. On the other hand, if the intermediate node has the resource, the query is not forwarded and a reply is sent back to the querying node. This random walk technique greatly reduces the message overhead but causes a longer searching delay. In the k -walker random walk algorithm [26], k walkers are deployed by the querying node to search the desired item. That is, the querying node forwards k copies of the query message to k randomly selected neighbors. Each query message takes its own random walk and each walker checks whether it reached the destination or its TTL value reaches zero. In this way, the k-walker random walk algorithm attempts to reduce the routing delay by a factor of k. However, the arbitrary increase in the number of walkers results in a significant increase in the redundant visits in the initial stage, which increases the message overhead. Actually, the performance of k-walker random walk largely depends on the choice of k and T T L. Intuitively, the average number of nodes required to be probed for discovering a resource is inversely proportional to the popularity of the resource. Choosing low values of k and T T L for searching for a resource with low popularity would result in a low success rate and high delays; choosing high values of k and T T L for searching for a resource with high popularity would result in excessive overhead. Thus, the parameters of random walk must be chosen according to the popularity of the resource being searched for. The popularity of a resource may not be known a priori at the querying node. In addition, the popularity may change due to the arrival/departure of nodes,
Technological Networks
267
replication/deletion/exhaustion of resources, or other random changes in the network. Thus, the parameters of random walk must be set in an adaptive manner. The modified random BFS technique [22] is a modification of the k-walker random walk scheme to reduce the unnecessary message overhead. Here the querying node forwards the query to a randomly selected subset of its neighbors. On receiving a query message, each neighbor forwards the query to a randomly selected subset of its neighbors excluding the source node. This procedure continues until the query stop condition is satisfied. It is expected that this approach visits more nodes and has a higher query success rate than the k-walker random walk. Some hybrid schemes are also developed [25] based on a compromise between flooding and random walks. One of the hybrid schemes uses local flooding, until exactly K (predefined) new outer nodes have been discovered. Then, each of the K nodes initiates an independent random walk. Gradient-based search in scale-free networks. Recent measurements of Gnutella networks [7] and simulated Freenet networks [18] have shown that their topological structure follows a power-law degree distribution. [1] proposed a message-passing algorithm that can be efficiently used to search in scale free networks such as Gnutella. It has been observed that random walks in scale free networks naturally gravitate towards the high degree nodes, but an even better coverage is achieved by intentionally choosing high degree nodes. In [1], Adamic et al. have shown analytically that if the nodes with highest degree are visited first and subsequently go down to the degree sequence, the significant portion of the network can be covered very quickly. In the proposed algorithm, the walker approximately follows the degree sequence across the entire scale-free network with an exponent close to 2 (2.0 < γ < 2.3). At each step, the random walker chooses a node with a degree higher than the current node, quickly finding the highest degree node. Once the highest degree node has been visited, it will be avoided, and a node of approximately second highest degree will be chosen. Effectively, after a short initial climb, one goes down the degree sequence. This is the most efficient way to do this kind of sequential search, visiting highest degree nodes in sequence. These algorithms are completely decentralized and exploit the power-law link distribution in the node degree. The paper demonstrates that the search algorithms work well on real Gnutella networks, scale sublinearly with the number of nodes, and may help to reduce the network search traffic that tends to cripple such networks. 3.2 Topological Dynamics and Stability of Superpeer Networks From the point of view of topological dynamics, P2P networks exhibit similar behavior to that of the Internet. However, the special superpeer topology exhibited by many commercial P2P networks makes the outcome of the dynamics different from that of the Internet (mainly scale free networks).
268
B. Mitra
A superpeer network can be modeled by a bimodal degree distribution, where a small fraction of nodes are superpeers with high degree and a large fraction of nodes are low degree peers [28]. Formally, degree distribution pk of the superpeer networks can be specified as pk > 0 if k = kl , km ; pk = 0 otherwise, where kl and km are degrees of peers and superpeers respectively. Moreover, there are some differences in the dynamics of the P2P networks and the Internet. We explain the different kinds of peer dynamics and then illustrate the outcomes in each case. Peers in the P2P system join and leave the network randomly without any central coordination. This is termed as peer churn. In addition, important peers are targeted for attack [38]. All these peer dynamics can be modeled by different kinds of node removal schemes in random graph. 1. Random failure: Peer churn can be modeled by random removal of nodes from the graph. This is the simplest model of churn, and the probability of removal of a node is independent of its degree. 2. Degree-dependent failure: Peers having higher connectivity are more stable in the network than peers having lower connectivity because those loosely connected peers enter and leave the network quite frequently. This observation leads us to model churn in a more realistic manner, where the probability of removal of a node is inversely proportional to the degree of that node. 3. Degree-dependent attack: In case of attack, the nodes having higher degrees are more likely to be removed from the network. Let the probability distribution fk model the different node removal techniques. In the following we consider a unified churn/attack model of the form fk = C k γ , where γ is a parameter called attack exponent and C is a constant. The different node removal techniques can be realized from this unified model just by changing the parameter γ. 1. Random failure: For γ = 0, fk = C, i.e., the probability of removal of a node is independent of the degree of the node. 2. Degree-dependent failure: For γ < 0, the probability of removal of a node, having degree k is inversely proportional to the degree of the node, i.e. fk ∝ 1/k γ . 3. Degree-dependent attack: For γ > 0, the probability of removal of a node having degree k is directly proportional to the degree of the node, i.e., fk ∝ k γ . 3.2.1 Outcomes Next we illustrate the impact of different peer dynamics on the stability of the superpeer networks. The peer churn has been modeled by random failure and degree-dependent failure, and the attack has been modeled by degree-dependent attack.
Technological Networks
269
1
fr (Percolation threshold)
0.95 0.9 Theoretical 〈Ksp〉=30 Simulation 〈Ksp〉=30 Theoretical 〈Ksp〉=50 Simulation 〈Ksp〉=50
0.85 0.8 0.75 0.7
0.85
0.9 0.95 r (Fraction of peers)
1
Fig. 7. The impact of random failure upon the stability of superpeer networks.
Random failure. The analysis done in [30] shows that the superpeer networks are quite robust against churn (Fig. 7). Since churn affects peers and superpeers depending upon their individual fraction in the network, peers are affected much more than superpeers. The removal of a significant number of low degree peers along with a few high degree superpeers has less impact upon the stability of the networks. Practical experience also ensures that superpeer networks exhibit high robustness in the face of churn. Another significant observation is that a lower fraction of superpeers in the network (specifically when it is below 5%) results in a sharp fall in the percolation threshold; that is, the vulnerability of the network drastically increases when the fraction of superpeers is below 5%. Degree-dependent failure. It can be easily identified from Fig. 8, that with the increase of superpeer degree km , the value of critical attack exponent γc that percolates the network decreases. This increases the necessary fraction of superpeers required to be removed to break down the network. Since the increase of km increases the fraction of peers r, the removal of most of the low degree peers along with a fraction of superpeers increases the percolation threshold fd . It is also interesting to observe that the percolating γc remains quite low and less than 0.1 for the entire range of km . The reason is that small values of γc result in the removal of a higher fraction of superpeers nodes from the network. Since the degree-dependent failure mainly removes the lower degree nodes, which are not so useful for breaking the network down, removal of a significant amount of superpeers becomes necessary. Degree-dependent attack. [29] analyzes the behavior of superpeer networks against degree-dependent attack, where kl and km are the degree of peers and superpeers respectively and r is the fraction of peers in the network. In [29],
270
B. Mitra 1
0.07 0.06 0.05
0.98
=8 =12 =16 Line fitting curve
0.96
0.04
γc
fd
0.94
0.03
0.92
0.02
0.9
0.01
0.88
0 10
25 20 15 Km (Degree of superpeers)
30
0.86 10
Theoretical 〈k〉=4 Simulation 〈k〉=12 Theoretical 〈k〉=4 Simulation 〈k〉=12
15 20 25 Km (Degree of superpeers)
30
Fig. 8. Change in critical attack exponent γc and percolation threshold fd with respect of superpeer degree km for superpeer networks undergoing degree-dependent failure. Here mean degree k varies from 8 to 16. x-axis represents the superpeer degree(km ) and y-axis represents the corresponding γc and fd .
the authors have established the critical condition for the stability of the network against degree-dependent attack: γ+1 rklγ+1 (kl − 1) + (1 − r)km (km − 1) γ ≥ km (k(km + kl ) − km − 2k).
(4)
The inequality gives the set of solutions for the critical exponent γc and subsequently the normalizing constant C, which determines the fraction of peers and superpeers to be attacked. The nature of the solution set Sγc of the inequality has a profound impact upon the fraction of peers and superpeers required to be removed and the percolation threshold fc . The breakdown of the network can be due to one of the following three situations. Case A. Removal of all the superpeers along with a fraction of peers. Networks having bounded solution set Sγc where 0 ≤ γc ≤ γcbd exhibit this kind of behavior at the maximum value of the solution γc = γcbd . Case B. Removal of only a fraction of superpeers. Networks having unbounded solution set Sγc where 0 ≤ γc ≤ +∞ exhibit this kind of behavior as γc → ∞. Case C. Removal of some fraction of both superpeers and peers. Intermediate critical exponent γc ∈ Sγc signifies the fractional removal of both peers and superpeers. Figure 9 shows that solution set Sγc of the networks up to a threshold superpeer fraction spth (spth = 0.19 and 0.41 for kl = 3 and kl = 4 respectively) remains bounded. Hence, the removal of all the superpeers is necessary to disintegrate the network along with a fraction of the peers (Fig. 9). It also represents some instances of case B where only some fraction of superpeers are needed to be removed.
Technological Networks 5
1 Peer degree kl=3 Peer degree kl=4
0.9
Percolation threshold
4
Boundary γc (γcbd)
271
3 2 1
Percolation threshold (fc) (kI=3) Peer fraction removed (fp) (kI=3) Superpeer fraction removed (fsp) (kI=3) Percolation threshold (fc) (kI=4) Peer fraction removed (fp) (kI=4) Superpeer fraction removed (fsp) (kI=4)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0 0
0.1
0.2 0.3 Superpeer fraction
0.4
0.5
0 0
0.1
0.2 0.3 Superpeer fraction
0.4
0.5
Fig. 9. Impact of degree-dependent attack on superpeer networks. Behavior of γcbd and percolation threshold due to the change of superpeer fraction is shown.
4 Conclusion In this chapter, we have presented a comprehensive study of various aspects of technological networks. We have chosen two different technological networks under consideration: the Internet and P2P networks. The protocols used in the Internet have been discussed briefly along with their services. An empirical study of the different topological properties of the Internet like scale invariance, small world, etc. have been elaborated. The impact of the fault tolerance of the Internet has been discussed in the light of general stability analysis. The spread of computer viruses has been modeled by network-aware epidemic models. We have also shed some light on the recent advancements and classifications of the P2P networks. As search is one of the most important services provided by the P2P systems, different search techniques and their comparative study have been provided. The stability of P2P networks in the face of churn and attack has also been discussed as a continuation of the Internet fault tolerance. The advancements of the Internet have also posed some serious challenges in front of the network research community. One of the significant problems is modeling the widely varying Internet traffic. An appropriate modeling of the Internet is often useful to measure the efficiency of routing algorithms and the quality of service (QoS) of different web applications. Maintaining specific QoS in a faulty environment can be another major research issue. There is always substantial uncertainty when making network management decisions. A decision maker is limited not only because it possesses only partial information due to decentralized control but is also limited by the impossibility of predicting the future in terms of traffic demand and/or network topology status. Hence, managing this large-scale Internet is also a non-trivial issue. Understanding the assortative or disassortative relation among different participating nodes and their impact upon the complex structural properties is also a major research problem.
272
B. Mitra
Advancements in P2P networks also raise some issues regarding security and trust. The P2P philosophy is based upon the cooperative nature of the participating peers. However, it has been found that in Gnutella networks, as many as 65% of the nodes do not contribute resources, but free-ride on other peers’ resources. Hence, the problem of selfish peers and free riders are a serious threat against the performance of any P2P system. Development of low overhead trust-aware protocols to ensure trust among the peers is necessary to enhance the utility of P2P networks. Understanding the self-organizing features, evolution, and scalability of the superpeer networks is also interesting and necessary.
References 1. L. A. Adamic, R. M. Lukose, A. R. Puniyani, B. A. Huberman, Search in powerlaw networks, Physical Review E, 64, 046135, 2001. 2. R. Albert, H. Jeong, A.-L. Barabasi, Diameter of the world wide web, Nature, 401, 130–131, 1999. 3. R. Albert, H. Jhong, A.-L. Barabasi, Error and attack tolerance of complex networks, Nature, 406, 2000. 4. N. Berger, C. Borgs, T. Chayes, A. Saberi, On the spread of viruses on the Internet, Proceedings of the 16th ACM-SIAM Symposium on Discrete Algorithms (SODA), 301–310, 2005. 5. T. Bu, D. Towsley, On distinguishing between Internet power law topology generators, Proceedings of INFOCOM, New York, NY, USA, 2002. 6. D. Clark, Face-to-face with peer-to-peer networking, IEEE Computer, 34 (1), pp. 18–21, January 2001. 7. Clip2 Company, Gnutella. http://www.clip2.com/gnutella.html. 8. R. Cohen, K. Erez, D. Avraham, S. Havlin, Resilience of the Internet to random breakdown, Physical Review Letters, 85 (21), 2000. 9. R. Cohen, K. Erez, D. Avraham, S. Havlin, Resilience of the Internet under intentional attack, Physical Review Letters, 86 (16), 2001. 10. Q. Deng, H. Lv, Analyzing unstructured peer-to-peer Search Networks with QIL Proceedings of the IEEE International Conference on Services Computing, pp. 547–550, Shanghai, China, 2004. 11. P. Erdos, A. Renyi, On Random Graphs I, Publ. Mathematical, Debrecen, 6, 290– 297, 1959. 12. M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships of the internet topology, Computer Communications Review, 29, 251262, 1999. 13. C. Fuchs, The Internet as a self-organizing socio-technological system”, Cybernetics and Human Knowing, 12 (31), pp. 37–81, 2005. 14. Gnutella: www.gnutellaforums.com. 15. R. Govindan, H. Tangmunarunkit, Heuristics for internet map discovery, Proceedings of IEEE Infocom, 2000. 16. C. Griffin, R. Brooks, A note on the spread of worms in scale-free networks, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Feb. 2006. 17. L. Guo, S. Jiang, X. Zhang, H. Wang, LightFlood: Minimizing redundant messages and maximizing scope of peer-to-peer search, IEEE Transactions on Parallel and Distributed Systems (TPDS) 19 (5), pp. 601–614, May 2008.
Technological Networks
273
18. T. Hong, in Peer-to-Peer: Harnessing the benefits of a disruptive technology, Andy Oram (ed), O’Reilly, Sebastopol, CA, Chap. 14, pp. 203–241, 2001. 19. C. Hunt, TCP/IP Network Administration, Second Edition, O’Reilly Networking, December 1997. 20. S. Jin, A. Bestavros, Small-World Internet topologies possible causes and implications on scalability of end-system multicast, Boston University, Technical Report BUCS-TR-2002-004, January 2002. 21. S. Jin, H. Jiang, Novel approaches to efficient flooding search in peer-to-peer networks, Computer Networks: The International Journal of Computer and Telecommunications Networking, 51(10), pp. 2818–2832, July 2007. 22. V. Kalogeraki, D. Gunopulos, D. Zeinalipour-yazti, A local search mechanism for peer to peer networks, Proc. of the 11th ACM Conference on Information and Knowledge Management (ACM CIKM02), 2002. 23. J. O. Kephart, A Biologically inspired immune system for computers, artificial Life IV: Proceedings of the Fourth International Workshop on the Synthesis and Simulation of Living Systemsl, Cambridge, MA, July, 1994. 24. J. M. Kleinberg, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, The Web as a graph: Measurements, models and methods, in Proceedings of the International Conference on Combinatorics and Computing, Lecture Notes in Computer Science, pp. 118, Springer, Berlin, 1999. 25. X. Li, J. Wu, Searching techniques in peer-to-peer networks, Handbook of Theoretical and Algorithmic Aspects of Sensor, Ad Hoc Wireless and Peer-to-Peer Networks, CRC Press, Ann Arbur, MI, 2005. 26. Q. Lv, P. Cao, E. Cohen, K. Li, S. Shenker, Search and replication in unstructured peer-to-peer networks, ACM International Conference on Supercomputing, New York, USA, 2002. 27. G. Manku, Routing networks for distributed hash tables, Annual ACM Symposium on Principles of Distributed Computing archive Proceedings of the twenty-second annual symposium on Principles of distributed computing, Boston, Massachusetts, pp. 133–142, 2003. 28. B. Mitra, F. Peruani, S. Ghose, N. Ganguly, Analyzing the vulnerability of superpeer networks against attack, 14th ACM Conference on Computer and Communications Security, Alexandria, USA, 29 Oct–2 Nov, 2007. 29. B. Mitra, Md. M. Afaque, S. Ghose, N. Ganguly, Developing analytical framework to measure robustness of peer-to-peer networks, 8th International Conference on Distributed Computing and Networking - ICDCN 2006 (formerly IWDC), December 27–30, 2006, IIT Guwahati, India. 30. B. Mitra, S. Ghose, N. Ganguly, Effect of dynamicity on peer to peer networks, 14th International Conference on High Performance Computing, Goa, India, 19–22 December 2007. 31. Napster: http://www.napster.com/. 32. R. Pastor-Satorras, A. Vzquez, A. Vespignani, Dynamical and correlation properties of the Internet, Phys Rev Lett, 87, 258701, 2001. 33. R. Pastor-Satorras, A. Vespignani, Epidemics and immunization in scale-free networks in S. Bornholdt and H. G. Schuster (eds.), Handbook of Graphs and Networks, Wiley-VCH, Berlin, 2003. 34. R. Pastor-Satorras, A. Vespignani, Epidemic dynamics in finite size scale-free networks, Physical Review E, 65, 035108, 2002. 35. R. Pastor-Satorras, A. Vespignani, Epidemic dynamics and epidemic states in complex networks, Physical Review E, 63, 066117, 2001.
274
B. Mitra
36. R. Pastor-Satorras, A. Vespignani, Epidemic spreading in scale-free networks, Physical Review Letters, 86, 32003203, 2001. 37. K. Patch, Internet stays small world, Technology Research News, 2003. 38. B. Pretre, Attacks on peer-to-peer networks, Ph.D. thesis, Swiss Federal Institute of Technology (ETH) Zurich, 2005. 39. Y. J. Pyun, D. S. Reeves, Constructing a balanced, log(N)-diameter super-peer topology, Proceedings of the 4th International Conference on Peer-to-Peer Computing, Zurich, Switzerland, August 2004. 40. K. Singh, H. Schulzrinne, peer-to-peer internet telephony Using SIP, Columbia University Technical Report CUCS-044-04, New York, NY, October, 2004. 41. P. Szor, The art of computer virus research and defense, Symantec Press, Indianapolis, IN, 2005. 42. A. Vazquez, R. Pastor-Satorras, A. Vespignani, Large-scale topological and dynamical properties of the Internet, Physical Rev E, 65, 066130, 2002. 43. C. Wang, B. Li, Peer-to-Peer Overlay Networks: A Survey, Department of Computer Science. The Hong Kong University of Science and Technology, Technical Report, 2003. 44. D. J. Watts, S. H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature, 393, 440–442, 1998. 45. B. Yang, H. Garcia-Molina, Improving search in peer-to-peer networks, Proc. of the 22nd IEEE International Conference on Distributed Computing (IEEE ICDCS02), 2002. 46. B. Yang, H. Garca-Molina, Designing a super-peer networks, Proceedings of the International Conference on Data Engineering (ICDE), Los Alamitos, CA, March 2003. 47. S. Yook, H. Jeong, Y. Tu, A. L. Barabasi, Weighted evolution networks, Phys. Rev. Lett., 86, 5835, 2001.
Advances in the Theory of Complex Networks Fernando Peruani1,2 1
CEA-Service de Physique de l’Etat Condens´e, Centre d’Etudes de Saclay, 91191 Gif-sur-Yvette, France 2 Institut des Syst´emes Complexes de Paris ˆIle-de-France, 57/59, rue Lhomond F-75005 Paris, France; [email protected]
1 Introduction An exhaustive and comprehensive review on the theory of complex networks would imply nowadays a titanic task, and it would result in a lengthy work containing plenty of technical details of arguable relevance. Instead, this chapter addresses very briefly the ABC of complex network theory, visiting only the hallmarks of the theoretical founding, to finally focus on two of the most interesting and promising current research problems: the study of dynamical processes on transportation networks and the identification of communities in complex networks.
2 The ABC of Complex Networks A network or a graph is a set of interconnected nodes (or vertices). The node connection is performed through edges. An edge represents a link between two nodes. Between two vertices there can run more than one edge. Alternatively, an edge can have a number associated to it denoting its importance or weight. Edges can be directed or undirected. A directed edge between a node A and a node B symbolizes that, for example, node A “speaks” to node B, while the opposite is not possible. On the other hand, undirected edges are completely symmetric. This review deals exclusively with undirected edges. For a comprehensive review on the theory of complex networks we refer the reader to [1, 2]. 2.1 Network Characterization 2.1.1 Degree Distribution A network can be characterized in many ways. For example, we could measure the mean degree of the network k. Here, k stands for one of the most N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 16, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
276
F. Peruani
Fig. 1. The figure shows two network topologies: (a) network with exponential degree distribution and (b) network with power-law (scale-free) degree distribution. Figure taken from Ref. [12].
relevant properties of a node, its degree, which indicates the number of edges attached to it, and ... denotes the average over all nodes of the system. Though k is a useful and informative quantity, by itself it cannot characterize the structure of the network, and typically a good characterization also requires higher order moments, such as k 2 , k 3 , etc. How many moments do we need to know to unequivocally characterize the network? All the information about the moments is contained in the degree probability distribution of the network pk (see Fig. 1). pk is the probability of picking up a node at random and observing that its degree is k. The moments are computed as k n pk . If the network is such that the vertices (nodes) are statisk n = tically independent, that is, the connections are completely at random, then the degree probability distribution unequivocally determines the properties of the network. If this is not the case and there are correlations among nodes, the characterization of the network will require the use of a degree-degree probability distribution, or an even higher n-points probability distribution, etc. Let us assume for the moment that vertices are statistically independent. There are three types of degree distributions which due to their ubiquity and simplicity, deserve to be specially mentioned: a) the Poisson distribution, defined as pk = e−k kk /k! and which is the degree distribution of a classical random graph; b) the exponential distribution, defined as pk ∼ e−k/k (see Fig. 1(a)); and c) the power-law distribution (see Fig. 1(b)), which is proportional to pk ∼ k −γ , with γ > 0, and has (for infinite networks) all moments higher than m > γ − 1 diverging (for this reason these distributions are referred to as scalef ree). For distributions like a) and b), the first moment of the distribution, i.e., k, unequivocally characterizes the network topology, but in general higher moments are required to unequivocally determine the network topology.
Advances in the Theory of Complex Networks
277
2.1.2 Clustering Coefficient Another important quantity used to characterize the network topology is the clustering coefficient. The clustering coefficient measures the degree of connectivity in the environment close to a node, i.e., the degree of cliquishness of the closest environment of a node. In a more colloquial way, it is an answer to the question: Are my friends also friends of each other? If a node has degree z, i.e., z neighbors, and all these z nodes are connected among them, there would be z(z − 1)/2 edges linking the nodes. The clustering coefficient is defined as the ratio between the total number y of edges connecting the z nearest neighbors, and the total number of all possible edges between the z nearest neighbors, C = y/ (z(z − 1)/2) . (1) Logically, a network is associated with a distribution of clustering coefficients; however, typically only the average cluster coefficient is reported, which is a simple estimation for the probability of finding that any couple of neighbors of a given node are also connected among themselves. A simple approximation for the average clustering coefficient of Poissonian (or exponential) random networks is given by k . (2) Crand = N Another definition of average clustering coefficient extensively used in the literature is given by 3A Ctriangle = , (3) B where A stands for the number of interconnected triplets of nodes, such that each node is connected to the other two nodes (i.e., a triangle), and B is the number of connected triplets, where each node is connected to just one node or more. The factor 3 accounts for the fact that from each triangle three simple triplets can be formed. 2.1.3 Network Diameter The network diameter is defined as the maximal distance between any pair of nodes. The above definition strictly works for fully connected networks; however, by redefining the diameter as the maximum distance among all fully connected components (clusters) of the system, the definition is applicable to all kinds of networks. Assuming that the network has a sort of tree-like structure, a simple rough estimation can be obtained by equating kd with N as follows: d∼
ln(N ) . ln(k)
(4)
It has been shown that Eq. (4) predicts the correct scaling of d with N and k for random networks. Note that when k > ln(N ), a random network
278
F. Peruani
has a high probability of being totally connected [1]. The concept of network diameter is closely related to another important quantity, the average path length, which is the average distance between any pair of nodes. 2.1.4 Network Spectrum The network topology can also be studied through the adjacency matrix A, which is an N × N symmetric matrix whose elements Ai,j represent the connections among the nodes of the network. If nodes i and j are connected, then Ai,j = 1, otherwise, Ai,j = 0. The spectrum of the network is the set of eigenvalues of A, and since A has N eigenvalues, the spectral density takes the form N 1 δ(λ − λj ). (5) ρ(λ) = N i=1 In the limit of N → ∞, ρ(λ) becomes a continuum function. Interestingly, the topology of the network is related to the spectral density through 1 1 dλ λk ρ(λ) = (λj )k = Ai1 ,i2 Ai2 ,i3 . . . Aik ,i1 . (6) N j N i ,i ,...,i 1
2
k
Equation (6) represents the number of paths returning to the same node in the network. One of the most remarkable results connected to this kind of approach is Wigner’s law, which applies to infinite random networks with a connectivity p ∼ N −ξ . When 0 < ξ < 1, Wigner’s law predicts that the spectrum density is a semicircular distribution ρ(λ) = 4N p(1 − p) − λ2 /(2πN p(1−p)) for |λ| < 2 N p(1 − p) and is vanishing for larger values of λ, except for the principal eigenvalue, which is isolated from the bulk and increases with network size. For ξ > 1 the spectral density deviates from Wigner’s law and its odd moments vanish (i.e., k 2m+1 = 0), indicating that the only path that comes back to the original node is following all nodes previously visited, i.e., there are no closed loops [4–9, 26] 2.2 Building a Network There are equilibrium and non-equilibrium random networks. These terms are associated to the way in which the network was grown. In this subsection we briefly review how a network can be built. 2.2.1 Equilibrium Random Networks Given a fixed number N of nodes and a fixed number M of edges, the network is built by taking for each edge a randomly selected couple of nodes and inserting an edge between them.
Advances in the Theory of Complex Networks
279
2.2.2 Non-Equilibrium Random Networks In this case, the network is grown by simultaneously adding vertices and edges. The procedure is as follows. a) A node is added at each time step. b) Simultaneously, a pair (or several pairs) of randomly chosen vertices are connected by an edge. If at some moment the addition of nodes is stopped while the addition of edges continues, the network will tend to an equilibrium. However, the network will never approach the equilibrium state given by equilibrium networks, since the growing process produces a sort of correlation by which ‘old’ nodes are more connected than ‘young’ nodes. The only way to achieve an equilibrium network configuration is by also allowing the removal of old edges. 2.2.3 Preferential Attachment A huge amount of real-world networks are scale free; i.e., they exhibit a powerlaw degree distribution. The Barab´ asi–Albert model [33] was the first model that satisfactorily described a non-equilibrium network whose asymptotic degree distribution is a power law. The growth model is as follows. Starting with a small number m0 of nodes, at every time step, add a new node with m (m ≤ m0 ) edges and link the new node to m different nodes already present in the system according to the following rule: choose each node to which the new node connects to with probability Π proportional to the node degree ki , ki Π(ki ) = . j kj
(7)
The attachment rule described by Eq. (7) is referred to as preferential attachment. After t time steps this procedure produces a network with N = t + m0 nodes and mt edges. Asymptotically with t the degree distribution of the network approaches a power law with exponent γ = 3. This remarkable fact can be understood according to the following simple continuum theory [37]. Assume that ki is a continuous variable whose rate growth is proportional to Π(ki ), then ki evolves according to ∂ki ki = mΠ(ki ) = . (8) ∂t 2t The solution of this equation, with the initial condition that every node i at its introduction (at time ti ) has ki (ti ) = m, is ki (t) = m(t/ti )β with β = 1/2. Thus, the cumulative probability takes the form p[ki (t) < k] = 1 −
m1/β t . k 1/β (t + m0 )
(9)
Taking the derivative of Eq. (9), we obtain the degree distribution pk =
2m1/β t . (m0 + t)k 1/β+1
In the limit of t → ∞, pk ∼ 2m1/β k −γ with γ = β −1 + 1 = 3.
(10)
280
F. Peruani
2.3 Network Stability: Breaking Down a Network A finite network can be formed by many isolated clusters of various sizes, or it can be fully connected with only one giant component. For infinite networks this statement has to be rephrased in the following way. An infinite network can exhibit a giant cluster with an infinite number of nodes contained in it, or on the contrary, all clusters in the system can be finite. If a network exhibits a giant cluster, we say that the network is stable and highly connected. We now review the already classical results on percolation of complex networks [10–14]. Specifically, we follow the method proposed in [10, 11] and extended in [15, 16]. The goal is to find the minimum fraction of nodes that should be removed from a network in order to break down the connectivity of the network. By definition, a network is no longer connected when the initial giant component disappears, i.e., when the biggest cluster of connected nodes in the system is much smaller than the total initial number of nodes. Let pk be the network degree distribution, i.e., the probability of finding a randomly chosen vertex with degree k, and let qk be the probability that a node of degree k survives the failure or attack. Correspondingly, 1 − qk is the probability that a node of degree k is removed. In consequence, pk qk represents the fraction of nodes of degree k that are removed after the failure or attack. The objective is now to characterize the cluster size distribution of surviving nodes, and determine under which condition cluster sizes can be infinite. We make use of generating function formalism and define G(x) as the generating function of the network degree distribution pk : G(x) =
∞
pk xk .
(11)
k=0
Recall that the connection between the generating function and the probability distribution it generates is given by 1 dk G(x) . x−→0 k! dxk
pk = lim
(12)
We still need to derive the generating function F0 (x) of the probability of finding a node of degree k that has survived the attack. Since pk qk is the probability of finding a surviving node of degree k after the disruptive event, applying the definition of generating function, Eq. (11), we find that F0 (x) takes the form ∞ F0 (x) = pk qk xk . (13) k=0
Another important generating function is the one associated with the probability of finding a randomly chosen edge connected to a node of degree k (after the attack): ∞ kpk qk k xF0 (x) x = , (14) A(x) = z G (1) k=0
Advances in the Theory of Complex Networks
281
∞
where z = k = k=0 kpk = dG(1)/dx. To obtain an expression for the cluster size distribution, we need first to find the generating function of the probability that one of the outgoing edges of the node we arrived at connects to a surviving node of degree k. This is simply A(x)/x, and the desired generating function can be expressed as F1 (x) = F0 (x)/G (1) = F0 (x)/z.
(15)
Now we look for the generating function H1 (x) of the distribution of cluster sizes of surviving nodes that are reached by randomly choosing an edge and following it to one of its ends. If we choose an edge that leads us to a removed node, regardless of the degree of the node, we say that the cluster size we find is zero. The probability of following the randomly chosen edge and finding a surviving node of degree zero is zero, the probability of finding a surviving node of degree one is p1 q1 /z, the probability of finding a surviving node of degree two is 2p2 q2 /z, and so on.So, the probability of finding a surviving node, regardless of its degree, is ∞ k=0 kpk qk /z = F1 (1). In consequence, the probability of finding an edge that leads to a removed node is 1 − F1 (1). Clearly, this is also the probability of following a randomly chosen edge that leads to a zero size component, and so also the coefficient s0 that accompanies x0 in H1 (x). To find the full expression of H1 (x), we have still to look for the probabilities that accompany non-zero size components, i.e., xk with k > 0. This can be computed from the probability s1 of finding, by following a randomly chosen edge, a component of size 1. This is nothing other than the sum of the probabilities of following an edge and finding a surviving node of degree k which has its other k − 1 edges connected to removed nodes: s1 =
∞
kpk qk /z(1 − F1 (1))k−1 = F1 (H1 (0)).
(16)
k=1
Similarly for s2 , we can obtain s2 = =
∞
(k − 1)kpk qk /z(1 − F1 (1))k−2 s1
(17)
k=2 F1 (H1 (0))H1 (0),
where (1 − F1 (1))k−2 s1 is the probability of taking randomly k − 1 edges and finding that k − 2 edges are attached to removed nodes, and one to a size 1 component. The term k−1 indicates that there are k−1 possible configurations for these edges. We observe that Eq. (17) is the derivative with respect to x of Eq. (16) evaluated in x = 0. However, from the definition given by Eq. (11), we know that the term x1 is accompanied by a first derivative, while the second is associated with a second derivative and a factor 1/2. We solve this problem by considering that the function we have to derive successive times is xF1 (H1 (x)). The first derivative of this function evaluated in x = 0 is
282
F. Peruani
F1 (H1 (0)), while the second derivative evaluated in x = 0 is 2F1 (H1 (0))H1 (0). This suggest a self-consistence equation for H1 (x) of the form H1 (x) = (1 − F1 (1)) + xF1 (H1 (x)).
(18)
It can be easily verified that Eq. (18) leads to the correct expressions of s0 , s1 , . . . , sn by applying the definition given by Eq. (12). Along similar lines, we can obtain the generating function H0 (x) of the distribution of the component size to which a randomly chosen node belongs. The main difference is that instead of determining the probability of finding a randomly chosen edge attached to a component size s, we now randomly choose a node and want to determine the probability of finding this node belonging to a cluster of size s. For this reason, instead of using P (k) as before, we use pk qk and its corresponding generating function F1 (x). The expression for H0 (x) takes the form (19) H0 (x) = (1 − F0 (1)) + xF0 (H1 (x)). Finally from Eq. (19), we can obtain the average size of the components: H0 (1) = s = F0 (1) +
F0 (1)F1 (1) . 1 − F1 (1)
(20)
As mentioned above, we are interested in knowing the threshold at which the average cluster size becomes finite, or inversely, when it becomes infinite. Clearly, Eq. (20) diverges when 1 − F1 (1), and this critical condition sets the threshold between finite and infinite cluster sizes. Finally replacing F1 (1) by its definition, Eq. (15), we obtain a critical condition for qk , which was our initial goal: ∞ kpk (kqk − qk − 1) = 0. (21) k=0
Equation (21) defines the critical condition for the stability of an uncorrelated infinite network under an arbitrary attack. For failure, i.e., when the attack does not depend on the degree k of the node, qk = q and from Eq. (21) the classical percolation threshold for failure [13, 10] is retrieved as follows: qc = 1 −
k . k 2 − k
(22)
Notice that Eq. (22) defines the percolation threshold for infinite networks. The critical qc strongly depends on system size and thus Eq. (22) fails to describe the stability of finite networks [17]. Also notice that a basic assumption behind Eq. (21) is that the original network is uncorrelated. Expressions for the percolation threshold of finite and/or correlated networks are still missing.
Advances in the Theory of Complex Networks
283
3 Two Current Hot Problems in Complex Networks In this section we address two current hot problems in complex networks: dynamics on transportation networks and community identification in complex network. Part of the future advances of complex network theory clearly is going to be along the lines of the problems reviewed in this section. However, we warn the reader that this selection of problems just gathers a small number of timely interesting issues on networks which are particularly attractive for the author. The amount of relevant open problems in the fast-evolving area of network theory exceeds by far the small selection presented here. 3.1 Dynamics on Transportation Networks A transportation network typically models the movement of entities across the nodes of the network (see Fig. 2). A classical example is the airline transportation network where each node denotes a city (i.e., an airport) and edges indicate direct flights between cities. If we associate to each node i a number ni (t) denoting the number of individuals at node i at time t, we can model the dynamical flow of mass (or individuals) across the network. It is not difficult to imagine a transportation network moving various types (e.g., species) of individuals or entities. This means that at a given instant of time there will be various species of individuals coexisting at each node. If in turn there is a dynamics among the various types of individuals, on top of the transport dynamics there will be an inter-species dynamics. A chemical reaction where the chemical species diffuse across the transportation network [18] would be an example of this type of dynamical process. Another example would be the spreading of a disease through the airline transportation network [19, 20, 21], as occurred in 2002 during the outbreak of the severe acute respiratory syndrome (SARS) [19]. In this case, susceptible, infected, and recovered individuals are the reacting species. In this section we briefly review some recent results [18, 22–25] which have helped to elucidate some key aspects of the metapopulation dynamics which occurs on transport networks. Let us start by understanding the transport dynamics. 3.1.1 Transport Dynamics For the moment we assume that there is only one species diffusing in the system. A metapopulation description of the transport process can be obtained by thinking in terms of the mean occupation number n ˜ k (t) of nodes of degree k at time t, which by definition reads as n ˜ k (t) =
1 (i) n (t), Nk (i) k
=k
(23)
284
F. Peruani
where the sum runs over all nodes whose degree is k, Nk refers to the total number of nodes with degree k, and n(i) (t) denotes the occupation number (= number of individuals) at node i. It is assumed that there is a diffusion rate d(k, k ) that controls the migration of individuals from a subpopulation with degree k to another of degree k . In consequence, the probability per unit time Lk for an individual at a node of degree k of leaving the node is Lk = k kp(k |k)d(k, k ), where p(k |k) is the conditional probability that an edge departing from a node of degree k points to a node of degree k . Thus, the (mean-field) time evolution of n ˜ k (t) can be expressed as ˜ k (t) = −Lk n ˜ k (t) + k p(k |k)d(k , k)˜ nk (t). (24) ∂t n k
The reasoning behind Eq. (24) is very simple. The first term on the righthand side accounts for the number of individuals that initially are in a node of degree k and then leave it, while the second term considers the increase of individuals in k-degree nodes due to the migration of individuals from subpopulations of degree k to k. For uncorrelated networks, p(k |k) takes the form p(k |k) = k pk /k and Eq. (24) reduces to k ˜ k (t) = −Lk n ˜ k (t) + pk d(k , k)˜ nk (t). (25) ∂t n k k
If in addition it is assumed that the probability for an individual to leave a given population is independent of its degree, then Lk = L for all k, and d(k, k ) = L/k. The stationary solution for Nk (t) then reads: Nk (t → ∞) =
k N. k
(26)
A more realistic transportation process has to consider the migration of individuals to be proportional to the traffic intensity along the network edges. This can be obtained by defining a heterogeneous diffusion probability for any given individual to go from a subpopulation of degree k to another one of degree k as d(k, k ) = Lw0 (kk )θ /Tk , where Tk provides the correct renormalization to ensure that overall outflow is still L, θ is a model parameter that controls the impact of the network topology, and w0 is simply a constant. 3.1.2 Dynamics Among Different Species In the following discussion we assume that there are multiple species traveling across the network which interact among themselves. We consider three interacting species: susceptible, infected, and recovered individuals which follow the classical Susceptible-Infected-Recovered (SIR) dynamics (see Fig. 2). For a single population (node), an epidemic outbreak can occur depending on the basic reproductive number R0 , which accounts for the number of secondary infected cases generated by a primary infected individual. The basic reproductive number is defined as
Advances in the Theory of Complex Networks
285
j Subpopulation i:
i
i
Transportation network
Agents: susceptible infected recovered
Fig. 2. The scheme illustrates a tranportation network. Each node is a container of agents, i.e., a subpopulation. Agents are transported through the network edges, e.g., from node j to i. Inside each node, individual agents interact. The figure depicts a SIR dynamics in which susceptible agents get the disease from infected agents, which in turn become, after a characteristic time, recovered.
R0 =
β , μ
(27)
where 1/β is the characteristic time required by a susceptible individual to acquire the disease from any given neighbor, and 1/μ is the characteristic time an individual remains infected after getting the disease. If R0 > 1 initially, the number of infected individuals is larger than the number of recovered individuals, and the disease spreads. When R0 < 1 the epidemic goes to extinction. Note that even if R0 > 1, i.e., when the disease at the node level affects many individuals, the infection does not necessarily spread over the metapopulation system, which in turn means that a macroscopic fraction of nodes remains immune to the disease. For this to happen, we still require a fast enough diffusion of individuals. In the following we review the derivation of the metapopulation disease invasion predictor R∗ , which determines under which parameters (including R0 and d(k, k )) a disease infects a finite fraction of the network. Let us start out by estimating the number of new infected individuals (seeds) that may appear in a connected subpopulation of degree k during the duration of an outbreak in a subpopulation of degree k. We denote by αNk the number of infected individuals during the evolution of the epidemic in a closed subpopulation (α depends on the specific disease model). If each infected individual holds the disease for a characteristic time μ−1 during which it can travel to a neighboring subpopulation k with a rate d(k, k ), then the number of new seeds can be expressed as
286
F. Peruani
λk,k =
d(k, k )αNk . μ
(28)
Now we can derive a simple approximate evolution equation for the number of infected subpopulations Dkn of degree k at generation n for a random graph in which each subpopulation has the same degree k, 1 λkk n n−1 1 − Dn−1 /N . (29) D = D (k − 1) 1 − ( ) R0 The reasoning behind Eq. (29) is the following. Each of the Dn−1 infected populations at generation n − 1 will seed during the next generation a number of subpopulations proportional to k − 1 times that the neighbor the probability ing subpopulations are not infected (i.e., 1 − Dn−1 /N ), times the probability that the new infected individuals cause a local outbreak (this probability is proportional to 1 − R0−λkk since the probability that a single individual will
not transmit the disease is R0−1 [27]). Assuming, as before, that d(k, k ) = p/k, then λkk = pN0 α/(μk) (where N0 = Nk ) and in addition R0 1 such that 1 − R0−λkk ∼ λkk (R0 − 1), Eq. (29) reduces to Dn = pN0 αμ−1
k−1 (R0 − 1)Dn−1 . k
(30)
From Eq. (30) it is easy to observe that a macroscopic outbreak can only occur if k−1 (R0 − 1) > 1. (31) R∗ = pN0 αμ−1 k Thus, the global invasion threshold is defined by Eq. (31). This implies that to observe global spread the mobility rate has to be such that p≥
μk . α(k − 1) (R0 − 1)
(32)
In a heterogeneous metapopulation network, i.e., when the subpopulation degree varies across the network, Eq. (29) has to be replaced by (33) Dkn = Dkn−1 (k − 1)λk k (R0 − 1) p(k|k ) 1 − Dkn−1 /Nk , k
where again it was assumed that R0 1. Since p(k|k ) is the conditional probability that an edge attached to a node of degree k has its other tip connected to a node of degree k, p(k|k ) k is the probability that at least one edge is connected to a node of degree k. In Eq. (33), p(k|k ) (k − 1) refers to the probability that a recently infected node with degree k , discounting the edges from which the nodes got the disease, is linked to a node of degree k. As said above, when degree correlation can be neglected, p(k|k ) = k p(k)/k, and Eq. (33) can be expressed as
Advances in the Theory of Complex Networks
Dkn =
kp(k) (R0 − 1) Dkn−1 (k − 1)λk k . k
287
(34)
k
Similarly, Eq. (28) is reduced to λk,k =
pαNk . μk
(35)
Consequently, the evolution equation for Dkn reads: kp(k)pN0 α n−1 Dk (k − 1). μk2
Dkn = (R0 − 1)
(36)
k
Multiplying both sides by (k − 1) and taking the sum over k on both sides, Eq. (36) can be expressed as Θn = (R0 − 1) where Θn is defined as Θn = disease spreads only if
k
k 2 − k pN0 α n−1 Θ , k2 μ
(37)
Dkn (k − 1). From Eq. (37) we learn that the
R∗ = (R0 − 1)
k 2 − k pN0 α > 1. k2 μ
(38)
Equation (38) defines the global invasion threshold for a heterogeneous network. Though in recent years we have observed important progress related to the dynamics on transportation networks, there are still many open questions to be answered. For example, the degree of a subpopulation has been considered so far decoupled from the subpopulation size. However, we know that, in many cases, as in an airline transportation network which connects cities of different sizes, degree and subpopulation size are strongly correlated. In fact, a satisfactory network growth model for transportation networks is still lacking. Regarding the dynamics on the nodes, typically death and birth processes are ignored, even though small size nodes could experience large fluctuations which in turn could dramatically affect global flow on the network. Bottleneck effects due to limitation in the transportation channel, as well as limitation in node capacity, are important problems that deserve to be investigated. 3.2 Identifying Communities in Complex Networks If we observe real-world networks, we notice that typically there are small sets of nodes which are highly connected to each other but with only few links to the rest of the network (see Fig. 3). These sets of highly connected nodes are typically referred to as communities or modules. To fully understand the
288
F. Peruani
Fig. 3. The scheme illustrates a network comprising two modules or communities. Notice the high connectivity exhibited by nodes in each community.
internal topological structure of a network it is crucial to correctly detect the community structure in it. A general method for identification of communities in unipartite networks is the maximization of the modularity function Q introduced by Newman and Girvan [28]. The function Q evaluates the “goodness” of a partition of a network into communities. The basic assumption behind the modularity function is that a community or module of a network should exhibit a number of internal links greater than the number of links of a subset of a random network. For a network with N nodes and L links, the modularity function Q is defined as follows: 2 m m ds ls − qs = , (39) Q= L 2L s=1 s=1 where the sum runs over the m modules of the network, ls is the number of links inside module s, and ds is the total degree of the nodes in module s. The term ls /L denotes the fraction of links connecting pairs of nodes belonging to module s, while (ds /2L)2 represents the fraction of links that one would find in the module if links were placed at random in the network, under the constraint of respecting the degree distribution of the original network. If qs is such that 2 ds ls qs = − ≥ 0, (40) L 2L the module is well defined, in the sense that the module presents more links than expected by random chance. The greater qs , the better defined the module. The identification method implies the maximization of Q, which in turn involves sampling over all possible partitions of the network. Unfortunately,
Advances in the Theory of Complex Networks
289
the number of possible subsets grows exponentially with the network size, and the modularity optimization is an NP-complete problem [29]. Typically, the ambitious goal of finding the true optimum of the measure is not possible. However, approximations of the minimum can be obtained by applying optimization algorithms such as simulated annealing, extremal optimization, or spectral division. Other drawbacks of the Newman–Girvan method are that it cannot scan the network below some scale, leaving small modules undetected, and that it may be affected by the time evolution of the network, i.e., by network size. 3.2.1 Identifying Communities in Bipartite Networks Bipartite networks are a special and important class of networks in which nodes are divided in two disjoint subsets and edges link nodes of one subset with nodes in the other. The number of applications of bipartite networks is really huge; however, one application in social science has become the example of prototypical bipartite networks: the movie-actor network [30–34]. This network is divided in two sets, the set of actors and the set of movies (also referred to as teams; see Fig. 4). An edge that connects an actor a and a movie m indicates that a has participated in the movie m. Note that the behavior of these networks strongly depends on whether both partitions grow with time, which leads to scale-free degree distributions of actors, or on whether one of the partitions, e.g., the actor set, is fixed over time while the remaining set grows unboundedly, which results in a beta-distribution for the degree of actor nodes [35]. Bipartite network: D
C
B
A Teams
1
2
3
4 Actors
Unipartite projection: 1
3
2
4
Fig. 4. The figure shows a scheme of a growing bipartite network. The team node D represents a new incoming node. The scheme at the bottom indicates the resulting unipartite projection of actor nodes (see text).
290
F. Peruani
Many relevant properties of bipartite networks become evident in the unipartite projection of actor nodes. In this unipartite network, an edge running from an actor a to an actor a indicates that a and a have co-starred in the same movie (see Fig. 4). Notice that in consequence the actors attached to a movie m in the bipartite network are part of a clique in the unipartite projection. Bipartite networks have intrinsically very strong modularity and typically exhibit complex structure. Guimer´ a et al. [36] have recently proposed a simple and elegant model for bipartite network growth which allows us to study different levels of modularity in bipartite networks. The model assumes that each actor and movie has associated a color. The number of colors is a model parameter that has to be defined in advance. The next step is to assign to each actor a color. Once all this has been defined, the network is grown according to the following steps. a) b) c) d)
Create team m. Select the number μm of actors in the team. Select the color cm of the team. For each of the μm actors in m proceed as follows: with probability p, select the actor from the pool of actors with the team color cm ; otherwise, select an actor at random with equal probability.
The parameter p is called team homogeneity and quantifies how homogeneous a team is. For p = 1 all the actors in the team belong to the same module and modules are perfectly segregated, whereas for p = 0 the color of the team is irrelevant and actors are perfectly mixed and the network does not have a modular structure. Guimer´ a et al. in [36] have adapted the modularity criterion of Newman– Girvan, Eq. (39), to account for modularity in bipartite networks. They consider that the expected number of times a given actor a belongs to a team composed of μ actors is ta , (41) pa→m = μ k tk where ta is the total number of movies in which actor a has participated, i.e., the degree of node a. Eq. (41) represents the probability that a given team m with μ actors is connected to actor a. Thus, the probability that a team m is connected to a and a is given by ta ta . pa,a →m = μ(μ − 1) ( k tk )2
(42)
In consequence, the average number na,a of movies in which a and a have co-starred (assuming a non-correlated random process) is μm (μm − 1) ta ta , (43) na,a = m ( m μm )2
Advances in the Theory of Complex Networks
291
where m μm = k tk . From Eq. (43) the bipartite modularity can be expressed as the cumulative deviation from the random expectation of costarring movies (i.e., Eq. (44)): m a=a ∈s caa a=a ∈s ta ta − QB = , (44) ( m μm )2 m μm (μm − 1) s where caa is the actual number of movies in which a and a have co-starred. Notice that the identification of modules through the optimization of QB leads to the same type of problems present in the Newman–Girvan method: the method leaves small modules undetected and strongly depends on network size. The identification of communities in complex networks is extremely important, since it can reveal functional relationships between nodes. So far the available methods for modularity identification are purely phenomenological and they cannot guarantee the correct identification of the community structure. A theoretical founding for modularity identification is still lacking. Due to the relevance of the problem, we expect to observe important theoretical progress in this direction in the near future.
4 Concluding Remarks The complex network community has been growing for years. Everyday we see new articles on complex networks, and the evolution of the field seems limitless. In such a dynamical research field, any prediction about the future of complex network theory is extremely risky. The two selected hot topics in this chapter, dynamical processes on transportation networks and identification of communities in complex networks, are certainly areas that will experience important progress in the near future. Very important progress is also expected in many other areas, as for example, in dynamical networks of moving agents. In the coming years we will witness substantial new progress in network research.
References 1. R. Albert and A.-L. Barab´ asi, Rev. Mod. Phys. 74, 47 (2002). 2. S.N. Dorogovtsev and J.F.F. Mendes, Evolution of Networks: From Biological Nets to the Internet and WWW, Oxford University Press, Oxford, UK (2003). 3. F. Chung and L. Lu, Adv. Appl. Math. 26, 257 (2001). 4. E.P. Wigner, Ann. Math. 62, 548 (1955). 5. E.P. Wigner, Ann. Math. 65, 203 (1957). 6. E.P. Wigner, Ann. Math. 67, 325 (1958). 7. M.L. Metha, Random Matrices, 2nd ed., Academic Press, New York (1991).
292
F. Peruani
8. A. Crisanti, G. Paladin, and A. Vulpiani, Products of Random Matrices in Statistical Physics, Springer, Berlin (1993). 9. T. Guhr, A. Mueller-Groeling, and H.A. Weidenmueller, Phys. Rep. 299, 189 (1998). 10. D.S. Callaway, M.E.J. Newman, S.H. Strogatz, and D.J. Watts, Phys. Rev. Lett. 85, 5468 (2000). 11. M.E.J. Newman, S.H. Strogatz, and D.J. Watts, Phys. Rev. E 64, 026118 (2001). 12. R. Albert, H. Jeong, and A.L. Barab´ asi, Nature (London) 406, 6794 (2000); 406, 378 (2000). 13. R. Cohen, K. Erez, D. Ben-Avraham, and S. Havlin, Phys. Rev. Lett. 85, 4626 (2000). 14. R. Cohen, K. Erez, D. Ben-Avraham, and S. Havlin, Phys. Rev. Lett. 86, 3682 (2001). 15. B. Mitra, F. Peruani, S. Ghose, and N. Ganguly, in Proceedings of 14th ACM Conference on Computer and Communications Security (Association for Computing Machinery, Inc. New York, 2007). 16. B. Mitra, F. Peruani, S. Ghose, and N. Ganguly, in Proceedings of 26th Symposium on Principles of Distributed Computing (Association for Computing Machinery, Inc. New York, 2007). 17. B. Mitra, N. Ganguly, S. Ghose, and F. Peruani, Phys. Rev. E 78, 026115 (2008). 18. V. Colizza, R. Pastor-Satorras, and A. Vespignani, Nature Physics 3, 276–282 (2007). 19. L. Hufnagel, D. Brockmann, and T. Geisel, Proc. Natl. Acad. Sci. USA 101, 15124 (2004). 20. Z. Wu, L.A. Braunstein, V. Colizza, R. Cohen, S. Havlin, and H.E. Stanley, Phys. Rev. E 74, 056104 (2006). 21. V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani, Proc. Natl. Acad. Sci. USA 103, 2015–2020 (2006). 22. V. Colizza and A. Vespignani, J. Theor. Biol. 251, 450–467 (2008). 23. V. Colizza and A. Vespignani, Phys. Rev. Lett. 99, 148701 (2007). 24. V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani, Int. J. Bifurcation and Chaos 17, 2491–2500 (2007). 25. V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani, BMC Medicine 5, 34 (2007). 26. I.J. Farkas, I. Derenyi, A.-L. Barab´ asi, and T. Vicsek, Phys. Rev. E 64, 026704 (2001). 27. N.T. Bailey, The Mathematical Theory of Infectious Diseases, 2nd edition, Hodder Arnold (1975). 28. M.E.J. Newman and M. Girvan, Phys. Rev. E 69, 026113 (2004). 29. S. Fortunato, e-print arXiv:0705.4445. 30. J.J. Ramasco, S.N. Dorogovtsev, and R. Pastor-Satorras, Phys. Rev. E 70, 036106 (2004). 31. D.J. Watts and S.H. Strogatz, Nature (London) 393, 440 (1998). 32. R. Albert and A.-L. Barab´ asi, Phys. Rev. Lett. 85, 5234 (2000). 33. R. Albert and A.-L. Barab´ asi, Science 286, 509 (1999). 34. L.A.N. Amaral, A. Scala, M. Barth´el´emy, and H.E. Stanley, Proc. Natl. Acad. Sci. 97, 11149 (2000).
Advances in the Theory of Complex Networks
293
35. F. Peruani, M. Choudhury, A. Mukherjee, and N. Ganguly, Europhys. Lett. 79, 28001 (2007). 36. R. Guimera, M. Sales-Pardo, and L.A. Nunes Amaral, Phys. Rev. E 76, 036102 (2007). 37. A.-L. Barabasi, H. Jeong, and R. Albert, Physica A 272, 173 (1999).
Glossary of Essential Terms
Adjacency Matrix: Let G be a graph with n vertices. The n × n matrix A, such that aij = 1 if there is an edge between vertices vi and vj and where the rest of the values are 0, is called the adjacency matrix of graph G. Assortativity: Assortativity refers to a preference for a network’s nodes to attach to others that are similar or different in some way. Assortativity Coefficient: The assortativity coefficient is the Pearson correlation coefficient r between pairs of nodes. Hence, positive values of r indicate a correlation between nodes of similar degree, while negative values indicate relationships between nodes of different degree. Automorphic Equivalence: Two vertices u and v of a labeled graph G are automorphically equivalent if all the vertices can be relabeled to form an isomorphic graph with the labels of u and v interchanged. Betweenness Centrality: Betweenness centrality of a node v is defined as the sum of ratios of the number of shortest paths between vertices s and t (s, t ∈ V ) through v to the total number of shortest paths between s and t. The betweenness centrality g(v) of v is given by g(v) = Σs=v=t
σst (v) . σst
(1)
Biological Networks: Biological networks are representations of biological systems such as metabolic networks, protein interaction networks etc. Bipartite Graphs: Bipartite graphs are graphs that contain vertices of two distinct types, with edges running only between unlike types.
296
Glossary of Essential Terms
Centrality: The centrality of a node in a network is a measure of the structural importance of the node. Citation Networks: A citation network is a network formed by nodes of articles, such that there is a directed edge from node i to j if the article i cites article j. Clique: Cliques are complete graphs where all nodes are connected to all other nodes. Closeness Centrality: The closeness centrality Cc (v) for a vertex v is the reciprocal of the sum of geodesic distances to all other vertices in the graph: 1 . (2) Cc (v) = Σt∈V dG (v, t) Clustering Coefficient: The clustering coefficient for a vertex v in a network is defined as the ratio between the total number of connections among the neighbors of v to the total number of possible connections between the neighbors. For a vertex i, the clustering coefficient is given by Ci =
|ejk | : vj , vk ∈ Ni , ejk ∈ E. ki (ki − 1)
(3)
Community: A community is a subgraph, where in some reasonable sense the nodes in the subgraph have more to do with each other than with the nodes that are outside the subgraph. Coordination Number: The coordination number of a graph is the average degree z of the nodes of the network. Cumulative Advantage: Cumulative advantage means that the more connected a node is, the more likely it is to receive new links. Nodes with higher degree have a stronger ability to grab links added to the network. This concept is more popularly known as “preferential attachment.” Degree Centrality: Degree centrality is defined as the number of links incident upon a node. Degree Distribution: The degree distribution of a network gives the probability distribution of the degree of a random node in a network. Diameter: The diameter of a graph is defined as the maximum of all the shortest distances between any two nodes in the graph.
Glossary of Essential Terms
297
Dual Graphs: A dual graph of a given planar graph G is a graph that has a vertex for each plane region of G, and an edge for each edge in G joining two neighboring regions, for a certain embedding of G. Edge Connectivity: The edge connectivity of G, κ (G), is the minimum size of a disconnecting set. Edge Cutset: An edge cutset is a set F , a subset of E(G) such that G − F has more than one component. Eigenvector Centrality: Eigenvector centrality is a measure of the importance of a node in a network. It assigns relative scores to all nodes in the network based on the principle that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. Thus, the centrality of a node is proportional to the centrality of the nodes to which it is connected and this in a recursive fashion. Erd˝ os-R´ enyi Graph: In the E-R graph model, each pair of n vertices is connected by an edge with some probability p. The probability of a vertex having degree k is given by (z = np) n k z k e−z . (4) p (1 − p)n−k pk = k k! Euclidean Distance: The Euclidean distance between two nodes A and B is defined as (5) ED(A, B) = Σi (Ai − Bi )2 . Euler’s Formula: If a connected planar graph G has exactly n vertices, e edges, and f faces, then n − e + f = 2. Euler Tour: An Euler tour of a connected, directed graph G = (V, E) is a cycle that traverses each edge of graph G exactly once, although it may visit a vertex more than once. Euler Walk: An Euler walk in an undirected graph is a path that uses each edge exactly once. Geodesic Path: The geodesic path between two vertices is the shortest path between them. Giant Component: The giant component refers to a connected subgraph that contains a majority of the entire graph’s nodes.
298
Glossary of Essential Terms
Hierarchical Clustering: Hierarchical clustering builds (agglomerative) or breaks up (divisive) a hierarchy of clusters. Hyperedges: The edges in the network that join more than two nodes together. Hypergraphs: Hypergraphs are graphs that have hyperedges. Incidence Matrix: The incidence matrix of a graph gives the (0, 1)-matrix which has a row for each vertex and column for each edge, and (v, e) = 1 iff edge e is incident on vertex v. Jaccard Coefficient: The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the sample sets: J(A, B) =
|A ∩ B| . |A ∪ B|
(6)
k-core: A k-core is defined as the maximal subset where each node is connected to at least k members. k -connected: A connected graph G is k-connected iff every pair of vertices in G is joined by at least k non-intersecting paths and there exists at least one pair with exactly k non-intersecting paths. k -plex: In a k-plex, all the nodes have degree at least (n − k). 1-plex represents a clique. Lagrange’s Matrix: If di is the degree of node i, then Lagrange’s matrix is defined as follows: ⎧ ⎨ di if i = j Lij = −1 if i is connected to j. (7) ⎩ 0 Otherwise n-clan: An n-clan is an n-clique S such that the subgraph induced by S has a diameter (D) less than or equal to n. n-clique: An n-clique is the maximal subset of the nodes where the distance between any two nodes u and v is less than or equal to n: d(u, v) ≤ n, ∀u, v.
(8)
Network Motif: Network motifs are patterns that occur in different parts of a network at frequencies much higher than those found in randomized networks.
Glossary of Essential Terms
299
Pearson’s Correlation Coefficient: Pearson’s correlation coefficient between two nodes x and y can be measured as ΣxΣy Σxy − n . (9) r= 2 2 (Σx) (Σy) )(Σy 2 − ) (Σx2 − n n Percolation Theory: Percolation theory is based on adding nodes and connections to an empty graph until a giant component surfaces. A percolation process is one in which vertices or edges on a graph are randomly designated as either occupied or unoccupied and one asks about various properties of the resulting patterns of vertices. Planar Graphs: A graph is planar if it has a drawing without crossings. Power Law: A power law is any polynomial relationship that exhibits the property of scale invariance. The most common power laws relate two variables and have the form f (x) = axk + o(xk ).
(10)
Preferential Attachment: Preferential attachment means that the more connected a node is, the more likely it is to receive new links. Nodes with higher degree have a stronger ability to grab links added to the network. Random Graphs: A random graph is a graph that is generated by some random process. Reciprocity: Reciprocity is the probability that a pair of vertices in a directed network are connected to each other by directed edges. Regular Equivalence: Two nodes are said to be regularly equivalent if they have the same profile of ties with other nodes that are also regularly equivalent. Resilience: The property of resilience of networks to the removal of their vertices. Scale-Free Network: The defining characteristic of scale-free networks is that their degree distribution follows the Yule–Simon distribution, a powerlaw relationship defined by pk ∼ k −γ . SIR Epidemic Model: SIR (Susceptible-Infected-Recovered/Removed) is a model of disease spread where individuals are susceptible to a disease, potentially contract the disease, and then recover without becoming susceptible any further. This can also include individuals who die of the disease.
300
Glossary of Essential Terms
SIS Epidemic Model: SIS (Susceptible-Infected-Susceptible) is a model of disease spread where individuals are susceptible to a disease, potentially contract the disease, and are once again susceptible as soon as they recover. Small-World Network: A small-world network is a type of mathematical graph in which most nodes are not neighbors of one another, but most nodes can be reached from every other node by a small number of hops or steps. These nodes show a large clustering coefficient value and a small average shortest path distance. Social Network: A social network is a social structure made of nodes that are tied by one or more specific types of interdependency, such as values, visions, ideas, financial exchange, friends, kinship, dislike, conflict, trade, web links, sexual relations, disease transmission (epidemiology), or airline routes. Strongly Connected Components: A strongly connected component of a directed graph G is a maximal set of vertices C ⊂ V , such that, for every pair of vertices u and v, there is a directed path from u to v and a directed path from v to u. Structural Equivalence: Two nodes are said to be structurally equivalent if they have the same relationships to all other nodes. Structural Holes: Structural holes are nodes that separate non-redundant sources of information; that is, they act as a bridge between two networks that are not directly linked. Technological Networks: Technological networks are man-made networks designed typically for distribution of some commodity or resource, such as electricity or information. Vertex Connectivity: The connectivity of G, κ(G), is the minimum size of vertex set S such that G − S is disconnected or has only one vertex. Vertex Cutset: A vertex cutset of a graph G is a set S, a subset of V (G) such that G − S has more than one component. Weakly Connected Components: A weakly connected component is a maximal subgraph of a directed graph such that, for every pair of vertices u, v in the subgraph, there is an undirected path from u to v and a directed path from v to u. Zipf ’s Law: Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.
Index
C. elegans, 4 E. coli, 74 ADIOS, 153 adjacency matrix, 98, 226 Akaike information criterion, 211 antonymy, 149 Apache, 200 apoptosis, 19, 27, 31 assembly model, 60 attachment kernel, 220, 222, 229–231, 234 attack, 258 authority, 159 B-cell antigen receptor, 7 bandwidth, 219, 224, 225, 227, 234 Barab´ asi–Albert model, 194 Bayesian network, 45, 46 Belousov–Zhabotinsky reaction, 11 bifurcation point, 43 binding motif, 39 biochemical reaction, 38 biological systems, 35 bistability, 86 blogosphere, 159 Boolean model, 20, 23, 25, 31 network, 45 rules, 23 cascade model, 60 cells, 35 cellular system, 36, 73, 90
child concepts, 147 Chinese Whispers algorithm, 157 chlamydia, 97 clustering coefficient, 119, 207, 218, 257, 277 coevolution, 137, 139 community, 58 matrix, 60 structure, 135, 138, 287 competition, see ecological interaction complex adaptive system, 145 complexity science, 133, 135 compositional semantics, 145 computer viruses, 260 condition specific, 42 configuration model, 232 context vectors, 156 core lexicon, 151 cost, 77 function, 77 cross talk, 78 deep sequencing, 48 degradation, 81 degree, 222, 232 correlations, 220, 230–232, 235 distribution, 136, 203, 217, 234, 256, 275 cumulative, 223, 232 excess, 222, 232 exponential, 232 Poisson, 223, 227, 228, 234 power-law, 217–219, 224, 232 Weibull, 218
302
Index
free-excess, 240 heterogeneity, 133, 135 deletion kernel, 220, 230, 231 dendrogram, 153 diameter, 277 differential equation, 36, 46 directed tree, 108 disassortative network, 5 disease, 36 distinctive features, 154 distributional hypothesis, 156 DNA, 73 microarray, 42 dynamic modeling, 46 dynamical, 5 eccentricity, 190 ecological interaction, 58 ammensalism, 59 commensalism, 59 competition, 59 mutualism, 59 parasitism, 59 predation, 58 symbiosis, 59 edge duplication, 123 pruning, 174 eigenfunction, 121–124 eigenvalue, 119–125, 128 eigenvector centrality, 98, 159 electric current, 45 elementary mode, 46 entangled, 86 epidemics, 9, 97, 244, 247 models, 261 epigenetic, 83 eukaryotes, 86 evolutionary model, 60 explanatory variables, 209 exponential random graph models, 200, 209 expression data, 42 factor graph, 45, 46 failures, 258 false negative, 41 positive, 41
fault tolerance, 258 feature economy, 154 feedback loops, 74 female sex workers, 97 fixed point, 81 flooding, 265 flux balance analysis, 46 food webs, 9, 58 game theory, 135, 137 gene, 35 generating function, 222, 234 genome, 35 giant component, 229, 242 glial cells, 11 global structure, 147 gonorrhea, 97 grammar dependency, 151 phrase structure, 151 tree-adjoining, 151 graph visualization, 99 bipartite, 99, 120, 121, 128 complete, 120 complete, 120 function call, 200 neighbour-based, 174 random, 122 regions, 97 second order co-occurrence, 177 sentence-based, 174 small-world, 167 steepest-ascent, 98 topographic analysis, 99 graphlet, 211 habitat fragmentation, 68 Heaviside step function, 230 hierarchical structure, 146 higher order transformation, 178 histones, 83 HITS, 159 HIV, 97 condom use, 103 Holme–Kim model, 194 holonymy, 149 homeostasis, 87 homosexual, 108
Index hub, 159 HyperLex, 158 hyperlink, 159 hypernymy, 149 hyponymy, 149 IκB, 81 implicational hierarchies, 160 indexing, 189–193, 195, 196 inflammation, 80 information, 189, 197 inhibitor, 81 interaction strength, 41, 61 Internet, 197, 253 intra-cellular signaling, 7 k -core, 201 kernel lexicon, 175 keystone species, 60 kinase, 8 cascade, 45 substrate cascade, 38 kinetic modeling, 45 kinetics, 46 Laplacian, 7 algebraic graph Laplacian, 120 graph Laplacian, 117 normalized Laplacian, 120 Leipzig Corpora Collection, 168 letter frequency distribution, 169 lexical spectrum, 168 linguistic systems, 145 universals, 160 local structure, 147 logical model, 45 Lyapunov exponent, 125 machine learning, 38, 41 macroscopic, 145 mass spectrometry, 44 maximum likelihood, 210 May–Wigner theorem, 12 mental lexicon, 146 meronymy, 149 mesoscopic, 135, 145 metabolic, 80 microscopic, 145
minimal cut set, 47 modeling, 135, 137 modular structure, 136 motif, 80, 122 collector, 89 consumer, 88 duplication, 122, 123 fashion, 89 joining, 122 socialist, 87 multistability, 74, 80 mutualism, see ecological interaction natural language processing, 167 negative feedback, 80 network adaptive, 137 assortative, 5 clustered, 237 co-authorship, 129 co-expression, 43 dynamic, 140 e-mail interchange, 127 electronic circuit, 130 equilibrium, 278 food web, 127 gene-gene, 38 Internet, 126 jazz band, 129 logical, 38 metabolic, 38, 46, 126 modular, 9 neuronal, 9, 127 peer-to-peer, 253 phonological neighborhood, 148 power-grid, 129 processes, 135 protein contact, 6 protein-protein interaction, 38, 126 protein-RNA, 39 random, 4 randomized, 76 reactive, 219 regular, 4 regulatory, 38 scale-free, 119, 259 sex, 97 signaling, 45 small-world, 9
303
304
Index
social, 134–136, 138, 140 star, 14 static, 138 structured P2P, 263 superpeer, 268 syntactic dependency, 151 technological, 253 transcriptional, 43 transportation, 283 unstructured and decentralized P2P, 263 weblog, 127 word co-occurrence, 167 word collocation, 150 word-adjacency, 126 neurons, 11 NF-κB, 80 niche model, 60 node deletion, 219, 229 preferential, 230, 232 random, 219, 221, 227, 230 targeted, 230, 232 duplication, 119, 125 noise, 40 non-equilibrium network, 279 non-local diffusion, 10 nucleosome, 83 open source software, 199 opinion dynamics, 135, 138 orthographic similarity, 149 oscillations, 29, 31, 74, 80 oscillatory behaviour, 27 PageRank, 159 parasitism, see ecological interaction Pareto’s law, 168 peer churn, 268 peer-to-peer, 197, 218, 219, 223, 227, 230, 234 percolation theory, 280 Petri net, 45, 46 PhoNet Phoneme-Phoneme Network, 154 phonological similarity, 147 phylogenetic profile, 38 relations, 153
piecewise linear, 25, 31 PlaNet Phoneme-Language Network, 154 Polya–Cheeger constant, 124 polysemy, 149 population dynamics, 13 positive feedback, 74 posttranscriptional process, 44 power law, 257 distribution, 167 in language-related areas, 173 two-regime, 151 preferential attachment, 118, 119, 217, 218, 279 prey prey-predator, 63 prey-preference, 65 procedural software, 200 propagation cost, 205 protein, 35, 73 activation, 42 complex, 37, 38 localization, 42 modification, 45 protein-DNA interaction, 39, 43 turnover, 44 qualitative dynamics, 32 properties, 20 quality assessment, 40 random matrix, 12 walk, 225, 266 biased, 229 rank-degree distribution, 151 rate equation, 219, 221, 230–232, 235 recency effect, 150 recursive syntax, 145 regulatory, 80 rich-get-richer principle, 118 robustness, 13 saturated degradation, 81 scale-free, 135, 137 small-world graphs in language, 173 science of networks, 133, 140 search time, 219, 224–227, 234
Index search tree, 191 searching techniques, 264 self-organization, 145 semantic similarity, 147 sentence frequency, 172 signal propagation, 74 signals, 74 simulated annealing, 43 SIS model, 139 small-world, 135, 207, 257 social dynamics, 134, 135 groups, 135 media, 136 network analysis, 133, 135, 140 networking, 136 networks, 134–136, 138, 140 phenomena, 135, 137 structure, 138 socio-technological system, 253, 254 software engineering, 200 systems, 199 sound inventory, 153 spectral gap, 124 plot, 123, 125, 128 spectrum, 120–122, 125, 128, 278 SpellNet, 149 spiky, 81 spiral waves, 11 spreading, 135, 139 square lattices, 194 stability, 5 state change, 44, 46 steady state, 46 stimulus, 148 stoichiometry, 38, 46 structure discovery, 167
305
sub-lexical units, 146 symbiosis, see ecological interaction synchronization, 125 solution, 125 syntactic similarity, 147 synthetic lethal, 39 text summarization, 160 time course data, 43 time dependent, 42 time scales, 138 time-lagged correlation, 43 topology, 79 transcription factor, 41, 80 factor binding, 43 regulatory cascade, 39 translation, 39, 44 transmission probability, 103 tree, 120 treebanks, 151 triangle duplication, 123 trophic level, 58 typologies, 160 UCLA Phonological Segment Inventory Database, 154 unsupervised induction, 150 Watts–Strogatz model, 6 weak spot, 46 webpages, 159 word N -gram frequency, 170 word co-occurrence, 174 word sense disambiguation, 157 World Wide Web, 253 Zipf’s law, 168 Zipfian distribution, 168