BIOLOGICAL NETWORKS
6459 tp.indd 1
10/1/07 4:43:00 PM
Complex Systems and Interdisciplinary Science (ISSN: 1793-4540)
Series Editors: Felix Reed-Tsochas (University of Oxford, UK) Neil Johnson (University of Oxford, UK) Associate Editors: Brian Arthur Santa Fe Institute, Spain
Philip Maini University of Oxford, UK
Robert Axtell George Mason University, USA
Martin Nowak Harvard University, USA
Stefan Bornholdt University of Bremen, Germany
Ricard Solé Santa Fe Institute, Spain
Janet Efstathiou University of Oxford, UK
Dietrich Stauffer University of Cologne, Germany
Pak Ming Hui The Chinese University of Hong Kong, China
Kagan Tumer Oregon State University, USA
Published: Vol. 1
A Nasdaq Market Simulation: Insights on a Major Market from the Science of Complex Adaptive Systems by Vincent Darley & Alexander V. Outkin
Vol. 2
Large Scale Structure and Dynamics of Complex Networks edited by Alessandro Vespignani & Guido Caldarelli
Vol. 3
Biological Networks edited by François Képès
Forthcoming: Coping with Complexity: Understanding and Managing Complex Agent-Based Dynamical Networks edited by Janet Efstathiou, Neil F. Johnson & Felix Reed-Tsochas
Complex Systems and Interdisciplinary Science
Vol. 3
BIOLOGICAL NETWORKS
Editor
François Képès Genopole®, CNRS & University of Evry, France
World Scientific NEW JERSEY
6459 tp.indd 2
•
LONDON
•
SINGAPORE
•
BEIJING
•
SHANGHAI
•
HONG KONG
•
TA I P E I
•
CHENNAI
10/1/07 4:43:01 PM
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
BIOLOGICAL NETWORKS Complex Systems and Interdisciplinary Science — Vol. 3 Copyright © 2007 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN-13 978-981-270-695-9 ISBN-10 981-270-695-X
Printed in Singapore.
PREFACE
Over the last few years, biologists have accumulated at unprecedented pace huge datasets on systems at many different levels, ranging from molecules to populations. As these datasets typically consisted of a list of biological objects and of their interactions, they could naturally be captured by network representations. These uniform representations allowed any single domain of application to benefit from scientific breakthroughs originating in several disciplines, from graph theory to technological or social networks. This Book on Biological Networks testifies to the recent efficiency of this transversal approach, while anticipating that the advent of more sophisticated types of abundant data may inspire combinations of network methods with other approaches in an application-driven fashion. Challenges In network models, the relevant components in a system are identified as nodes. The interactions between these components are represented as links between nodes. Following this abstraction step, it becomes possible to study the topological properties of the network thus obtained. The generality and uniformity of the network representation make it possible to compare systems of very different types. At present, pure and combined network-based approaches still present fascinating challenges with respect to topological properties, and to temporal and spatial development. Transversal challenges (Fig. 1) include tackling networks with a high number of nodes, and partitioning them and recomposing their parts in a useful way. Partitioning a network into sub-networks is of interest only if the resulting sub-networks or modules are biologically relevant and display a characteristic dynamics, that they retain upon recomposition into the full picture. While partitioning into relevant modules has met with some success, it is noteworthy that network recomposition is still in its infancy. Recomposition is however a requisite for the full success of
v
vi
Preface
the modular approach in fruitfully compacting representations and in building a knowledge of Nature's own design principles. Modularity has an evolutionary counterpart. In particular, it is quite possible that modularity contributes to the evolvability of organisms. In this respect, the relation between functional modules and evolutionary modules must be questioned. Another important transversal challenge to network-based approaches (Fig. 1) consists in tackling heterogeneity, a central feature of most empirical networks. This implies the possibility of expressing an arbitrary number of link types among a single set of nodes (layered networks), or an arbitrary number of node types, or heterogeneities both in nodes and links. Besides heterogeneities of components and links, it will also be fruitful to deal with temporal heterogeneity, i.e. connections that individually vary over time, even though they remain priviledged, specific interactions.
Figure 1. Challenges of biological networks.
In any realistic model, one would ideally like to unfold networks both in time (dynamics) and in space, to escape the static view that one often associates with network topological descriptions (Fig. 1). Among the shortcomings of current purely network-based approaches, most
Preface
vii
conspicuous is the lack of a geometrical space where to place the biological objects in neighborhood situations with potentially important effects on the dynamics of their interactions, from cellular regulations to epidemics. Indeed, many studies in complex systems follow either a network-based or an agent-based approach. Seldom are these approaches used jointly. The reasons for disjoint use typically include the requirement for simplicity in the modeling process. However, these two approaches may rather be seen as complementary. Indeed, space and locality are at the heart of agent-based approaches. However, these models fall short of allowing for perennial relations between specific agents, such as those encountered among individuals, cells and even biomolecules. Network models allow such priviledged interactions between specific agents. Thus, it would be useful to harness the networkbased and the agent-based approaches together for more realistic modeling that would involve both the movement of agents in space and priviledged interactions between specific agents. Finally, the search for design principles afforded by network approaches to biology will logically lead to applying these principles to build new forms of synthetic Life, for engineering purposes as well as for a better understanding of natural Life. Outline Part 1 of the Book addresses transversal topics, thus covering generic issues and providing the mathematical setting. The Chapters in Part 1 each survey many types of networks with a common question, although they anchor their discussion in a few well-chosen case studies for the sake of understanding. The common questions are topology in Chapter 1 by Barabási and coworkers, modularity in Chapter 2 by Solé and coworkers, and reverse engineering in Chapter 3 by d'Alché-Buc. Part 2 of the Book addresses vertical topics, thus covering in-depth various application domains at the molecular, cellular and population levels. Each such Chapter corresponds to one application domain, describing one type of network, from the interacting partners and discovery methods, to the overall structure, dynamics and modeling. At the molecular level, transcriptional networks are discussed in Chapter 4
viii
Preface
by Képès, protein networks in Chapter 5 by Ideker and coworker, metabolic networks in Chapter 6 by Fell, and mixed molecular networks in Chapter 7 by Schächter. At the cellular level, neuronal networks are covered in Chapter 9 by Frégnac and coworkers, and immunological networks in Chapter 10 by Callard and Stark. At the population level, Chapter 11 by Bersier offers a historical perspective and a wide introduction to ecological studies, while Chapter 12 by Martinez and coworker provides detailed innovative views on food webs. Finally, epidemiological networks are discussed in Chapter 13 by Koopman. A biological network cannot be fully understood unless the evolutionary dimension is considered. This is why counterparts on natural genesis or artificial generation of networks are provided when sufficient ground is available, notably for molecular networks in Chapter 8 by Bornberg-Bauer and coworkers, and to a lesser extent for ecological networks inside Chapter 11 by Bersier. Evolutionary considerations are not absent from the topical chapters though. Acknowledgements This Book is part of a nascent Series that World Scientific devotes to "Complex Systems and Inter-disciplinary Science". Without the exceptional dedication of Felix Reed-Tsochas from Oxford University, one of the Series Editor, this Book would have never seen the light of day. When Felix approached me with this idea, my response was immediately enthusiastic, as I had suffered in the past of the unavailability of a Book that would cover biological networks with both transversal spirit and in-depth insight, exactly what this Series was calling for. In preparing the Book project, his advices and encouragements have been particularly useful. It has been a pleasure to interact with Authors originating from a variety of disciplines and countries, some I knew well, some I am still hoping to meet in person some day, all of them enthusiastic scientists. I have learnt a lot by reading their contributions, and I wish that the Reader will learn from these chapters and enjoy them as much as I did. Over the last ten years many people have influenced and supported my exploration of biological networks. In particular, the Genopole®
Preface
ix
workgroups have been since their inception, and still are a constant and lively source of scientific inspiration for me and my close colleagues. I am very grateful to Lizzie Bennett from Imperial College Press in London, who provided support and advice at crucial moments. Last but not least, I gratefully acknowledge the editorial assistance of Sylvie Bobelet from the Epigenomics Project in Évry. Week after week, she bravely fought with reference mismatches and formatting issues which gave her a hard time, to finally assemble this Book.
Évry, March 2007
François Képès
This page intentionally left blank
CONTENTS
Preface
v
Contributors
xiii
Chapter 1
Scale-Free Networks in Biology Eivind Almaas, Alexei Vázquez and Albert-László Barabási
1
Chapter 2
Modularity in Biological Networks Ricard V. Solé, Sergi Valverde and Carlos Rodriguez-Caso
21
Chapter 3
Inference of Biological Regulatory Networks: Machine Learning Approaches Florence d'Alché-Buc
41
Chapter 4
Transcriptional Networks François Képès
83
Chapter 5
Protein Interaction Networks Kai Tan and Trey Ideker
133
Chapter 6
Metabolic Networks David A. Fell
163
Chapter 7
Heterogeneous Molecular Networks Vincent Schächter
199
Chapter 8
Evolution of Regulatory Networks Amélie Veron, Dion Whitehead and Erich Bornberg-Bauer
257
xi
xii
Chapter 9
Contents
Complexity in Neuronal Networks Yves Frégnac, Michelle Rudolph, Andrew P. Davison and Alain Destexhe
291
Chapter 10 Networks of the Immune System Robin E. Callard and Jaroslav Stark
341
Chapter 11 A History of the Study of Ecological Networks Louis-Félix Bersier
365
Chapter 12 Dynamic Network Models of Ecological Diversity, Complexity, and Nonlinear Persistence Richard J. Williams and Neo D. Martinez
423
Chapter 13 Infection Transmission through Networks James S. Koopman
449
Index
507
CONTRIBUTORS
Eivind Almaas Center for Network Research Department of Physics University of Notre Dame, USA
Alain Destexhe Unité de Neurosciences Intégratives et Computationnelles (UNIC), Gif-sur-Yvette, France
Albert-László Barabási Center for Network Research Department of Physics, University of Notre Dame, USA
David A. Fell School of Life Sciences, Oxford Brookes University, UK Yves Frégnac Unité de Neurosciences Intégratives et Computationnelles (UNIC), Gif-sur-Yvette, France
Louis-Félix Bersier Unit of Ecology & Evolution, Fribourg University, Fribourg, Switzerland
Trey Ideker Department of Bioengineering University of California at San Diego, USA
Erich Bornberg-Bauer Division of Bioinformatics, Institute for Evolution and Biodiversity, The Westphalian Wilhelm's University of Münster, Germany
François Képès Epigenomics Project, Genopole®, CNRS, University of Évry, France
Robin E. Callard Immunobiology Unit, Institute of Child Health and CoMPLEX, University College London, UK
James S. Koopman Dept. of Epidemiology, University of Michigan, USA Neo D. Martinez Pacific Ecoinformatics and Computational Ecology Lab, USA
Florence d'Alché-Buc Informatique, Biologie Intégrative et Systèmes Complexes, CNRS & Epigenomics Project, Genopole®, Evry, France
Carlos Rodriguez-Caso Complex Systems Lab, ICREA, Universitat Pompeu Fabra, Spain Santa Fe Institute, New Mexico, USA
Andrew P. Davison Unité de Neurosciences Intégratives et Computationnelles (UNIC), Gif-sur-Yvette, France xiii
xiv
Contributors
Michelle Rudolph Unité de Neurosciences Intégratives et Computationnelles (UNIC), Gif-sur-Yvette, France
Alexei Vázquez Center for Network Research Department of Physics, University of Notre Dame, USA
Vincent Schächter Genoscope, CEA, CNRS, Evry, France
Amélie Veron Division of Bioinformatics, Institute for Evolution and Biodiversity, The Westphalian Wilhelm's University of Münster, Germany
Ricard V. Solé Complex Systems Lab, ICREA, Universitat Pompeu Fabra, Spain, and Santa Fe Institute, New Mexico, USA Jaroslav Stark CISBIC and Department of Mathematics, Imperial College London, UK Kai Tan Department of Bioengineering University of California at San Diego, USA Sergi Valverde Complex Systems Lab, ICREA, Universitat Pompeu Fabra, Spain, and Santa Fe Institute, New Mexico, USA
Dion Whitehead Division of Bioinformatics, Institute for Evolution and Biodiversity, The Westphalian Wilhelm's University of Münster, Germany Richard J. Williams Microsoft Research Ltd, Cambridge, UK
CHAPTER 1 SCALE-FREE NETWORKS IN BIOLOGY
Eivind Almaas, Alexei Vázquez and Albert-László Barabási Center for Network Research and Department of Physics, University of Notre Dame, Notre Dame, IN 46556,USA
[email protected],
[email protected],
[email protected]
1. Introduction The last century brought with it unprecedented technological and scientific progress, rooted in the success of the reductionist approach. For many current scientific problems, however, it is not possible to predict the behavior of a system from an understanding of its (often identical) elementary constituents and their individual interactions. For these systems we need to develop new methods in order to gain insight into their properties and dynamics. During the last few years network approaches have shown great promise in this direction, offering new tools to analyze and understand a host of complex systems (1-7). A much studied example concerns communication systems like the internet and the world wide web, which are modeled as networks with nodes being the routers (8) or web pages (9) and the links are the physical wires or URL’s, respectively. The network approach also lends itself to the analysis of societies, with people as nodes and the connections between the nodes representing friendships (10), collaborations (11,12), sexual contacts (13) or co-authorship of scientific papers (14,15) to name a few possibilities. It seems that the more we scrutinize the world surrounding us, the more we realize that we are hopelessly entangled in myriads of
1
2
E. Almaas, A. Vázquez and A.-L. Barabási
interacting webs, and to describe them we need to understand the architecture of the various networks that nature and technology offers us. Biological systems ranging from food webs in ecology to biochemical interactions in molecular biology can benefit greatly from being analyzed as networks. In particular within the cell the variety of interactions between genes, proteins and metabolites are well captured by network representations, especially with the availability of veritable mountains of interaction data from genomics approaches. In this Chapter we will discuss recent results and developments in the study and characterization of the structure and utilization of biological networks. 2. Characterizing Network Topology There are by now many tools and measures available to study the structure of complex networks. In the following we will discuss three of the most fundamental quantities, the degree distribution, node clustering and hierarchy, and the issue of subgraphs and motifs. In addition, it is customary to investigate the betweenness-centrality (BC) of both nodes and links, and the network assortativity. The BC is related to the number of shortest paths going through either a node or a link, and hence a large BC value indicates that the node or link acts as a bridge by connecting different parts of the network (16). The assortativity describes the propensity of a node to be directly connected to other nodes with similar degree (17,18). 2.1. Degree Distribution The representation of various complex systems as networks has revealed surprising similarities, many of which are intimately tied to power laws. The simplest network measure is the average number of nearest neighbors of a node, or the average degree k . However, this is a rather crude property, and to gain further insight into the topological organization of real networks, we need to determine the variation in the nearest neighbors, given by the degree distribution P (k ) . For a surprisingly large number of networks, this degree distribution is best characterized by the power law functional form (19) (Fig.1a);
Scale-Free Networks in Biology
3
P(k) ~ k −γ .
(1)
Important examples include the metabolic network of 43 organisms (20), the protein interaction network of S. cerevisiae (21) C. elegans (22), D. melanogaster (23), and various food webs (24). If the degree distribution instead was single-peaked (e.g. Poisson or Gaussian) as in Fig. 1b, the majority of the nodes would be well described by the average degree and we can with reason talk about a “typical” node of the network. This is very different for networks with a power-law degree distribution; the majority of the nodes only have a few neighbors, while many nodes have hundreds and some even thousands of neighbors. Although average node degree values can be calculated for these networks since their size is finite, these values are not representative of a typical node. For this reason, these networks are often referred to as “scale-free”. (a)
(b)
Figure 1. Characterizing degree distributions. For the power-law degree distribution (a), there exists no typical node, while for single peaked distributions (b), most nodes are well represented by the average (typical) node with degree k .
2.2. Clustering Coefficient A measure that gives insight into the local structure of a network is the so-called clustering of a node: the degree to which the neighborhood of a node resembles a complete subgraph (25).
E. Almaas, A. Vázquez and A.-L. Barabási
4
For a node i with degree k i the clustering is defined as
Ci =
2 ni , k i (k i − 1)
(2)
representing the ratio of the number of actual connections between the neighbors of node i to the number of possible connections. For a node which is part of a fully interlinked cluster C i = 1 , while C i = 0 for a node where none of its neighbors are interconnected. Accordingly, the overall clustering coefficient of a network with N nodes is given by C = C i / N , quantifying a network’s potential modularity. By
∑
studying the average clustering of nodes with a given degree k, information about the actual modular organization of a network can be extracted (26-29): For all metabolic networks available, the average clustering follows a power-law form as
C(k) ~ k −α ,
(3)
suggesting the existence of a hierarchy of nodes with different degrees of modularity (as measured by the clustering coefficient) overlapping in an iterative manner (26). In summary, we have seen strong evidence that biological networks are both scale-free (20,21) and hierarchical (26). 2.3. Subgraphs and Motifs A number of complex biological and non-biological networks were recently found to contain network motifs, representing elementary interaction patterns between small groups of nodes (subgraphs) that occur substantially more often than would be expected in a random network of similar size and connectivity (1,2). Theoretical and experimental evidence indicates that at least some of these recurring elementary interaction patterns carry significant information about the given network’s function and overall organization (30-33). For example, transcriptional regulatory networks of cells (30,31,34,35 ; see Chapter 4), neural networks of C. elegans and some electronic circuits (31) are all information processing networks that contain a significant number of feed-forward loop motifs (see Chapter 2). However, in transcription-
Scale-Free Networks in Biology
5
regulatory networks these motifs do not exist in isolation but meld into motif clusters (36), while other networks are devoid of feed-forward loops altogether (31).
Figure 2. The phase diagrams organize the subgraphs based on the number of nodes (n, horizontal axis) and the number of links (m, vertical axis), each discrete point explicitly depicting the corresponding subgraph. The stepped yellow line corresponds to the predicted phase boundary separating the abundant Type I subgraphs (below the line) from the constant density Type II subgraphs (above the line). The background color is proportional to the relative subgraph count Cnm=Nnm/ΣsNns of each n-node subgraph, the color code being shown in the upper right corner. Note that some (n,m) points in the phase diagram may correspond to several topologically distinguishable subgraphs. For simplicity, we depict only one representative topology in such cases. As the yellow phase boundary depends on the γ and α exponents of the corresponding network, each phase diagram is slightly different. Yet, there is a visible similarity between the networks of the same kind: the phase diagrams of the two transcription or the two metabolic networks are almost indistinguishable.
The number Nnm of subgraphs with n nodes and m interactions expected for a network of N nodes can be estimated from the two key topological parameters of a network’s large-scale structure: the degree exponent, γ, and the hierarchical exponent. In general we find that there are two subgraph classes: Type I subgraphs are those that satisfy (m-n+1)α-(n-γ)<0, their number being given by NInm~Nkmax - [(m-n+1)α - (n-γ)],
6
E. Almaas, A. Vázquez and A.-L. Barabási
where kmax denotes the degree of the most connected node in the network. Type II subgraphs are those that satisfy (m-n+1)α-(n-γ)>0, and their number is given by NIInm~N. As even for finite networks kmax>>1, the typical number of Type I subgraphs is significantly larger than the number of Type II subgraphs (NInm/NIInm >>1). Moreover, for infinite systems (N→∞) the relative number of Type II subgraphs is vanishingly small compared to Type I subgraphs, as NInm/NIInm→∞. This subdivision in Type I and II subgraphs is shown in Fig. 2 for five cellular networks: the metabolic networks of E. coli and S. cerevisiae, the regulatory networks of E. coli and S. cerevisiae, and the protein interaction network of S. cerevisiae; and different (n,m) subgraphs. The (m-n+1)α-(n-γ)=0 condition, predicted to separate the Type I and II subgraphs, appears as stepped yellow phase boundaries in the phase diagrams. For example, for the E. coli transcriptional regulatory network with α=1 and γ=2.1 (Table 1) the phase boundary corresponds to a stepped-line with approximate overall slope 1+1/α=2.0 and intercept -1γ/α =-3.1 (Fig. 1a). The Type II subgraphs are those above this boundary, and should be either absent, or present only in very low numbers in the transcriptional regulatory network. In contrast, the Type I subgraphs below the boundary are predicted to be abundant. Comparing Figs. 2a-e we find that while the stepped phase boundaries for the different cellular networks differ due to the differences in the (γ,α) exponents (Table 1), the observed densities in the real networks follow relatively closely the predicted phase boundaries. Occasional local deviations from the predictions can be attributed to the error bars of the (γ,α) exponents (Table 1), which allow for some local uncertainties for the phase boundary. Figures 1a-e also indicate that, in agreement with the empirical findings (30-33), each cellular network is characterized by a distinct set of over-represented Type I subgraphs, raising the possibility of classifying networks based on their local structure (4). Yet, the phase diagrams demonstrate that knowledge of the two global topological parameters introduced in Sections 2.1 and 2.2 automatically uncovers the local structure of cellular networks, suggesting that a subgraph- or motifbased classification are equivalent with a classification based on the different (γ,α) exponents characterizing these networks.
Scale-Free Networks in Biology
7
Figure 3. Graphical representation of three network models: (a) and (d) The ER (random) model, (b) and (e) the BA (scale-free) model and (c) and (f) the hierarchical model. The random network model is constructed by starting from N nodes before the possible nodepairs are connected with probability p. Panel (a) shows a particular realization of the ER model with 10 nodes and connection probability p=0.2. In Panel (b) we show the scalefree model at time t (green links) and at time (t+1) when we have added a new node (red links) using the preferential attachment probability (see Eq. (4)). Panel (c) demonstrates the iterative construction of a hierarchical network, starting from a fully connected cluster of four nodes (blue). This cluster is then copied three times (green) while connecting the peripheral nodes of the replicas to the central node of the starting cluster. By once more repeating this replication and connection process (red nodes), we end up with a 64-node scale-free hierarchical network. In Panel (d) we display a larger version of the random network, and it is evident that most nodes have approximately the same number of links. For the scale-free model, (e) the network is clearly inhomogeneous: while the majority of nodes has one or two links, a few nodes have a large number of links. We emphasize this by coloring the five nodes with the highest number of links red and their first neighbors green. While in the random network only 27% of the nodes are reached by the five most connected nodes, we reach more than 60% of the nodes in the scale-free network, demonstrating the key role played by the hubs. Note that the networks in (d) and (e) consist of the same number of nodes and links. Panel (f) demonstrates that the standard clustering algorithms are not that successful in uncovering the modular structure of a scale-free hierarchical network.
E. Almaas, A. Vázquez and A.-L. Barabási
8
Table 1. The γ and α exponents for five cellular networks, determined from a direct fit to the P(k) and C(k) functions.
γ α
Transcription E. coli S. cerevisiae 2.1±0.3 2.0±0.2 1.0±0.2 1.0±0.2
Metabolic E. coli S. cerevisiae 2.0±0.4 2.0±0.1 0.8±0.3 0.7±0.3
Protein Interaction S. cerevisiae 2.4±0.4 1.3±0.5
3. Network Models As we have just seen, many biological networks are dominated by a scale-free distribution of nearest neighbors. Why is this power-law behavior so pervasive? To understand the cause of the scale-free degree distribution and the hierarchical network structure, we will in the following explain three models that serve as network paradigms. These models build on very different principles and, to varying degrees, are able to explain the observed network features.
Figure 4. Properties of the three network models. (a) The ER model gives rise to a Poisson degree distribution P(k) (the probability that a randomly selected node has exactly k links) which is strongly peaked at the average degree k . The degree distributions for the scale-free (b) and the hierarchical (c) network models do not have a peak, they instead decay according to P(k ) ~ k −γ . The average clustering coefficient for nodes with exactly k neighbors, C(k), is independent of k for both the ER (d) and the scale-free (e) network model. (f) In contrast, C(k)~k −1 for the hierarchical network model.
Scale-Free Networks in Biology
9
3.1. Random Network Model In discussing the origin of the observed power-law behavior, we need to first understand the properties of the simplest available network model. While graph theory initially focused on regular graphs, since the 1950's large networks with no apparent design principles were described as random graphs (37), proposed as the simplest and most straightforward realization of a complex network. According to this Erdos−Renyi (ER) model of random networks (38), we start with N nodes and connect every pair of nodes with probability p. This creates a graph with approximately pN(N-1)/2 randomly distributed edges (Fig. 3a,d). The distribution of nearest neighbors follows a Poisson distribution (Fig. 4a), and consequently the average degree k of the network describes the properties of a typical node. Furthermore, for this “democratic” network model, the clustering is independent of the node degree k (Fig. 4d). The ER model, although simple and appealing, does not capture the properties of neither the degree distribution nor the clustering coefficient observed in biological networks. 3.2. Scale-Free Network Model In the network model of Barabási and Albert (BA), two key mechanisms, which both are absent from the classical random network model, are responsible for the emergence of a power-law degree distribution (19). First, networks grow through the addition of new nodes linking to nodes already present in the system. Second, there is a higher probability to link to a node with a large number of connections, a property called preferential attachment. These two principles are implemented as follows: starting from a small core graph consisting of m0 nodes, a new node with m links is added at each time step and connected to the already existing nodes (Fig. 3b,e). Each of the m new links are then preferentially attached to a node i (with ki neighbors) which is chosen according to the probability
Π i = ki / ∑ k j . j
(4)
10
E. Almaas, A. Vázquez and A.-L. Barabási
The simultaneous combination of these two network growth rules gives rise to the observed power-law degree distribution (Fig. 4b). In Panel 3b, we illustrate the growth process of the scale-free model by displaying a network at time t (green links) and then at time (t+1), when we have added a new node (red links) using the preferential attachment probability. Compared to random networks, the probability that a node is highly connected is statistically significant in scale-free networks. Consequently, many network properties are determined by a relatively small number of highly connected nodes, often called “hubs”. To make the effect of the hubs on the network structure visible, we have colored the five nodes with largest degrees red in Fig. 3d and 3e and their nearest neighbors green. While in the ER network only 27% of the nodes are reached by the five most connected ones, we reach more than 60% of the nodes in the scale-free network, demonstrating the key role played by the hubs. Another consequence of the hub’s dominance of the network topology is that scale-free networks are highly tolerant of random failures (perturbations) while being extremely sensitive to targeted attacks (39). Comparing the properties of the BA network model with those of the ER model, we note that the clustering of the BA network is larger, however C (k ) is approximately constant (Fig. 4e), indicating the absence of a hierarchical structure. 3.3. Hierarchical Network Model Many real networks are expected to be fundamentally modular, meaning that the network can be seamlessly partitioned into a collection of modules where each module performs an identifiable task, separable from the function(s) of other modules (40-43 ; see Chapter 2). Therefore, we must reconcile the scale-free property with potential modularity. In order to account for the modularity as reflected in the power-law behavior of C (k ) and a simultaneous scale-free degree distribution, we have to assume that clusters combine in an iterative manner, generating a hierarchical network (26,29). Such a network emerges from a repeated duplication and integration process of clustered nodes (26), which in principle can be repeated indefinitely. This process is depicted in Panel 2c, where we start from a small cluster of four densely linked
Scale-Free Networks in Biology
11
nodes (blue). We next generate three replicas of this hypothetical initial module (green) and connect the three external nodes of the replicated clusters to the central node of the old cluster, thus obtaining a large 16-node module. Subsequently, we again generate three replicas of this 16-node module (red), and connect the 16 peripheral nodes to the central node of the old module, obtaining a new module of 64 nodes. This hierarchical network model seamlessly integrates a scale-free topology with an inherent modular structure by generating a network that has a power law degree distribution (Fig. 4c) with degree exponent γ = 1 + ln 4 / ln 3 ≈ 2.26 and a clustering coefficient C(k) which proves to be dependent on k −1 (Fig. 4f). However, note that modularity does not imply clear-cut sub-networks linked in well-defined ways (26,44). In fact, the boundaries of modules are often blurred considerably (see e.g. Fig. 3f). 3.4. Bose-Einstein Condensation and Networks In most complex systems the nodes have differing abilities of attracting new links, which is independent of their number of nearest neighbors. For instance, some Web pages quickly acquire a large number of links through a mixture of good content and marketing, although they are just recently published on the World wide web. This competition for links can be incorporated into the scale-free model by adding a "fitness" parameter, ηi, to each node, i, describing its ability to compete for links at the expense of other nodes. For example, a Web page with good up-todate content and a friendly interface would be expected to display a greater fitness than a low-quality page that is only updated occasionally. The probability Πi that a new node connects to one with ki links is then modified from Eq. (4) such that Πi = ηi ki/Σj ηj kj (45). The competition generated by the various fitness levels means that each node evolves differently in time compared with others. Indeed, the connectivity of each node is now given by ki(t) ~ tß(η), where the exponent ß(η) increases with η, and t is the time since the node was added to the network (45). Consequently, fit nodes (ones with large η) can join the network at some later time and connect to many more links than less-fit nodes that have been around for longer.
12
E. Almaas, A. Vázquez and A.-L. Barabási
Amazingly, such competitive-fitness models appear to have close ties with Bose-Einstein condensation, currently one of the most investigated problems in atomic physics. In a normal atomic gas, the atoms are distributed among many different energy levels. However in a BoseEinstein condensate, all the atoms accumulate in the lowest energy state of the system and are described by the same quantum wave function. By replacing each node in the network with an energy level having energy εi= exp(-β ηi), the fitness model maps exactly onto a Bose gas (45). According to this mapping, the nodes map to energy levels while the links are represented by atoms in these levels. Additionally, the behavior of a Bose gas is uniquely determined by the distribution g(ε) from which the random energy levels (or fitnesses) are selected. One expects that the functional form of g(ε) depends on the system. For example, the attractiveness of a router to a network engineer comes from a rather different distribution than the fitness of a dot-com company competing for customers. For a wide class of g(ε) distributions, a "fit-get-richer" phenomenon emerges (45). Although the fittest node acquires more links than its lessfit counterparts, there is no clear winner. On the other hand, certain g(ε) distributions can result in a Bose-Einstein condensation, where the fittest node does emerge as a clear winner. For these distributions, a condensate develops by acquiring a significant fraction of the links which is independent of the size of the system. In network language this corresponds to a "winner-takes-all" phenomenon. While the precise form of the fitness distribution for the Web or the Internet is not known yet, it is likely that g(ε) could be measured in the near future. 4. Network Utilization Despite their impressive successes, purely topologic approaches have important intrinsic limitations. For example, the activity of the various metabolic reactions or regulatory interactions differs widely, some being highly active under most growth conditions while others are switched on only for some rare environmental circumstances. Therefore, an ultimate description of cellular networks requires us to consider the intensity (i.e., strength), the direction (when applicable) and the temporal aspects of the
Scale-Free Networks in Biology
13
interactions. While we so far know little about the temporal aspects of the various cellular interactions, recent results have shed light on how the strength of the interactions is organized in metabolic and geneticregulatory networks (46-48) and how the local network structure is correlated with these link strengths. 4.1. Flux Utilization In metabolic networks the flux of a given metabolic reaction, representing the amount of substrate being converted to a product within unit time, offers the best measure of interaction strength. Recent advances in metabolic flux-balance approaches (FBA, see also Chapter 6) (49-52) allow us to calculate the flux for each reaction, and they have significantly improved our ability to generate quantitative predictions on the relative importance of the various reactions, thus leading to experimentally testable hypotheses. The FBA approaches can be described as follows: Starting from a stoichiometric matrix model of an organism, e.g. one for E. coli contains 537 metabolites and 739 reactions (49-51), the steady state concentrations of all metabolites must satisfy
d [ Ai ] = ∑ Sij ν j = 0 dt j
(5)
where S ij is the stoichiometric coefficient of metabolite Ai in reaction j and ν j is the flux of reaction j. We use the convention that if metabolite Ai is a substrate (product) in reaction j, S ij < 0 ( S ij > 0 ), and we constrain all fluxes to be positive by dividing each reversible reaction into two “forward” reactions with positive fluxes. Any vector of positive fluxes {ν j } which satisfies Eq. (5) corresponds to a state of the metabolic network, and hence, a potential state of operation of the cell. Assuming that the cellular metabolism is in a steady state and optimized for the maximal growth rate (50,51), FBA allows us to calculate the flux for each reaction using linear optimization, providing a measure of each reaction’s relative activity (46). A striking feature of the resulting flux distribution from such modeling of both H. pylori, E. coli
14
E. Almaas, A. Vázquez and A.-L. Barabási
and S. cerevisiae is its overall inhomogeneity: reactions with fluxes spanning several orders of magnitude coexist under the same conditions (Fig. 5a). This is captured by the flux distribution for E. coli, which follows a power law where the probability that a reaction has flux ν is given by P (ν ) ~ (ν + ν 0 ) −α . This flux exponent is predicted to be α = 1.5 by FBA methods (46). In a recent experiment (53) the strength of the various fluxes of the E. coli central metabolism was measured, revealing (46) the power-law flux dependence P (ν ) ~ ν −α with α ≅ 1 (Fig. 5b). This power law behavior indicates that the vast majority of reactions have quite small fluxes, while coexisting with a few reactions with extremely large flux values.
(a)
(b)
Figure 5. Flux distribution for the metabolism of E. coli. (a) Flux distribution for optimized biomass production on succinate (black) and glutamate (red) rich uptake substrates. The solid line corresponds to the power law fit P (ν ) ~ (ν + ν 0 ) −α with ν 0 = 0.0003 and α = 1.5 . (b) The distribution of experimentally determined fluxes (53) from the central metabolism of E. coli also displays power-law behavior with a best fit to P(ν ) ~ ν −α with α = 1 .
The observed flux distribution is compatible with two quite different potential local flux structures (46). A homogeneous local organization would imply that all reactions producing (consuming) a given metabolite have comparable fluxes. On the other hand, a more delocalized “hot backbone” is expected if the local flux organization is heterogeneous, such that each metabolite has a dominant source (consuming) reaction. To distinguish between these two scenarios for each metabolite i produced (consumed) by k reactions, we define the measure (54,55)
Scale-Free Networks in Biology
⎛ νˆ ij Y (k , i ) = ∑ ⎜ k ⎜ j =1 ∑ νˆil ⎝ l =1 k
⎞ ⎟ ⎟ ⎠
15
2
,
(6)
where νˆij is the mass carried by reaction j which produces (consumes) metabolite i. If all reactions producing (consuming) metabolite i have comparable νˆij values, Y (k , i ) scales as 1 / k . If, however, a single reaction’s activity dominates Eq. (6), we expect Y (k , i ) ~ 1 , i.e., Y (k , i ) is independent of k. For the E. coli metabolism optimized for succinate and glucose uptake we find that both the in and out degrees follow the power law Y (k , i ) ~ k −0.27 , representing an intermediate behavior between the two extreme cases (46). This suggests that the large-scale inhomogeneity observed in the overall flux distribution is increasingly valid at the level of the individual metabolites as well: for most metabolites, a single reaction carries the majority of the flux. Hence, the majority of the metabolic flux is carried along linear pathways – the metabolic high flux backbone (HFB) (46). 4.2. Gene Interactions One can also investigate the strength of the various genetic regulatory interactions provided by microarray datasets. Assigning each pair of genes a correlation coefficient which captures the degree to which they are co-expressed, one finds that the distribution of these pair-wise correlation coefficients follows a power law (47,48). That is, while the majority of gene pairs have only weak correlations, a few gene pairs display a significant correlation coefficient. These highly correlated pairs likely correspond to direct regulatory and protein interactions. This hypothesis is supported by the finding that the correlations are larger along the links of the protein interaction network and between proteins occurring in the same complex than for pairs of proteins that are not known to interact directly (56-59). Taken together, these results indicate that the biochemical activity in both the metabolic and genetic networks is dominated by several ‘hot links’ that represent a few high activity interactions embedded into a web
16
E. Almaas, A. Vázquez and A.-L. Barabási
of less active interactions. This attribute does not seem to be a unique feature of biological systems: hot links appear in a wide range of nonbiological networks where the activity of the links follows a wide distribution (60,61). The origin of this seemingly universal property is, again, likely rooted in the network topology. Indeed, it seems that the metabolic fluxes and the weights of the links in some non-biological system (60,61) are uniquely determined by the scale-free nature of the network. A more general principle that could explain the correlation distribution data as well is currently lacking. 5. Conclusion Power laws are abundant in nature, affecting both the construction and the utilization of real networks. The power-law degree distribution has become the trademark of scale-free networks and can be explained by invoking the principles of network growth and preferential attachment. However, many biological networks are inherently modular, a fact which at first seems to be at odds with the properties of scale-free networks. However, these two concepts can co-exist in hierarchical scale-free networks. In the utilization of complex networks, most links represent disparate connection strengths or transportation thresholds. For the metabolic network of E. coli we can implement a flux-balance approach and calculate the distribution of link weights (fluxes), which (reflecting the scale-free network topology) displays a robust power-law, independent of exocellular perturbations. Furthermore, this global inhomogeneity in the link strengths is also present at the local level, resulting in a connected “hot-spot” backbone of the metabolism. Similar features are also observed in the strength of various genetic regulatory interactions. Despite the significant advances witnessed the last few years, network biology is still in its infancy, with future advances most notably expected from the development of theoretical tools, development of new interactive databases and increased insights into the interplay between biological function and topology.
Scale-Free Networks in Biology
17
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
Albert, R. and Barabási, A.L. (2002). Statistical mechanics of complex networks. Rev Mod Phys. 74, 47-97. Strogatz, S.H. (2001). Exploring complex networks. Nature. 410, 268-76. Dorogovtsev, S.N. and Mendes, J.F.F. (2003). Evolution of networks : From biological nets to the Internet and WWW, Oxford University Press, Oxford. Bornholdt, S. and Schuster, H.G. (2003). Handbook of graphs and networks: From the genome to the Internet, Wiley-VCH, Berlin, Germany. Newman, M.E.J., Barabási, A.L. and Watts, D. (Eds.) (2005). The structure and growth of networks, Princeton Univ Press, Princeton. Ben-Naim, E., Frauenfelder, H. and Toroczkai, Z. (Eds.) (2004). Complex networks, Lect. Notes Phys., Springer Verlag, Berlin. Pastor-Satorras, R. and Vespignani, A. (2004). Evolution and structure of the Internet, Cambridge Univ Press. Faloutsos, M., Faloutsos, P. and Faloutsos, C. (1999). On power-law relationships of the Internet topology. Comput Commu Rev. 29, 251-62. Albert, R., Jeong, H. and Barabási, A.L. (1999). Diameter of the World wide web. Nature. 401, 130-1. Milgram, S. (1967). The small-world problem. Psychology Today. 2, 60-7. Kochen, M. (1989). The small-world, Ablex, Norwood, N.J. Wasserman, S. and Faust, K. (1994). Social Network Analysis: Methods and Application, Cambridge University Press, Cambridge. Liljeros, F., Edling, C.R., Amaral, L.A.N., Stanley, H.E. and Aberg, Y. (2001). The web of human sexual contacts. Nature. 411, 907-8. Newman, M.E.J. (2001). The structure of scientific collaboration networks. Proc Natl Acad Sci. 98, 404-9. Barabási, A.L., Jeong, H., Ravasz, R., Neda, Z., Vicsek, T. and Schubert, A. (2002). On the topology of the scientific collaboration networks. Physica A. 311, 590. Goh, K.I., Kahng, B. and Kim, D. (2001). Universal behavior of load distribution in scale-free networks. Phys Rev Lett. 87, 278701. Newman, M.E.J., (2002). Assortative mixing in networks. Phys Rev Lett. 89, 208701. Pastor-Satorras, R., Vazquez, A. and Vespignani, A. (2001). Dynamical and correlation properties of the Internet. Phys Rev Lett. 87, 258701. Barabási, A.L. and Albert, R. (1999). Emergence of scaling in random networks. Science. 286, 509-12. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N. and Barabási, A.L. (2000). The large-scale organization of metabolic networks. Nature. 407, 651-4. Jeong, H., Mason, S.P., Barabási, A.L. and Oltvai, Z.N. (2001). Lethality and centrality in protein networks. Nature. 411, 41-2. Li. S., Armstrong, C.M., Bertin, N., Ge, H., Milstein, S., et al. (2004). A map of the interactome network of the metazoan C. elegans. Science. 303, 540.
18
E. Almaas, A. Vázquez and A.-L. Barabási
23. Giot, L., Bader, J.S., Brouwer, C., Chaudhuri, A., Kuang, B., et al. A protein interaction map of Drosophila melanogaster. Science. 302, 1727. 24. Montoya, J.M. and Sole, R.V. (2002). Small-world patterns in food webs. J Theor Biol. 214, 405-12. 25. Watts, D.J. and Strogatz, S.H. (1998). Collective dynamics of small-world networks. Nature. 393, 440-2. 26. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N. and Barabási, A.L. (2002). Hierarchical organization of modularity in metabolic networks. Science. 297, 15515. 27. Ravasz, E. and Barabási, A.L. (2003). Hierarchical organization in complex networks. Phys Rev E. 67, 026112. 28. Dorogovtsev, S.N., Goltsev, A.V. and Mendes, J.F.F. (2002). Pseudofractal scalefree web. Phys Rev E. 65, 066122. 29. Vázquez, A., Pastor-Satorras, R. and Vespignani, A. (2002). Large-scale topological and dynamical properties of the Internet. Phys Rev E. 65, 066130. 30. Shen-Orr, S., Milo, R., Mangan, S. and Alon, U. (2002) Nat Genet. 31, 64-8. 31. Milo, R., Shen-Orr, S.S., Itzkovitz, S., Kashtan, N. and Alon, U. (2002). Science. 298, 824-27. 32. Mangan, S., Zaslaver, A. and Alon, U. (2003). J Mol Biol. 334, 197-204. 33. Milo, R., Itzkovitz, S., Kashtan, N., Levitt, R., Shen-Orr, S., Ayzenshtat, I., Sheffer, M. and Alon, U. (2004). Science. 303, 1538-42. 34. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., et al. (2002). Science. 298, 799-804. 35. Hinman, V.F., Nguyen, A.T., Cameron, R.A. and Davidson, E.H. (2003). Proc Natl Acad Sci. U.S.A. 100, 13356-61. 36. Dobrin, R., Beg, Q.K., Barabási, A.L. and Oltvai, Z.N. (2004). BMC Bioinformatics. 5, 10. 37. Bollobas, B. (1985). Random Graphs. Academic Press, London. 38. Erdos, P. and Renyi, A. (1960). On the evolution of random graphs. Publ Math Inst Hung Acad Sci. 5, 17-61. 39. Albert, R., Jeong, H. and Barabási, A.L. (2000). Attack and error tolerance of complex networks. Nature. 406, 378-82. 40. Hartwell, L.H., Hopfield, J.J., Leibler, S. and Murray, A.W. (1999). From molecular to modular cell biology. Nature. 402, C47-52. 41. Rao, C.V. and Arkin, A.P. (2001). Control motifs for intracellular regulatory networks. Annu Rev Biomed Eng. 3, 391. 42. Hasty, J., McMillen, D., Isaacs, F. and Collins, J.J. (2001). Computational studies of gene regulatory networks: In numero molecular biology. Nature Rev Genet. 2, 26879. 43. Shen-Orr, S.S., Milo, R., Mangan, S. and Alon, U. (2001). Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genet. 31, 64-8. 44. Holme, P., Huss, M. and Jeong, H. (2003). Subnetwork hierarchies of biochemical pathways. Bioinformatics. 19, 532-9.
Scale-Free Networks in Biology
19
45. Bianconi, G. and Barabási. A.L. (2001). Bose-Einstein condensation in complex networks. Phys Rev Lett. 86, 5632. 46. Almaas, E., Kovacs, B., Vicsek, T., Oltvai, Z.N. and Barabási, A.L. (2004). Global organization of metabolic fluxes in the bacterium Escherichia coli. Nature. 427, 839. 47. Kutznetsov, V.A., Knott, G.D. and Bonner, R.F. (2002). General statistics of stochastic processes of gene expression in eukaryotic cells. Genetics. 161, 1321-32. 48. Farkas, I.J., Jeong, H., Vicsek, T., Barabási, A.L. and Oltvai, Z.N. (2003). The topology of the transcription regulatory network in the yeast, Saccharomyces cerevisiae. Physica A. 318, 601-12. 49. Edwards, J.S. and Palsson, B.O. (2000). The Escherichia coli MG1655 in silico metabolic genotype: its definition, characteristics, and capabilities. Proc Natl Acad Sci. 97, 5528-33. 50. Edwards, J.S., Ibarra, R.U. and Palsson, B.O. (2001). In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data. Nat Biotechnol. 19, 125-30. 51. Ibarra, R.U., Edwards, J.S. and Palsson, B.O. (2002). Escherichia coli K-12 undergoes adaptive evolution to achieve in silico predicted optimal growth. Nature. 420, 186-9. 52. Segre, D., Vitkup, D. and Church, G.M. (2002). Analysis of optimality in natural and perturbed metabolic networks. Proc Natl Acad Sci. 99, 15112-7. 53. Emmerling, M., Dauner, M., Ponti, A., Fiaux, J., Hochuli, M., Szyperski, T., Wuthrich, K., Bailey, J.E. and Sauer, U. (2002). Metabolic flux responses to pyruvate kinase knockout in Escherichia coli. J Bacteriol. 184, 152-64. 54. Barthelemy, M., Gondran, B. and Guichard, E. (2003). Spatial structure of the Internet traffic. Physica A. 319, 633-42. 55. Derrida, B. and Flyvbjerg, H. (1987). Statistical properties of randomly broken objects and of multivalley structures in disordered-systems. J. Phys. A: Math Gen. 20, 5273-88. 56. Dezso, Z., Oltvai, Z.N. and Barabási, A.L. (2003). Bioinformatics analysis of experimentally determined protein complexes in the yeast, Saccharomyces cerevisiae. Genome Res. 13, 2450-4. 57. Grogoriev, A. (2001). A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and yeast Saccharomyces cerevisiae. Nucleic Acids Res. 29, 3513-9. 58. Jansen, R., Greenbaum, D. and Gerstein, M. (2002). Relating whole-genome expression data with protein-protein interactions. Genome Res. 12, 37-46. 59. Ge, H., Liu, Z., Church, G.M. and Vidal, M. (2001). Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genet. 29, 482-6. 60. Goh, K.-I., Kahng, B. and Kim, D. (2002). Fluctuation-driven dynamics of the internet topology. Phys Rev Lett. 88, 108701. 61. de Menezes, M.A. and Barabási, A.L. (2004). Fluctuations in network dynamics. Phys Rev Lett. 92, 028701.
This page intentionally left blank
CHAPTER 2 MODULARITY IN BIOLOGICAL NETWORKS
Ricard V. Solé, Sergi Valverde and Carlos Rodriguez-Caso Complex Systems Lab, ICREA-Universitat Pompeu Fabra, 08003 Barcelona, Spain, and Santa Fe Institute, 1399 Hyde Park Road, 08075 Santa Fe, New Mexico, USA
[email protected],
[email protected],
[email protected]
1. Introduction The intimate structure of cellular life is largely associated to the networks of interactions among different types of molecules. The structure of such cellular and molecular networks, from the genome and the proteome to the protein folding graphs is known to be very heterogeneous and often reveals a characteristic modular architecture (1,2). At one level, it has been shown that most units (amino acids, genes, proteins or metabolites) are linked to a few other units but invariably a few units exhibit a large number of links. Such heterogeneity has been also found in a wide spectrum of other complex systems, from natural to artificial (3-6). More importantly, the topological organization of complex nets might pervade their efficiency, robustness and fragility under perturbations (7). Understanding the origins and meaning of these topological maps is an important step towards understanding the role played by different mechanisms of evolution (7). Modules have been found in biological systems at multiple levels, from RNA structure (8) to the cerebral cortex (see ref. 9) and references therein). The widespread character of modular organization pervades the
21
22
R. V. Solé, S. Valverde and C. Rodriguez-Caso
functional association between compartmentalization and evolution. Modules have been variously defined as: (a) functionally buffered, (b) robust, (c) independently controlled, (d) plastic in composition and interconnectivity and (d) evolutionarily conserved. The evolutionary conservation of modules is clearly appreciated in gene networks involved in early development (10,11). The argument is that the special features of some of these modules are tightly linked to their robustness under different sources of noise. As discussed in Chapter 1, real biological networks are typically scale-free: their degree distributions fall off as a power law P(k) ~ k −γ . Real networks are also known to display the so-called small world effect, defined in terms of high clustering coefficient (high number of triangles compared to a randomly wired web with the same distribution of links) and very short path lengths. The presence of a high heterogeneity is actually tied to the small number of degrees of separation: the highly connected hubs seem to act as glue in these webs, allowing most nodes to connect to each other through a small number of jumps. The modular structure exhibited by biological webs is naturally associated to hierarchical properties (12). An example of modular network is shown in Fig. 1a : here a very simple picture is shown, involving three coupled modules. Each module has the same number of units and they are more connected within the module than with units outside it. The system shown here is generated as follows (12). Nodes inside each module are randomly wired with some probability p, as in so-called Erdös-Renyi (ER) graphs (13). They are also linked to nodes in other modules with a probability q
Modularity in Biological Networks
23
these sub-parts. These networks have been shown to be hierarchical small worlds and their growth through design seem to be not far from biological-like mechanisms of evolution (14). A similar scenario is found in electronic circuits, where heterogeneous structures have been shown to be the rule (15). The modular character of biological networks is assumed to be a consequence of both their robustness and evolvability. In this context, modularity would evolve through a decrease of pleiotropy (16). Since modules somewhat define separated compartments, they would act as buffers against lethal mutations perhaps facilitating variation. In a different context, it has been suggested that modularity might arise from the intrinsic structure of the non-metric mapping between genotype and phenotype (17). Although functionality must pervade the selection of some modular structures, other no less important factors might shape them. Man made designs might help understanding this point. Technological systems are the result of planned activity: engineers have a predefined goal. Such situation is far from the one we encounter in nature: to a large extent, new biological functions and structures are obtained through a process of tinkering (18-20). In other words, novelties are achieved by re-using and combining previously present structures. And although tinkering might seem a rather inefficient way of generating innovations, life on our planet clearly illustrates how well it works. Perhaps not surprisingly, engineers have independently converged to some designs that are largely based upon the combinatorial use of common, simple and flexible elements. This is the case of the so-called Field Programmable Gate Arrays (FPGA) widely used in the design of adaptable electronic circuits (Fig. 1c). In these systems, simple modules are assembled in a regular array, and different computations are implemented by using parts of the whole array in a flexible manner. As stressed by Peter Schuster, assembling structures from predefined building blocks by means of appropriate rules leads to combinatorial explosions (21). Such combinatorial explosions pervade some of the major transitions in evolution. Moreover, such combinatorial tinkering is not exclusive of natural systems. It also takes place in artificial designs. In electronic circuits for example, computer engineers early developed a number of standard circuit modules, each one performing a specific
c
a
b
R. V. Solé, S. Valverde and C. Rodriguez-Caso
24
Figure 1. Modularity is a common trait exhibited by both natural and artificial structures. Two examples of modular graphs are shown. In (a), a homogeneous graph is shown, involving four modules, each of which is formed by 20 randomly connected units. The connectivity among modules is smaller than the intra-modular one. In (b) a hierarchical modular graph is shown. It corresponds to a software diagram displaying scale-free organization. Here nodes are well-defined blocks performing given tasks or defining key objects in a large program. Links indicate that such blocks are related to each other. These systems are typically modular, being modules associated to well-defined functions. Modularity is widespread in large-scale integrated circuits, such as the one shown in the right picture (c). Given the increasing complexity of computations to be performed, engineers soon used pre-defined modules able to perform specific operations as the basic building blocks of their designs. The most flexible one in this context is provided by FPGAs (see text) where the same building block is extensively used in a flexible manner.
function. They became the logical building blocks for creating their systems. Such tendency started with the advent of the transistor and never stopped. It is actually worth noting that, in spite of the optimal design involved in circuit engineering, scale-free structures are nevertheless present, suggesting common principles of evolution (15). A similar situation is found in software architecture, where large-scale systems have been shown to exhibit topologies not far from those seen in cellular networks (14). Actually, it has been found that the global and local organization of these webs might actually result from mechanisms surprisingly close to those responsible for the evolution of cells (14). In this Chapter we review some basic techniques that allow defining and characterizing modular patterns. We consider two biological examples belonging to two different scales of organization. The first is
Modularity in Biological Networks
25
related to the internal structure of proteins and the second to the way proteins interact in networks within cells. 2. Topological Overlap Before any interpretation or conjecture about the origins and functional meaning of modules is made, appropriate measures of network modularity are required. Not all networks display modularity, but modularity itself (as discussed before) can be organized in many different ways. In Table 1 we summarize the differences between Table 1. Main hubs in the transcription factor network. Here their functional role and associated diseases as well as their degree (k) are indicated. Most elements involved here are associated to proliferative diseases (particularly cancer), being either oncogenes or tumor suppressor genes. TF
Functional class
Associated diseases
k
TBP
Basal transcription machinery initiator
Spinocerebellar ataxia
27
p53
Tumor suppressor protein
Proliferative disease
23
P300
Coactivator. Histone acetyltransferase
May play a role in epithelial cancer
18
RXR-α
Retinoid X-α receptor
Hepatocellular
18
pRB
Retinoblastoma suppressor protein. Tumor suppresor
Proliferative disease Bladder cancer. Osteosarcoma
15
RelA
NF-κB pathway
Hepatocite apoptosis and fetal death
14
c-jun
AP-1 complex (activator). Proto-oncogen
Proliferative disease
14
c-myc
Activator. Proto-oncogen
Proliferative disease
13
c-fos
AP-1 complex (activator). Proto-oncogen
Proliferative disease
12
26
R. V. Solé, S. Valverde and C. Rodriguez-Caso
networks generated through random wiring (ER), preferential attachment (22) and the actual proteome map. It is interesting that none of the models gives modular architecture. Protein modules, for example, result from the binding of multiple protein molecules forming stable complexes. Although there is no agreed measure of overall modularity for graphs, the intuitive notion that a module is a group of nodes that have higher connectivity to the inside than to the outside seems sensible, and some simple heuristic algorithms can reveal this straightforwardly (12). To do so, the topological overlap (TO) of two nodes has to be defined. The degree of overlap of two vertices provides a normalized estimate of the likelihood that two vertices belong to the same “module”. If they do, they will probably have many neighbors in common. Its definition is O (vi , v j ) = O (v j , vi ) =
J (vi , v j ) , min(ki , k j )
(1)
where J(vi,vj) denotes the number of nodes to which both vi are vj are linked (plus one if a direct link between them is present). The topological overlap provides a measure of the proximity in terms of modules of two vertices. Given the O(vi,vj) matrix, we can apply the general algorithm for hierarchical clustering, which tries to group together those vertices in the system that have a high overlap (12). This algorithm starts with a matrix with the original values, and repeatedly finds the highest value, grouping together the corresponding row and column (the two vertices) into one new aggregated vertex, and computes the new values for the pair grouped vertices, so as to be able to apply the same procedure to the resulting matrix until it collapses into a single value. A clever example of modular organization is offered by that of proteins. Proteins act as the nanomachines driving cellular organization and dynamics. Each protein is itself composed by a folded chain of residues and the folding process provides the mapping between the sequence and the function. Once folded into a given three-dimensional arrangement, the protein shape interacts with other molecules, including other proteins. The modular structure of proteins is well known: most proteins are made up by combining a finite set of so-called protein
Modularity in Biological Networks
27
domains. They can be defined as protein substructures folded independently into a compact structure. Furthermore, protein domains are often associated with functions, such as, binding region to membrane, DNA and any other biomolecule, enzymatic activity, or target for translational modifications (23,24). Their modular organization can be made explicit by using the previous technique. In this case, the basic units are the amino acid residues. A protein folding results from different types of interactions between these residues. So-called peptidic bonds allow amino acids to be linked forming a linear chain (the protein sequence). Disulfide and hydrogen bonds link distant points in the chain. Other interactions like van der Waals and electrostatic forces also maintain and determine the protein folding in the space (25). The crystal structure of the molecule gives us the information to build a contact map. As a first approach, we can consider that two residues are linked between them whether they have a peptidic, disulfide or hydrogen bond. The resulting graph provides a picture of the protein topology that can be analyzed using the previous techniques.
Figure 2. Left: Protein molecules can be understood in terms of a graph. Here we show the hierarchical organization of the pig heart GTP-Specific Succinyl-CoA Synthetase (EC 6.2.1.4), as uncovered by the hierarchical clustering algorithm described in the text. This method reveals two basic modules (corresponding to the two chains A and B), which are marked using a dashed box. These two modules also include a lot of internal substructure, shown by the further boxing of the upper right modules. Right: The corresponding protein structure.
28
R. V. Solé, S. Valverde and C. Rodriguez-Caso
An example of the modular structures revealed by protein contact maps is shown in Fig. 2. Here a given enzyme, the crystal structure of pig heart, GTP-Specific Succinyl-Coa Synthetase (EC 6.2.1.4) has been chosen as an illustrative example. The hierarchical clustering algorithm described in the text reveals two large modules (corresponding to the two polypeptides, chains A and B), which are marked using a dashed box. These two modules have a large amount of internal structure, here indicated by means of further boxing of the upper right modules. The presence of this nested hierarchy of sub-modules seems to suggest that proteins also display a hierarchical, modular organization. Actually, by measuring average path length and clustering, the overall organization of a large database of proteins already indicates that protein contact maps are small worlds (26). The previous method can be applied to any system that can be characterized as a topological graph of interactions. It can provide valuable information concerning the overall organization of an important class of cellular network involving those proteins that control the expression of genes: the so-called transcription factors. In Fig. 3 we plot the largest connected component of the human transcription factor (HTF) network. It has been shown that such network exhibits scale-free, small world architecture (27) consistently with other biological networks reviewed in this book. Transcription factors are an essential subset of interacting proteins, since they are responsible for the control of gene expression (for further details, see Chapter 4). They interact with DNA regions and tend to form transcriptional regulatory complexes. Thus, the final effect of one of these complexes will be determined by its transcription factor composition. In Fig. 3a we appreciate the heterogeneous structure displayed by this network, in which a few nodes (the hubs, indicated as black circles) are linked to many other proteins. Using the previous technique, a well-defined hierarchical structure is found (Fig. 3b). The map shows a nested, hierarchical structure, with small modules as dark boxes across the diagonal, which have a large overlap. However, there are some weak connections between modules, as shown by the tiny lines in the topological overlap matrix. The algorithm weights the (topological) association of any node to the others. Thus, it is possible to build a dendrogram of relations where we can see also a
Modularity in Biological Networks
29
hierarchy, since modules are not related at the same level as would be expected in a pure modular network (28). Are these topological modules functionally relevant here? In other words, are these nested structures reflecting some functional organization?
a
b
Figure 3. Protein interactions allow defining complex networks. In (a) the human transcription factor network is shown. Here most elements have just one or two links and a handful of the transcription factors have multiple connections. Numbered black filled nodes are the highest connected transcription factors. Here we indicate: 1, TATA binding protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor α (RXRα); 5, retinoblastoma protein (pRB); 6, Nuclear factor NFκB p65 subunit (RelA); 7, c-jun; 8, c-myc; 9, c-fos. The right picture shows the corresponding topological overlap matrix and the dendrogram (Panel b). A to G are the seven topological groups defined by tracing of a dashed line through the dendrogram. In table 1 we provide a biological description and functional features of each group.
The previous modules are essentially organized around a handful of hub transcription factors. What is their functional meaning? In table 1 we summarize some essential information associated to the nine most connected elements. As we can see, their functional meaning is enormous, since they are involved in key cellular processes and their failure is tied to complex diseases such as cancer. Consistently with previous studies, the essential character of these transcription factors is reflected by their high degree. An example is given by the tumor suppressor protein p53, a hub integrating regulatory interactions
R. V. Solé, S. Valverde and C. Rodriguez-Caso
30
involving cell cycle, cell differentiation, DNA repair, senescence or angiogenesis. Accordingly with its key relevance in maintaining genome integrity and its importance in current therapies, this gene is considered a so-called Achilles heel of cancer (29). If we look at the specific composition of each module, as detected from the hierarchical overlap, we can see that such topological method properly identifies meaningful sets of interacting transcription factors. In Fig. 4a we show the previous network using a color-coded representation based on the modular information obtained from the topological overlap. A detailed study of the properties of each protein involved in the network (27) showed that such topological groups retain structural and functional properties, as indicated in Table 2. Table 2. Structural and functional features of the groups obtained from topological overlap matrix. Group
Structural features
Functional features
A
77% bHLH domains.
B
47% bHLH-bZip domains. 36% rel homology region 40% bZip domains. 24% fork head domains.
Muscle and neural tissue specific, sex determination. Includes E proteins family related to lymphocyte differentiation. Includes E-box type A TF. c-myc related factors (59%). Includes E-box type B TF. Related to cell proliferation. TF involved in NFκB pathway, AP1 complex and others
C
D
E
F
G
22% histone folding. Major part of specific interacting regions 42% Zn finger domains. 31% MAD domains.
E2F/pRB pathway, histone deacetylases (HDAC). PRB and p53 isoforms… Basal transcriptional machinery for promoters type I, II, III, PTF/SNAP complex and TBP related factors. It contains the 90% of the members of nuclear receptor superfamily (they are Zn fingers also) of the HTFN. SMAD family proteins and β-catenin and APC related factors.
Modularity in Biological Networks
a
31
b
Figure 4. The modular, scale-free architecture of the human transcription factor network is shown here (a) by coloring each group of topologically related proteins with a color indicating the functional group to which it belongs. The color scale and its relation with the previous classification (shown in Fig. 3) are indicated in the small box. The scaffold graph for this network is shown in (b). Here only those connected nodes such that at least one has a degree larger than kc=11 are shown. We can see that each hub has an associated group of functionally related proteins, consistent with the expected specificity associated to each topological module. Together with the hub proteins, the connectors are also captured. A different characterization of the network correlations and its modularity is provided by the correlation profile analysis taking into account self-interactions (Panel a) and avoiding them (Panel b).
32
R. V. Solé, S. Valverde and C. Rodriguez-Caso
The modular organization of this network around a core of key proteins can be highlighted by using a scaffold graph (27). This core provides a well-defined sub-graph containing all the hubs and their interaction partners. One pair of connected proteins is conserved, in the so-called core graph, if the degree of at least one protein of this pair is bigger than a predefined cut-off kc . The core graph for this network is shown in Fig. 4b. In this way we keep the hubs, around which modules are organized, as well as those proteins not belonging to the hub set but acting as connectors. These are shown in the core graph as nodes with degree two linking pairs of hubs. They allow the whole system to keep integrated but also help isolating hubs and thus increase the global robustness and damage propagation. An additional tool that also helps to detect modular organization has been used here. It is known as the correlation profile. It involves the comparison of the correlations of a given graph with its randomized counterpart (30). Since we are comparing degree correlations, it is important in this case to create a random graph that has the same degree distribution. This is achieved through a rewiring process, in which the degree distribution is preserved. If two edges are chosen at random that do not have vertices in common, the simple exchange of their starting vertices will give a graph with the same degree distribution but otherwise random. Iterating this process as many times as twice the number of edges yields a reasonably randomized graph, with preserved degree distribution. If we call the probability P (k0 , k1 ) , of finding an edge connecting two nodes with degree k0 and k1, and Pr (k0 , k1 ) the random equivalent, we can measure two things, that is,
R(k0 , k1 ) =
P(k0 , k1 ) , Pr (k0 , k1 )
(2)
whose deviation manifest the correlations, and
Z (k 0 , k1 ) =
P (k 0 , k1 ) − Pr (k 0 , k1 ) , σ r (k 0 , k1 )
(3)
quantifying the statistical significance of R (k0 , k1 ) , or Z-score. The value σ r (k0 , k1 ) is the standard deviation of Pr (k0 , k1 ) in an ensemble of randomized networks.
Modularity in Biological Networks
33
Fig. 4 shows two correlation profile of the HTF network attending the presence of self-interactions (Panel a) or their absence (Panel b). The HTF network exhibits an abundance of nodes with self-interactions, meaning that many proteins are able to form homo-oligomers. The presence of self-interactions is not a usual property however, comparison of these two correlation profiles show evident changes indicating the importance of self-interactions along the distribution degree. Part of the modularity of HTF network is explained by such self-connectivity, since a common regulatory mechanism in TFs is based on a homo/hetero oligomerization that modulates the activity of the transcriptional complex. One example of this is the c-myc/max/mad regulatory network (31) made by basic-helix-loop-helix superfamily members (32). They contain domains with the ability to form oligomers by themselves (33). It is not exclusive of them, as it also occurs with interacting regions of steroid hormone nuclear receptor (34) and the leucine zipper transcription factors (33). The presence of one of them not only confers the ability of homo-oligomerization but also allows to bind to another transcription factor which also contains the same domain. In other words, according to the modularity definition proposed earlier, they have a higher probability to bind between them than with other proteins. 3. Modular Networks: The Role of Tinkering How is modularity generated in the evolution of complex networks? Since modular organization is an obvious source of adaptation, it must have been a key component in the success of Darwinian selection. However, network growth through multiplicative processes is also an important ingredient in generating heterogeneities. Amplification processes, such as those derived from a tinkering process in which parts of the system are copied and pasted, are able to create complex correlations, and even modular structures. Let us consider a generic model of node duplication and rewiring. This model will be restricted to a topological framework, thus ignoring any kind of functional constraint. The first component of the model allows the system to grow by means of the copy process of previous units (together with their links). The second introduces novelty by means
34
R. V. Solé, S. Valverde and C. Rodriguez-Caso
of changes in the wiring pattern, constrained in our approach to the newly created genes (35,36). This constraint is required if we assume that conservation of interactions is due to functional restrictions and that further changes in the regulation map are limited. Such constraint would be strongly relaxed when involving a newly created (and redundant) unit. Within the context of protein interaction networks, the evolution of the globin gene family provides an example of genome evolution by gene duplication. Here the primitive, single-chain globin molecule (found in many insects and primitive fish) is closely related to the hemoglobin molecule present in higher vertebrates. The hemoglobin is composed of a tetramer formed by two copies of two slightly different globin chains. About 500 million years ago, a series of gene duplications and mutations took place, establishing the two different globin genes which allow, through the interaction of their protein products, a more efficient transport of oxygen. This is not a unique example, phylogenetic studies have evidenced that basic-helix-loop-helix trancription factors and steroid hormone nuclear receptors have evolved by gene amplification and domain suffling (37,38). The model used here considers a simple set of rules. First, each node can be chosen at random and duplicated. Afterwards, redundant links are removed with some probability δ . Additionally, with probability β, a link connecting the old and the copied nodes is also introduced. Here we fix β = 0.1 and different probabilities of removal are used. In Fig. 5 we plot the connected component of the graph for different values of removal. As expected, increasing deletion rates generate sparser networks, and in all cases we can show that, together with a heterogeneous distribution of links, proto-modules are also obtained. Actually, careful analysis of the statistical properties shown by these graphs indicates that most statistical patterns exhibited by protein interaction networks seem matched by the toy model (35, 36). A physical view of the problem confirms this prediction: due to fluctuations, stochastic network dynamics gives rise to modularity (39). These findings somewhat suggest that modularity, being a fundamental target of selection and the engine of evolvability, might actually be much easier to get than we would expect. Due to internal constraints associated to the
Modularity in Biological Networks
35
5
6
.
0
5
4
.
0
5
2
.
0
5
1
.
0
δ
rules of network growth, proto-modular features are in fact inevitable and ready to be used by selective forces.
Figure 5. The architecture of a tinkered network, as generated by the simple model of duplication and re-wiring, for different values of the deletion rate. As the rate of removal increases, the network becomes more and more sparse, and modules emerge spontaneously. Here the largest connected component is shown. Computing the hierarchical overlap, it can be shown that nested modularity is present, in spite of the lack of functional traits in the model.
36
R. V. Solé, S. Valverde and C. Rodriguez-Caso
4. Conclusions It is generally acknowledged that modules define functional units and as such are the targets of selection (2,40). In this context, some authors describe modules as a critical level of biological organization (see also 2,11,16). In order to reach specialization and division of labor, evolutionary paths have provided the source of innovation by generating modular structures at multiple levels. The trend is very general, and not surprisingly appears to be widespread in the design of computational systems, such as large software packages or electronic circuits. Modularity in complex networks can be detected at two levels: (a) by looking at the topological organization of connections or (b) by studying the common functional traits shared by different components. In this paper we have seen that modular structures can be found at vastly different scales. The modular pattern found in protein contact maps is related to the differential folding dynamics displayed by different protein domains. Here the contact map detects such separated modules as a nested hierarchy, which highlights the presence of well-defined domains and further internal structure. Such nested hierarchy is likely to be related to common mechanisms of folding at multiple scales. In this context, the most visible modular substructures correspond to domains, whereas the internal details within each sub-module are likely to be a by-product of such common features. At a larger scale, the protein interaction map completely ignores the details inherent to each protein. The resulting network thus only retains the set of links relating different proteins. As we have seen, the topological patterns found through the hierarchical clustering algorithm are consistent with the functional traits that relate proteins inside each module. Thus function and structure (at this level) come together. When looking at these structures, the obvious question is what came first. Selection has been exploiting modules and their combinatorial explosions in a rather successful way. However, the previous topological models seem to suggest that there is no need for a priori assuming that modularity is generated through natural selection. The answer to the question is that topology and function (at the local scale) have been coevolving together. The mechanisms allowing complexity to increase are
Modularity in Biological Networks
37
multiplicative: tinkering is inevitable and thus the amplification processes associated to duplication are the basic rules of play. But these rules immediately lead to heterogeneities that are available for selection. On the one hand, scale-free structures are inevitable and provide a source of efficient communication among different parts and robustness against random failure (see Chapter 1). Groups of related elements that are less connected to others can be shaped in order to generate isolated subsystems. Further copy of elements (or basic blocks) rapidly increases the modular patterns and the resulting division of labor, robustness and evolvability. It is interesting to see that such tinkered evolution is not restricted to biology. Technological graphs have also a scale-free architecture and modular organization. Here tinkering is also widely used as the complexity of the whole becomes large enough (see Table 3). Complex engineered designs require the coordination of a large number of interconnected sub-systems, and as their number increases, so do the constraints. Once given sub-systems have been shown to successfully implement a given function, they are easily re-used in other parts of the structure, either as basic blocks (such as logic gates in electronic circuits) or as modified subsystems, as it occurs in software structures. We thus have a surprising case of convergent design in both cellular and manmade structures, where common global and local patterns are obtained by means of playing with available pieces and putting them together to perform complex functions. Table 3. Tinkered networks: Here four different systems are considered, sharing their modular architecture and the relevant contribution of duplication and rewiring to their final structure.
Electronic Digital Circuits Software Maps Protein Structure Protein Networks
p(K)
Modules
Origin of tinkering
Scale-Free γ≈3 Scale-Free γ ≈ 2.5 Single-Scale
Logic units Functional packages Domains
Truncated SF γ ≈ 2.5?
Protein modules
Design Constraints Extensive reuse of code Combinatorics by domain shuffling Duplication & rewiring
R. V. Solé, S. Valverde and C. Rodriguez-Caso
38
Acknowledgments The authors thank the members of the CSL for useful discussions. Special thanks to Baldo Oliva for discussions on protein architecture. This work has been supported by grants FIS-2004, DELIS and NIH and by the Santa Fe Institute (RVS, SV). References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11.
12.
13. 14. 15. 16.
Albert, R. and Barabási, A.L.(2002). Statistical mechanics of complex networks. Reviews of Modern Physics. 74, 47-97. Hartwell, L.H., Hopfield, J.J., Leibler, S. and Murray, A.W.(1999). From molecular to modular cell biology. Nature. 402, C47-52. Strogatz, S.H.(2001). Exploring complex networks. Nature. 410, 268-276. Bornholdt, S. and Schuster, H.G. (2003). Handbook of Graphs and Networks: From the genome to the Internet. Weinheim: Wiley-VHC. Dorogovtsev, S.N. and Mendes, J.F.F. (2003). Evolution of networks: From the genome to the Internet and WWW. Oxford: Oxford University Press. Newman, M.E.J., Barabási, A.L. and Watts, D. (2005). The structure and growth of networks. Princeton: Princeton Univ. Press. Ben-Naim, E., Frauenfelder, H. and Toroczkai, Z. (2004). Complex networks. Berlin: Springer Verlag. Ancel, L.W. and Fontana, W. (2000). Plasticity, evolvability, and modularity in RNA. J Exp Zool. 288, 242-283. Zeki, S. and Shipp, S. (1988). The functional logic of cortical connections. Nature. 335, 311-317. Solé, R.V., Salazar-Ciudad, I. and García-Fernandez, J. (2002). Common Pattern Formation, Modularity and Phase Transitions in a Gene Network Model of Morphogenesis. Physica A. 305, 640-647. Dassow, G. and Munro, E. (1999). Modularity in animal development and evolution: elements of a conceptual framework for EvoDevo. J Exp Zool. (Mol. Dev. Evol.). 285, 307-32. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N. and Barabasi, A.L. (2002). Hierarchical organization of modularity in metabolic networks. Science. 297, 15511555. Erdös, P. and Rényi, A. (1960). On the evolution of Random graphs. Publ Math Inst. Hung Acad Sci. 5, 17-60. Valverde, S. and Solé, R.V. (2005). Network motif in computational graph: A case study in software architecture. Phys Rev E. 72, 026107. Cancho, R.F., Janssen, C., and Solé, R.V. (2001). Topology of technology graphs: Small world patterns in electronic circuits. Phys Rev E. 64, 046119. Wagner, G.P. and Altenberg, L. (1996). Complex Adaptations and the Evolution of Evolvability. Evolution. 50, 967-976.
Modularity in Biological Networks
39
17. Stadler, B.M., Stadler, P.F., Wagner, G.P. and Fontana, W. (2001). The topology of the possible: formal spaces underlying patterns of evolutionary change. J Theor Biol. 213, 241-274. 18. Jacob, F. (1976). Evolution as tinkering. Science. 196, 1161-1166. 19. Duboule, D. and Wilkins, A.S. (1998). The evolution of 'bricolage'. TIG, 14: 54-59. 20. Solé, R.V., Ferrer, R., Montoya, J.M. and Valverde, S. (2002). Selection, Tinkering and Emergence in Complex Networks. Complexity. 8, 20-33. 21. Schuster, P.(2000). Taming combinatorial explosion. Proc Natl Acad Sci U.S.A. 97, 7678-7680. 22. Barabasi, A.L. and Albert, R. (1999). Emergence of scaling in random networks. Science. 286, 509-512. 23. Baron, M., Norman, D.G. and Campbell, I.D. (1991). Protein modules. Trends Biochem Sci. 16, 13-17. 24. Sonnhammer, E.L. and Kahn, D. (1994). Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 3, 482-492. 25. Creighton, T.E. (1996). Proteins: Structure and Molecular Propeties. New York: W. H. Freeman and Company. 26. Fernández, P. and Solé, R.V. (2005). Graphs as models of large-scale biochemical organization. In: D. Bonchev and D.H. Rouvray (eds.), Complexity in Chemistry, Biology, and Ecology. New York: Springer. 27. Rodriguez-Caso, C., Medina, M.A. and Solé, R.V. Topology, tinkering and evolution of the human transcription factor network. FEBS journal, submitted. 28. Barabasi, A.L. and Oltvai, Z.N. (2004). Network biology: understanding the cell's functional organization. Nat Rev Genet. 5, 101-113. 29. Vogelstein, B. and Kinzler, K.W. (2001). Achilles' heel of cancer? Nature. 412, 865866. 30. Maslov, S. and Sneppen, K. (2002). Specificity and stability in topology of protein networks. Science. 296. 910-913. 31. Luscher, B. (2001). Function and regulation of the transcription factors of the Myc/Max/Mad network. Gene. 277, 1-14. 32. Atchley, W.R. and Fitch, W.M.(1997). A natural classification of the basic helixloop-helix class of transcription factors. Proc Natl Acad Sci U.S.A. 94, 5172-5176. 33. Branden, C. and Tooze, J. (1999). Introduction to protein structure. New York: Garland publishing, Inc. 34. Zhang, X.K. and Pfahl, M. (1993). Hetero- and homodimeric receptors in thyroid hormone and vitamin A action. Receptor. 3, 183-191. 35. Sole, R.V., Pastor-Satorras, R., Smith, E. D. and Kepler, T. (2002). A model of large-scale proteome evolution. Adv. Complex Systems. 43-54. 36. Pastor-Satorras, R., Smith, E., and Sole, R.V. (2003). Evolving protein interaction networks through gene duplication. J Theor Biol. 222, 199-210. 37. Morgenstern, B. and Atchley, W.R. (1999). Evolution of bHLH transcription factors: modular evolution by domain shuffling? Mol Biol Evol. 16, 1654-1663. 38. Laudet, V. (1997). Evolution of the nuclear receptor superfamily: early diversification from an ancestral orphan receptor. J Mol Endocrinol. 19, 207-226.
40
R. V. Solé, S. Valverde and C. Rodriguez-Caso
39. Guimera, R., Sales-Pardo, M. and A., N.A. L. (2004). Modularity from fluctuations in random graphs and complex networks. Phys Rev E. 70, 025101. 40. Solé, R.V., Salazar-Ciudad, I., and García-Fernandez, J. (2002). Common Pattern Formation, Modularity and Phase Transitions in a Gene Network Model of Morphogenesis. Physica A. 305, 640-647.
CHAPTER 3 INFERENCE OF BIOLOGICAL REGULATORY NETWORKS: MACHINE LEARNING APPROACHES
Florence d'Alché-Buc Informatique, Biologie Intégrative et Systèmes Complexes, FRE 2873 CNRS and Epigenomics Project, Genopole®, 523, place des Terrasses de l’Agora , 91000 Evry, France
[email protected]
1. Introduction Regulatory processes at work in the cell imply various interactions between macromolecular components. Usually referred as biological regulatory networks, these processes involve genes, RNA and proteins. Through time, some genes via their products act as regulators for other genes, inhibiting the expression of the regulee or inducing its expression. This nonstop interaction between regulators and regulees enables the cell to produce adequate responses to internal or external signals. Hence, a coherent global dynamical behavior seemingly emerges from the elementary individual behaviors. Identifying and understanding these regulatory mechanisms appear nowadays as one of the key challenge in systems biology with potential applications in therapeutical targeting, drug design, diagnosis and management of disease (1). For the last three decades these genetic regulatory networks have been an object of study for pioneering modelers who have provided convincing arguments in favor of the primordial role of dynamics in these systems. Networks involving at most tens of genes have been
41
42
F. d'Alché-Buc
studied and modeled using the continuous paradigm of either differential equations (2,3), or the discrete framework of Boolean networks (4) or discrete-automata (5). The development of these frameworks has contributed to a better understanding of some of the major regulatory mechanisms involved in the cell, yielding a comprehensive view of dynamical phenomena responsible for instance for epigenetic switches. While there still exists numerous fundamental challenges in direct modeling (6,7), the development and spread of microarray data has caused a surge of interest for the reverse modeling task which aims at identifying the true regulatory system from the observed gene expression profiles. This identification step was actually missing in the classic loop of modeling, leaving to biologists and modelers the task to assign values to model parameters as well as the choice of the network structure. The arrival of large scale transcriptomic data has now opened the door to the automated learning of both parameters and structure of such networks. By exploiting regularities in the data, machine learning enables the identification of the unknown parameters and to some extent provides also the structure of networks. This chapter aims at reviewing the main streams in this fast-developing field with a focus on a comparative study of the properties of each family of approaches. It is not meant to be exhaustive but rather aims to provide a comprehensive view of what has been achieved so far using the machine learning framework, and to outline which points require further developments. 1.1. Feasibility of Inference From the biological side, it is not that clear that the identification of regulatory networks is possible only using concentrations of mRNA due to the role of additional sources of regulation such as micro RNA. Even if the reverse-modeling approach is restricted to transcriptional regulatory networks, the interlacing roles of genes and proteins in the whole regulation of the cell clearly shows that deciphering the gene networks would be made easier if considering side information such as known interactions between proteins, biochemical reactions involved in metabolic pathways and other various information linked to the genes.
Inference of Biological Regulatory Networks
43
From the computer science and statistics side, the identification of the regulatory networks from data falls into the field of machine learning or empirical inference. Given a class of mathematical functions, a training dataset and some prior knowledge, a learning algorithm seeks the best function in this class that realizes a good fit to the data while restraining the complexity of the mathematical function. We will show in this chapter that there are many ways to express the reverse-modeling of regulatory networks as a machine learning problem. However, regardless of the adopted approach, the hardest point to solve is the identification of the network structure. This problem when expressed as a combinatorial one is known to be NP-hard (8). NP-hardness either calls for a relaxation of the combinatorial problem into a continuous one or demands heuristics to explore the finite but huge set of candidate networks for a given number of genes. Moreover, the feasibility of the task depends tightly on the size of the training dataset and on how much the learning problem is constrained by prior knowledge. In case of few data and limited prior knowledge, the learning problem is under constrained and multiple solutions exist. 1.2. Overview of Methods Between the first papers in 1998 about the learning of regulatory networks and today, three main research directions have been progressively explored: • • •
the learning of dynamical models of genetic regulation, the learning of static models of genetic regulation the elucidation of the sole structure of the network without a model of the behavior of the gene network
The first direction is complementary to direct modeling and is the closest to a comprehensive view of interaction networks. The model to be learnt takes time into account and thus allows the representation of dynamical behaviors. Once learnt, it can be used for simulation and for prediction and thus could be used for therapeutical targeting in a longterm goal. However this approach suffers from the lack of available time-
44
F. d'Alché-Buc
series data: its future success depends crucially on the development of such experimental data. With the second direction, time is no more taken into account and the model (mainly Bayesian network) gives a view of causal dependency between variables (states of genes) but is no more able to encode feedback loops or cycles in the network. For such a simplified view, there exists more data which thus has contributed to the success of this approach. In the third direction, the idea of modeling is abandoned but the focus is put only on the learning of the structure itself. In this case, the learning framework is still unsupervised with for instance techniques that consist in the computation of mutual information between two gene expression profiles. If we enlarge our scope to other biological networks such as non oriented protein-protein interaction or enzymes networks then we can find in the literature new supervised approaches aiming to capture the features that characterize an edge between two biological nodes. Currently, these methods are limited to well known networks or partially known networks but could have a potential interest for gene regulatory networks inference. In this review, we focus on the reverse-modeling approaches that infer not only the network but also the functional parameters that govern its behavior. It should be emphasized that each of these approaches leads to difficult technical points, some of which remaining unsolved. Sometimes, these points - which can appear as limitations for the applications - are underestimated in the literature. In this chapter, they are discussed in order to promote research in machine learning in the corresponding areas. In parallel, by showing the mechanism of the learning machinery, it can provide comparison tools to biologists and modelers who can be put off by the large panel of inference tools. The main research streams in inference of regulatory networks are presented here by adopting a transversal point of view. As solving a machine learning task requires generally to solve three major issues: the representation issue, the optimization issue and the validation issue, the features of the network inference methods are described here according to this organization, outlining thereby their similarity and differences. The chapter is structured as follows. Section 2 introduces the notion of gene regulatory networks and the problem of statistical inference with its different steps. Section 3 describes how the representation issue has
Inference of Biological Regulatory Networks
45
been solved through various choices of models. Section 4 is devoted to the optimization problem that underlies the network inference task. Section 5 mainly covers the validation of network inference methods, namely statistical validation as well as biological validation. A conclusion and some perspectives are given in section 6. 2. The Inference of Gene Regulatory Networks as a Machine Learning Problem 2.1. Gene Regulatory Networks The cellular response to input signals depends on the state of the proteome in the cell. Contrary to the genome, the proteome differs from one cell to another cell, depending on the cellular type, and the amount and the nature of proteins being present vary along time. The major mechanism to control the production of proteins is the regulation of gene expressiona. This regulation can take several forms but one of the most important is the transcriptional regulation, i.e. some genes regulate the transcription of other genes. A gene i directly regulates a gene j if the protein encoded by i is a transcription factor for gene j, i.e. if it binds to DNA on a specific site near the sequence coding for j, called a regulatory region of j, and activates or inhibits its rate of transcription. Regulation can be indirect, e.g. i activates j, which activates k, and cooperative, i.e. several genes produce proteins that form a complex that regulate the same target gene. As said before, all these regulations take effect through time: some amount of time is required for transcription, then for translation, for finding the binding site and for the binding itself. Hence, a complex dynamical system - the genes and their products - is at work, the proteins being the main variables of interest. Depending on the degree of abstraction of the modeling, we can either consider only the gene expressions as the main variables or include the concentrations of proteins as well. a
The reader can find a more comprehensive introduction of gene regulatory networks in Chapter 4.
46
F. d'Alché-Buc
A transcriptional gene regulatory network can thus be fully defined by an oriented and labeled graph that describes the presence, the orientation and the sign of the regulations and some equations that define mathematically the meaning of the arrows on the graph such as differential equations, for instance. According to the degree of abstraction and simplification of the model, automated inference of gene regulatory networks can be converted into either the extraction of both structure and parameters of the dynamical equations or in the sole extraction of dependencies between variables without taking time into account. Machine learning, the field of automated inference, offers not only the concepts to express formally the problem from the statistical and the computational points of view but also provides some guidelines to address such a problem. 2.2. Machine Learning: A Short Definition According to Herbert Simon, machine learning (9,10) concerns "any process by which a system improves its performance". Let us consider some mathematical function h which is supposed to model a fixed but unknown process and call this function h an hypothesis. At the beginning of the learning procedure, the hypothesis is randomly initialized in a given class of hypothesis or biased by some available knowledge. Then, the learning system is provided with some observations of the true process. We will say that this system is able to learn if it can exploit the available data, called the examples, to correct the parameters and possibly the structure of the hypothesis in order to model the underlying process. The quality of the obtained hypothesis is evaluated by its generalization ability, i.e. how correctly it can predict the true process at new points. All the choices that govern a learning process are made in order to ensure a good generalization ability. Let us notice that this is crucial to understand the difference between machine learning and parameters fitting. Given some fixed data, it is always possible to choose an hypothesis space in which a function sufficiently complex fits exactly the data. A model of regulatory networks obtained in such a way would be obviously of no use for biologists because it would be very sensitive to any small variation in the data and thus unable to make correct
Inference of Biological Regulatory Networks
47
predictions. On the contrary, machine learning aims at finding an hypothesis that achieve a trade-off between an accurate fitting of the observed data and the ability to deal with unseen data. The achievement of this trade-off related to the control of the estimator bias and variance build on the training sample. One can distinguish mainly two general frameworks for machine learning: batch learning and on-line learning. In batch learning, the data are provided once and the learning process occurs with no additional interactions with the environment while in on-line learning, the learning process is constantly running, fed by a flux of data. On-line learning makes possible active learning which consists in asking for targeted data if the learning cannot be achieved properly with available information. Major advances in reverse modeling of biological networks come from batch learning which relies on a strong theoretical background. Another distinction among learning algorithms concerns the supervision : when learning use labeled examples of the form (input, output), the learning is said to be supervised. Supervised learning ranges from regression to classification and has been extensively worked out. In unsupervised learning the idea of inputs/outputs is absent: data are used to build clusters or to infer generative models that could have produced the data. Semi-supervised learning uses both labeled and unlabeled examples usually to learn some input/output relation. 2.3. A Methodology for the Conception of a Learning Algorithm First of all, an appropriate class of hypotheses has to be defined. If this class can be endowed by an inner product and the corresponding norm, then it is called an Hilbert Space. The hypotheses class must be rich enough to encompass at least an accurate model of the true process and simple enough to make the optimization step computationally feasible. This choice is closely related to the encoding of the data. Pre-processing of raw data using more or less complex features yields to simpler hypotheses classes. The whole step is often referred as the representation step. Second, the search of an optimal hypothesis in the hypotheses class is translated into an optimization problem: the maximization of some objective function under some constraints. These
F. d'Alché-Buc
48
choices are guided by the nature of the problem at hand and theoretical aspects. The usual ingredients of an objective function are a term that measures how much the hypothesis fits the observed data and a second term that penalizes the complexity of the hypothesis. Then an optimization algorithm devoted to this objective function has to be developed or adapted from a class of existing optimization algorithms such as E-M, linear or quadratic programming, and stochastic algorithms. This complete step which remains closely linked to the first one is called here the optimization step. Third, the hypothesis provided by the learning algorithm needs to be evaluated in order to be validated or infirmed. This phase deals with the estimation of the generalization error (the probability that the hypothesis goes wrong). Robust methods of estimation such as bootstrap or cross-validation are used for this purpose. We denote this last step as the internal validation step in order to differentiate it from the necessary biological validation of the results, always required when inferring biological models. We will show that the validation question is far from being solved in the context of network reconstruction. data
knowledge
Representation
Optimization
Validation
Learnt model
Figure 1. Methodology for machine learning.
Inference of Biological Regulatory Networks
49
The various network inference approaches fall into this machine learning framework (Fig. 1) albeit most of them do not discuss each of the key issues of machine learning. We propose here to present the main network inference approaches according to this point of view in order to highlight advantages and drawbacks of the existing approaches. The reader can refer to Fig. 2 for the instanciation of the learning methodology to the case of network inference. Prior knowledge Class of models Inferred model
Data Learning Validation
Objective function Prior knowledge Figure 2. Scheme of network inference.
3. Representation Issues 3.1. Prerequisites Choosing a class of functions for reverse-modeling differs from what is done in direct modeling (6). In reverse modeling more importance is put on the conditions of learnability if they are known and the existence of
F. d'Alché-Buc
50
an algorithm able to ensure generalization properties whereas direct modeling is uniquely concerned with the trade-off between biological relevance and the level of abstraction necessary to provide keys for behavior analysis. For the network inference task, most of the works of the literature attempt to implicitly fulfill the following requirements: •
• • •
white box requirement : the functions class should provide relevant insight on the biological processes to be unraveled, i.e. correspond to a model more or less refined that may be used in simulation and prediction availability of the data and adequation of the models class to the data the complexity of the model should be roughly in relation with the sample size and the prior knowledge existence or development of an appropriate learning algorithm: such an algorithm devoted to the model class should exist or should be developed on purpose and have theoretical properties of convergence and a computational complexity as low as possible
3.2. Questions When Accounting for Dynamics Considering the crucial role played by dynamics in regulatory processes, it is not surprising that when large scale data such as microarray data appeared in the mid-nineties, dynamical models of gene regulatory networks were rapidly considered for empirical inference. Only accounting for time allows to model easily feedback loops which play a major role in the regulation processes. Learning a dynamical model of a regulatory network can lead to the discovery of the underlying network topology and dynamics parameters. From the discovered nature of dynamics one can deduce the kind of behavior that will adopt the biological system in response to some given input signal. Thus the identification of a dynamical model enables prediction, opening the door in principle to therapeutical targeting or drug discovery. However, in order to learn a dynamical model of gene networks, the training sample must be composed of gene expression time-series measured at discrete
Inference of Biological Regulatory Networks
51
time-points. In principle, the concentration of proteins through time should be also required but is practically never available. Several questions about the training data then arise: What is the time scale? Is there enough time-series of gene expression available? Are they long enough? In parallel, the main issues for the models class concern the following choices: stationary or non-stationary models, deterministic or non deterministic models, continuous time or discrete time models, continuous valued variables models or discrete valued variables models, linear or non linear models. 3.2.1. Encoding the Data It is possible but still quite expensive to process DNA microarrays for different time steps and thus produce time-series of gene expression. The most well known datasets of gene expression kinetics concern the cell cycle of Saccharomyces Cerevisiae (11) and its responses to various stresses (12). Temporal data are especially relevant for the biologist when evaluating the nature of the response of an organism to some specific signal. In an ideal context, several independent time series (supposedly from the same fixed stochastic process) need to be acquired for the use of learning. Assigning various initial conditions to the (real) regulatory network allows to observe various trajectories and thus to pave the space of trajectories, making learning possible. However this requirement appears not to be easily fulfilled using microarray data: very often biologists focus their study on one specific stress or specific signal. Replications of experiments provide at most two or three samples of time-series that may start very near around the same initial conditions. Moreover it should be emphasized that the number of training time-series cannot be exchanged in general with the length of a training time-series which would be unique because starting from some specific initial conditions will not allow in general visiting each possible state of the dynamics. These remarks imply that when only one single training timeseries is available, it is possible that given some specific family of models, the learning algorithm finds not one but a set of networks candidates that could have produced such data, the problem being under constrained.
52
F. d'Alché-Buc
3.2.2. Identifiability, Learnability and Sample Complexity In systems and control theory, algorithms have been developed to test the identifiability of a system (linear or non-linear). In contrast to biological modeling, it is possible in control theory to generate arbitrary inputs to a system and thus the real challenge for these algorithms is their algorithmic complexity and their ability to conclude to non identifiability if it is the case. In reverse-modeling of biological networks, the issues are slightly different: we need to design experiments that will provide specific inputs to the biological system. Sontag (2002) recently proved that the identification of a differential equations based system with r parameters requires at least 2r+1 parameters considering the availability of exact measurements (13). However if we choose the framework of machine learning for the empirical inference of parameters, then the critical issue is again the generalization ability of the learnt model. Both PAC-learning theory and statistical theory of learning offer a theoretical tool to address this issue: the sample complexity. Sample complexity measures how many data are needed for learning to provide an approximation of the true function that will ensure a bounded generalization error. For some simple classes of mathematical functions like Boolean functions (14), there exists some results, usually large upper bounds that provide insight about the feasibility of learning. However to our knowledge most of the results stand for static i.i.d. training sample and not for temporal data. Nevertheless a few works concern the extension of static results towards the learning of stochastic processes: these contributions use the information that data points in a time-series coming from a stationary system have a specific dependency that can be described by mixing coefficients. According to the nature of this dependency, it may be possible to get sample complexity results and more generally generalization error bounds. As the development of results in this area will be helpful to the experimental design of experiments, the next years will surely witness a surge of interest in these theoretical results.
Inference of Biological Regulatory Networks
53
3.2.3. Time-Scale, Sampling Frequency and Irregular Sampling Another important issue when tackling with dynamical biological systems concerns the sampling rate. Currently the sampling rate is chosen during the experimental design step without integrating the requirements of the learning and the identification problem. The experimenters usually have relatively precise ideas of the time scale regarding the considered biological system is a prokaryote or an early eukaryote. For more complex eukaryotes, various events take place and make the effects cumulate. The sampling rate is chosen in order to achieve a good trade-off between the information quantity expected from the experiments and the cost and the feasibility of experiments. However if the time-dependent signals into consideration (kinetics of gene expression) are not properly sampled, the learning process will produce an irrelevant hypothesis. The sampling frequency should then be required to verify some property according to the time-scale of the true dynamical system. For regulatory networks inference, Bay et al. (2004) have pointed out this problem well-known from the econometrics community Granger (1969) as the temporal aggregation bias. Temporal aggregation occurs when a process is sampled slower than the natural rate at which it is changing (15,16). Various works in econometrics have shown that it makes correct inference extremely difficult both in continuous time and discrete time systems. In Bay et al. (2004) a few current solutions are discussed and empirically tested on artificial data and autoregressive models (15). Interpolation for instance using cubic splines is often used as a pre-processing of the training data in order to augment the number of time-points Bansal et al. (2006)(17). However Bay et al. (2004) showed on that such an interpolation may lead to find spurious causal relations and proposed three other venues for addressing the bias issue : the use of alternative measurement technologies to get improved sampling rates, the addition of perturbation experiments that can be dealt with causal interventions and the use of background knowledge in conjunction with the gene expression time-series (15). As said before, time series of gene expression are very often difficult to obtain for numerous organisms: it requires a lot of time and a complex protocol to acquire these data. Due to these difficulties experimenters
54
F. d'Alché-Buc
usually choose time-points in a very sparse way and generally at the very beginning of a signal induction and at some time where a stable level is supposed to be reached. Sampling is then not regular which in theory does not make any difficulties (at least to some extent) for appropriate estimation algorithm: the model would be supposed to be iterated several times in such a way to fit next time points and to complete the missing observations. Algorithms such as Expectation-Maximization can help to learn missing observations as well as other parameters if other regularly sampled time-points are sufficiently numerous. 3.2.4. Continuous versus Discretized Encoding The nature of the observed data influences strongly the choice of representation. Discretization makes sense when the goal of modeling turns around the nature of dynamics that can reach the biological system (5,4,18). However discretization of continuous measurements can produce quantization noise and make data useless. Continuous encoding avoids this drawback and allow continuous optimisation tools. 3.3. Deterministic Models of Dynamics The hypothesis of stationarity for modeling regulatory networks is usually assumed. This question is rather little discussed in the literature of network inference because at a first level, this assumption can be taken when focusing on a specific system under a short temporal window. First-order dynamical models with synchronous updating represent the most general and the simplest way to encode dynamical GRN. Most of the models studied in the inference framework have been inspired from differential equations based systems introduced to describe the behaviour of genes and proteins across time at a more or less high level of details. At the most refined level of detail, modelers have generally considered that first order dependencies are sufficient to encode the relation between past and present. Let us define xi(t) the expression level of gene i at time t. General continuous time systems based on first order differential equations can be described as:
Inference of Biological Regulatory Networks
55
∀ i ∈{1,...,n}, dxi(t) /dt = fi(x1(t),...,xn(t))
(1)
Discrete time variants of these models are also considered, using the assumption that the time-lag is appropriate to the system under study.The transition between the state of the network at time t to time t+1 is given by the following equations: ∀ ι ∈{1,...,ν}, ξι(τ+1) = φι(ξ1(τ),...,ξν(τ))
(2)
with fi ∈ F. F, the family of transition functions, range from Boolean functions to more complex functions. For a gene i, the function fi encodes which genes play a regulating role and thus reduces to some function gi such that fi(x1(t),...,xn(t)) = gi(x1[i](t),...,xn[i](t)) where n[i] is the exact number of regulators for gene i, and 1[i], being the first regulator, 2[i] the second, etc...In the following, we will introduce the most important classes of models encountered in the literature, organizing them from the coarse-grained to the finestgrained ones. 3.3.1. Temporal Boolean Network Models In (temporal) Boolean networks first introduced by Kauffman (see for instance ref. 4) and used for network inference by the gene expression variables are binary and F is Fbool(k), the set of Boolean functions that use at most k variables in its image (18,19). The functions can be simply described in extension by a table or more intensively by a DNF formula, or a decision tree. The most interesting feature of Boolean networks is their simplicity and the fact they give rise to a finite space for the possible states of the network under study. Thus given a known Boolean network it is possible to identify and study its attractors, the only limit being the size of the network. In the scope of learning, we should notice that this model implies to discretize drastically the observed gene expression levels into binary values involving a quantization noise that can strongly bias the learning process and prevent accurate modeling.
F. d'Alché-Buc
56
Figure 3. A Boolean network G(V,F) and its wiring diagram G′(V′,F′). The state transition table shows the expression patterns at time t (Input) and the expression patterns at time t+1 (output).
3.3.2. Linear Networks Linear networks introduced by D’Haeseleer assume that the effect of regulators is additive (20). This means a restriction comparing to Boolean networks that allow conjunctive effects. Namely, linear models won’t be able to account for a complex of proteins that acts as a transcription factor. These models are characterized by continuous expression variables and F = Flin, the set of affine functions. xi(t+1) = fi(x1(t),...,xn(t)) = Σj wij xj(t) + bj
(3)
D’Haeseleer motivated his linear model as a discrete version of linear differential equations used to describe the kinetics of gene expression. In this model, interpretation of the functions in terms of structure is still possible. We can say that for a given j, the coefficients wij can be interpreted as the strength of the coupling between the regulator expression xj(t) and the regulee expression xi(t). A null coefficient means that gene j does not regulate gene i, a non null
Inference of Biological Regulatory Networks
57
and positive coefficient suggests an induction of i from j, i.e. a positive regulation, and a negative coefficient means an inhibition of i from j. From the learning point of view, the model can be easily identified using regularized linear regression as further discussed in section 4. From the point of view of dynamical systems, the linearity is a serious drawback because biological systems are characterized by saturation and are not supposed to get unbound values. The kinds of possible dynamical behaviors reduce to stable, unstable and periodic solutions (21). Bansal et al.(17) also proposed a linear dynamical systems slightly more complex that integrates additional terms accounting for external perturbations encoded by a vector u(t). xi(t+1) = fi(x1(t),...,xn(t)) + gi(u) = Σj aij xj(t) + Σl bil uil(t)
(4)
3.3.3. Artificial Recurrent Neural Networks A straightforward extension of linear networks is the recurrent artificial neural networks : variables are also continuous and fi is now defined as follows: xi(t+1) = fi(x1(t),...,xn(t)) = s(Σj wij xj(t))
(5)
where s is sigmoid functionb. In this case, we note: F = Fann. Considering a threshold function after the linear combination of regulators allows to saturate the gene expression level of the regulee and thus brings an advantage compared to linear models. Learning is still possible and has been extensively studied in the late 80’s using back propagation through time algorithms or even evolutionary algorithms. Let us note that the coefficients wij keep the same interpretation as before.
b
A sigmoid function is a smoothed threshold function
58
F. d'Alché-Buc
3.4. Probabilistic Models of Dynamics Various studies such as showed that genetically identical cells exhibit great diversity even when they are exposed to the same input signals (22,23). The presence of intrinsic noise can explain these variations. The assumption of an inherently random nature of the biochemical reactions of gene expression is now currently admitted. This is of course a strong argument in favor of probabilistic or stochastic models. Moreover, it is well known that acquisition techniques such as DNA micro arrays generate measurement noise, which should also be taken into account explicitly as extrinsic noise. Temporal Boolean network as well as linear and non linear models have led to probabilistic models, called respectively PBN (Probabilistic Boolean Networks) and DBN (Dynamical Bayesian Networks). As there is a strong link between PBN and DBN as stressed by Lähdesmäki et al. (2006), we focus here on DBN for sake of clarity (24).
Figure 4. First order Markov model.
DBN belong to the family of graphical models. In a probabilistic graphical model, the variables of interest are considered as random variables and a graph represent the causal dependencies between variables. Graphical models provide a useful framework to represent the joint probability of all the variables. In DBN, the considered variables depend on time. Time provides naturally the direction of causality. A first structure is thus given by this temporal dependency and is necessarily linear. This very large definition encompasses autoregressive models, Markov chains, hidden Markov models. Let us consider that the variables take continuous values. The most simple model (represented in Fig. 4) is a first-order Markov model characterized by the following joint probability distribution where a temporal window [1,T] is used : P(x(1),...x(T)) = P(x(1)).P(x(2)|x(1))....P(x(T)|x(T-1))
Inference of Biological Regulatory Networks
59
3.4.1. Linear Models and Linear State-Space Models A straightforward way to cope with stochastic dynamical processes is to consider previous deterministic models with some additional noise. De Hoon in (De Hoon et al. 2003) inspired by the seminal work of (Chen et al. 1999) proposed to start from linear differential equations, discretize time and add Gaussian noise (25,26). This setting leads to a multidimensional linear dynamical system for which likelihood criterion can be maximized. For the whole network state, we have: x(t+Δt) –x(t) = Δt .W.x(t) + ε(t)
(6)
If the noise is homoscedastic (diagonal covariance with the same value), then (4) can be written as: ∀ i ∈{1,...,n}, xi(t+1) = Δt. wi.x(t) + εi(t)
(7)
with the ith row vector wi in the matrix W and εi(t), the ith coordinate of ε(t). Moreover, equation (4) can be re-written in terms of conditional probabilities instead of a transition equation. Given ε(t) as a Gaussian isotropic noise N(μ; σ2.I), we have : P(x(t+Δt) | x(t)) = (2π)-n/2. exp(- ||x(t+Δt) – Δt .W.x(t)- μ||2. (2σ2)-1) (8) As for linear deterministic models, the interpretation of coefficient wij remains valid. Moreover, this probabilistic model provides an estimation of the underlying true distribution of gene expression. The model proposed by De Hoon only accounts for extrinsic noise. It can be tempting to differentiate intrinsic noise and extrinsic noise and consider two spaces: the observation space and the hidden state-space model. The true stochastic process corresponds to the hidden state-space where noise is present. In this model called state-space model whose an example is represented in Fig. 5, the observed data are supposed to be
60
F. d'Alché-Buc
Figure 5. Representation of an unrolled state-space model. Dependencies between the hidden variables have been explicitly represented by arrows.
produced by the hidden process. The Inertial Dynamic Bayesian Network (IDBN) was introduced and tested in Perrin et al. (2003) ; d’Alché-Buc et al. (2005) ; Quach et al. (2006) (27,28,29). The definition of this model starts also from linear differential equations but of second order. By using second order dependencies, a linear model is able to capture the delay between the expression of a regulee after the regulator. Moreover, considering a transformation of variables it is possible to encapsulate this second order model into a first order model based on an hidden statespace. Denoting y(t) the vector of observed gene expression levels and x(t) the vector composed of the true gene expression levels followed by the derivative of these gene expression levels : x(t) = [e1(t)....en(t) e’1(t).....e’n(t)], the equations are the following: x(t+1) = A.x(t) + ζ(t)
(9a)
y(t) = C.x(t)+ ε(t)
(9b)
Inference of Biological Regulatory Networks
61
where ε(t) and ζ(t) are Gaussian noises respectively in the observation space and in the hidden state space. Other state-space models based on linear first order equations have been proposed where external inputs are considered in (9a) (30,31). All these models are particular instances of dynamic Bayesian networks and the joint probability density of the observations can be written in terms of conditional probabilities. T-1
p(y(1),…y(T)) = ∫ Α(1) Π A(t+1) dx(1)…dx(T) t=1
(10)
with the following notations: A(1) = p(y(1)|x(1)).p(x(1)) and A(t+1) = p(y(t+1)|x(t+1)).p(x(t+1)|x(t)) for t=1… T-1. Being able to deal with additional hidden variables is one of the most important feature of the algorithm : if d is the dimension of the state-space, adding one hidden unit leads to add one dimension to equation (9a) (30,31). There exist learning algorithms such as Expectation-Maximization that allow to estimate such an hidden variable and the corresponding parameters. 3.4.2. Dynamical Bayesian Networks Using non Parametric Regression for Conditional Probability Distributions (CPD) In DBN, it is also possible to consider the decomposition of the joint probability distribution in terms of one-dimensional variables xi(t) that represents the expression of a single gene i level at time t. Then we get two structures in the model: the first structure which is linear is given by the time and the second structure corresponds to the dependencies between one-dimensional state variables at time t and t-1. We now call Pa[i,t-1] the set of parent of the variable xi(t). An oriented edge from a node j considered at time t-1 and a node i considered at time t means that the node j belong to the set of parents of the node i.
62
F. d'Alché-Buc
P(x1(1),...,xn(T))= P(x(1)).Π 1< t ≤T Π 1≤ i ≤n P(xi(t)|Pa[i,t-1])
(11)
The model resembles a BN, there is no hidden state, all the variables are supposed to be observable. Within this context, Kim et al. (2003) have proposed to model the CPD using decompositions on a splines basis (32). 3.4.3. Models of Biochemical Processes To provide a deep understanding of regulatory processes, modelers have introduced highly detailed models of the biochemical processes at work in the cell (33). These models based on differential equations describe the process of transcription factor binding, diffusion, protein and RNA degradation. In contrast to previous approaches, the complexity of these detailed models makes the inference not realistic at a large scale. Zak et al. (2002) even showed that gene expression data alone do not allow to identify a three genes model based on such equations and that additional information concerning protein-DNA interactions are needed to ensure identifiability (34). We present here the general form of these differential equations as proposed by Chen et al. (1999) (26). The previous notations are completed by the following ones: zi(t) is the concentration of protein i, z(t) is the protein concentration vector, f(z(t)) is the transcription function, L is a n ×n non-degenerate diagonal matrix encoding the translational constants, V (resp. U) is a n ×n nondegenerate diagonal matrix encoding the degradation rates of mRNAs (resp. proteins). dx(t)/dt = f(z(t)) - V.z(t)
(12a)
dz(t)/dt = L.x(t) - U.z(t)
(12b)
A slightly simpler but stochastic model was considered in Nachman et al. (2004)(35). Let us consider a gene i with two transcription factors, proteins j and k. The mRNA level still depends on transcription rate computed using a function f with two regulators j and k and the degradation rate δi but the protein concentration is no more linked to the translation nor with its own degradation.
Inference of Biological Regulatory Networks
with:
63
dxi(t)/dt = ri(t)– δi xi(t)
(13a)
ri(t) = f(zj(t),zk(t); θi,j,k) (1 + εi(t))
(13b)
zi(t) = zi(t-1) + εh,i(t)
(13c)
Note that this last model belongs to the Dynamical Bayesian Networks family and thus inherit their associated learning framework. 3.5. Static Models of Causal Dependencies Learning dynamical models requires a large amount of data under the form of several independent and identically distributed time-series. If one ignores the temporal aspect of regulatory processes, then the network reconstruction problem reduces to the search for causality chains among variables which requires in principle i.i.d. observations. As DNA microarrays enable to measure simultaneously the expression level for thousands of genes across several conditions or tissues, it can be interesting to exploit these "snapshots" of the behavior of the regulatory networks with a static modeling. Major advances in this area concern the use and the development of Bayesian Networks (36-38). The pioneering work of Friedman et al. (2000) introduced the use of Bayesian Networks (BN) for discovering and representing statistical dependencies between genes expression levels (36). Compared to DBN, this static model keeps two major advantages inherent to graphical models: first it still takes into account the inherent stochastic character of biological processes and can also cope with noisy data. Second it still provides a generative model that can be used in simulation and prediction. Nevertheless, the loss of the temporal aspects makes impossible to account for feedback loops. The acyclicity of G which is required here in order to propagate properly information through nodes appears as a strong limitation of these approaches (39,40). Nevertheless BN have been widely used for uncovering the network structure of regulatory networks (38). From classic BN (36) to extensions such as probabilistic relational models (41), module networks (42), factor graph networks (43), the spectrum of relevant variants is quite large.
F. d'Alché-Buc
64
3.5.1. Bayesian Networks BN have been widely used in diagnosis problems where various causes lead to some consequence of interest. A BN is defined by a direct acyclic graph: G and a set of conditional probability distributions that govern the behavior of each variable (node). If there is an oriented edge arrow from node j to node i, node j is said to be a parent variable of node i. In our setting, nodes correspond to genes. The behavior of a variable is governed by its conditional probability given its parents. Like in DBN, the object of interest is the joint probability distribution of node variables. We have: P(x1,...,xn) = Π 1≤ i ≤n P(xi|Pa[i])
(14)
If the set Pa[i] of parent variables of a gene i is empty, then P(xi|Pa[i]) reduces to P(xi). Choosing to encode variables as discrete or continuous and deciding if the conditional probability distributions (CPD) will be parametric or non parametric represent the main representational issue. In most of existing works on gene network inference, discrete variables with CPD tables (non parametric encoding) have been chosen for the sake of simplicity. CPD tables of parameters are especially easy to estimate through ML (Maximum likelihood) approaches. There exist pros and cons arguments for the discretization of expression: on one hand, the presence of noise does not require a high level of accuracy in the encoding but on the other hand, the quantization can also induce bias that might hide some causal dependencies. 3.5.2. Probabilistic Relational Models Probabilistic relational models (PRMs) (44) have been proposed in Segal et al. (2001) to include multiple types of information concerning genes such as experiment type, putative binding sites, or functional information (41). While this kind of modeling is promising, its complexity limits currently its scalability. For this reason, we focus here on two other extensions that attempt to find common factors in the network.
Inference of Biological Regulatory Networks
65
3.5.3. Module Networks Module networks exploit the idea that in many large domains, variables can be partitioned into sets so that, at least, variables of each set have a similar parents set (42,1). In gene regulatory network modeling, this assumption reflects the well accepted idea that some sets of genes are coregulated by the same inputs in order to ensure the coordination of their joint activity. In the Module Network framework, a module Mj is defined by a set of parents Pa[Mj] and a conditional probability template (CPT) P(Mj|Pa[Mj]) that specifies a distribution over the values that can take the variables in the modules for each assignment of the parents values. The "parameters" of the model are now the number of modules, the dependency structure over the modules, and the conditional probability template for each module. A non parametric tool, regression trees, is used here to model these conditional distributions, allowing a great flexibility for dependencies. Such a modeling has appeared to be relevant when the network involves regulatory cascades with large groups of co-regulated genes, for instance in the cell cycle of yeast. 3.5.4. Factor Graph Networks (FGN) Gat-Vick et al. (2005) proposed to use factor graph models that Kschischang et al. (2001) first introduced for coding/decoding problems (43,44). As PRMs, a FGN, illustrated in Fig. 6, allows to incorporate various biological entities, e.g. mRNAS, proteins, metabolites, and various stimulators but it also mixes noisy continuous measurements with regulatory discrete logic. Moreover, FGN enables at least in principle to encode feedforward loops but this feature is not detailed nor exploited in the biological application, the study of yeast pathways, described in Gat-Vick et al. (2005) (43). The model can be described as a probabilistic graphical model, defined by the joint distribution over logical (x) and sensor (y) variables: P(x,y) = 1/Z Πi θi(xi,Pa[i]) ψi(xi,yi) where Z is a normalization constant.
(15)
66
F. d'Alché-Buc
The sensor variables y are supposed to represent the measurements to be discretized into continuous variables x using the “discretizer” distributions ψi(xi,yi) and θi(xi,Pa[i]) encode the regulation function of gene i. As the discretization scheme offers a too wide flexibility which may lead to overfitting, the authors proposed to control this effect by imposing the same discretization schemes on all variables.
Figure 6. An overview of the factor graph network model. Fig. taken from Gat-Viks et al. (2005) (43).
4. Learning and Optimization Once a hypotheses space has been defined, the next step in machine learning is to define a cost functional that expresses how the considered hypothesis is suitable according to the machine learning requirements and an optimization algorithm that minimize this cost functional. In order to simplify notations, an hypothesis will be defined by its parameters vector θ. We can thus describe the learning problem as the following: from training data D, we want to minimize J(θ;D) over Θ the set of all values of parameter vector θ. In the static case, D is the set of observations corresponding to snapshots of the network state recorded in replicated experiments: D = {x1,...xn}
(16)
Inference of Biological Regulatory Networks
67
Observations in D are supposed to be independently and identically drawn from a fixed but unknown probability distribution. In the dynamical case, D can be in principle a set of n-dimensional time-series corresponding to independent realizations of the same stochastic stationary process. For the sake of simplicity, learning algorithms will be described using only one time-series D: D = {x(1),...x(T)}
(17)
Let us recall that the goal of statistical learning is to ensure generalization abilities of the final hypothesis. A well-known way to fulfill these requirements consists in realizing a compromise between the accuracy of the model on the training data and its simplicity. However some works about network inference have set the problem as an exact learning problem or a best-fit problem, leading to a system of equations to be solved without any constraint on the complexity of the model. Examples of these approaches can be found in section 4.1 while section 4.2. is devoted to statistical learning of regulatory networks. 4.1. Exact Learning and Best-Fit Approaches For instance, the first procedures for identification of deterministic Boolean networks were purely combinatorial approaches that exhaustively search for the definition of the Boolean transition functions consistent with available observations (18). However the noise usually observed in real data limits strongly this approach. A more recent work concerns the linear model proposed by Bansal et al. (2006)(17). They applied first a cubic smoothing spline filter with an adjustable parameter to reduce the fluctuations in the data. Second they increased the number of time points by interpolating the smoothed data using a piece-wise cubic interpolation. Third they solve the linear systems of equations in a reduced space defined by the k first principal components of the data. In order to come back to the original input space, the solution found in the k-dimensional space is projected on this space. Moreover the authors extract the parameters of the continuous differential equations systems from the discrete inferred model. As the solution is approximated due to
F. d'Alché-Buc
68
the use of the k first principal components of the data, it induces in fact some complexity reduction by means of dimensionality reduction. However the method in its principle does not incorporate regularization term that penalizes large coefficients. 4.2. Statistical Learning A well-known means to achieve good generalization properties for a learnt model consists in realizing during learning a compromise between the good accuracy of the model on the training data and a low complexity of this model. Ensuring simplicity for the current hypothesis avoids the problem of overfitting the training data. Generally the cost function will take the following form: J(θ;D) = JD(θ) + λ Ω(θ)
(18)
where JD(θ) is an empirical loss term computed using the training sample and Ω(.) is generally a regularization operator whose role is to make the hypothesis as smooth or simple as possible. When prior knowledge is available it is also possible to encapsulate it into the penalization term or to add specifically a third term. Whenever the learning from the gene expression data is underconstrained, prior knowledge is expected to play an important role by reducing the richness of hypotheses space. For example, known transcription factors and gene functions are complementary information that can be used for that purpose. 4.2.1. Mean Squared Error and Weight Decay for Neural Networks Artificial neural networks can be learnt within the supervised framework of functional approximation. JD(θ) can be directly written as mean squared error without taking into account the recursive outputs of the network model : JD(θ) = Σt (f(x(t);θ) - x(t+1))2
(19)
Inference of Biological Regulatory Networks
69
Alternatively, it is possible to take into account the fact that each output at time t depends on all the previous outputs in order to apply back propagation through time algorithm. Moreover, the mean squared error should be penalized by a weight decay term to reduce the number of interactions between genes : Ω(θ) = Ω(w) = || w ||2
(20)
The derivation of the weights provides implicitely the structure which can be deduced from the weight matrix by discretization. Mjolsness et al. have employed evolutionary algorithms to search the space of hypotheses (46). 4.2.2. Maximum A Posteriori Approaches for Learning Parameters of Bayesian Networks When only parameters are considered for learning BN and DBN, Maximum A Posteriori (MAP) approaches provide a principled way to achieve statistical learning. Let us first notice that a MAP approach leads to maximize the data likelihood penalized by a complexity term: when continuous and temporal variables with a Gaussian noise assumption, this turns to penalized squared loss as described in subsection 4.2.1. More generally, these methods concern two cases: • •
learning of the conditional probabilities parameters in BN and DBN when the structure of the network is given (47). learning of all the linear models for which the network structure is embedded into the coefficients of the transition matrix (27).
In the second case, learning the parameters also provides the structure and thus, the NP-hard problem of structure learning is avoided. Of course there is no free lunch theorem here: like in ANN, a procedure of extraction of regulation needs to be defined in order to recover the final structure. Before introducing the Maximum A Posteriori approach, we present how BN’s parameters can be learned using a maximum log-likelihood
70
F. d'Alché-Buc
approach. The idea is to maximize the probability of observing the training data given parameters θ. Given the training data D, the term to minimize will be: JD(θ) = -ln P(D|θ)
(21)
We can easily show that when the BN is a linear dynamic model with no hidden variables in the modeling and the noise is Gaussian, minimizing this term is equivalent to minimize the mean square loss used in the approximation framework. To get the desired property of generalization, the log-likelihood term need to be balanced by a penalization term. This can be realized in the Bayesian framework of learning. Adopting bayesian point of view, the model parameters are considered as random variables, just like the variables under study. Parameters θ have a prior distribution and this prior is modified into a posterior distribution once the training data D have been observed: P(θ|D) = P(D|θ).P(θ)/P(D)
(22)
Maximum A Posteriori approaches are the simplest bayesian way to incorporate prior knowledge or simplicity constraint on the model to be learnt. As P(D) does not depend on the parameters to be determined, the new loss function is : J(θ;D) = -ln [P(D|θ).p(θ)] = - ln P(D|θ) − ln P(θ)
(23)
For linear DBN, a first simple choice for Ω(θ) = − ln P(θ) is to choose to enforce the parsimony of the coefficients of transition matrix that encodes the presence of regulators (Perrin et al. 2003): for instance a Gaussian prior with mean 0 and variance σ2 .I with I, the identity matrix. In the simplest case with no hidden state-space, we have eventually: Ω(w)= ||w||k
(24)
Inference of Biological Regulatory Networks
71
where k denotes usually the norm Lk: k usually equals 2 for Gaussian noise but the use of the L1 norm has been proved in several contexts more efficient at driving some weights to 0. Another kind of prior could be used here if some regulators are known like the informative priors used in full Bayesian approaches (40). When all the variables are observable, applying maximum likelihood principle can be easily implemented. When some of the variables are not observable like in state-space models, Expectation-Maximization algorithm is thus appropriate. EM alternates computing expected values of the unobserved variables conditionally on observed data, with maximizing the complete likelihood (or posterior) assuming that previously computed expected values are correct. Introduced by (Dempster et al. 1977), this estimation method is ensured to converge towards a local maximum. In case of linear dynamical, EM equations corresponds to Kalman filter and Smoother equations. 4.2.3. Structure Learning In the general case where the structure is unknown and must be explicitly discovered. The problem of structure learning in static BN, i.e. learning a DAG that best explains the data, is known to be a NP-hard problem since the number of DAGs on n variables is super-exponential in n (8). Learning the structure of a DBN does not require the acyclicity assumption but still remains NP-difficult. An algorithm for structure inference includes generally two components: a scoring function and a search strategy. A Bayesian score based on the posterior probability of a graph structure G given the data D is usually defined to measure the adequacy between the model and the data. For both BN and DBN, it takes the following form: Score(G;D) = ln P(D|G) + ln P(G)
(25)
This score is nearly the log posterior probability of the network structure G given the data, except that the term P(D) has been thrown out like in the MAP approach.
72
F. d'Alché-Buc
In a full bayesian approach, the computation of the log likelihood log P(D|G) requires maginalizing over the distribution of possible parameters θ, which is yet analytically tractable when variables are discrete (40). Priors on the network structure G can either be related with the structure size or, more locally, introduce information about the presence of edges. For instance Bernard and Hartemink (2005) propose to encode into a prior the existence of a regulation by a known transcription factor (40). As an exhaustive search is impossible, most of the authors use simply heuristic search algorithms that apply incremental changes in order to improve the score of the structure. The fact that the joint probability is factorable as a product over the variables allow to get, in the resulting closed-form expression for log marginal likelihood, a sum of terms each of which corresponds to one variable. Consequently, the computation of the new score corresponding to a local change of the network structure (adding or deleting a single edge) remains cheap. However, in the case of static BN, the DAG structure needs to be checked at each step to avoid the introduction of cycle. A heuristic search algorithm like simulated annealing or Markov chain Monte-Carlo can be used but do not always avoid getting trapped in local minima. Evolutionary algorithms have also been developed for structure learning. 5. Validation 5.1. Introduction to Validation Let us remind briefly that the primary goal of network inference is to produce relevant hypothesis for unknown regulatory networks. Within this scope, the validation can essentially take two forms: first the results of the learning algorithm have to be assessed by some quality measure in order to inform the biologist about the success or the failure of the inference process. Second, if the learnt model is considered as relevant according to the criteria of statistical machine learning, it is mandatory to confront the proposed model to new sources of knowledge or data to confirm or infirm each piece of inferred information. These two steps of validation are the price to pay to get a procedure useful for the reverse-
Inference of Biological Regulatory Networks
73
modeling process. To simplify the discourse, we will call the first kind of validation, statistical validation and the second one, biological validation. To illustrate the usefulness of the first point (statistical validation), we can describe at least two cases for which the hypothesis resulting from learning can be considered at least partially unsuccessful: •
•
Case 1 : if the learning procedure provides itself some confidence scores associated to each parameter (including structural ones) and if these scores are (all) very low Case 2 : more generally if the learnt model is proved to be unrobust to data variance using for instance bootstrapped samples or likelihood ratio tests on data sampled from the posterior distribution Pr(D|M).
Both cases may happen if data are especially noisy and inconsistent in some part of the working space. Case 2 can be encountered if the model class is too rich, leading to overfit the training data. However, the good news is the classic approach in statistical machine learning incorporates tools for this internal validation and that many works in network inference rely on these tools. To implement the second validation step, the biologists and the computer scientist/statistician face two possibilities: they can re-design appropriate additional experiments targeted towards the test of the most relevant extracted pathways or they can screen the literature and mine the existing biological databases for testing the validity of the hypothesis using independent sources of information. Currently, a increasing number of works is now devoted to this kind of modeling loop, which could be hoped to open the door to true discoveries. Alternatively, most of the authors evaluate their inference tool by exploited public data related to known networks in order to show the quality of their approach and present also extensive results on artificial data generated from artificial networks, allowing an empirical study of the algorithm behaviour.
F. d'Alché-Buc
74
5.2. Statistical Validation of Network Inference All the machine learning algorithms described in section 4 produce a final hypothesis or a set of hypotheses. To complete the procedure of reverse modeling, a process able to reject or to accept the proposed modeling must be defined and applied. Note that the produced hypothesis may be interesting regarding only some of the variables and yet bring local but valuable information about the network. We present here three methods that contribute to assess the quality of the learning results : • • •
model selection via (re-)sampling and hypothesis test prediction on unseen data performance evaluation on known networks (simulated or real)
5.2.1. Model Selection via Sampling and Re-sampling Methods Statistical estimation methods such as bootstrap offer a convenient means to measure the sensibility of some quality criterion to data variance especially when the posterior distribution Pr(D|M) is unknown or difficult to estimate. Once a quality criterion is fixed, bootstrap can be used to estimate the variance of this criterion computed when the learning sample varies. This procedure allows to measure to what extent the algorithm provides a robust model. Bootstrap is frequently used to select a final hypothesis among a set of candidate hypotheses: in machine learning this is one the most general way to do model selection since no information about the background distribution is needed. However in case of BN, if a learning procedure concludes for a set of hypotheses {m1,...,mk}, then sampling from the posterior distribution Pr(D|mj) is possible and can be exploited into a likelihood ratio test. This is for instance what Gat-Vicks et al. proposed in 2005 in order to choose among several FGN models whose structure was fixed. Currently very few papers in network inference use such a process and prefer to evaluate the task as a class prediction task if the toy case where the true network is available.
Inference of Biological Regulatory Networks
75
5.2.2. Prediction on Unseen Data Some authors have measured the generalization ability of the learnt model. For static data it can be measured through cross-validation (49,50), while in temporal data sequential cross-validation can be used (27). In probabilistic models, this resorts to evaluate the predictive likelihood of each model. Pe'er et al. have for instance use a 5-fold cross validation to evaluate MingReg's performance (50). Learning data were split into 5 equal parts and MingReg was run 5 times, thus providing five BN. Each time only 4/5 of the sample was used as a training set to learn both the structure and parameters of the network model while 1/5th of the data were withholded. In a single test sample belonging to the mth set, they used the mth inferred model and the expression of regulators to predict the expression levels of the variables in that sample. When data are sufficiently numerous, this provides a robust means to evaluate the generalization ability of the learning method. 5.2.3. Performance Evaluation on Known Networks (Simulated or Real) When learning algorithms are applied to known networks (real or simulated), it is possible to evaluate the results of the learning process in term of class prediction. The concept of regulation from one gene (regulator) to another gene (regulee) is used as the class to be predicted. Like in information retrieval or in medical diagnosis, the quality of a reverse-engineering system can be evaluated by measuring the number of true positive regulations (TP), the number of false positive regulations (FP) and the number of true negative regulations (NP) for this class (39,27,51). Let us define the sensitivity and the specificity as: sensitivity = precision = TP/(TP+FN) specificity = recall = FN/(FN+FP) Moreover, in each learning algorithm there usually exists a way to tune the prediction with some parameter γ. When γ varies in some relevant interval, the trade-off between sensitivity (γ) and specificity (γ) can be
76
F. d'Alché-Buc
represented by a curve, called the ROC curve, which indicates how many the true positive rate you will get, given the false positive rate (43). 5.3. Biological Validation Network inference aims at helping the biologists to discover new pathways and refine existing ones. Some research teams have tried to check by additional sources the information conveyed by the inferred networks. As an example, we focus on the work of Segal et al. in 2003: they exploited their module network approach for the yeast cell cycle data (11), starting from a set of regulators candidates containing both putative and known transcription factors. First biological validation consisted of scoring the functional coherence of each module according to its genes covered by annotations significantly enriched in the module (p-value < 0.01). Second, they tested whether their method correctly predicted the targets of each regulator by analyzing the distribution of genes differentially expressed in modules. Third they tried to identify the process regulated by each regulator and thus corresponding to the modules. They selected three hypotheses suggested by the method, involving uncharacterized putative regulators and processed the relevant yeast deletion strains. Altogether, two of the three regulators were confirmed by the various additional sources of information including the new experiments. While most attempts to unraveling gene regulatory networks have focused on yeast, we can notice that the methods have begun to be applied to higher eukaryotes such as mouse or human (43,1). In Segal et al. (2004) an original analysis of conditional activity of expression modules in cancer was led by the authors of the module network approach (1). The underlying idea is to link modules of coregulated genes resulting from module network inference to clinical conditions including tissue and tumor type, diagnostic and prognostic information and molecular markers. Margolin et al. (2006) used the system ARACNe for the reconstruction of regulatory networks from expression profiles of human B cells (52). Their results suggested a hierarchical, scale-free network,
Inference of Biological Regulatory Networks
77
where a few highly interconnected genes (hubs) account for most of the interactions. Confrontation of the inferred network against available data led to the identification of MYC as an important hub, which controls known target genes as well as new ones, which were biochemically validated. The newly identified MYC targets include some major hubs. This approach can be generally useful for the analysis of normal and pathologic networks in mammalian cells. 6. Conclusion and Perspectives The recent years have witnessed the explosion of machine learning methods in reverse-modeling of biological regulatory networks from genomic data. This review attempted to highlight some of the methodological advantages brought by machine learning as well as the difficult technical points yet to be solved. We especially put some emphasis on the following points: • • • • •
the limited amount of available transcriptomics data relative to the dimension of the genomes the presence of noise in the observations as well as in the biological processes the interlacing roles of several variables some of which being unobserved the non linearity of the dynamic regulatory processes the necessary trade-off between model simplicity and biological relevance.
Nevertheless a first generation of tools have demonstrated on problems of limited complexity and on biological case studies that network inference is possible but requires prior knowledge integration and dimension reduction to be scalable. Regarding these last points, probabilistic graphical models from linear factor analysis to more complex Bayesian networks have appeared as a promising framework, supporting the management of uncertainty, the estimation of hidden variables and modular approaches. From the literature, it appears that the most interesting machine learning works are those which have been
F. d'Alché-Buc
78
carefully validated statistically and biologically. This suggests that in an ideal setting, the computer scientists and the statisticians should be involved in the discovery process from the beginning, i.e. from the experiment design to the biological validation of the discovered hypotheses. Regarding this modeling and discovery process, there is a large open field for various learning applications from complex data analysis to active learning. We therefore think that the use of machine learning in the processes of discovery in systems biology is only at its infancy. For this approach to be fully successful, several issues must be addressed. First, we should exploit the wide knowledge stored in existing databases in order to be able to confront these sources with experimental data from the lab. Second, if network inference has to take place at the scale of a whole genome, the detailed models must be replaced by hierarchical and modular approaches that reduce the dimension of the search space. Third, the identification of biological dynamical systems should benefit from the results obtained in dynamical systems theory. Fourth, integrative views that combine for instance genetic networks, protein-protein interaction networks and metabolic networks have not yet been worked out and must be considered. Finally, active learning combined with design of experiment (DOE) should be explored (53). References 1. 2.
3. 4. 5. 6.
Segal, E., Friedman, N., Koller, D. and Regev, A. (2004). A module map showing conditional activity of expression modules in cancer. Nature genet.. 36, 1090-1098. Tyson, J.J., Hong, C.I., Thron, C.D. and Novak, B. (1999). A Simple Model of Circadian Rhythms Based on Dimerization and Proteolysis of PER and TIM, Biophys J. , 77,. 2411-2417. Gonze, D., Halloy, J. and Goldbeter, A. (2002). Deterministic versus Stochastic Models for Circadian Rhythms. J Biol Phys. 28, 637-635. Kauffman, S.A. (1993). The origins of order: self-organization and selection in evolution. Oxford University Press, New-York. Thomas, R. (1999). Deterministic chaos seen in terms of feedback circuits : analysis, synthesis, "labyrinth chaos"., Int J Bifurcation and Chaos. 9, 1889-1905. De Jong, H. (2002). Modeling and Simulation of Genetic Regulatory Systems: a literature review. Journal of Computational Biology. 9 , 67-103.
Inference of Biological Regulatory Networks 7.
8. 9.
10. 11.
12.
13. 14.
15.
16. 17.
18.
19.
20. 21. 22. 23.
79
Richard, A., Comet, J.P. and Bernot, G. (2005). R. Thomas' modeling of biological regulatory networks: introduction of singular states in the qualitative dynamics. Fundamenta Informaticae. 65, 373-392. Chickering, D.M., Heckerman, D. and Meek, C. (2004). Large-Sample Learning of Bayesian Networks is NP-Hard, J. Machine Learning Research. 5, 1287-1330. Michalski, R.S., Jaime G. Carbonell, J.G., and Mitchell, T.M. (1984). Machine learning : an artificial intelligence approach. Berlin ; Heidelberg ; New York : Springer-Verlag. Vapnik , W. (1999). The nature of statistical learning theory. Springer Verlag. Spellman,P.T., Sherlock, G., Zhang, M.Q., Iyer, D.V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998). Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Molecular Biology of the Cell. 9, 12, 3273-3297. Gasch, A.P., Huang, M., Metzner, S., Botstein, D., Elledge, S.J. and Brown, P.O. (2001). Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol Biol Cell. 12, 2987-3003. Sontag, E.D. (2002). For differential equations with r parameters, 2r+1 experiments are enough for identification. J Nonlinear Sci. 12, 553-583. Anthony, M. (2003). Data Classification by Multi-threshold Functions. Proceedings of the Workshop on Discrete Mathematics and Data Mining, 3rd SIAM International Conference on Data Mining, San Francisco. Bay, S. D., Chrisman, L. and Pohorille, A. (2004). Temporal Aggregation Bias and Inference of causal Regulatory Networks, Journal of Computational Biology. 11, 971-85. Granger, C.W.J. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica. 37, 424-438. Bansal, M., Della Gatta, G. and di Bernardo, D. (2006). Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 22, 815-822. Akutsu, T., Miyano, S. and Kuhara, S. (1999). Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Pacific Symposium on Biocomputing. 4, 17-28. Silescu, A. and Honavar, V. (1997). Temporal boolean network models of genetic networks and their inference from gene expression time series. Complex systems. 11, 1-18. D’Haeseleer, P., Liang, S. and Somogyi, R. (2000). Genetic Network Inference: form co-expression clustering to reverse-engineering. Bioinformatics. 16, 707-726. Hirsch, M.W., Devaney, R.L. and Smale, S. (2004). Differential Equations, Dynamical Systems, and an Introduction to Chaos. Elsevier, Academic Press. McAdams, H.H. and Arkin, A.(1997). Stochastic mechanisms in gene expression. Proc Natl Acad Sci U.S.A. 4, 94, 814-9. Raser, J.M. and O'Shea, E.K. (2005). Noise in Gene Expression: Origins, Consequences, and Control. Science. 309, 2010-2013.
80
F. d'Alché-Buc
24. Lähdesmäki, H., Hautaniemi, S., Shmulevitch, I. and Yliy-Haria, O. (2006). Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Processing. 86, 814-834. 25. de Hoon, M.J.L., Seiya Imoto, S., Kobayashi, K., Ogasawara, N., and Miyano, S. (2003). Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data of Bacillus Subtilis Using Differential Equations. Pacific Symposium on Biocomputing. 2003, 17-28. 26. Chen, T., Hongyu, L., He, H.L. and Church, G.M. (1999). Modeling Gene Expression with Differential Equations. Proceedings of Pacific Symposium on Biocomputing. 1999, 29-40. 27. Perrin, B.E., Lahaye, P.J., Ralaivola, L., Mazurie, A., Bottani S., Mallet, J. and d’Alché-Buc, F. (2003). Inference of gene regulatory network with Dynamic Bayesian Network. Bioinformatics. 19, i38-i49. 28. d’Alché-Buc, F., Lahaye, P.J., Perrin, B.E., Ralaivola, L., Vujasinovic, T., Mazurie, A. and Bottani S. (2005). Dynamic model of gene regulatory networks based on inertia principle, Bioinformatics Using Computational Intelligence Paradigms, Series: Studies in Fuzziness and Soft Computing, Vol. 176. Seiffert, U.; Jain, L.C.; Schweizer, P. (eds.), Springer. 93-117. 29. Quach, H.M., Geurts, P. and d’Alché-Buc, F. (2006). Elucidating the structure of genetic regulatory networks. Proc of ESANN, M. Verleysen, Bruges 2006, 26-28, 569-574. 30. Rangel, C., Angus, J., Ghahramani, Z., Lioumi, M., Sotheran, E., Gaiba, A., Wild, D.L. and Falciani, F. (2004). Modelling T-cell activation using gene expression profiling and state space models. Bioinformatics. 20, 1361-1372. 31. Beal M.B., Falciani, F., Ghahramani, Z., Rangel, C. and Wild, D. (2005). A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics. 21, 349-356. 32. Kim, S., Imoto, S. and Miyano S. (2003). Dynamic bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data. Proc. of CMSB. 2003, 104-113. 33. Arkin, A., Ross, J. and McAdams, H.H. (1998). Stochastic kinetic analysis of developmental pathway bifurcation in phage {lambda}-Infected Escherichia coli cells. Genetics. 149, 1633-1648. 34. Zak, D., Doyle, F. and Schwaber, J. (2002). Local identifiability: when can genetic networks be inferred from microarray data? Proceedings of the Third International Conference on Systems Biology, December 12–14 , Stockholm, Sweden Karolinska Institute. 236–237. 35. Nachman, I., Regev, A. and Friedman, N. (2004). Inferring quantitative models of regulatory networks from expression data. Bioinformatics. 20, 248-256. 36. Friedman, N., Linial, M., Nachman, I. and Pe'er, D. (2000). Using Bayesian Network to Analyze Expression Data. J Computational Biology. 7, 601-620. 37. Hartemink, A., Gifford, D., Jaakkola, T. and Young, R. (2002). Bayesian methods for elucidating genetic regulatory networks. IEEE Intelligent Systems, special issue on Intelligent Systems in Biology. 17, 37-43.
Inference of Biological Regulatory Networks
81
38. Friedman, N. (2004). Inferring Cellular Networks Using Probabilistic Graphical Models. Science. 303, 799-805. 39. Husmeier, D. (2003). Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 19, 2271-2282. 40. Bernard, A. and Hartemink, A. (2005). Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data. Pacific Symposium on Biocomputing 2005 (PSB’05), Altman, R., Dunker, A.K., Hunter, L., Jung, T. and Klein, T., eds. World Scientific: New Jersey. 459–470. 41. Segal E., Taskar B., Gasch A., Friedman N. and Koller D. (2001).Rich probabilistic models for gene expression. Bioinformatics. 17, S243-52. 42. Segal E., Pe'er D., Regev A., Koller D. and Friedman N. (2003). Learning module networks. Proceedings of the 19th Conference in Uncertainty in Artificial Intelligence, August 7-10 2003, Acapulco, Mexico, Christopher Meek and Uffe Kjaerulff (eds), Morgan Kaufmann. 523-524. 43. Gat-Vicks I., Tanay A., Raijman D. and Shamir, R. (2005). The factor graph network model for biological systems. The factor graph network model for biological systems. RECOMB 2005, Lecture Notes in Bioinformatics 3500, Springer, Berlin. 31647, 31-45. 44. Kschischang, F.R., Frey, B.J., and Loeliger H.A. (2001). Factor Graphs and the Sum-Product Algorithm. Trans. IEEE on Information theory. 47, 498-519. 45. Koller, D. and Pfeffer, A. (1998). Probabilistic Frame-Based Systems. Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, AAAI 98, IAAI 98, July 26-30, 1998, Madison, Wisconsin, USA. AAAI Press /The MIT Press. 1998, 580-587. 46. Mjolsness, E., Mann, T., Castaño, R. and Wold, B. (2000). From Coexpression to Coregulation: An Approach to Inferring Transcriptional Regulation among Gene Classes from Large-Scale Expression Data", E. Mjolsness, T. Mann, R. Castaño, and B. Wold. Advances in Neural Information Processing Systems 12, Solla, S., Leen, T. K., Mueller, K.-R. (eds). 928-936. 47. Murphy, K. and Mian, S. (1999). Modelling gene expression data using dynamic Bayesian networks. Technical report, Computer Science Division, University of California, Berkeley, CA. 48. Dempster, A.P., Laird, N.M., and Rubin, D.P. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B. 39-138. 49. Girolami, M. and Breitling, R. (2004). Biologically valid linear factor models of gene expression. Bioinformatics. 20, 3021-3033. 50. Pe'er, D., Regev, A. and Tanay, A. (2006). Minreg: A Scalable Algorithm for Learning Parsimonious Regulatory networks in Yeast and Mammals, Journal of Machine Learning Research. 7, 167-189. 51. Yu, J., Smith, V., Wang, P., Hartemink, A. and Jarvis, E. (2004). Advances to bayesian network inference for generating causal networks from observational biological data. Bioinformatics. 20, 3594-3603.
82
F. d'Alché-Buc
52. Margolin, A., Nemenman, I., Basso K., Wiggins, C., Stolovitzky G., Dalla Favera, R., Califano.(2006). ARACNE: An algorithm for the reconstruction of gene regulatory networks in a Mammalian cellular context. BMC Bioinformatics 7 (Suppl 1):S7. 53. King, R.D, Whelan, K.E., Jones, F.M., Reiser, P.G.K., Bryant, C.H., Muggleton, S.H., Kell, D.B. and Oliver, S.G. (2004). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature. 427, 247-252.
CHAPTER 4 TRANSCRIPTIONAL NETWORKS
François Képès Epigenomics Project, Genopole®, CNRS, University of Évry, France
[email protected]
1. Introduction Among the many molecular interactions that together participate to the cellular dynamics in a spatially and temporally ordered fashion, this Chapter focuses on those that regulate gene expression. Transcription of DNA into RNA is the first — and often most regulated — step in gene expression. As most genes are regulated at the transcriptional level, and 5-10% of protein-encoding genes encode regulatory proteins, the interactions between regulated genes and regulatory proteins constitute a complex web called transcriptional network or genetic network. This interaction network is currently among the most accessible through genome-wide experiments. Large-scale monitoring of genetic expression is inspired by the premise that the functional state of an organism is largely determined by the expression status of its genes. As a first approximation, the latter may be described as the quantity of each RNA molecule transcribed from a gene which is present in the cell at a given moment. Over time, these quantities evolve as a function of interactions between regulatory proteins and regulated genes. This Chapter will review the state of the art in describing and analyzing transcriptional networks. This analysis will be anchored both in real-world biology and in complex systems approaches. Sections 2 and 83
84
F. Képès
3 will define the interaction partners and their mode of interaction, respectively. Section 4 will specify the bench and computer methods to obtain a chart of transcriptional interactions. Section 5 will focus on the formal representations of the chart of transcriptional interactions. Sections 6 and 7 will address the global and local topologies, respectively. Section 8 will jointly discuss the modularity of these networks and their dynamics, as these two aspects are tightly linked in practice. Section 9 will highlight how the network topology may unfold in the geometry of the cell. The concluding Section will sketch future directions. 2. Interacting Partners The two molecular partners of transcriptional regulation are regulatory proteins and DNA regulatory regions. 2.1. Genes and DNA Regulatory Regions According to one definition, a gene is an abstract entity that carries two types of information: The first is the code for sequential assembly of a macromolecule. The second specifies the quantity of that macromolecule to be synthesized. The physical support of genes is always a nucleic acid polymer, usually DNA. The DNA coding region bears the first piece of information and the regulatory region bears the second. If the coded macromolecule is a protein, the DNA coding zone is the open reading frame (ORF), and the DNA regulatory region is usually (often only, in microorganisms) located upstream from the coding region (see Gene A, on the left of Fig. 1). 2.2. Regulatory Proteins or Dedicated Transcription Factors When a regulatory protein binds to a DNA regulatory region, it modulates the expression of one or more genes. These genes are either grouped together and share the same DNA regulatory region, to which the regulatory protein binds (bacterial operons, for example), or are
Transcriptional Networks
85
dispersed along various points of the DNA molecule, where copies of the regulatory protein can bind. What are the limits of the definition of a regulatory protein? We are interested here in proteins that regulate a subset of the genes of an organism. Other proteins also called regulatory play a more general role in initiating transcription (for example the eukaryotic transcription factor of type II). In principle, these generalists, like RNA polymerase itself, are essential for the transcription of all genes that encode proteins. However, since their action is non-specific, they are not covered in this Section. Nevertheless, the boundaries between “generalist” and “dedicated” regulatory proteins have become blurred. Indeed, the new, post-genomic ChIP-Chip technique (see Section 4) may be used to locate regulatory proteins at their DNA binding sites. This technique is sometimes used in combination with specific inactivation of each regulatory gene. Studies with yeasts have shown that certain dedicated regulatory proteins affect up to 5% of the genes of an organism, whereas certain “generalists” affect only 3% (1). Therefore, an alternative view would be that both types of regulatory proteins cover a quasi-continuous range of influences. According to current knowledge, the threshold between generalist and dedicated regulatory proteins may be empirically established around 3 to 5% of all genes. Dedicated regulatory proteins, which are discussed here, are defined as being below this threshold. They will be called "transcription factors" in the following Sections. 3. Mode of Interaction Between Transcription Factors and DNA Regulatory Regions Most cellular processes are rooted in dynamic interactions among a great number of biological molecules that both implement and undergo regulation. Biological function emerges from a network of interactions among active macromolecules and small molecules, and only exceptionally from a single macromolecule. In other words, the network generally constitutes a filter between the isolated macromolecule and the function, that blurs direct causality. Besides, global feedback is essential. The state of the molecular interaction network feeds back onto the RNA state (affecting alternative splicing, for example). Likewise, the state of
86
F. Képès
the network feeds back onto the DNA state (affecting the pattern of active and inactive genes, for example). Distinctions are classically made among interactions between proteins (see Chapter 5), between an enzyme protein and its substrate (see Chapter 6), and between a transcription factor and its DNA binding site (this Chapter). Other types of interactions are often ignored due to lack of genome-wide data, although they are physiologically relevant. As the transcriptional network is separately analysed here, it is important to recall that it acts in tight relation with the other networks in the cell (see Chapter 7). A good example of this fact may be found in the description of the eukaryotic cell cycle as a color-coded mixed network (2). Moreover, even transcriptional regulation in the strictest sense relies mechanically on the interaction of the transcription factor with its DNA targets, but also with other transcription factors and with the RNA polymerase complex, which in eukaryotes harbor the "generalist" factors mentioned above. Therefore, transcriptional regulation itself often incorporates protein-protein interactions in addition to the protein-DNA ones. 3.1. Genetic Interactions Interactions between regulatory and regulated genes are customarily referred to as “genetic interactions”. However, this terminology poses a problem, since there is a fundamental asymmetry between the regulated gene and the regulatory gene. Regulatory genes exercise their effects via their products, which are transcription factors. This implies that a graph of genetic interactions is naturally of the directed type. Bearing this fact in mind, we will continue to use the customary terminology. 3.2. Genetic Interaction Map At the qualitative level, the transcription factor, the product of a regulatory gene, can activate (positive effect) or inhibit (negative effect) a regulated gene target. A dual effect is also sometimes observed, which may be either positive or negative, according to the circumstances. “Regulated” genes or target genes bear regulatory regions. Regulatory
Transcriptional Networks
87
genes may themselves be regulated, and target genes may themselves be regulatory, in which case they participate in a genetic regulatory pathway or cascade. If such a regulatory pathway is closed onto itself, it forms a feedback circuit. Since the incoming and outgoing connections of a gene can be multiple, the circuits and pathways are sometimes linked in a fully connected component of variable size. Some of these situations are represented in Fig. 1, including the case of a gene that self-regulates, which is otherwise said to form a unary feedback circuit. Fig. 2 represents around 10% of the yeast transcriptional network.The whole dataset comprises about 900 interactions among 500 genes and gathers the results of classical biochemical and genetic analysis from about 700 articles (3). It is interesting to note that today, thousands of interactions among thousands of genes would be reported in a single article (4).
Figure 1. Interactions among genes. DNA, the physical support of all genes, is represented as bipartite fragments: the regulatory region (RR) on the left and the coding region (ORF, or open reading frame) on the right. Gene A is neither regulatory nor regulated. Gene B is not regulated, but its product is a transcription factor which activates transcription of gene C and inhibits transcription of gene D upon binding to their regulatory regions. Gene D is regulated but not regulatory. The product of gene C inhibits transcription of gene E and of itself. In so doing, gene C constitutes a feedback circuit, or unary self-regulating loop. Gene C also participates in a genetic regulatory pathway or cascade from gene B to gene E, since it is both regulatory and regulated. Here, B and C are regulatory genes, and C, D, and E are regulated genes.
88
F. Képès
Figure 2. Partial map of transcriptional interactions in the baker's yeast Saccharomyces cerevisiae. For clarity, arrows are not drawn, but transcriptional influences go from left to right, as indicated by the thick arrow at the top. One-quarter of the 500 studied genes are regulatory, 52 of which are linked to at least one other regulatory gene. These 52 regulatory genes are represented here in a graph that displays causal relationships. In order to reflect the causal flow, genes that lack a known regulator (except for selfregulation) are listed in the left-hand column. The other genes are then placed in the leftmost column, such that all their regulators are to their left. There are five columns in this map, and the sixth one, that comprises only non-regulatory genes (encoding, e.g. structural or catalytic proteins), is consequently not represented. Self-activation is indicated in bold, self-inhibition in bold italics, activation in thick lines, inhibition in thin lines, dual regulation (one case) with a dashed gray line, and essential genes (whose knock-out is lethal) are boxed. Adapted from ref. 3.
4. Methodology Transcriptomic methods aim at systematically obtaining exhaustive information on the concentration, activity and localization of RNAs, as well as on molecular interactions that regulate gene transcription. Current methods provide RNA concentration (or a ratio of this concentration relative to a standard concentration) and have been applied mostly to one category of RNA called messenger RNA or mRNA. They are progressively extended to other categories of RNA such as small RNAs
Transcriptional Networks
89
that are involved in regulation of macromolecular biosynthesis. Transcriptional regulation has been systematically addressed only at the level of interactions between protein and DNA, although other types of interaction such as between protein and small molecules have long been known to also play a role. While these methods are both high-throughput and massively parallel, the output is not an exhaustive information but rather of the order of magnitude of exhaustiveness. Methods for measuring the quantity of mRNA, as well as for investigating the relationship between regulatory proteins and their gene targets, are briefly examined below. They are more thoroughly described in the book edited by Schena in 2000 (5). 4.1. Complementary DNA Microarrays Used throughout the world today, microarrays are plates (usually glass) onto which complementary DNA (cDNA) probes are deposited by highspeed robotized printing methods. They are very well adapted to analyzing the expression of up to 10,000 genes derived from sequencing projects (Fig. 3). Measurement is carried out by differential hybridization so as to minimize errors due to variations in cDNA printing. mRNA from two different sources (for example, a drug-treated assay and an untreated control) is usually reverse-transcribed into cDNA and marked with either of two fluorophores, one for the assay and one for the control. Differential hybridization of the microarray probes is carried out in liquid phase with a mixture of both cDNA preparations, thereby minimizing systematic errors (Fig. 3). After hybridization and washing, each fluorescent signal is independently evaluated and used to calculate the ratio of the concentrations of the experimental and control nucleic acids. This ratio is generally used as the starting point for interpreting the results. In pioneering efforts, such microarrays have been used to measure the genetic expression levels of the complete baker's yeast — Saccharomyces cerevisiae — genome (around 6,400 distinct cDNA sequences) during various kinds of treatment such as induction of sporulation (6), or throughout the entire cell cycle (7). Such data sets and
90
F. Képès
Figure 3. A microarray. Complementary DNAs (cDNAs) are prepared from two messenger RNA (mRNA) sources (for example, the treated sample on the right and the untreated sample on the left). They are marked by two different fluorophores, one red and one green. The two cDNA are then mixed, and the mixture is probed in liquid phase on a microarray bearing gene probes representative of the entire genome, for example, gene "G". A probe for gene "G" consists of cDNA derived from gene "G" that has been attached to the microarray at a given spot. Hybridization between the fluorescent cDNA and the solid-phase cognate cDNA is detected, following washes, by measuring green and red fluorescences on each spot. The ratio of green over red signals indicates whether there is any excess or deficit of a given mRNA in the original assay population compared to the control.
more recent ones are available in the public domain. Guides explaining how to construct devices that deposit cDNA onto the surfaces of microarrays, as well as how to analyze fluorescence levels, may also be found on the Internet. In addition, pre-prepared microarrays are commercially available for an increasing number of organisms, including human, rat, mouse, plant (Arabidopsis), and some bacteria. As already mentioned, they are readily adaptable to a wide range of available cDNA probes corresponding to individual laboratory requirements. 4.2. Oligonucleotide Chips These chips are produced mainly by Affymetrix®, and consist of small silicon plates to which thousands of short oligonucleotides (20 nucleotides or more) are attached. The oligonucleotides are synthesized directly on the surface of the chip by photolithography and light-
Transcriptional Networks
91
controlled chemical synthesis. Due to the combinatory nature of the process, it is possible to probe simultaneously a very large number of mRNA molecules. Today chips consist of as many as 200,000 different probes, usually including several ones for each mRNA molecule (Fig. 4). Chips may contain different exons of an intron-containing gene, or some perfect-pairing probes as well as a few mispaired probes with a single nucleotide mismatch (5). However, the preparation and reading of oligonucleotide chips requires expensive equipment, and at present only commercially produced standard chips are available at an accessible price. This does not afford laboratories the latitude to pose specific questions. The future of the microarray/oligonucleotide chip industry appears to lie in the development of synthetic probes that are longer than the earlier versions (but shorter than cDNA sequences) and which are attached after synthesis and verification. A disadvantage of all the preceding methods is their lack of sensitivity, which means that tens of thousands of cells have to be mixed in order to obtain enough material for a microarray experiment. This can be a problem if only a small amount of tissue is available, e.g. from an isolated embryonic tissue. As will be discussed in Section 5.9, this can also be a problem for the interpretation, as the signal is averaged over many cells, and this average may not reflect the behaviour of any single cell in the set.
Figure 4. Portion of an oligonucleotide chip. Each fluorescent patch corresponds to a specific gene probe.
92
F. Képès
4.3. Normalization of cDNA Microarray Experiments There are many sources of variation in cDNA microarray experiments, which affect the measured gene expression levels. For instance, the response dynamics are diminished by auto-absorption of fluorescence, steric hindrance reducing access to the probe, etc. There are also differences in labeling and detection efficiencies between the two fluorescent dyes, and often differences in the amount of starting material. These problems are exacerbated by the cost of microarrays, limiting the number of hybridization time points in the kinetics, as each time point corresponds to the consumption of one chip. They impose limits on the quantity of useful information that can be obtained from massively parallel measurements. The term "normalization" refers to the process of removing such systematic variation by a set of transformations applied to expression data. Normalization usually consists first in balancing the individual hybridization intensities in order to do meaningful comparisons. Better results are obtained when normalization is made to depend on signal intensity rather than being a global uniform process (8). Replication is important for identifying and reducing variation. Biological replicates use samples independently derived from distinct biological sources. They reveal biological variability and random variation in sample preparation. Technical replicates involve replicated elements within an array, replicate hybridizations, or multiple independent elements for a given gene within an array. They reveal the natural and systematic variability that occurs in performing the assay (9,10). 4.4. Reverse Transcription–Polymerase Chain Reaction (RT-PCR) In order to evaluate gene expression using RT-PCR (reverse transcription-polymerase chain reaction), mRNA is first reversetranscribed into cDNA, then amplified by PCR until detectable levels are reached. Using internal calibration techniques, it is possible to obtain exceptionally high levels of sensitivity (on the order of 1 molecule per microliter of sample volume) and dynamic range (6 to 8 orders of magnitude). This method requires using primers for all genes of interest
Transcriptional Networks
93
and, unlike the techniques described above, is not of the parallel type. It is therefore crucial that the procedure be automated in order for it to function on a large scale. In practice, it is used to obtain more precise data on one or a few dozens of genes that have previously been identified by high-throughput screening as being particularly interesting (e.g. ref. 11). 4.5. Serial Analysis of Gene Expression (SAGE) Serial analysis of gene expression (SAGE) utilizes a very different technique to measure mRNA levels. First, cDNA is synthesized from mRNA; then a DNA tag long enough (~10-20 basepairs) to identify a gene unambiguously is cut from each cDNA fragment at a precise site. The tags are then concatenated into a long double-stranded DNA sequence and the concatenated DNA is amplified and sequenced. It must be proportionately amplified for the present purpose. If tag T1 is 10 times more frequent than tag T2 in the concatenated DNA sequence, it indicates that the mRNA that contains T1 is 10 times more abundant than the mRNA containing T2 (12). The SAGE method presents two advantages: 1) it is not necessary to know the mRNA sequence in advance, so that unknown but expressed genes may be detected, and 2) it utilizes a sequencing technology that is already usual in numerous laboratories. However, the drawbacks are that SAGE involves a somewhat complicated procedure and necessitates massive sequencing. It has already been used for instance to analyze the entire set of S. cerevisiae genes expressed during various phases of the cell cycle, as well as the expression of tens of thousands of human genes, comparing healthy and cancer cells. 4.6. Chromatin Immunoprecipitation A recently developed approach has permitted direct investigation on a genomic scale of the gene targets of proteins that regulate transcription by binding to DNA (Fig. 5). During the first phase, a chemical fixative such as formaldehyde is added to a culture of living and metabolizing cells. The bivalent reagent rapidly penetrates into cells, where it forms
94
F. Képès
covalent bonds between two chemical groups that are in close spatial proximity. In particular, it can form a solid bridge between a transcription factor and its DNA binding site, if one happens to be bound at that instant. The cells are then ruptured and their DNA is extracted
Figure 5. The Chromatin ImmunoPrecipitation-biochip (ChIP-chip) technique. Proteins (geometric icons) associated with DNA (thick line) are covalently bonded by a fixative such as formaldehyde (X), which penetrates and kills living cells. The DNA is mechanically fragmented into random pieces. A DNA-binding protein is then immunoprecipitated with specific antibodies. Fragments of the co-precipitated DNA are deproteinized and marked with a red fluorophore. In parallel, DNA fragments representative of the entire genome are marked with a green fluorophore (control). The red and green DNA fragments are then mixed. To identify the enriched DNA fragment, the mixture is used to probe a microarray or biochip that contains both genic and intergenic regions of the organism being studied. After washing, the intensity of the two fluorescences is measured. For each microarray probe, the intensity ratio indicates the level of enrichment in protein-associated DNA. Once a DNA fragment has been identified, its neighboring genes are declared putative targets of the DNA-binding protein that had been immunoprecipitated.
Transcriptional Networks
95
and broken by a mechanical process into random fragments of about 1 kbp. DNA fragments associated to the protein of interest are then immunoprecipitated, using an antibody specifically directed against this protein. Precipitation is followed by deproteinization. Optionally, PCR may then be used to amplify the DNA, but tricks must be used to deal with DNA that has been chemically damaged by the fixative. In a second phase, the precipitated DNA may be identified by hybridization to a microarray or biochip. Ideally, the microarray should include probes that are representative of both gene and intergene regions. Since regulatory regions are usually intergenic, in principle, one would expect the precipitated DNA to hybridize with intergenic, not with gene probes. The first phase is known as Chromatin ImmunoPrecipitation (ChIP). Since the second phase requires the use of a microchip, the whole method is called the ChIP-chip technique. It has been used thoroughly to discover new protein/DNA interactions in yeast (13,14,4) and in humans(15). The most thorough analysis, in yeast, recovered of the order of 10000 interactions among 4000 genes for 200 transcription factors (4). However, this approach runs into problems, especially those involving background noise due to low antibody specificity. These necessitate the use of relatively arbitrary thresholds to analyze the results. Nevertheless, as expected, the precipitated DNA preferentially hybridizes with intergene probes, rather than with gene probes. It also hybridizes more often with DNA regulatory regions that contain the motif recognized by the transcription factor than with zones that do not. The results indicate that the number of regulated targets per sequence-specific transcription factor is higher than previously thought on the basis of classical genetics and biochemistry experiments. Thus, the S. cerevisiae transcription factor Rap1p has around 300 targets, which is around 5% of the organism’s genes (13). Finally, it is important to note that this method, in its current form, does not provide any information on whether the measured interaction corresponds to a transcriptional influence, and if it were the case, it does not provide the sign of the interaction — activation or inhibition. It merely indicates that a given protein was positioned in the immediate neighborhood of this stretch of DNA at the time the fixative acted. As a
96
F. Képès
matter of fact, this method has also been used to study other DNAbinding proteins that do not influence transcription, such as those that are involved in initiating DNA replication at the origins of replication. 4.7. Bioinformatics A protein that regulates transcription preferentially binds to DNA sites that are identified by a certain sequence or small subset of similar sequences (Fig. 6). These potential sites may be detected using textual analysis of chromosome sequences. If the predicted position of such a site relative to the coding part of a gene is correct, a notion that highly depends on the organism under study, it is reasonable to assume that transcription of the gene may be regulated by the DNA-binding protein. However, in practice, this approach meets various difficulties, the most serious of which being that such sequences are usually very short and degenerate. The number of potential protein binding sites is therefore disproportionately large. To cut down on the false positive hits, several tricks or techniques have been deployed. One can eliminate sites that are located in coding regions, which often are as dense as those located in regulatory regions. It is sometimes possible to obtain better results by noting when these potential sites appear three or more times in rapid succession in the regulatory region of the same gene (16,17). Empirically, such repetition often corresponds to effective regulation, which may be demonstrated using a direct approach. In higher eukaryotes, the respective positioning of binding sites for different transcription factors may similarly be used as a criterion to improve the quality of predictions (18). Finally, a promising recent avenue of research consists in integrating sequence conservation across a few related genomes to improve the accuracy of prediction (e.g. ref. 19). Other methods score the energies of the individual contacts between each nucleotide and the DNA-binding protein. In general, for conveniently searching novel sites, these methods assume independent contributions from these individual contacts, such that the total energy of the interaction is the sum of the energies of the individual contacts. This additivity assumption is often a good approximation (20).
Transcriptional Networks
97
Figure 6. Usual representation in the form of a logo of a consensus DNA site at which a transcription factor is likely to bind. The motif presented in this example comprises 12 significant positions. Only one DNA strand is shown. The sum of the heights of the letters at a given position indicates the information content in bits. The relative sizes of the letters correspond to the frequency of the nucleotide at each position. For example, only C can be found at position 5; both G and A are found at position 6.
4.8. Combining Several Approaches The emerging practice is to combine two or three of the above approaches, in order to discriminate among significant interactions between genes (14,21). In this view, genes are considered to be the probable targets of a regulatory protein if they simultaneously satisfy the following three conditions: 1) their regulatory region contains at least one site that is recognized by the regulatory protein (bioinformatics); 2) their DNA co-precipitates with the regulatory protein (ChIP-chip); 3) their transcription level is modified after a stimulus known to trigger a response involving the regulatory protein (kinetic experiment using a microarray or biochip). 5. Computational Modeling The dynamic implications of the underlying logic of regulatory networks cannot be deduced solely from bench approaches, in particular because the molecular components are entangled into a complex web of interactions. Increasingly, formal models and computational simulations are required in conjunction with bench experimentation, in order to link the dynamical behavior of variables (trajectory, attractor) to a specific network topology. This Section reviews the formalisms that are commonly used to describe and study regulatory networks. A more mathematically-minded review has recently appeared elsewhere on the
98
F. Képès
same subject (22). The choice among the formalisms discussed below must be based on careful consideration of their shortcomings and their strong points, given the problem and data at hand, and given the available computer power. Increasingly, hybrid formalisms have been proposed to cope with particular data structures. 5.1. Graphs and Their Derivatives A graph is a set of vertices and edges connecting some pairs of vertices. It is customary at all scales of biology to use graphs to represent interaction maps, including genetic interactions (23). One potential problem with pure graphs is that they describe a static relational topology, while data are sometimes more informative than just the presence or absence of a relation between vertices. Starting from graphs, it is possible to add various types of conditional, directional or spatiotemporal information to the relations between vertices. The following graph derivatives are relevant to transcriptional networks. •
•
•
For transcriptional regulator/target interactions, the graph is directed, as the relation connects two objects with asymmetrical roles (see Section 3.1). Vertices are of the catalytic type, rather than of the stoichiometric type. In the latter case, they describe resources that are consumed (e.g. a metabolic graph where vertices denote metabolites). In the former case, they are recycled to the identical form (indeed, the gene is not consumed by being influenced or by regulating another gene through its product). A function may be assigned to each edge, that formally describes the interrelation between the vertices it connects. Such refined functions effectively provide information relevant to the dynamics of the graph, thereby allowing it to escape its inherent limitation of depicting a static topology. For the dynamical interpretation of genetic networks, a signed graph is particularly precious. However, whereas classical genetic and biochemical data provide a sign for the interaction, the ChIP-chip technique does not.
Transcriptional Networks
•
99
Graphs are not suitable when it comes to expressing interactions that involve more than two partners in an obligate fashion. The formalism of hypergraphs allows to express non-binary interactions and could be used in the above context. If computational efficiency requires using graphs, and at the cost of a limited loss of information, a n-ary complex can be broken down into a set of all possible binary interactions between subcomponents (a "clique", see Chapter 1).
5.2. Boolean Modeling •
• •
•
Output
•
An idealized model based on elementary mechanisms may sometimes capture the essence of a complex behavior. In a boolean model, each gene may receive one or several inputs from other genes or from itself; assuming a sigmoidal (highly cooperative) relation between input and output, a gene may be considered as a first approximation to be either on (1, transcribed) or off (0, untranscribed) (Fig. 7); time takes discrete values and all gene states are simultaneously computed at each time step; the output at time t+1 is calculated from the input at time t according to boolean functions.
Input Figure 7. From continuous to discrete. The output is some sigmoidal function of the input (triangles), owing to cooperative intermolecular interactions. The boolean simplification replaces this curve with a step function (squares). Only two output states remain, minimal or maximal.
100
F. Képès
Boolean modeling has been used for network inference (24). It has also been used to study the global dynamic properties of large-scale regulatory systems, in particular genetic networks, given local rules that bear for instance on the average degree of connection between genes. It is efficient, even for large genetic networks, at the expense of strong simplifying assumptions on the absence of intermediate gene expression levels (25,26). 5.3. Generalized Logical Modeling Transcriptomic data generally do not show extreme gene expression values, but rather intermediate ones. Although this observation may often reflect a mixture in varying proportions of cells that are each in different extreme states, it is likely on the basis of careful studies conducted at smaller scales that at least some genes are expressed at more than two possible levels. More importantly, transcription factors likely have different thresholds for their different target genes. The generalized logical method proposed by René Thomas takes this fact into account by allowing logical variables to assume several discrete values (27). Here a variable is an abstraction for the cellular concentration of one transcription factor. If a transcription factor encoded by gene A influences k genes, each with a different threshold, then the logical variable for A can take at most k+1 values, one for each threshold and 0 if no threshold is overpassed. In practice, A will have only a limited number of different significant thresholds, and consequently the number of values A can take will often be smaller than k. State transitions are not necessarily synchronous for this formalism. Indeed, a synchronous step would sometimes entail jumping several thresholds at once, which in vivo cannot occur because processes such as protein accumulation are continuous. In logical networks, transitions are made more realistic by being desynchronized, i.e. by jumping one threshold at a time. Furthermore, time delays, such as those arising from biosynthetic steps, can be taken into account. Generalized logical networks have been used to model various small regulatory systems, including developmental networks and bacterial genetic switches (28). In the case of developmental networks, it has been
Transcriptional Networks
101
possible to manually introduce notions of compartmentalization (29). This approach appears as an efficient compromise between the wild simplifications of the boolean model and the excessive parameter dependence of the differential approaches. Moreover, it offers the possibility to exhaustively verify the temporal properties of a system, by taking advantage of the whole corpus of formal methods from computer science (30). 5.4. Petri Nets A Petri net typically allows the description and modeling of concurrent systems. Although it has mostly been used so far to model technological systems (seat reservations, communication protocols etc.), it has also been proposed to describe and model biological networks, among others genetic networks (31). A Petri net consists of places, transitions and arcs. A place contains tokens which may flow through arcs according to some general rules. An arc connects a place to a transition, or vice versa. A transition comprises incoming and outgoing arcs that connect to places. When a transition is triggered, a token is taken from each input place and one is added to each output place through the corresponding arcs. This may be used to represent interactions of the stoichiometric type. Besides, a "test" arc checks for the presence of a token in its source place but does not consume it. Thus, it may be used to represent interactions of the catalytic type. Some extensions to the classical Petri net have become popular for biological applications, including the hybrid Petri net which has two kinds of places and two kinds of transitions. The places and transitions are either discrete or continuous. The discrete places and transitions are defined as above. A continuous place holds a real positive number. A continuous transition continuously fires at a rate determined by parameters assigned to the transitions in hybrid Petri nets, or to the places in hybrid dynamic nets, or to both in hybrid functional Petri nets (31). Petri nets have been employed to model a variety of small biological systems, including regulatory gene networks (31,32).
102
F. Képès
5.5. Bayesian Networks Using the Bayesian formalism, a chart of regulatory interactions is represented by a directed acyclic graph, i.e. a graph with oriented edges and without circuits. In this graph, a vertex corresponds to a molecular entity such as a gene or a protein, and holds a random variable representing the gene expression level or the protein concentration. A conditional probability distribution is defined for the variable of each vertex, given the variables of its direct inputs in the directed graph. A joint probability distribution is finally defined from all conditional distributions. As such, this formalism allows to propagate information within the model. Moreover, when data are available, it is possible to apply algorithms of statistical inference to estimate parameters of the conditional probability distributions, and to identify plausible structures. In this vein, Bayesian modeling has been successfully used for inference of small networks of transcriptional interactions (see Chapter 3). This formalism is interesting because it is strongly anchored in statistics, and it appears to be well suited to directly handle noisy data. Furthermore, it can be used even under conditions of incomplete knowledge, and prior knowledge can be introduced. 5.6. Ordinary Differential Equations Using the widespread formalism of ordinary differential equations, concentrations of transcription factors and their targets are represented by time-dependent variables. Regulation is modeled by expressing the rate of synthesis of a mRNA as a function of the concentrations of all the transcription factors that regulate the gene that encodes it. A degradation term can be added to account for the exponential decay of the ith molecule, which may result from true degradation, but also from dilution due to growth or diffusion. In this case, the equation represents a balance between synthesis and decay. Time delays can also be easily introduced in the function. One popular non-linear function that accounts for real cases of sigmoidal response curves for gene expression is the Hill function. Analytical solutions are often impossible to reach for non-linear
Transcriptional Networks
103
functions, and practitioners usually resort to numerical simulations that calculate approximate values for the variables at each successive time steps. However, it is sometimes possible to analyse specific features of the dynamical system known as steady states and limit cycles. Such states and cycles are interpreted in the biological setting as representing bona fide stable or recurring physiological situations. The robustness of these steady states or limit cycles towards altering parameter values may additionally be assessed using bifurcation analysis. Numerical simulations of non-linear ordinary differential equations have been used to study systems such as the regulatory switch of bacteriophage Lambda between host cell lysis or lysogenic growth, where the kinetic parameters are few and have been measured very carefully (33). A general difficulty with the ordinary differential equations is that they rely on an accurate knowledge of the numerical parameters, and this knowledge is seldom available at present, although this situation can only improve. In the absence of proper experimental measurements, parameter values can be chosen, using a manual or semiautomatic procedure, to fit the experimentally observed behavior. However, mere fitting does not guarantee that the parameters are right or that the numerical model is relevant to the biological situation. Furthermore, predictive extrapolations following fitting are unsafe. One way to circumvent the analytical difficulties encountered with non-linear differential equations is to approximate them with a series of linear differential equations, yielding "piecewise-linear differential equation" models (34,22). For instance, the sigmoidal relation depicted in Fig. 7 could be approximated by a step function. Often, this approximation does not change the qualitative properties of the solutions. Thus, piecewise-linear differential equation models stand in between non-linear ones and logical models and have the advantage of strongly constraining local behavior in the phase space. 5.7. Partial Differential Equations So far, the spatial aspects have not been considered, except in a rather superficial way in a few applications. In all other cases, spatial homogeneity was assumed, or the formalism could not handle spatial
104
F. Képès
aspects. However, simple considerations of how a cell functions, even a prokaryotic cell deprived of any internal membranes, tell us that this assumption is wrong within a cell, not to mention the case of multicellular organisms. Partial differential equations, and their close relatives called reactiondiffusion equations, incorporate spatial and compartmental considerations. They have been extremely popular in morphogenetic studies (35,36). However, the predictions made on the basis of partial differential and reaction-diffusion equations are generally sensitive to parameter values, boundary conditions and domain shape. 5.8. Stochastic Equations Bacteria may contain as little as ten molecules of a given transcription factor, a few copies of the DNA binding site, or one molecule of a given mRNA. It is thus questionable to assume, as has been made so far with differential approaches, that molecular concentrations vary continuously. It is equally questionable to neglect fluctuations (internal noise) in the timing of molecular processes and assume a perfect determinism, i.e. that two identical systems starting from the same initial states will follow an identical trajectory. This concern also holds true in eukaryotes (37). Accordingly, regulatory systems have also been modeled in a stochastic fashion, to account for the imperfect determinism, and in a discrete fashion, to account for the small number of molecules. One possibility with respect to the lack of determinism is to add to a rate equation a term that accounts for the noise in the system (Langevin's equation). Another possibility is to simulate step by step the time evolution of the system (38). In this latter case, stochasticity is introduced at the level of two variables which represent the time interval between two successive steps, and the next reaction to occur (39). Stochastic simulations have been notably applied to the developmental choice of bacteriophage Lambda between host cell lysis and lysogenic growth. An interesting outcome of the observed fluctuations is that stochasticity may be one good way to account for phenotypic diversity, i.e. the fact that different individuals in an apparently homogeneous population have different behaviors (33). In
Transcriptional Networks
105
practice, the stochastic approach may yield realistic simulations provided that the reaction mechanisms are known in great detail, at the cost of heavy computations, given the number of complex simulations to be run. When it is possible to widen either the time or the space scale, the phenomenon under study may be approximated by less costly deterministic models. 5.9. Modeling Strategy Post-genomic data obtained from kinetic experiments may be thought to contain the information required to model the underlying network. However, two kinds of difficulties lie in the path of these modeling attempts: on one hand, experimentally measured parameters reveal intrinsic variability and dispersion, since the observations almost always concern a population, not an individual. On the other hand, experimental options, which are often constrained by practical considerations, generate extrinsic variability (due to measurement) as well as data whose structure may not be well adapted to modeling. More generally, a typical set of post-genomic data concerns thousands of variables, either genes or proteins. However, even in the best circumstances, this set includes only a few hundred experimental situations, thus a few hundred numerical values for each variable. Under these conditions, in which the model is under-determined by the experimental facts, one would expect the computational method to propose a model which, in “attempting” to account for the facts, contains some false correlations. Therefore, it would most often be useful to have supplementary data available to further constrain the model. Such data could derive from prior knowledge of molecular interactions, or of the consensus binding sequences. It is also sometimes possible to convert prior knowledge obtained from higher organizational levels into constraints that are expressed in the same language as the model. Constraining the model by using prior information concerning what is known or plausible from the biological point of view, probably remains our best ally in attempting to tackle the curse of dimensionality! How to include this information in the inference and modeling processes is the real art of the modeler.
106
F. Képès
6. Global Topology of Transcriptional Networks 6.1. Introduction to Topological Analysis The set of molecular networks in a cell constitutes a web that is heterogeneous both in its nodes and edges. If we emphasize the genetic network, then information rapidly flows from the patterns of genetic activity and through a cascade of inter- and intra-cellular signaling functions, before slowly returning toward the regulation of gene expression. Other perspectives are possible, starting from and returning to protein or membrane activity patterns. In any case, the challenge is to identify significant connections in these regulatory networks and to determine the abstract principles underlying the architecture and dynamics of the network, which allow it to function in a reliable yet flexible manner. The accumulation of data has made network architecture accessible. Expressed in the symbolism of graphs, network architecture consists in describing links (edges) that connect nodes (vertices), and eventually in the rules, functions, and weights that may be assigned to the links. Often the only information available is whether or not a link exists (a pure graph), which is insufficient for modeling network dynamics. In addition, such dynamics would require introducing the notion of temporal delay, and delays are seldom available in a quantitative manner. Thus it often continues to escape us. 6.2. Analysis of the Global Topology Empirical and theoretical results indicate that networks may be divided into two major categories, according to the connectivity distribution, pk, which indicates the probability that a node is connected to k other nodes (see Chapter 1). The first category of networks is characterized by a pk that reaches a maximum at an average value kaverage, and that diminishes exponentially for values higher than k : pk ~ C e-βk, where β and C are constants. In such "exponential" networks, each node has approximately the same number kaverage of links. In the second category of networks, pk decreases according to a power law: pk ~ C k-γ, where γ and C are
Transcriptional Networks
107
constants. In such "power-law" networks, the distribution tail for high k is fatter, making the node population much less homogeneous than for the case of exponential distribution. There would be many nodes with few links, and a small number of nodes with many links (Chapter 1). One property of a network is its average diameter, which is the minimum number of edges connecting any two nodes in the network, averaged over the set of all possible pairs. Intuitively, the diameter of a network must have some impact over its dynamics. For example, information takes longer to flow through a large-diameter network. Another general network property is the presence of a giant component, i.e. a large sub-network in which a path connects any pair of vertices. The question of the threshold size at which the word “giant” applies is of little importance, since the giant component phenomenon corresponds to a sudden change of the network phase from fluid to frozen. This frank jump in the number of vertices in the largest connected component marks the phase transition into a network that includes a giant component. The existence of a giant component in fact also depends on local topology (40). 6.3. A Case Study and Its Biological Interpretation For the moment, the genetic networks of the baker's yeast S. cerevisiae and of the enteric bacterium Escherichia coli have been sufficiently investigated to allow some genome-wide observations (3,41). In the multicellular world, the most thorough analysis bears on the transcriptional influences on one gene involved in the development of the sea urchin (42,43). Here the analysis of the global topology will be illustrated with the case of the S. cerevisiae genetic network, which has been studied by the author (3). The graph of genetic interactions is signed; that is, each edge bears an interaction sign, positive for activation and negative for inhibition. It is also directed, for the molecular reasons discussed above. As a consequence, incoming and outgoing connectivities will be considered separately. The distribution of incoming interactions obeys an exponential law, whose exponent is –0.45 for S. cerevisiae and –1.2 for E. coli (Fig. 8a).
F. Képès
108
a
b
Figure 8. Connectivity of the genetic regulatory network of yeast. This network includes 500 genes and 900 interactions, a small number of which are represented on Fig. 2. a) Incoming connectivity (semilog plot): the number of transcription factors per regulated gene follows an exponential distribution. b) Outgoing connectivity (log–log plot): the number of regulated genes per transcription factor approximates a power law distribution.
Transcriptional Networks
109
In practice, the shallower slope observed for yeast indicates that the maximum number of different transcription factors that may regulate the same gene is higher than in bacteria. The average and maximum connectivities are respectively 2.3 and 13 for S. cerevisiae, and 2 and 6 for E. coli. This reflects the greater sophistication of the machinery regulating eukaryotic transcription and the longer regulatory regions of eukaryotic promoters. Kauffman et al. have analyzed in 2003 the yeast transcriptional network in terms of simplified Boolean network models, with the aim of determining feasible rule structures, given the requirement of stable solutions of the generated Boolean networks. They found that generated models with canalyzing Boolean rules are remarkably stable, whereas those with random Boolean rules are only marginally stable. Furthermore, substantial parts of the generated networks appear to reach the same state, regardless of the initial state. Thus, their ensemble approach suggests that the yeast network shows highly ordered dynamics. Outgoing connectivity has no such limits. The total number of DNA targets and the protein concentration constitute its only molecular limits. Indeed, outgoing connectivity does not obey an exponential law, but approaches a power law (Fig. 8b). The exponent is around –1 for both organisms, indicating that the number of outgoing connections k pk is distributed equally over k. This –1 value also corresponds to the phase transition of a generalized random graph. Average connectivity is 8 for S. cerevisiae and 3 for E. coli. It is among the essential genes that are boxed in Fig. 2 (therefore sensitive to sabotage attack) that those richest in direct and indirect targets are found (3). In these networks, one observes a small maximum diameter (5 steps; see Fig. 2). 7. Local Topology 7.1. Analysis of the Local Topology Several criteria may be taken into account to describe the local topology of a network. Those involving motifs will be briefly dealt with here, and in depth in Section 8. Another local criterion is edge apportionment. For
110
F. Képès
a given average connectivity (the total number of edges divided by the total number of vertices), edges may be apportioned in extremely diverse ways (see also Chapter 1). Independent of the type of overall distribution, edges may be distributed uniformly among vertices, or display various degrees of local clustering. The formal criterion determining the presence of a “small world” is a reduced diameter, as in random networks, combined with a strong local clustering, as in regular networks. In spite of such strong local clustering, a “small world” still includes a connected giant component (44). 7.2. A Case Study of a Microorganism Here the analysis of the local topology will be illustrated again with the case of the S. cerevisiae genetic network, which shows a very strong local clustering, beyond that of a "small world", as it is highly fragmented. The number of feedback circuits seems small, even if it is greatly superior to what would be predicted for a random graph constrained by the same empirical connectivity distributions. In E. coli, self-inhibition (unary negative circuits) widely predominates (45). In S. cerevisiae, there is a slight excess of positive circuits, nearly all of which are self-activations (Fig. 2)(3). Positive circuits are implicated in various differentiations, among which more examples are effectively found in the eukaryotic yeast which displays differentiative capacity than in E. coli, which is a non-sporulating bacteriuma. The overwhelming predominance of the shortest possible circuits could reflect cellular economy of both response time and biosynthetic energy (3). Finally, overall fragmentation of the genetic network could limit information crosstalk at the transcriptional level. However, the number of feedback circuits would greatly increase and fragmentation would diminish if the genetic network were in contact with other molecular networks (see Chapter 7).
a Sporulation, seen as an alternative to vegetative growth, is a case of a differentiative event.
Transcriptional Networks
111
7.3. A Case Study of a Multicellular Organism Whereas genome-wide data are available for a few microorganisms as has just been discussed, this is not the case for multicellular organisms. The best datasets bearing on transcriptional regulation in a multicellular organism represent near-exhaustive information on the regulatory neighbors of one gene and its product. This allows for the analysis of a local topology just as before, but analysis of the global topology may not be conducted. Among others, genes involved in the development of the organism must be finely regulated both in time and space. Therefore, it is expected that such developmental genes are regulated in a particularly sophisticated way. This sophisticated regulation has been revealed in great detail in the case of the early (first 24-30 hours) development of the sea urchin larva. One particular gene, Endo16, that encodes a polyfunctional secreted protein of the midgut in the late embryo and larva has been the subject of extensive analysis. The Endo16 gene is the terminus of a transcriptional network, and the Davidson's group has accumulated near-exhaustive knowledge of its regulatory afferences (42,43). The regulatory region of the Endo16 gene is about 2300 bp long. By comparing the expression of gene constructs that harbored various combinations of the target sites, it was possible to demonstrate that this regulatory region comprises cisregulatory modules. Each such module contains target sites for four to eight different interactions with transcription factors. This combination of sites uniquely defines the regulatory functions encoded in the DNA sequence of the cis-regulatory module. For example, the most essential modules for Endo16 regulation are modules A and B. Module A controls the early Endo16 expression in the embryonic endoderm, integrates transactions originating from other modules and communicates to the basal transcription apparatus the global output information. Module B controls later Endo16 expression level in the midgut. Together, these two modules include 13 specific sites, targeted by nine different transcription factors. Of these nine transcription factors, only two are important "drivers", i.e. they are presented variably in embryonic time and space. Sporulation, seen as an alternative to vegetative growth, is a case of a differentiative event.
112
F. Képès
The overall functioning of this system can be described in terms of a computational model integrating the logical interrelations between these components together with quantitative informations on the amplification factor corresponding to some interactions (the amplification factor represents the ratio of the expression levels of the target gene, with the considered transcription factor bound versus unbound). Davidson proposes the notion of a genomic regulatory code, justified by the reduced complexity that it affords in comparison to a fully detailed biochemical model. Operators in this code can for instance be "AND" gates, toggle switches, amplifiers (42,43). Thus far, it seems as though the Endo16 system operates in a purely logical way. However, there are two difficulties with such a view. Firstly, the issue of how and why one of these transcription factors considered as input has been activated or raised above a critical concentration level was left aside. The experimental output is merely a measure of the Endo16 expression level for a given configuration of transcription factor concentrations. By focusing only on the immediate effectors of the Endo16 gene, some feedback loops may have been missed by the analysis thus far. In other terms, the concentrations of these transcription factors may not all be independent variables. Secondly, this interesting work also shows the importance of the 'hardware', which consists of the physical ordering of the transcription factor target sites and of the cisregulatory modules along the DNA sequence. Indeed, the relative positions of these sites and modules with respect to the unique transcription initiation site where all the informations must be integrated, determines the frequency of transcription initiation by the basal apparatus. Although each cis-regulatory module is both necessary and sufficient, one must keep in mind that its operation requires that it be properly positioned. Thus, the 'hardware' appears to be essential in determining the inner workings of the 'software', i.e. the logical operation of the set of sites and factors that was just reported. 7.4. Combinatorial Transcription Logic The above-described study of the sea urchin developmental gene begs the question of how to implement in a parsimonious way various
Transcriptional Networks
113
multigenic regulatory logics. Many control functions may be implemented synthetically by a transcriptional network in live cells, i.e. through interactions between regulatory genes in the modular spirit described in Section 8, and the first attempts date back to the year 2000 (46-48). However, Buchler et al. (49) have worked out at a theoretical level an alternative view based on the "regulated recruitment" of transcription factors and RNA polymerase proposed by Ptashne and Gann (50,51). In this alternative view, transcriptional signals are combinatorially integrated at the cis-regulatory level, i.e. through the interplay of transcription factors bound on the same regulatory region of their common target gene. Their combinatorial schemes include most common logical functions, such as the and gate, meaning that both activators must be present for their target gene to be activated (Fig. 9).
TF 1 and TF 2 TF 1 and TF 2 RNA polymerase TF 1
TF 2
binding site 1
binding site 2
and
P
Figure 9. Cis-regulatory construct that implements an and logical gate. The RNA polymerase will be recruited on its P site and the gene transcription will consequently start (symbolized by the arrow) only if both transcription factors TF 1 and TF 2 are bound on their respective binding sites. Gene activation does not occur when only one of these transcription factors is bound, which entails that these binding sites have to be weak. Dashed lines indicate cooperative interactions between protein pairs. Inspired by Buchler et al., 2003 (49).
In principle, this view would be compatible with shallower transcriptional networks and would allow for shorter time responses and lower biosynthetic costs, as compared to a pure network view. Of course, these two views are not mutually exclusive but should rather be regarded as two layers of regulation. Genes may act together in a network to perform a certain logical operation, while being each regulated with a given cis-regulatory logic of their own. A third approach to regulatory combinatorics consists of evolving circuits towards
114
F. Képès
performing a fixed task, for instance acting as a bistable switch, from a given set of allowed interactions, for instance protein-DNA and proteinprotein interactions. In such attempts, the obtained circuits have provided a variety of functional designs for a given task in transcriptional regulation (52) and in signal transduction (53). Some of these designs are also found in known biological networks. This third approach should thus prove helpful as a way to understand and create small functional subnetworks endowed with diverse functions. 8. Dynamics and Modularity in Macromolecular Networks In Section 4, we saw that the remarkable progress of post-genomics leads to a map — now much closer to completeness than a mere ten years ago — of the macromolecular components and their interactions. This map should be used to understand the regulatory logic of living organisms. However, this will only be possible if we know how to read the map, a task beyond the grasp of the unequipped human mind. In attempting to interpret the map and to reduce the otherwise insoluble overall problem, it is important to shed light on the inherent modularity of the map. The relevance of this modularity is discussed below, as well as its tight link with the dynamical aspects, which explains why these two aspects are treated together below. A more thorough and general discussion of modularity may be found in Chapter 2. 8.1. Modularity 8.1.1. Interest of Modularity Partitioning a molecular network into sub-networks is of interest only under three conditions. The first condition is that such modules must be biologically relevant, for example by underlining functionality. For instance, a module could include all the actors involved in the response to a hormone. If this condition is not met, partitioning is just a mathematical game of no biological interest.
Transcriptional Networks
115
The second condition is that it must be possible to attribute a characteristic dynamics to a module. For example, a negative circuit (see below) could generate a homeostatic behavior. If this second condition is not satisfied, partitioning will not achieve the objective of helping to understand the functioning of the whole. The third condition is that modules may be recomposed among themselves (or that they be immersed into a larger network), while retaining their principal properties, especially their dynamics. For example, the negative circuit, which in isolation generates homeostasis, must retain that property when placed in a wider context. On this condition, modularity permits compacting certain representations and facilitates comparison among organisms. 8.1.2. Implementing Modularity There are two ways of partitioning. Constructively, it amounts to selecting and assembling a small number of vertices into a sub-network. Reductively, a mathematical criterion, based for example on local connectivity, is applied to the whole network, so as to fragment it into sub-networks. Statistical analysis can demonstrate the interest of the modules detected, using an analytic approach (3) or numerical simulation (41) to demonstrate that they are significantly over- or under-represented in a natural macromolecular network. Note, however, that there is not necessarily a correlation between the exceptional representation of a module and its functional or evolutionary pertinence (54). A likely reason is that motifs within the networks are not isolated, that is, they strongly aggregate and have important edge and/or node sharing with the rest of the network (55,54). This fact makes it currently difficult to fulfill the above third condition, except under strict provisions. Until now, the constructive approach, in conjunction with generally rudimentary statistical analysis, has been used more frequently. Its major results are described in Section 8.2. The reductive approach is briefly summarized in Section 8.3.
116
F. Képès
8.2. A Taxonomy of Modules Involved in Regulatory Networks A few modules obtained constructively from directed graphs, that have been subjected to investigation, are gathered here, along with an example of topology and a probable dynamics. A more mathematical treatment of this topic may be found in ref. 56. 8.2.1. Feedback Circuits A directed interaction pathway that is closed on itself constitutes a loop or feedback circuit. Interactions can be activating or inhibiting. Two types of feedback circuits may be identified (Fig. 10a): “positive” circuits, which include an even number of inhibitory interactions, and “negative” circuits, which include an odd number of inhibitory interactions. This terminology is justified by the fact that a vertex has an activating (positive) effect on itself in a positive circuit and an inhibitory (negative) effect on itself in a negative circuit. These two types of circuit have very different dynamic and biological properties (28). The presence of a positive circuit is required for “multi-stationarity” (28); i.e. the coexistence of several possible stationary states (57,58). Indeed, if the concentration of the molecule that corresponds to node Ν increases, its formation will be further activated, whereas if its concentration diminishes, its formation will be less activated. Thus, as a first approximation, the dynamics tends toward either the maximum or minimum value, and any intermediate equilibrium is metastable (Fig. 10a, left). This property can be used to build a functional toggle switch in bacterial cells (47). It may be at the heart of a differentiation process. As already mentioned, a number of positive circuits are found in natural genetic networks, and seems to increase in proportion to the importance of the developmental program of the organism involved. A negative circuit contributes to homeostasis (parameter stability) (28), since if the concentration of a molecule that corresponds to node Ν increases, its formation will be further inhibited, whereas if the concentration of the molecule decreases, its formation will be less inhibited; it therefore tends towards a stable equilibrium value (Fig. 10a,
Transcriptional Networks
117
Figure 10. Relation between topological and dynamical properties. Several topologies (above) and their associated typical dynamics (below) are shown. Arrow, activating influence; square arrowhead, inhibitory influence. a) Feedback circuits: two families are distinguished by the number of inhibitory interactions connecting a vertex to itself. A positive circuit (left) consists of an even number of inhibitory interactions. Here, A selfactivates via B. If A is high, it will remain so, and B will remain low, and vice-versa. A negative circuit (center and right) has an odd number of inhibitory interactions. Here, the negative circuit in the center includes an inhibitory interaction: A self-inhibits. If A is high, it further self-inhibits, thus will diminish. If A is low, it self-inhibits less, thus will increase. The negative circuit on the right consists of three inhibitory interactions. With proper time delays and protein half-lives, the system may behave as an oscillator. b) Feedforward loops. These two families are distinguished by the coherence of action of A on C, whether this action is via B or is direct. For example, on the left, A activates C directly and indirectly via B; the action of A on C is coherent, and the resulting dynamics is rather easy to qualitatively predict. On the right, A inhibits C directly, and activates C indirect; the action of A on C is incoherent. Several topologies exist for each family, although only one is shown here. c) Cascades. The number of steps in a cascade, as well as the duration of each step, determine the delay of the response to the initial stimulus. d) Fans. Only the single-input (SIM) and single-output modules (SOM) are represented. The SIM may provide a temporal expression program, according to an activation threshold hierarchy for regulated genes. When the single regulator A rises or descends, gene B, which is sensitive to the lowest threshold, will be activated first and deactivated last. Gene C, sensitive to the highest threshold, is activated last and deactivated first. The SOM can provide an “AND" logical gate. The single target, D, is triggered only after all A–C regulators are active.
118
F. Képès
middle). This property can be used to build a homeostatic device in bacterial cells (48). A negative circuit can also lead to more or less dampened oscillations, according to how the parameters for the interactions are set and to the half-lives of the molecules that correspond to the nodes (Fig. 10a, right). This property can be used to build an oscillator, called the 'repressilator', in bacterial cells (46). As already mentioned, unary negative circuits (auto-inhibition) are considerably over-represented in E. coli. 8.2.2. Regulatory Triangles (“Feedforward Loops”) A “triangle” in directed graphs is also known as a feedforward loop, and consists of an input vertex (“In”) that influences a second vertex. These two vertices jointly influence an output vertex (“Out”). The feedforward loop is said to be “coherent” if the direct effect of the input vertex on the output vertex has the same sign (activating or inhibiting) as its net effect through the indirect path. If not, the loop is said to be “incoherent” (Fig. 10b). Each of these two circuit families includes four possible topologies with different dynamic and biological properties (59,60). Coherent triangles of the type represented at the left of Fig. 10b are found to be over-represented in genetic networks, both in bacteria (41) and in yeast (3). If activation of C requires simultaneous activation of A and B (A and B), B must progressively accumulate under the effect of A in order to cross its threshold, finally allowing activation of C. Thus, this triangle filters the transients (which do not leave time for B to accumulate), responds only to persistent stimulation (which does allow B to accumulate), and quickly shuts down when A ceases to activate B and C. More generally, numerical simulations indicate that coherent triangles introduce a delay in the response when the signal goes either up or down. The same approach suggests that incoherent triangles introduce an acceleration in the response when the signal goes either up or down (59,60). While the incoherent triangle represented at the right of Fig. 10b has also been observed, it is infrequent (23). In empirical networks, it also includes an additional interaction (B self-inhibits). These characteristics make any prediction of dynamic behavior difficult, unless the model is constrained by some prior biological knowledge.
Transcriptional Networks
119
8.2.3. Cascades A cascade is a chain of vertices that influence one another sequentially. For the case in which each interaction introduces a non-negligible temporal delay (for example, the time necessary for biosynthesis, i.e. transcription followed by translation in the case of genetic networks), the delay introduced by the cascade is roughly a function of its length (Fig. 10c). It is also interesting to note that cascades are often short in microorganisms (left), for which a quick reaction to an external stimulus is essential. In contrast, cascades in multicellular organisms are often long (right), thereby introducing delays that the organism can exploit for its developmental program (61). This phenomenon is accentuated by the larger quantity of introns in multicellular organisms, which increase the time required for mRNA synthesis. If several steps each introduce an amplification factor (for example, kinase cascades), the cascade globally permits strong amplification. If several interactions are cooperative, the foot of the cascade responds in a quasi all-or-none manner. 8.2.4. Combination of Cascades and Positive Circuits A developmental program consists of a series of irreversible steps, each of which takes a relatively precise length of time (42). One way to satisfy these constraints would be to introduce a cascade that inserts a delay, followed by a positive circuit that irreversibly locks the mechanism. A linear suite of such mechanisms would implement the series of steps that constitute the entire program. Control points could be added to this simplified diagram, each of which would represent a prerequisite for the next key step. For example, attaining a minimum mass would be a condition for the next cell division. 8.2.5. Fans A fan consists of a few upstream vertices (“In”) that influence some downstream vertices (“Out”) under a closure condition. This condition means that all the incoming influences of the downstream vertices are in the fan, and reciprocally, that all the outgoing influences of the upstream
120
F. Képès
vertices are in the fan. At present, we can say nothing about the general case of fan dynamics; we are discussing only single-output modules and single-input modules. A single-output module is a set of vertices that jointly and exclusively regulate a single output vertex (“multigenic regulation”), allowing, for example, fine regulation of genes by combining numerous inputs, each of which represents one aspect of the state of the cell. In known cases, the single-output module implements an “AND” logical gate; that is, all the regulators must be present and activating in order for the output to be activated (Fig. 10d, right). A single-input module comprises a vertex that regulates a set of output vertices which have no other incoming influence ("pleiotropic regulation"). In empirical genetic networks, single-input modules are over-represented. In some cases, it has been surmised that a single-input module allowed the sequential firing of its output genes. Suffices that the concentration of the input regulator steadily increases, and that the regulated output genes respond to different doses of this regulator. Indeed, the most sensitive gene would consequently trigger the early events of the global response, while the least sensitive would trigger the late events (Fig. 10d, left). This effectively unfolds in time the effects of a single regulator and may be used for instance in sequential building of a structure or sequential response to a stress (62). 8.3. Community Structure Two approaches to the partitioning problem were mentioned in Section 8.1.2. In the case of transcriptional networks, the constructive way has been the subject of much more work than the reductive one, and has been described above in great detail. The reductive way consists of applying a mathematical criterion to the whole network, so as to fragment it into sub-networks called "communities", thus revealing a "community structure". Few conclusive results have been obtained in the case of molecular networks, and this second approach will only briefly be reviewed below. A community is a group of vertices in which there are more edges between vertices within the group than vertices outside of it. Although
Transcriptional Networks
121
the partitioning of a network into such groups is a well-studied problem, older algorithms tend to only work well in special cases (63). Several algorithms have been proposed and have been shown to reliably extract known community structure in real world networks (64-68). However, each of these algorithms require knowledge of the entire structure of the graph. Recently, new measures and efficient algorithms have been proposed that allow to make quantitative statements about community structure in graphs that are not fully known and that must be explored one vertex at a time (69,70). A small fraction of the search of community structures has been devoted to molecular networks, with results that were generally less convincing than those from other fields of application. This lack of a clear success suggests a more entangled, less modular structure of molecular networks (55), as compared to social or technological ones. A few applications to molecular networks are surveyed below. Spirin and Mirny applied an algorithm based on superparamagnetic clustering to a network of protein-protein interactions. In line with the underlying biology, they found two types of modules: stable protein complexes (e.g. splicing machinery), and dynamic functional units (e.g. signaling cascades) (71). Bader and Hogue described a novel graph theoretic clustering algorithm called "Molecular Complex Detection". The method is based on vertex weighting by local neighborhood density and outward traversal from a locally dense seed protein to isolate the dense regions according to given parameters. They applied their algorithm to large protein-protein interaction networks from the yeast S. cerevisiae, and identified densely connected regions that may represent molecular complexes (72). Gagneur et al. have introduced an algorithmic method, termed modular decomposition, that defines the organization of proteinprotein interaction networks as a hierarchy of nested modules. The method is applied to experimental data on the pro-inflammatory tumor necrosis factor-α / NFΚB transcription factor pathway (73). In one of the very few attempts to tackle gene networks, Wilkinson and Huberman have presented a method for creating a network of gene co-occurrences in article abstracts from the literature, and for partitioning it into communities of related genes. They applied their method to produce communities of genes related to colon cancer. They found cases of
122
F. Képès
overlapping communities, in which a node common to two communities was an indicator of a link between two groups of related genes (74). 9. Spatial Aspects Little is known about how these genetic networks unfold in the threedimensional geometry of live cells. Yet, circumstantial evidence abundantly shows that transcription is a spatially heterogeneous process. There is also some evidence that this spatial heterogeneity is relevant to the physiology of transcription. Circumstantial evidence mostly relies on morphological observations. In particular, imaging of metazoan nuclei has unveiled thousands of small mRNA foci that each comprise 14-20 different transcription units (75,65). A link between chromosomal structure and transcriptional dynamics is also provided by the observation that mRNA synthesis inhibitors disperse DNA sequences originally confined to distinct regions and increase DNA loop mobility (66,67). As far as prokaryotes are concerned, the 3-D clustering model predicts that transcription foci and long-range DNA loops should dynamically self-organize in the nucleoid, especially around the most active transcription units. Indeed, morphological approaches on live cells demonstrate discrete foci, each comprising hundreds of RNA polymerases engaged on the rRNA encoding operons in Bacillus subtilis (76), or containing transcription factors in E. coli (77). Various structural and biochemical approaches suggest that, in actively growing E. coli cells, there would be ~50 independent loop domains per genome (78,79). 3-D focusing is expected to depend on transcriptional activity. Indeed, starvation reduces the transcription rate and disperses the polymerases (76). Along the same lines, in vivo structural data indicate that the number of loop domains drops as bacteria are grown on a poor medium (80) or enter stationary phase (81). This morphological evidence takes a more precise meaning if it is balanced by transcriptomic data that show regularities in the respective positioning of co-regulated or co-expressed genes along chromosomes. Such regularities have been categorized in two classes called 1-D and 3-D clustering for reasons discussed below (82). 1-D clustering means
Transcriptional Networks
123
that co-regulated genes tend to be found next to each other along the 1-D sequence of DNA. In the literature, 1-D clustering has been reported both for animal co-expressed genes (83-87), for S. cerevisiae and E. coli coregulated genes (16,88-90) and for S. cerevisiae genes encoding subunits of the same stable complex (91). 3-D clustering means that co-regulated genes are positioned at regular intervals along the chromosome. Both in S. cerevisiae (88,92) and in E. coli (89), co-regulated or co-expressed genes tend to be regularly spaced along the chromosome. The same spacing is observed for most transcription factors within the E. coli circular chromosome in the nucleoid, or within any of the 32 yeast chromosome arms in the nucleus. Importantly, the periods differ among the yeast chromosome arms and among four different E. coli strains. Furthermore, in E. coli, target genes locate at periodic intervals from the gene encoding their transcription factor. This transcription factor/target pattern is more pronounced than the target/target one, thereby suggesting that the former causes the latter (89). In S. cerevisiae, no specific location of the transcription factor's gene with respect to its target genes is observed (88). Based on these results, a solenoidal topology of chromosomes was proposed. Essentially, this model posits that the interacting partners, i.e. several copies of a given transcription factor and of its DNA binding sites, gather within the nucleus/nucleoid into small sub-volumes called "foci". Binding at genuine regulatory sites on DNA would thus be optimized by locally increasing the concentration of transcription factors and their binding sites. That this is the case for different sites within one gene has actually been demonstrated (93,94) and modeled (95). The solenoidal model extrapolates from these demonstrated intragenic cases to intergenic ones to propose the mechanism of local concentration effects to account for the observed chromosomal regularities. As many transcription factors are simultaneously active and some share targets, the resulting collection of foci provides a potent self-organizational principle for the chromosome, and consequently for the functional nuclear architecture (Fig. 11). Other chromosome-long regularities have been detected by spectral or autocorrelation analysis. In E. coli, patterns in the spatial series of transcriptional activity were observed and classified into three categories: short-range, of up to 16 kbp; medium-range, over 100-125 kbp like in the
124
F. Képès
Figure 11. Solenoidal loops. The DNA fiber is shown as a coil. The gene targets of one transcription factor are shown in violet, those of another factor are shown in blue. As both target sets are periodically spaced along the fiber, most of the target genes from any of these factors are clustered in 3-D space, even though a few individual targets may not be at the proper position. Modified from ref. 83.
above studies; and long-range, over 600-800 kbp (96). Using wavelet analysis, a study encompassing 163 chromosomes from eubacteria and archebacteria has uncovered spatial patterns and correlated some of them with various organism-specific features (97). For both B. subtilis and E. coli, Carpentier et al. found that the coexpression of genes varies as a function of the distance between the genes along the chromosome. They observed long-range correlations, i.e. the changes in the level of expression of any gene are correlated either positively or negatively to the changes in the expression level of other genes located at well-defined long-range distances. They also found short-range correlations, which they interpreted to suggest that the location of these co-expressed genes corresponds to DNA turns on the nucleoid surface (14–16 genes). They explained their results by a solenoidal structure based on two types of spirals (short and long). The
Transcriptional Networks
125
long spirals would be uncoiled expressed DNA while the short one would correspond to coiled unexpressed DNA (99). 10. Conclusion and Perspectives Biologists are fond of wiring diagrams abstracting the system's components and their interactions. Network-based approaches extend this common viewpoint, while providing a well-paved path to more formal analysis. In particular, a simple topological or dynamical analysis may sometimes allow to reject a biologist's model with little effort. A deeper analysis may shed light on the plausible mechanisms of a welldescribed and poorly understood phenomenon. Besides this explanatory capacity, the analysis may in different settings be of predictive value or increase the efficiency of subsequent experimental testing. However, what often appears on the biologist's cartoons and is not directly amenable to network-based analysis stricto sensu is the spatial aspect of the biological process under scrutiny. As we have seen, this gap is only now starting to be filled up. Many questions remain open. The immersion of the network in a wider web of heterogeneous interactions (see Chapter 7), and into the geometry of the cell (see Section 9) will be crucial for getting closer to biological realism, not so much out of "bio-envy", but because it is already clear that this will result in qualitative jumps in the dynamical features of the system (99,93). Models attempting to account for the evolutionary shaping of todays' networks have recently flourished, and will continue to become increasingly sophisticated as they feed on an increasing staple of data, but no conceptual revolution is foreseeable on that side (see Chapter 8). As discussed in Section 8, the modular approach still has to hold its promises on the side of recomposing networks from isolated subnetworks, and hinting at how different levels of regulation intermingle. Thus far, the modular approach has developed a parallel system of description rather than contributed to the rationalization of the notion of biological function. The role of noise is far from being fully explored. Despite some progress, we do not understand yet how robust behaviors emerge amidst stochastic molecular events, and how cells control or use noise in regulatory networks.
F. Képès
126
Breakthroughs are expected on that side. Part of the answers will come from the wide application of some pioneering techniques that allow to monitor molecular events, including gene expression, in single cells (e.g. ref. 100). To tackle these issues, biologists, physicists, chemists, computer scientists and mathematicians are needed. Together, they pave the way to a new approach in life sciences dubbed “Systems biology”. This renewal of integrative approaches in biology takes its roots in the legacy of quantitative and theoretical biology, and in the conceptual developments on self-organizing systems in the 1980's. In accord with the "-omics" era, it shows a molecular slant. References 1. 2. 3. 4.
5. 6.
7.
8. 9.
10.
Holstege, F.C.P. et al. (1998). Dissecting the regulatory circuitry of a eukaryotic genome. Cell. 95, 717-728. Kohn, K.W. (1999). Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol Biol Cell. 10, 2703-34. Guelzim, N., Bottani, S., Bourgine, P., and Képès, F. (2002). Topological and causal structure of the yeast genetic network. Nature Genetics. 31, 60-63. Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, NJ., Macisaac, K.D., Danford, T.W., Hannett, N.M, Tagne, J.B., Reynolds, D.B., Yoo, J., Jennings, E.G., Zeitlinger, J., Pokholok, D.K., Kellis, M., Rolfe, P.A., Takusagawa, K.T., Lander, E.S., Gifford, D.K., Fraenkel, E. and Young, R.A. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature. 431, 99-104. Schena, M. (2000). Microarray biochip technology. BioTechniques Books Division (Eaton Publishing, Natick, MA, USA). Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O.,and Herskowitz, I. (1998). The transcriptional program of sporulation in budding yeast. Science. 282, 699-705. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer V.R., Anders K., Eisen M.B., Brown P.O., Botstein, D. and Futcher B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 9, 3273-97. Park. T. et al. (2003). Evaluation of normalization methods for microarray data. BMC Bioinformatics. 4, 33. Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J. and Speed, T.P. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30. Quackenbush, J. (2002). Microrarray data normalization and transformation. Nature Genet. Supp. 32, 496-501.
Transcriptional Networks 11.
12.
13.
14.
15.
16.
17.
18.
19. 20. 21. 22. 23. 24.
25.
127
Chen, H.Y., Yu, S.L., Chen, C.H., Chang G.C., Chen, C.Y., Yuan, A., Cheng, C.L., Wang, C.H., Terng, H.J., Kao, S.F., Chan, W.K., Li, H.N., Liu, C.C., Singh, S., Chen, W.J., Chen, J.J. and Yang, P.C. (2007). A five-gene signature and clinical outcome in non-small-cell lung cancer. N Engl J Med. 356, 11-20. Chan, W.Y., Lee, TL., Wu, S.M., Ruszczyk, L., Alba, D, Baxendale, V. and Rennert, O.M. (2006). Transcriptome analyses of male germ cells with serial analysis of gene expression (SAGE). Mol Cell Endocrinol. 250, 8-19. Lieb. J.D., Liu, X., Botstein, D. and Brown, P.O. (2001). Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association.Nat Genet. 28, 327-34. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., Zeitlinger, J., Jennings, E.G., Murray, H.L., Gordon, D.B., Ren, B., Wyrick, J.J., Tagne, J.B., Volkert, T.L., Fraenkel, E., Gifford, D.K. and Young, R.A. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 298, 799-804. Schreiber, J., Rolfe, P.A., Gifford, D.K., Fraenkel, E., Bell, G.I., Young, R.A. (2004). Control of pancreas and liver gene expression by HNF transcription factors. Science. 303, 1378-81. Wagner, A. (1999). Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics. 15, 776-784. Kielbasa, S.M., Korbel, J.O., Beule, D., Schuchhardt, J, Herzel, H. (2001). Combining frequency and positional information to predict transcription factor binding sites. Bioinformatics. 17, 1019-26. Rajewsky, N., Vergassola, M., Gaul, U., Siggia, ED. (2002). Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics. 3, 30. Xuan, Z. et al. (2005). Genome-wide promoter extraction and analysis in human, mouse, and rat. Genome Biology. 6, R72. Benos, P.V., Bulyk, M.L. and Stormo, G.D. (2002). Additivity in protein-DNA interactions: how good an approximation is it ? Nucleic Acids Res. 30, 4442-4451. Gifford, D.K. (2001). Blazing pathways through genetic mountains. Science. 293, 2049-51. De Jong, H. (2002). Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol. 9, 67-103. Képès, F. (2003a). Integrative biology of molecular networks and beyond. Biology International. 44, 48-60. Liang, S., Fuhrman, S. and Somogyi, R. (1998). Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput. 1998, 18-29. Kauffman, S., Peterson, C., Samuelsson, B. and Troein, C., (2003). Random Boolean network models and the yeast transcriptional network. Proc Natl Acad Sci U.S.A. 100, 14796-9.
128 26.
27.
28. 29.
30.
31. 32.
33.
34. 35. 36. 37. 38. 39. 40. 41. 42.
43. 44.
F. Képès Serra, R., Villani, M. and Semeria, A (2004). Genetic network models and statistical properties of gene expression data in knock-out experiments. J Theor Biol. 227, 149-57. Thomas, R. (1980). Logical description, analysis, and feedback loops; in Nicolis G (ed): Aspects of Chemical Evolution.17th Solvay Conference on Chemistry. Chichester, Wiley. 247–282. Thomas, R. and d’Ari, R. (1990). Biological Feedback (CRC Press, Boca Raton, FL, U.S.A.). Mendoza, L., Thieffry, D.and Alvarez-Buylla, E.R. (1999). Genetic control of flower morphogenesis in Arabidopsis thaliana: a logical analysis. Bioinformatics. 15, 593-606. Bernot, G., Comet, J.P., Richard, A. and Guespin, J. (2004). Application of formal methods to biological regulatory networks: extending Thomas' asynchronous logical approach with temporal logic. J Theor Biol. 229, 339-347. Matsuno, H., Doi, A., Nagasaki, M. and Miyano, S. (2000). Hybrid Petri net representation of gene regulatory network. Pac Symp Biocomput. 2000, 341-52. Steggles, L.J., Banks, R., Shaw, O.and Wipat, A. (2007). Qualitatively modelling and analysing genetic regulatory networks: a Petri net approach. Bioinformatics. 23, 336-43.. Arkin, A., Ross, J.and McAdams, H.A. (1998). Stochastic kinetic analysis of developmental pathway bifurcation in phage Lambda-infected Escherichia coli cells. Genetics. 149, 1633-1648. Glass, L. (1975). Classification of biological networks by their qualitative dynamics. J Theor Biol. 54, 85-107. Turing, A.M. (1951). The chemical basis of morphogenesis. Philos Transact Royal Soc B. 237, 37-72. Meinhardt, H. (1982). Models of biological pattern formation (Academic Press, London). Raser, J.M. and O'Shea, E.K. (2005). Noise in gene expression: origins, consequences, and control. Science. 309, 2010-3. Kerszberg, M. (2004). Noise, delays, robustness, canalization and all that. Curr Opin Genet Dev. 14, 440-5 Gillespie, D.T. (1977) Exact stochastic simulation of coupled chemical reactions. J Phys Chem. 81, 2340-2361. Newman, M.E.J., Strogatz, S.H. and Watts, D.J. (2001) Random graphs with arbitrary degree distribution and their applications. Phys. Rev. 64, 2. Shen-Orr, S.S., Milo, R., Mangan, S. and Alon, U. (2002). Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet. 31, 64-8. Yuh, C.H., Bolouri, H. and Davidson, E.H. (1998). Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science. 279, 18961902. Istrail, S. and Davidson, E.H. (2005). Logic functions of the genomic cisregulatory code. Proc Natl Acad Sci U.S.A. 102, 4954-4959. Watts, D.J. and Strogatz, S.H. (1998). Collective dynamics of 'small-world' networks. Nature. 4; 393, 440-2.
Transcriptional Networks 45.
46. 47. 48. 49. 50. 51. 52. 53. 54. 55.
56. 57. 58. 59.
60. 61. 62.
63. 64.
129
Thieffry, D., Huerta, A.M., Perez-Rueda, E. and Collado-Vides, J. (1998). From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli. Bioessays. 20, 433-40. Elowitz, M.B. and Leibler, S. (2000). A synthetic oscillatory network of transcriptional regulators. Nature. 403, 335-338. Gardner, T.S., Cantor, C.R. and Collins, J.J. (2000). Construction of a genetic toggle switch in E. coli. Nature. 403, 339-342. Becsksei, A. and Serrano, L. (2000). Engineering stability in gene networks by autoregulation. Nature. 405, 590-593. Buchler, NE, Gerland, U. and Hwa, T. (2003). On schemes of combinatorial transcription logic. Proc Natl Acad Sci. U.S.A. 100, 5136-41. Ptashne, M. and Gann, A. (1997). Transcriptional activation by recruitement. Nature. 386, 569-577. Ptashne, M. and Gann, A. (2002). Genes and signals (Cold Spring Harbor Laboratory Press, Plainview, NY). François, P. and Hakim, V. (2004). Design of genetic networks with specified functions by evolution in silico. Proc Natl Acad Sci U.S.A. 101, 580-5. Soyer, O.S., Pfeiffer, T. and Bonhoeffer. S. (2006). Simulating the evolution of signal transduction pathways. J Theor Biol. 241, 223-32. Mazurie, A., Bottani, S. and Vergassola, M. (2005). An evolutionary and functional assessment of regulatory network motifs. Genome Biol. 6, R35. Dobrin, R., Beg, Q.K., Barabasi, A.L. and Oltvai, Z.N. (2004). Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinformatics. 5, 10. Wolf, D.M. and Arkin, A.P. (2003). Motifs, modules and games in bacteria. Curr Opin Microbiol. 6, 125-34. Soulé, C. (2003). Graphic requirements for multistationarity. ComPlexUs. 1, 123133. Richard, A. (2006). Modèle formel pour les réseaux de régulation génétique. PhD thesis, University of Évry. Mangan, S., Zaslaver, A. and Alon, U. (2003). The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. J Mol Biol. 334, 197-204. Mangan, S., Alon, U. (2003). Structure and function of the feed-forward loop network motif. Proc Natl Acad Sci U.S.A. 100, 11980-5. Rosenfeld, N. and Alon, U. (2003). Response delays and the structure of transcription networks. J Mol Biol. 329, 645-54. Kalir, S, Mc Clure, J., Pabbaraju, K., Southward, C., Ronen, M., Leibler, S., Surette, M.G.and Alon, U. (2001). Ordering genes in a flagella pathway by analysis of expression kinetics from living bacteria. Science. 292, 2080-3. Kernighan, B. W. and Lin, S. (1970). An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal. 49, 291–307. Girvan, M. and Newman, M.E.J. (2002) Community structure in social and biological networks. Proc. Natl. Acad. Sci. U.S.A. 99, 7821–7826.
130 65. 66.
67. 68. 69. 70. 71. 72. 73. 74. 75.
76. 77.
78. 79.
80.
81. 82. 83.
F. Képès Newman, M.E.J. and Girvan, M. (2004) Finding and evaluating community structure in networks. Phys. Rev. 69, 026113. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., and Parisi, D. (2004). Defining and identifying communities in networks. Proc. Natl. Acad. Sci. U.S.A. 101, 2658–2663. Newman, M.E.J., (2004). Fast algorithm for detecting community structure in networks. Phys. Rev. 69, 066133. Wu, F.and B.A. Huberman, B.A. (2004). Finding communities in linear time: A physics approach. Eur. Phys. J. B 38, 331–338. Clauset, A., Newman, M.E.J. and Moore, C. (2004). “Finding community structure in very large networks.” Phys. Rev. 70, 066111. Clauset, A. (2005). Finding local community structure in networks. ArXiv:physics 0503036 v1. Spirin, V. and Mirny, L. A. (2003). Protein complexes and functional modules in molecular networks. PNAS. 100, 12123–12128. Bader, G.D. and. Hogue, C.W. (2003). An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4-2. Gagneur, J., Krause, R., Bouwmeester, T. and Casari, G. (2004). Modular decomposition of protein-protein interaction networks. Genome Biology. 2004, 5, R57. Wilkinson, D.M. and B.A. Huberman, (2004) A method for finding communities of related genes. PNAS. 101, 5241–5248. Jackson, D.A., Iborra, F.J., Manders, E.M. and Cook, P.R. (1998). Numbers and organization of RNA polymerases, nascent transcripts, and transcription units in HeLa nuclei. Mol. Biol. Cell. 9, 1523-1536. Lewis, P.J., Thaker, S.D. and Errington, J. (2000). Compartmentalization of transcription and translation in Bacillus subtilis. EMBO J. 19, 710-718. Talukder, A.A., Hiraga, S. and Ishihama, A. (2000). Two types of localization of the DNA-binding proteins within the Escherichia coli nucleoid. Genes Cells. 5, 613-626. Pettijohn, D.E. (1996). in The nucleoid, F. C. Neidhardt, Ed.-in-Chief (Am. Soc. Microbiol., Washington, DC, 2nd edition, , vol. 1, 158-166. Brunetti, R., Prosseda, G., Beghetto, E., Colonna, B. and Micheli, B. (2001). The looped domain organization of the nucleoid in histone-like protein defective Escherichia coli strains. Biochimie. 83, 873-882. Sinden, R.R. and Pettijohn, D.E. (1981). Chromosomes in living Escherichia coli cells are segregated into domains of supercoiling. Proc. Natl. Acad. Sci. U.S.A. 78, 224-228. Staczek, P. and Higgins, N.P. (1998). Gyrase and Topo IV modulate chromosome domain size in vivo. Mol. Microbiol. 29, 1435-1448. Képès, F. and Vaillant, C. (2003). Transcription-based solenoidal model of chromosomes. ComPlexUs. 1, 171-180. Hager, E.J.and Miller, O.L. Jr. (1991) Ultrastructural analysis of polytene chromatin of Drosophila melanogaster reveals clusters of tightly linked coexpressed genes. Chromosoma. 100, 173-186.
Transcriptional Networks 84.
131
Caron, H., van Schaik, B., van der Meer, M., Baas, F., Riggins, G., van Sluis, P., Hermus, M. C., van Asperen, R., Boon, K., Voute, P. A., Heisterkamp, S., van Kampen, A. and Versteeg, R (2001). The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science. 291, 1289-1292. 85. Spellman, P.T. and Rubin, G.M. (2002). Evidence for large domains of similarly expressed genes in the Drosophila genome. J Biol. 1, 5. 86. Boutanaev, A.M., Kalmykova, A.I., Shevelyov, Y.Y. and Nurminsky, D.I. (2002) Large clusters of co-expressed genes in the Drosophila genome. Nature. 420, 666669. 87. Roy, P.J., Stuart, J..M., Lund, J. and Kim, S.K. (2002). Chromosomal clustering of muscle-expressed genes in Caenorhabditis elegans. Nature. 418, 975-979. 88. Képès, F. (2003b). Periodic epi-organization of the yeast genome revealed by the distribution of promoter sites. J Mol Biol. 329, 859-865. 89. Képès, F. (2004). Periodic transcriptional organization of the E. coli genome. J Mol Biol. 340, 957-964. 90. Hershberg, R, Yeger-Lotem, E, Margalit, H. (2005). Chromosomal organization is shaped by the transcription regulatory network. Trends Genet. 21, 138-42 91. Teichmann, S.A. and Veitia, R.A. (2004). Genes encoding subunits of stable complexes are clustered on the yeast chromosomes: an interpretation from a dosage balance perspective. Genetics. 167, 2121-2125. 92. Mercier, G., Berthault, N., Touleimat, N., Képès, F., Fourel, G., Gilson, E. and Dutreix, M. (2005). A haploid-specific transcriptional response to irradiation in Saccharomyces cerevisiae. Nucleic Acids Res. 33, 6635-6643. 93. Müller-Hill, B. (1998). The function of auxiliary operators. Molec. Microbiol. 29, 13-18. 94. Dröge, P. and Müller-Hill, B. (2001). High local protein concentrations at promoters: strategies in prokaryotic and eukaryotic cells. Bioessays. 23, 179-183. 95. Vilar, J. M. and Leibler, S. (2003). DNA looping and physical constraints on transcription regulation. J Mol Biol. 2003, 331, 981-9. 96. Jeong, K.S., Ahn, U. and Khodurky A.B. (2004). Spatial patterns of transcriptional activity in the chromosome of Escherichia coli. Genome Biol. 5, R86. 97. Allen, T., Price, N.D., Joyce, A.R. and Palsson, B.O. (2006). Long-range periodic patterns in microbial genomes indicate significant multi-scale chromosomal organization. PLoS Comput Biol. 2, 2. 98. Carpentier, A.S., Torresani, B., Grossmann, A. and Henaut, A (2005). Decoding the nucleoid organisation of Bacillus subtilis and Escherichia coli through gene expression data. BMC Genomics. 6, 84. 99. Takahashi, K., Arjunan, S.N. and Tomita, M. (2005). Space in systems biology of signaling pathways--towards intracellular molecular crowding in silico. FEBS Lett. 579, 1783-8. 100. Levsky, J.M., Shenoy, S.M., Pezo, R.C. and Singer, R.H. (2002). Single-cell gene expression profiling. Science. 297, 836-40.
This page intentionally left blank
CHAPTER 5 PROTEIN INTERACTION NETWORKS
Kai Tan and Trey Ideker Department of Bioengineering,University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
[email protected],
[email protected]
1. Introduction Every living cell is governed by a vast network of interacting proteins, RNA, DNA, metabolites, and other molecules. Interactions among proteins are especially crucial to a wide variety of cellular processes: assembly of the structural compartments of a cell such as the cytoskeleton and nuclear pore; signal transduction pathways such as the classical mitogen-activated protein kinase (MAPK) cascade involved in pheromone signaling; enzyme-protein substrate interactions; and assembly of large molecular machines such as DNA polymerase and the proteasome. Knowledge of the stable and transient protein interactions in a cell facilitates functional annotation of novel genes and provides insight into its higher-order organization. Considered individually, protein interactions stimulate the formulation of hypotheses that can be tested experimentally. For example, a membrane protein found to interact with a transcription factor might seem at first to be a “false positive”, but such findings have also led to unexpected new insights into signal transduction, as in the case of Notch and Suppressor of hairless (1) or the SREBPS transcription factors that localize to the ER membrane (2).
133
134
K. Tan and T. Ideker
Further, when combined with diverse large-scale data such as microarray gene expression profiles (3) or genomic phenotypes (4-6), protein interaction networks provide a more complete picture of cellular pathways and responses than has ever before been available. Such an integrated network is useful because it provides a lucid means of summarizing existing biological knowledge about molecular behavior. Recent years have witnessed an explosive growth of research on protein interactions and networks, with new experimental techniques, data sets, analyses, and modeling methods being published at an ever increasing rate. In particular, the last years have seen the arrival of largescale protein interaction data sets from the multi-cellular organisms fruit fly (7) and round worm (8). Biologists are now faced with the challenge of deciphering these complex metazoan networks with the ultimate goal of describing the network of protein interactions in humans. In this chapter, we summarize current technologies for generating large-scale protein interaction data, as well as visualizing and modeling protein interaction networks together with complementary large-scale data of various types. We cover recent work to extend and compare these models across different species or biological conditions, and we describe efforts to understand the evolution and dynamics of protein interaction networks. 2. Methodologies to Obtain Protein Interaction Data Traditionally, protein interactions have been studied individually by genetic, biochemical, and biophysical techniques. However, the speed with which protein sequences are now discovered (or predicted) has created a need for high-throughput methods for interaction detection also. Consequently, a variety of experimental and computational approaches have been introduced in the past several years that can tackle the problem at large scale, resulting in a vast amount of interaction data in the public domain. As described in the following text, yeast two-hybrid and mass spectrometry (MS) technologies aim to detect physical binding between proteins, whereas genetic interactions and computational methods seek to predict protein functional associations.
Protein Interaction Networks
135
Such functional associations may or may not result from physical binding. 2.1. Experimental Technologies to Identify Protein-Protein Interactions A variety of methods are now available for measuring protein-protein interactions, such as co-immunoprecipitation (9), the two-hybrid system (10; Fig. 1a), and the glutathione-s-transferase (GST) pull-down assay (11) – the former two being the most widespread. Many of the proteinprotein interactions that occur in vivo are maintained when a cell is lysed under non-denaturing conditions. Co-immunoprecipitation takes advantage of this fact to detect and identify physiologically relevant protein-protein interactions. The principle is straightforward: if protein X is immunoprecipitated with an antibody to X, then protein Y, which is stably associated with X in vivo, may also precipitate in vitro. To identify novel associated proteins after immunoprecipitation, mass spectrometry has become the method of choice because of its sensitivity, speed, and ability to identify post-translational modifications (12). Tandem mass spectrometry (MS/MS) is typically used to identify proteins from complex mixtures (Fig. 1b). The protein mixture is digested to form peptides which are introduced into the first mass spectrometer to separate them according to mass (detected as a mass-tocharge ratio). Peptides of a fixed size are selected and directed towards a so-called “collision cell” in which the peptides collide with molecules of an inert gas (such as argon) and break apart into fragments. The resulting fragments are analyzed by a second mass spectrometer which measures the mass of each fragment to produce a peptide “fragmentation profile”. The peptides then serve as surrogate markers for the protein sequence. Proteins are identified by searching the resulting peptide mass fingerprint through sequence databases. To identify novel protein interactions, co-immunoprecipitation can be used initially to collect a mixture of interacting proteins followed by protein identification by MS/MS.
K. Tan and T. Ideker
136
[a]
[b]
Figure 1. Principles of two high-throughput technologies for identifying protein interactions. [a] Yeast two hybrid system. Typical two-hybrid screens use a library of random DNA or cDNA fused to a transcriptional activation domain (AD), expressed in yeast (‘preys’; circles denote plasmids). The library clones are mated to a strain of opposite mating type that expresses a protein of interest (‘bait’, B) as a fusion to a DNAbinding domain (DBD). If bait and prey interact in the resulting diploid cells, they reconstitute a transcription factor, which activates a reporter gene whose expression allows the diploid cell to grow on selective media (here, without histidine). Positive clones have to be picked, their DNA isolated and the encoded plasmids sequenced in order to identify interacting proteins. Reproduced with permission from (13). [b] Mass spectrometry. Intact proteins are proteolytically digested. The resulting peptide mixture is fractionated and introduced into a mass spectrometer. The mass spectrometer is responsible for separating peptide ions by their mass-to-charge (m/z) ratio. The peptides then serve as surrogate markers for the protein sequence. Proteins are identified by searching the resulting mass spectra through sequence databases.
Protein Interaction Networks
137
In the two-hybrid system (Fig. 1a), a protein “bait” of interest (B) is fused to the DNA binding domain (DB) of a transcription factor such as Gal4p. A second “prey” protein (P) is fused to the transcriptional activation domain (AD) of the same transcription factor. A physical interaction between B and P reconstitutes a functional transcription factor that can activate expression of a reporter gene. Usually, multiple reporter genes that allow growth selection on different media are used to increase the specificity of detection. Because the two-hybrid system is carried out in vivo and only requires the manipulation of DNA, it is amenable to automation and high-throughput methods. Currently, MS/MS is the most practicable way to identify the components of a protein complex but typically does not provide information about interaction topology. In this regard, the two-hybrid system can provide complementary information about direct interactions, revealing which specific proteins bind to which others within a protein complex or signaling pathway. Both yeast two-hybrid technology (14,15) and co-immunoprecipitation followed by MS (16,17) were initially applied in the yeast Saccharomyces cerevisiae (baker’s yeast, a model eukaryotic cell) to generate large-scale protein interaction data. More recently, yeast two-hybrid technology has also been used to generate large-scale protein interaction data in the multicellular organisms Drosophila melanogaster (fruit fly) (7) and Caenorhabditis elegans (round worm) (8). Protein interactions do not always represent physical binding events. For example, genetic interaction, in which two gene mutations have a combined effect not exhibited by either mutation alone, constitutes yet another interaction type that is being measured at high throughput. Two major types of genetic interactions are synthetic lethal interactions, in which mutations in two nonessential genes are lethal when combined; and suppressor interactions, in which one mutation is lethal but combination with a second restores cell viability. Screens for genetic interactions have been used extensively to shed light on pathway organization in model organisms (18-21), while in humans, genetic interactions are critical in linkage analysis of complex diseases (22) and in high-throughput drug screening (23). For species such as yeast, recent
138
K. Tan and T. Ideker
experiments have defined large genetic networks cataloguing thousands of such interactions (20,24-26). 2.2. Computational Approaches to Predict Protein-Protein Interactions In the late 1990s, several related methods were proposed for predicting protein interactions from DNA sequence information which received much attention due to the increasing number of complete genomes becoming available. These methods relied on the exploitation of “genomic context” in the form of structural or evolutionary constraints. One form of genomic context is the co-occurrence of orthologous genes across entire genomes which defines a phylogenetic profile (27,28). Such a profile associates each gene with a binary representation of the presence/absence of its orthologs in different genomes. Genes that “travel” together during evolution are assumed to be involved in similar cellular processes. It is then possible to predict the functional association of genes that possess similar profiles. This method becomes more powerful with an increasing number of genomes because this allows more accurate profiles to be constructed. However, evolutionary processes such as gene duplication, loss, and horizontal gene transfer could hamper accurate construction of phylogenetic profiles (29). Another genomic context based approach (30,31) exploits the notion of gene fusion, in which several genes in one species are merged or concatenated in other species into a single gene which encodes a multifunctional, multidomain protein. This event is maintained by selection, possibly due to the selective advantage of decreased regulational load (30). Proteins that are fused in one genome are likely to interact, physically or at least functionally, in other genomes. An approach analogous to the gene fusion method includes analysis of gene neighborhoods in genomes (32,33). The basic assumption is that genes which interact or are functionally associated tend to be located in physical proximity to each other on the genome presumably to increase the probability of co-transferring genes whose products must all be present for functionality. The most apparent case of this phenomenon occurs in prokaryotes in which related genes are often co-localized into
Protein Interaction Networks
139
so-called “operons”. Although operons do not generally occur in eukaryotic systems, it is still possible to infer functional association of a pair of genes if their homologs tend to be close in many genomes. A new trend in de novo protein interaction prediction is to search for coordinated mutations between the sequences of interacting proteins, e.g. as has been observed for ligand-receptor interactions (34,35). The assumption is that the interacting proteins must co-evolve to preserve the interaction over time and thus the functional activity mediated by the interaction. Pazos and Valencia have used such a method to perform large-scale predictions of interactions with high statistical significance, resulting in 2,742 putative protein interactions for E. coli (35). Ramani and Marcotte introduced further methods to align phylogenetic trees of interacting protein families to define specific interaction partners (36,37). They suggest a model for the evolution of interacting protein families in which interaction partners are duplicated in coupled processes. Other computational methods have been developed for predicting novel protein interactions through analysis of examples of known interactions. The common theme here is to transfer the existing annotation of a known gene to a newly sequenced gene product. This is based on the concept that sequence and structural similarities between gene products suggest functional similarities. One type of annotation transfer is based on structural data of known interacting proteins. New interactions can be inferred between pairs of proteins for which the sequences are compatible with known crystal structures of heterodimers (38,39) and between pairs of proteins with domains that are often observed in interacting proteins (40). Another type of annotation transfer is the “interolog” approach where a pair of proteins in one species is predicted to interact if their best sequence matches in another species were reported to interact (41,42). Large protein-protein interaction data sets are now available for a variety of species (Table 1) including S. cerevisiae (16,17,15,43,14), H. pylori (44), E. coli (45), D. melanogaster (7), C. elegans (8,46), and H. sapiens (47). In light of these vast scientific resources made available through experimental and computational analyses, several databases storing interaction data are now in wide usage (Table 2). Most of these databases contain interaction data derived from both high-throughput
K. Tan and T. Ideker
140
analyses and small-scale experiments. Besides being data warehouses, some of these databases have developed new methods for data exchange and visualization to facilitate the study of molecular interaction networks. Table 1. Current estimates of the volume of experimental protein-protein interaction data available in the public domain. “Spoke and matrix, models used to derive binary interactions from co-immunoprecipitation/mass spectrometry data. Spoke, only allow interactions between the bait protein and the rest complex members; matrix, all possible interactions among all members in a complex”. Number of Interactions
Number of Proteins H. pylori Two-hybrid assays
710 [Rain et al. 2001]
1425
E. coli Co-immunoprecipitation/ Mass spectrometry
530 [Butland et al. 2005]
Two-hybrid assays
934 [Uetz et al. 2000]
5420 (spoke)
S. cerevisiae 854
4131 [Ito et al. 2001] Co-immunoprecipitation/ Mass spectrometry
1361 [Gavin et al. 2002] 1560 [Ho et al. 2002]
3986 3221 (spoke) 31304 (matrix) 3589 (spoke) 25333 (matrix)
Synthetic lethal assays
1029 [Tong et al. 2004]
3627
DIP (small scale experiments)
1629
5068
Two-hybrid assays
2898 [Li et al. 2004]
C. elegans 4027
D. melanogaster Two-hybrid assays
7048 [Giot et al. 2003]
20405
Co-immunoprecipitation/ Mass spectrometry
32 [Bouwmeester et al. 2004]
221
HPRD (small scale experiments)
2750 [Peri et al. 2004]
10534
H. sapiens
Protein Interaction Networks
141
Table 2. Brief overview of protein interaction databases. Protein interactions ADVICE
http://advice.i2r.a-star.edu.sg
BIND
http://bind.ca/
Bioverse
http://bioverse.compbio.washington.edu/
Curagen
http://curatools.curagen.com/pathcalling_portal/index.htm
CYGD/MIPS
http://mips.gsf.de/services/ppi
DIP
http://dip.doe-mbi.ucla.edu/
GRID
http://biodata.mshri.on.ca/grid/servlet/Index
HPRD
http://www.hprd.org/
Hybrigenics/PIMRider
http://pim.hybrigenics.com/pimriderext/common/
MINT
http://mint.bio.uniroma2.it/mint/
PLEX
http://apropos.icmb.utexas.edu/plex/plex.html
STRING
http://string.embl.de/
Protein networks/pathways Biobase/Transpath
http://www.biobase.de/pages/products/transpath.html
Biocarta
http://www.biocarta.com/genes/index.asp
Genmapp
http://www.genmapp.org/links.html
Reactome
http://www.reactome.org/
3. Computational Modeling of Protein Networks 3.1. Visualization of Protein Interaction Networks Numerous articles and textbooks include figures showing different types of molecules and interactions between them. However, these figures typically invoke a limited number of components to describe an isolated biochemical process or signaling pathway, are carefully tailored to illustrate a predetermined concept, and rely heavily on accompanying textual descriptions (48). In contrast, there is a pressing need for visual representations that can systematically present and organize the extremely large amounts of protein-interaction and expression data rapidly accumulating in the wake of two-hybrid screens, DNA microarray technology, and high-throughput proteomics. Such displays
K. Tan and T. Ideker
142
are not hand-tailored to illustrate a foregone conclusion, but should ideally stimulate the discovery of new protein functions and biological relationships. As the raw data become increasingly complex with each type of supplemental information, tools that are both visual and interactive become increasingly important for emphasizing and extracting the key features. Ypt1 — YPL246C Akr2 — YHR105W Yip1 — YGL161W YPL246C — Vam7 YGL161W — Pep12 YPL246C — YHR105W YHR105W — YGL161W
(a)
Ypt1
Akr2
Yip1
YPL246C
YHR105W
YGL161W
Vam7
Pep12
(b)
Figure 2. List [a] versus graphical network representation [b] of protein interactions. The two representations differ in localization (a protein occurs multiple times in the list but exactly once in the layout); context (in the layout, the neighbors of a protein are easily identified and studied; and mental image (the network layout allows proteins to be memorized by position) (51). In positioning the nodes, secondary information can be employed to guide the layout; for example, proteins can be spatially grouped by localization or function. In this way, a particular arrangement of the proteins can even increase the information content.
Although protein-protein interactions were originally reported as lists of protein pairs (e.g. ref. 13), more and more often they are represented graphically as two-dimensional networks. Fig. 2 illustrates the difference on a small set of protein-protein interactions in yeast: while both representations reflect identical information, the network representation (called layout) has fundamental advantages with respect to human perception. Hand-formatted maps (such as those in Michal 1998; Kohn 1999) are usually of high quality, but available for very limited datasets due to the large amount of work involved to construct them (49,50). Accordingly, the large numbers of protein interactions in public databases (Table 2) have stimulated a range of automated layout algorithms to visualize them. Several software tools are available for visualizing physical or genetic interaction networks. Examples of network visualization tools
Protein Interaction Networks
143
include: Cytoscape (52), Osprey (53), Pajek (54), ProViz, and WebInterViewer. These are software packages that have either been designed to visualize protein interactions or can be customized for that task (Table 3 gives a side by side comparison). Such software enables a variety of routine operations on the network: automated network layout; association of data attributes (such as gene expression profile and gene ontology) with different network components; mapping of data attributes to visual properties (such as node and edge color, shape and size), and network filtering. Specific features of each available program are listed in Table 3.
Pajek V1.01
ProViz V1.0
WebInterViewer
General
Cytoscape V2.0
2
http://www.cytoscape.org
http://biodata.mshri.on.ca/osprey
http://vlado.fmf.unilj.si/pub/networks/pajek
http://cbi.labri.fr/eng/proviz.htm
http://interviewer.inha.ac.kr
License
Free
Free for educational, research, and non-for-profit
Free for noncommercial use
Free
Free
Platform
Linux, Mac, Windows
Linux, Mac, Windows
Windows
Linux
Linux, Mac, Windows
Import Files
Flat file (space-delimited interactions, node and edge attributes, gene functional annotations), GML
Flat file (tab-delimited gene names, interactions, experimental system, source, literature evidence)
Flat file (spacedelimited gene names, interactions), Vega graphs, Gedcom, Ucinet DL…
Tulip, PSI-MI (XML)
Flat file (tab-delimited gene names and interactions), GML, XML
IntAct interaction data
DB on InterViewer3 server or local data server
Data Exchange
Osprey V1.2.0
1
Table 3. Protein network visualization and analysis tools
Website
Databases
-
Additional
Expression data, arbitrary data attributes on nodes and edges
Export Text files
Visualization
-
Flat file (space delimited, node and edge attributes), Vega graphs, Gedcom, Ucinet DL… BMP, EPS/PS, Kinemage, MDL, SVG, VRML
-
-
Flat file (space-delimited, genes, interactions), GML
Flat file (tab-delimited, genes and interactions)
EPS, JPEG, PDF, PNG, PS, SVG
JPEG, PNG, SVG
Graph layout
5 algorithms
7 algorithms
7 algorithms
Data attributes Proteins
All imported properties
GO terms
All imported properties
GO terms
All imported properties
Source, experimental system (e.g. two-hybrid), literature evidence
All imported properties
PSI-MI terms
Color
Color, line type, size
-
-
Color
Color, line type, arrow
-
-
Attribute values
GO terms
Attribute values
GO terms
Interactions
Type (e.g. protein-DNA)
Experimental system, source
Attribute values
PSI-MI terms
Network
Node degree, distance
Node degree, distance
Node degree, distance
Node distance
Image files
Interactions Visual mappings Proteins Interactions Filters Proteins
Multiple data superposition Subnetwork identification Group and collapse nodes Network comparison
Color, shape, line type, size, label, font Color, line type, arrow, label, font
Tulip, PSI-MI
Flat file (tab-delimited, genes and interactions), XML, EdgeCnt, IG1
PNG
BMP (with copyright note)
3 algorithms
2 algorithms -
Node distance
-
+
+
+
MCODE, ActiveModules plug-ins
-
-
-
-
-
-
-
+ Group cliques, nodes with same interactions
PathBLAST plug-in
-
Intersection, union, difference
Find shared nodes and edges
Find shared nodes and edges
Extras
Many plug-ins for extended analysis, e.g. netwk comparison via PathBLAST
-
Many operations on graphs and metric computation
URL links to external source for node and edge properties
Data server for central data storage; List of connected groups
Pros
• Flexible and extensible through many existing and user defined plugins • Superposition of gene expression and other data
• Direct import and quick visualization from GRID DB • Superposition of different datasets
• General network vis. and analysis tool • Multiple formats for exporting images • Rich set of operations on graphs and metric computation
• Interaction filter based on PSI-MI controlled vocabulary terms • New analyses as plug-ins using the Tulip graph management platform
• Central storage of data on server
Cons
• Requires substantial preprocessing of data, e.g. special network formats and data attribute lists
• Limited visualization possibilities for external data sets (outside of GRID)
• Single platform • Not specifically designed for molecular interaction networks • Requires much data preprocessing
• Single platform • Limited visualization functionality
• No visualization of protein or interaction attributes (e.g. expression) • Only one filter • Very brief documentation
Conclusions
Analysis
GRID interaction data
144
K. Tan and T. Ideker
3.2. Topological Properties of Protein Interaction Networks Along with other types of cellular networks, such as metabolic, regulatory and genetic networks, the topological properties of protein interaction networks have been intensely studied since the first largescale data sets were published. In the past few years, the rapidly developing theory of complex networks has led to the discovery that the architectural features of molecular interaction networks within a cell are shared to a large extent by other complex systems, such as the Internet, US power grid and even social networks (55). This unexpected universality indicates that similar laws may govern most complex networks in nature, which allows the expertise from large scale, nonbiological systems to be used to characterize the organizing principles of cellular networks. Several recent studies have indicated that protein interaction networks in diverse species also have the features of a so-called scale-free network which means the connectivity distribution of the network follows a power-law function (see Chapter 1 for a detailed discussion) (56,57,44,7,8,45). This topological feature is illustrated in Fig. 3, which shows the protein interaction map of S. cerevisiae generated by a systematic two-hybrid screen. Whereas most proteins in the network participate in only a few interactions, a few proteins participate in many interactions (hubs) – a typical feature of scale-free networks. Protein interaction networks also exhibit another common architectural feature of all complex networks: the so called “small world effect” – any two nodes can be connected with a path of a few links only (see Chapter 1). Within the cell, this effect was first observed with metabolic networks, in which paths of only three to four reactions can link most pairs of metabolites (58,59). Although both “scale-free topology” and “small world connectivity” have clear mathematical definitions, the biological consequences of these topological properties remain to be studied. The presence of hubs seems to be a general feature of all cellular networks and they fundamentally determine the network’s global behavior (in terms of both scale-free and small world connectivity). The biological importance of hubs is supported by the over-representation of genetic interactions between hubs in protein interaction networks (60) and by the
Protein Interaction Networks
145
Figure 3. A map of protein-protein interactions in Saccharomyces cerevisiae based on an early systematic yeast two-hybrid experiment (14), illustrates that a few highly connected nodes hold the network together. The color of a node indicates the phenotypic effect of removing the corresponding protein (red = lethal, green = non-lethal, orange = slow growth, yellow = unknown). Reproduced with permission from Barabasi and Oltvai, 2004 (55).
over-representation of hub genes among all lethal genes revealed by genome-wide deletion studies (56). In addition to the aforementioned global topological features, protein interaction networks also possess recurring local topological features known as “network motifs” (see Chapter 2). Network motifs are defined as particular patterns of interaction (i.e. isomorphic subgraphs) that are over-represented compared to randomized versions of the same network. Significant motifs were first shown to exist in transcriptional regulatory networks (61) and subsequently in a variety of biological networks (62-64). The high degree of evolutionary conservation of motif
146
K. Tan and T. Ideker
constituents within the yeast protein interaction network (63) further indicates that motifs are indeed of direct biological relevance. Many network motifs, for instance, feed-forward loop and single input motif (Fig. 4), are also well known in circuit design and other engineering fields and thus can be studied in detail using similar approaches from these fields. Indeed, as a first step in this direction, the highly significant feedforward loop has been shown to function as a sign-sensitive delay element in transcriptional regulatory networks, a circuit that responds rapidly to step-like stimuli in one direction and as a delay to steps in the opposite direction (65).
Figure 4. Examples of network motifs in the yeast regulatory network. Regulators are represented by circles; gene promoters are represented by rectangles. Binding of a regulator to a promoter is indicated by a solid arrow. Genes encoding regulators are linked to their respective regulators by dashed arrows. For example, in the autoregulation motif, the Ste12 protein binds to the STE12 gene, which is transcribed and translated into Ste12 protein. Reproduced with permission from Lee et al. 2002 (43).
Protein Interaction Networks
147
3.3. Integrating Protein Interaction Networks with Complementary Data Just as BLAST has been proven instrumental for querying sequence databases to identify genes, new pathway discovery and search tools enable us to query a protein interaction network to identify particular interaction pathways in a systematic fashion. For example, several groups (66-69) have applied “co-clustering” approaches to identify groups of proteins that are co-expressed and also closely connected by interactions in the network. In many cases, these “expression-activated networks” correspond to well known protein complexes, regulatory pathways, or metabolic reaction pathways, such as the 26S proteasome complex (69), the core galactose-induction circuit (68), and the glycolysis pathway (67). Other methods (70,43,71,72) use probabilistic approaches to match changes in gene expression with transcriptional and/or protein signaling interactions that are most likely to regulate them directly. These methods start with a cluster of differentially expressed genes and incrementally choose a small set of transcription factors which, by virtue of their levels and/or protein-DNA interactions in the network, can maximally predict the observed levels of differential expression in the cluster. All of these approaches serve to reduce network complexity by pinpointing just those regions whose gene/protein states are perturbed by the conditions of interest, while removing false positive interactions and interactions not involved in the perturbation response. Software is available for several of these approaches, such as the GRAM approach by Bar-Joseph et al. (70). Others are implemented as extensions to existing network visualization software, such as MCODE (73) and the ActiveModules approach (68) which are implemented as plug-ins to Cytoscape. The key concept behind the more advanced queries is that, by interrogating a protein interaction network with other (complementary) large-scale data such as gene expression profiles, it is possible to condense and partition the enormous quantity of data into a small number of relevant pieces suitable for lower-level investigation and modeling. Such an approach reinforces the common signal present in both data sets while filtering out some of the independent noise.
148
K. Tan and T. Ideker
As an example application, Begley et al. (4,5) performed a series of network queries to screen for protein pathways and complexes important for cellular recovery to DNA damage. Begley et al. used a systematic phenotyping approach in which growth phenotypes were recorded for a set of 1,615 yeast single-gene knockout strains exposed to methyl methanesulfonate (MMS, a mutagen). Of the knockout strains, 416 grew more slowly in the presence of MMS and showed less than 67% the growth rate of a wild type strain exposed to identical MMS conditions (4,5). These strains were assigned an “MMS sensitive” phenotype, and the genes deleted from each of these were designated as “MMS essential”. To elucidate protein networks involved in the DNA damage response, the MMS phenotypic state data were integrated with a large combined protein-protein and protein-DNA interaction network for yeast (Fig. 5a). In a preliminary step, proteins were removed from the network whose distance from MMS-essential proteins was greater than one interaction. Within this filtered network, ActiveModules was used to search for connected subnetworks having a higher-than-expected proportion of MMS essential proteins. This search identified four significant modules associated with MMS sensitivity. Fig. 5b shows three of these: in addition to proteins already known to be associated with the damage response, the modules contained significant numbers of proteins involved in protein degradation (e.g. Vma6, Pep12, and Snf7) and several proteins of unknown function. These likely occur because toxins such as MMS also cause damage to proteins, activating protein degradation and turnover machinery as an integral part of the cellular response. 3.4. Network Alignment and Comparison A major emerging challenge of protein network biology is to systematically compare and contrast biological networks over different species, conditions, cell types, disease states, or points in time. For this purpose, methods are being developed to compare/contrast protein interaction networks to predict protein interactions (36); to assess the specificity of protein interactions (37); and to identify conserved
Protein Interaction Networks
149
interaction complexes and pathways (74,75). Recently, we have developed pairwise network alignment algorithms that are used to detect linear interaction paths (74) or dense clusters of interactions (75) that are conserved between networks. For instance, the algorithm PathBLAST searches for high-scoring “pathway alignments” involving a pair of paths, one from each network, in which proteins of the first path are paired with putative orthologs occurring in the same order in the second path (Fig. 6a). We have also developed a similar algorithm to search for dense interacting clusters of proteins rather than linear paths (75).
Figure 5. Screening damage phenotypes vs. the interaction network. [a] A protein interaction network was integrated with 1,615 yeast deletion phenotypes gathered in response to MMS. [b] A search of the network found protein complexes containing significant numbers of MMS-essential proteins. Three of four identified regions are shown. Dark gray nodes represent MMS-essential proteins; white nodes were untested. Reproduced with permission from Begley et al. 2002 (4).
These two network structures, paths versus dense clusters, attempt to capture different biological mechanisms that may be conserved. Very approximately, paths model signal transduction pathways while dense
150
K. Tan and T. Ideker
clusters of interactions model protein complexes. PathBLAST is available as a web-based query at http://www.pathblast.org. Target protein-protein interaction networks are currently available for H. pylori, S. cerevisiae, C. elegans, D. melanogaster, M. musculus, and H. sapiens. A related method that uses cross-species data for predicting protein interactions is the interolog approach (41,42): a pair of proteins in one species is predicted to interact if their best sequence matches in another species were reported to interact. As an example of network evolutionary comparison, a protein network alignment was performed among the protein-protein networks of the budding yeast S. cerevisiae and the human gastric pathogen H. pylori (74). Both the yeast network (14,489 interactions among 4,688 proteins, assembled from mass spectrometry and two-hybrid studies), and the H. pylori network (1,465 interactions among 732 proteins from a single two-hybrid study) (44) were extracted from the DIP database (76). The yeast and bacterial networks were analyzed to select the 150 highestscoring pathway alignments of length four (four proteins per path), corresponding to a level of significance of p ≤ 0.05. By combining all overlapping pathway alignments, each of the 150 fell into one of five conserved network regions, two of which are shown in Fig. 6 [b-c]. Interestingly, although the putative yeast-bacterial orthologs in these regions generally had significant sequence homology (i.e. having BLAST E-values <10−10), over 50% of these orthologs were in fact not the overall best BLAST matches possible between the two species’ genomes. Rather, they were identified by their close proximity to other orthologous proteins in the protein network. Although an entire network vs. network comparison is invaluable for cataloguing all of the homologous pathways between and within organisms, it is also desirable to query a single protein network with specific pathways of interest. This procedure is similar to using BLAST to interrogate a sequence database with a short nucleotide or amino-acid sequence query. As an example of this approach, we queried the S. cerevisiae protein network with a classic mitogen activated protein kinase (MAPK) pathway associated with the filamentation response, consisting of a MAPK (Kss1), a MAPK kinase or MAPKK (Ste7), and a MAPKK kinase or MAPKKK (Ste11). MAPK pathways transmit
Protein Interaction Networks
151
incoming signals to the nucleus through activation cascades in which each kinase phosphorylates the next one downstream. As shown in Fig. 6d, the pathway query identified two other well-known MAPK pathways
Figure 6. PathBLAST network alignment across species. [a] A model pathway alignment between two protein networks, where interactions in a pathway appear vertically and horizontal dotted lines link proteins with significant sequence similarity. Insertions (e.g. protein C) or mismatches (e.g. proteins E and g) in the alignment are permitted but penalized. Panels [b-c] show aligned regions from the networks of H. pylori (orange; left) vs. S. cerevisiae (green; right). Bacterial/yeast protein pairs with significant sequence similarity are placed on the same row (e.g. deaD and Dbp2 in row 1 of [b]). [d] Querying the yeast network with a specific MAP kinase pathway involved in the yeast filamentation response. In panels [b-d], solid links indicate direct protein interactions, whereas dotted links indicate a single protein insertion (additional protein in one of the compared network). Reproduced witn permission from Ideker et al. (2005) (96).
152
K. Tan and T. Ideker
as the highest scoring hits (the low- and high-osmolarity response pathways Bck1-Mkk1-Slt2 and Ssk2-Pbs2-Hog1). Such methods will be instrumental in extending comparative molecular biology from the level of DNA and protein sequences to the level of the protein network. 4. Robustness of Protein Interaction Networks Robustness is a property that allows a system to maintain its functions despite external and internal perturbations. This property has been widely observed in many biological systems, such as chemotaxis (77), circadian rhythms (78), and segmental pattern formation in embryogenesis (79). Understanding the origin and principles of robustness in biological networks enables us to put various observations about the networks into perspective and to facilitate the discovery of principles at the systems level. A prominent feature of all cellular networks studied so far is their scale-free nature. Unlike random networks, scale-free networks are highly resistant to random failures. By simulation studies, Albert and colleagues (80) showed that even if 80% of randomly selected nodes fail, the remaining 20% still form a compact cluster with a path connecting any two nodes. This is because random failure mainly affects nodes with few network connections, the absence of which does not disrupt the network’s overall integrity. On the other hand, removal of hubs rapidly disintegrates the network into small isolated node clusters. These computational simulations suggest hub proteins have an important role in cellular fitness. In fact, deletion analyses indicate that in S. cerevisiae only about 10% of the proteins with fewer than five interactions are essential, but this fraction increases to over 60% for proteins with more than 15 interactions. This indicates that the protein’s number of interactions plays an important role in determining its deletion phenotype (56). The importance of hubs is further supported by their evolutionary conservation: highly connected S. cerevisiae proteins have a smaller evolutionary distance to their orthologs in C. elegans (81) and are more likely to have orthologs in higher organisms (82). Although hubs are essential for protecting protein networks from accidental failures, their attack vulnerability makes them ideal targets for manipulating and
Protein Interaction Networks
153
controlling the network. For instance, from a therapeutic point of view, hub proteins can be used to screen against small molecule libraries to identify potential drug targets. In addition to global topological features that ensure the robustness of protein networks, local topological features, i.e. network motifs, is also used to maintain robustness. Negative feedback loops are a principle mode of control to enable robust response to perturbations (83). Alon and colleagues (77) have shown that bacteria use negative feedback in signal transduction systems to attain the perfect adaptation that allows chemotaxis to occur in response to a wide range of stimuli. Positive feedback contributes to robustness by amplifying stimuli so that the activation level of downstream pathways can be clearly distinguished from non-stimulated states and these states can be maintained. The bestdocumented example of a positive feedback loop functioning in signal transduction is the Mos-mitogen-activated protein kinase (MAPK) cascade in Xenopus oocytes (84). This cascade is activated when oocytes are induced to mature by the steroid hormone progesterone. The positive feedback loop in the signal transduction process ensures that oocyte convert a graded, reversible triggering stimulus into an all-or-none, irreversible cell-fate decision. 5. Evolution of Protein Interaction Networks As the most prominent feature of protein interaction networks and other cellular networks, the origin of scale-free topology has attracted the attention of many researchers. Two kinds of evolutionary processes have been invoked to explain this topological feature of protein interaction networks. The first kind of process consists of gene duplications followed by either silencing of one of the duplicated genes or by functional divergence of the duplicates. In terms of the protein interaction network, a gene duplication corresponds to the addition of a node with links identical to the original node, followed by the divergence of some of the redundant links between the two duplicate nodes. Barabasi and Albert (1999) were the first to suggest that gene duplication is the major mechanism for generating the scale-free topology of protein interaction networks (85). According to their growth and preferential
154
K. Tan and T. Ideker
attachment model, duplicated genes produce identical proteins that interact with the same protein partners. Therefore, each protein that is in contact with a duplicated protein gains an extra link. Highly connected proteins are more likely to have a link to a duplicated protein than their sparsely connected cousins, and therefore they are more likely to gain new links if a randomly selected protein is duplicated. A mathematical model of the growth of networks based on this principle produces scalefree topologies with parameters comparable to those of real-world networks (85). Two lines of empirical evidence support this model: An analysis of metabolic networks shows that metabolites of some of the most ancient pathways, such as glycolysis and the tricarboxylic acid (TCA) cycle, are among the most connected substrates of the network (59). In terms of protein interaction networks, comparative genomics analyses have revealed that, on average, evolutionarily older proteins have higher connectivity than their younger counterparts (86,87). The preferential attachment model aims to capture a general mechanism of network evolution capable of producing the observed scale-free topology. But it is likely to operate under functional constraints, as protein function determines types of binding partners, the degree of connectivity, and time of origin of the network (88). The second type of evolutionary process consists of point mutations in a gene resulting in modifications of the interface between interacting proteins (89). Consequently, the corresponding protein may gain new connections (attachment) or lose (detachment) some of the existing connections to other proteins. Berg et al. (2004) refer to these attachment and detachment processes collectively as link dynamics (90). They estimate the empirical rates of link dynamics and gene duplication in the yeast protein network and find the former to be at least one order of magnitude higher than the latter. Based on this observation, they propose a new model for the evolution of protein networks in which link dynamics due to point mutations are the major evolutionary forces shaping the scale-free topology of the network while slower gene duplication processes mainly affect its size. According to this model, the fast link turnover rate leads to the fast loss of connectivity of proteins encoded by duplicate genes. This is consistent with an earlier observation
Protein Interaction Networks
155
that the majority of duplicate pairs have few or no interaction partners in common (57). All of this previous research on protein network evolution has been directed towards understanding the origin of its global structural features. In contrast, little is known about the evolutionary process(es) that shape the network’s local wiring diagrams, i.e. network motifs, although it is often implied that the local properties reflect solely evolutionary selection towards desirable functional traits (61,65,62). A recent study (91) demonstrates that a network’s global and local structures mutually define and predict each other, raising intriguing questions about how the evolution of network motifs shape a network’s overall structure and vice versa. 6. Perspectives Although significant advances have been made in the past few years, protein network biology is still in its infancy. Future progress is expected in several directions. First and most importantly, to further expand our knowledge about protein interaction networks, we need to improve our data-gathering capabilities. This means development of highly sensitive and accurate methods to allow data collection under various cellular functional and temporal states as well as in different cell types in the case of metazoans. These new data sets will not only improve coverage of the networks but also enable us to ask questions about the dynamics of protein interaction networks. In contrast to the yeast protein network, the human network is largely unexplored. Based on the existing data for yeast proteins, a conservative estimate puts the total number of protein interactions in human at roughly 40,000-200,000 (92). Currently, about 20,000-30,000 total interactions are recorded in the literature (93), mostly from small-scale studies with a few medium-scale studies centered on particular pathways (94) or cellular machineries (95). Thus, there is a pressing need for experimental methods to be scaled up to the size of human proteome. Meanwhile, novel computational approaches need to be developed to transfer as many interactions as possible from model organisms to human. Also, just as theoretical advances in sequence evolution were
K. Tan and T. Ideker
156
essential for the development of modern sequence analysis algorithms, further advances in our understanding of network evolution will surely benefit many aspects of network analysis, such as cross-species network comparisons. Interaction data provide a high-level representation of the key molecular components and interactions of a biological system. Queries against this interaction network highlight particular pathways and complexes of interest, which are then prime candidates suitable for lowlevel verification and modeling as important signaling and compensatory pathways. Over successive iterations of modeling and experiment, the network model becomes annotated with increasingly low-level and pathway-specific parameters such as physico-chemical reaction rates, binding constants, and diffusion and transport coefficients. The promise of this approach is that ultimately, protein network models may provide a comprehensive “wiring diagram” lending global insight into normal and diseased cell function. References 1. 2. 3. 4.
5.
6.
7. 8. 9.
Artavanis-Tsakonas, S., Rand, M.D. and Lake, R.J. (1999). Notch signaling: cell fate control and signal integration in development. Science. 284, 770-6. Edwards, P.A., Tabor, D., Kast, H.R. and Venkateswaran, A. (2000). Regulation of gene expression by SREBP and SCAP. Biochim Biophys Acta. 1529, 103-13. DeRisi, J.L., Iyer, V.R., and Brown, P.O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 278, 680-686. Begley, T.J., Rosenbach, A.S. Ideker, T., and Samson, L.D. (2002). Damage Recovery Pathways in Saccharomyces cerevisiae Revealed by Genomic Phenotyping and Interactome Mapping. Mol Cancer Res. 1, 103-112. Deutschbauer, A.M., Williams, R.M., Chu, A.M., and Davis, R.W. (2002). Parallel phenotypic analysis of sporulation and postgermination growth in Saccharomycescerevisiae. Proc Natl Acad Sci U.S.A. 99, 15530-15535. Begley, T.J., Rosenbach, A.S., Ideker, T., and Samson, L.D. (2004). Hot spots for modulating toxicity identified by genomic phenotyping and localization mapping. Mol Cell. 16, 117-125. Giot, L., Bader, J.S., Brouwer, C., Chaudhuri, A., Kuang, B. et al. (2003). A protein interaction map of Drosophila melanogaster. Science. 302, 1727-1736. Li, S., Armstrong, C.M., Bertin, N., Ge H., Milstein, S.et al. (2004). A map of the interactome network of the metazoan C. elegans. Science. 303, 540-543. Lane, D.P. and Crawford,. L.V. (1979). T antigen is bound to a host protein in SV40-transformed cells. Nature. 278, 261-3.
Protein Interaction Networks
157
10. Fields, S. and Song, O. (1989). A novel genetic system to detect protein-protein interactions. Nature. 340, 245-6. 11. Kaelin, W.G., Jr., Pallas, D.C., DeCaprio, J.A., Kaye, F.J., and Livingston, D.M. (1991). Identification of cellular proteins that can interact specifically with the T/E1A-binding region of the retinoblastoma gene product. Cell. 64, 521-32. 12. Aebersold, R. and Mann, M. (2003). Mass spectrometry-based proteomics. Nature. 422, 198-207. 13. Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S. et al. (2000). A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 403, 623-627. 14. Ito, T., Chiba T., Ozawa, R., Yoshida, M., Hattori, M. et al. (2001). A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U.S.A. 98, 4569-4574. 15. Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M. et al. (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 415, 141-147. 16. Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L. et al. (2002). Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 415, 180-183. 17. Hartman, J.L., Garvik, B., and Hartwell, L. (2001). Principles for the buffering of genetic variation. Science. 291, 1001-1004. 18. Avery, L. and Wasserman, S. (1992). Ordering gene function: the interpretation of epistasis in regulatory hierarchies. Trends Genet. 8, 312-316. 19. Thomas, J.H. (1993). Thinking about genetic redundancy. Trends Genet. 9, 395-399. 20. Guarente, L. (1993). Synthetic enhancement in gene interaction: a genetic tool come of age. Trends Genet. 9, 362-366. 21. Sham, P. (2001). Shifting paradigms in gene-mapping methodology for complex traits. Pharmacogenomics. 2, 195-202. 22. Dolma, S., Lessnick, S.L., Hahn, W.C., and Stockwell, B.R. (2003). Identification of genotype-selective antitumor agents using synthetic lethal chemical screening in engineered human tumor cells. Cancer Cell. 3, 285-296. 23. Huang, L.S. and Sternberg, P.W. (1995). Genetic Dissection of Developmental Pathways. In Methods in Cell Biology (eds. H.F. Epstein and D.C. Shakes). 99-122. Academic Press, San Diego. 24. Tong, A.H., Lesage, G., Bader, G.D., Ding, H., Xu, H. et al. (2004). Global mapping of the yeast genetic interaction network. Science. 303, 808-813. 25. Tong, A.H., Evangelista, M., Parsons, A.B., Xu, H., Bader, G.D. et al. (2001). Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science. 294, 2364-2368. 26. Ouzounis, C. and Kyrpides, N. (1996). The emergence of major cellular processes in evolution. FEBS Lett. 390, 119-23. 27. Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. and Yeates, T.O. (1999). Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U.S.A. 96, 4285-8.
158
K. Tan and T. Ideker
28. Galperin, M.Y. and Koonin, E.V. (2000). Who's your neighbor? New computational approaches for functional genomics. Nat Biotechnol. 18, 609-13. 29. Enright, A.J., Iliopoulos, I., Kyrpides, N.C. and Ouzounis, C.A. (1999). Protein interaction maps for complete genomes based on gene fusion events. Nature. 402, 86-90. 30. Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O. and Eisenberg, D. (1999). Detecting protein function and protein-protein interactions from genome sequences. Science. 285, 751-3. 31. Dandekar, T., Snel, B., Huynen, M. and Bork, P. (1998). Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci. 23, 3248. 32. Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G.D. and Maltsev, N. (1999). The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U.S.A. 96, 2896-901. 33. Goh, C.S., Bogan, A.A., Joachimiak, M., Walther, D. and Cohen, F.E. (2000). Coevolution of proteins with their interaction partners. J Mol Biol. 299, 283-93. 34. Goh, C.S. and Cohen, F.E. (2002). Co-evolutionary analysis reveals insights into protein-protein interactions. J Mol Biol. 324, 177-92. 35. Pazos, F. and Valencia, A. (2001). Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng. 14, 609-14. 36. Ramani, A.K. and Marcotte, E.M. (2003) Exploiting the co-evolution of interacting proteins to discover interaction specificity. J Mol Biol. 327:273-84. 37. Russell, R.B., Alber, F., Aloy, P., Davis, F.P., Korkin, D., Pichaud, M., Topf, M. and Sali, A. (2004). A structural perspective on protein-protein interactions. Curr Opin Struct Biol. 14, 313-24. 38. Aloy, P., Bottcher, B., Ceulemans, H., Leutwein, C. Mellwig, C .et al. (2004). Structure-based assembly of protein complexes in yeast. Science. 303, 2026-9. 39. Ng, S.K., Zhang, Z. and Tan, S.H. (2003). Integrative approach for computationally inferring protein domain interactions. Bioinformatics. 19, 923-929. 40. Matthews, L.R., Vaglio, P., Reboul, J., Ge, H., Davis, B.P., Garrels, J., Vincent, S. and Vidal, M. (2001). Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res. 11, 2120-6. 41. Yu, H., Luscombe, N.M., Lu H.X., Zhu, X., Xia, Y. et al. (2004). Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res. 14, 1107-18. 42. Rain, J.C., Selig, L., De Reuse, H., Battaglia, V., Reverdy C. et al. (2001). The protein-protein interaction map of Helicobacter pylori. Nature. 409, 211-215. 43. Butland, G., Peregrin-Alvarez, J.M., Li, J., Yang, W., Yang, X., Canadien, V., Starostine, A., Richards, D., Beattie, B., Krogan, N., Davey, M., Parkinson, J., Greenblatt, J. and Emili, A. (2005). Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature. 433, 531-537. 44. Walhout, A.J., Sordella, R., Lu, X., Hartley, J.L., Temple, G.F.et al. (2000). Protein interaction mapping in C. elegans using proteins involved in vulval development. Science. 287, 116-122.
Protein Interaction Networks
159
45. Peri, S., Navarro, J.D., Amanchy, R., Kristiansen, T.Z., Jonnalagadda, C.K. et al. (2003). Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 2363-71. 46. Pirson, I., Fortemaison, N., Jacobs, C., Dremier, S., Dumont, J.E. and Maenhaut, C. (2000). The visual display of regulatory information and networks. Trends Cell Biol. 10, 404-8. 47. Michal, G. (1998). On representation of metabolic pathways. Biosystems. 47, 1-7. 48. Kohn, K.W. (1999). Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol Biol Cell. 10, 2703-34. 49. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., and Ideker, T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498-504. 50. Breitkreutz, B.J., Stark, C. and Tyers, M. (2003). Osprey: a network visualization system. Genome Biol. 4, R22. 51. Batagelj, V. and Mrvar, A. (1998). Pajek - Program for Large Network Analysis. Connections; 21, 47-57. 52. Barabasi, A-L and Oltvai, Z.N. (2004). Network biology: understanding the cell's functional organization. Nat Rev Genet. 5, 101-113. 53. Jeong, H., Mason, S.P., Barabasi, A.L. and Oltvai, Z.N. (2001). Lethality and centrality in protein networks. Nature. 411, 41-2. 54. Wagner, A. (2001). The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol. 18, 1283-92. 55. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., and Barabasi, A.L. (2000). The large-scale organization of metabolic networks. Nature. 407, 651-4. 56. Wagner, A. and Fell. D.A., (2001). The small world inside large metabolic networks. Proc Biol Sci. 268, 1803-10. 57. Ozier, O, Amin, N., and Ideker, T. (2003). Global architecture of genetic interactions on the protein network. Nat Biotechnol. 21, 490-1. 58. Shen-Orr, S.S., Milo, R., Mangan, S. and Alon. U. (2002). Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet. 31, 64-8. 59. Yeger-Lotem, E., Sattath, S., Kashtan, N. Itzkovitz, S. Milo, R. Pinter, R.Y. Alon, U. and Margalit, H. (2004). Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. Proc Natl Acad Sci U.S.A. 101, 5934-9. 60. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon. U. (2002). Network motifs: simple building blocks of complex networks. Science. 298, 824-7. 61. Wuchty, S., Oltvai, Z.N. and Barabasi, A.L. (2003). Evolutionary conservation of motif constituents in the yeast protein interaction network. Nat Genet. 35, 176-9. 62. Mangan, S., Zaslaver, A. and Alon, U. (2003). The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. J Mol Biol. 334, 197204. 63. Jansen, R., Greenbaum, D. and Gerstein, M. (2002). Relating whole-genome expression data with protein-protein interactions. Genome Res. 12, 37-46.
160
K. Tan and T. Ideker
64. Ge, H., Z. Liu, Church, G.M .and Vidal, M. (2001). Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet. 29, 482-486. 65. Ideker, T., Ozier, O., Schwikowski, B. and Siegel, A.F. (2002). Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics. 18, S233-240. 66. Hanisch, D., Zien, A., Zimmer, R. and Lengauer, T. (2002). Co-clustering of biological networks and gene expression data. Bioinformatics. 18, S145-S154. 67. Pe'er, D., Regev, A. and Tanay. A. (2002). Minreg: Inferring an active regulator set. Bioinformatics. 18, S258-S267. 68. Bar-Joseph, Z., Gerber, G.K., Lee, T.I., Rinaldi, N.J., Yoo, J.Y. et al. 2003. Computational discovery of gene modules and regulatory networks. Nat Biotechnol. 21, 1337-1342. 69. Yeang, C.-H. and Jaakkola, T. (2003). Physical network models and multi-source data integration. The Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB). 70. Bader, G.D. and Hogue, C.W. (2003). An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 4, 2. 71. Kelley, B.P., Sharan, R., Karp, R.M., Sittler, T., Root, D.E. et al. 2003. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc Natl Acad Sci U.S.A. 100, 11394-11399. 72. Sharan, R., Ideker, T., Kelley, B.P., Shamir, R. and Karp, R., (2004). Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data. Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology—RECOMB, 282-289. 73. Sharan, R., Suthram, S., Kelley, R.M., Kuhn, T., McCuine, S., Uetz, P., Sittler, T., Karp, R.M. and Ideker, T. (2005). Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci U.S.A. 102, 1974-9. 74. Xenarios, I., Salwinski, L., Duan, X.J., Higney, P., Kim, S.M. et al. (2002). DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30, 303-305. 75. Alon, U., Surette, M.G., Barkai, N. and Leibler, S. (1999). Robustness in bacterial chemotaxis. Nature. 397, 168-71. 76. Morohashi, M., Winn, A.E., Borisuk, M., Bolouri, H., Doyle, J. and Kitano., H. (2002). Robustness as a measure of plausibility in models of biochemical networks. J Theor Biol. 216, 19-30. 77. Von Dassow, G., Meir, E., Munro, E.M. and Odell, G.M. (2000). The segment polarity network is a robust developmental module. Nature. 406, 188-92. 78. Albert, R., Jeong, and Barabasi, A.L. (2000). Error and attack tolerance of complex networks. Nature. 406, 378-82. 79. Fraser, H.B., Hirsh, A.E., Steinmetz, L.M., Scharfe, C. and Feldman, M.W. (2002). Evolutionary rate in the protein interaction network. Science. 296, 750-2. 80. Krylov, D.M., Wolf, Y.I., Rogozin, I.B. and Koonin, E.V. (2003). Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Res. 13, 2229-35.
Protein Interaction Networks
161
81. Kitano, H. (2004). Biological robustness. Nat Rev Genet. 5, 826-37. 82. Ferrell, J.E., Jr. (2002). Self-perpetuating states in signal transduction: positive feedback, double-negative feedback and bistability. Curr Opin Cell Biol. 14, 140-8. 83. Barabasi, A.L. and Albert, R. (1999). Emergence of scaling in random networks. Science. 286, 509-12. 84. Wagner, A. (2003). How the global structure of protein interaction networks evolves. Proc Biol Sci. 270, 457-66. 85. Eisenberg, E. and Levanon, E.Y. (2003). Preferential attachment in the protein network evolution. Phys Rev Lett. 91,138701. 86. Kunin, V., Pereira-Leal, J.B. and Ouzounis, C.A. (2004). Functional evolution of the yeast protein interaction network. Mol Biol Evol. 21, 1171-6. 87. Jones, S. and Thornton, J.M. (1996). Principles of protein-protein interactions. Proc Natl Acad Sci U.S.A. 93, 13-20. 88. Berg, J., Lassig, M. and Wagner, A. (2004). Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC Evol Biol. 4, 51. 89. Vazquez, A., Dobrin, R., Sergi, D. Eckmann, J.P., Oltvai, Z.N. and Barabasi, A.L. (2004). The topological relationship between the large-scale attributes and local interaction patterns of complex networks. Proc Natl Acad Sci U.S.A. 101,17940-5. 90. Bork, P., Jensen, L.J., von Mering, C., Ramani, A.K., Lee, I., and Marcotte, E.M. (2004). Protein interaction networks from yeast to human. Curr Opin Struct Biol. 14, 292-9. 91. Peri, S., Navarro, J.D., Kristiansen, T.Z., Amanchy, R., Surendranath, V. et al. (2004). Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res. 32, 497-501. 92. Bouwmeester, T., Bauch, A., Ruffner, H., Angrand, P.O,. Bergamini, G. et al. (2004). A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol. 6, 97-105. 93. Andersen, J.S., Wilkinson, C.J., Mayor, T., Mortensen, P., Nigg, E.A. and Mann, M. (2003). Proteomic characterization of the human centrosome by protein correlation profiling. Nature. 426, 570-4. 94. Uetz, P. (2002). Two-hybrid arrays. Curr Opin Chem Biol. 6, 57-62. 95. Lee, T.I., Rinaldi, N.J., Robert, .F., Odom, D.T., Bar-Joseph, Z. et al. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 298, 799-804. 96. Ideker, T., Tan, K., Uetz., P. (2005). Visualization and integration of protein interaction networks, in Protein-Protein interactions: a molecular cloning manual. E.A.a.P.D.A. Golemis, Editor., Cold Spring Harbor Laboratory Press: Cold Spring Harbor, N.Y. 839-856.
This page intentionally left blank
CHAPTER 6 METABOLIC NETWORKS
David A. Fell School of Life Sciences, Oxford Brookes University, Headington, Oxford, OX3 0BP, UK
[email protected]
1. Introduction Metabolism is the set of chemical conversions that occur inside living organisms as they convert their nutrients into cellular materials, or use them to provide energy to power their activities. For example, most organisms can oxidize the sugar glucose, using oxygen, to generate carbon dioxide and water. However, whereas if we burn sugar it reacts directly with oxygen, the metabolic process is different, being broken up into a large number of steps, none of which involves any of the carbon atoms of glucose reacting with an oxygen. This is typical of metabolism, with every metabolic function of a living cell being implemented as a series of individual chemical conversions, the majority of which are not apparent outside the cell as most of the chemical intermediates, or metabolites, formed and utilized in the process are unable to escape the cell through the cell membrane. Although a few of the chemical reactions of metabolism occur as spontaneous chemical reactions, most of them would not occur to any significant extent without catalysis. The catalysts of cellular metabolism are the proteins known as enzymes. Each enzyme has a relatively high specificity in terms of the metabolites on which it can act, and usually catalyzes only a single reaction on those metabolites so that the products of the reaction are strictly determined, unlike in chemical catalysis where 163
164
D. A. Fell
side reactions are common. As with all proteins, the instructions for assembling an enzyme are genes, ‘written’ as sequences using the four chemical ‘letters’ that make up an organism’s DNA. The process of synthesis involves the production of a copy of the instructions in the form of messenger RNA (or mRNA), which is translated into protein. The study of metabolism started over a hundred years ago, when it was shown that the conversion of glucose to ethanol could be catalyzed by cell–free extracts from yeast. After that, the study was continued by the discovery of pathways for utilization or synthesis of specific substances, such as glycolysis for the breakdown of glucose, and Krebs’ tricarboxylic acid cycle. All these pathways must join up, since plants and many bacteria can make all their own constituents from a small set of simple precursors. In spite of this incontrovertible evidence that metabolism can form a single, interconnected network, a glance at the organization of any modern biochemistry textbook shows that the pathway paradigm predominates. A justification of the pathway designation is that it labels the major routes through the network. This is indeed plausible in the context of the cells in which the pathways were discovered, such as yeast for glycolysis, and pigeon breast muscle for the tricarboxylic acid cycle. In these cases, the designation as a pathway is merited because observation of relative mass flows, or metabolic fluxes, in the network concerned confirms that these are primary routes. However, pathway names are being adopted in bioinformatics databases as gene ontology terms that are attached to the genes for enzymes in organisms in the absence of any equivalent evidence, or even in some cases in contradiction to the observations. (For example, non–cyclic versions of the tricarboxylic acid cycle are common (1), and even in organisms containing the enzymes for the full cycle, the dominant flux patterns are not necessarily cyclic.) Nevertheless, trends in both experimental and theoretical biochemistry are starting to make the metabolic network a more natural focus of investigation, and this approach will be developed in the rest of this Chapter. 2. Interacting Partners There are different ways of viewing the metabolic network according to the focus selected. The complexity of the network is such that there
Metabolic Networks
165
is no single approach that can readily capture all its aspects. This is perhaps not immediately apparent when the basic unit of the network is considered: a reaction catalyzed by an enzyme in which one of the chemical reactants in metabolism, a metabolic intermediate or substrate, is transformed into another intermediate, or product. However, even if water is disregarded as a reactant in view of its ubiquity, something of the order of three quarters of the metabolic reactions actually involve two or more substrates reacting to form two or more products. If we take the case of two substrates, B and Q, reacting to form two products, C and R, how do we represent the network relationships between the four? In many metabolic diagrams (e.g. KEGG maps, at http://www.genome.ad.jp/kegg/, or BioCyc, at http://biocyc.org/), one particular pair, say B and C, is shown as the principal link, and the involvement of Q and R is suppressed or shown as a byproduct of the main conversion. Why is this? One reason is that metabolism is first and foremost carbon chemistry; carbon is the largest component of living cells. Hence the implication of showing a linkage between B and C is that the majority of the carbon atoms in B are also present in C. In the case where the maps show two substrates with equal prominence, it is often because both contribute carbon atoms to the product, and where two products are shown, it is because the substrate has split into two fragments. Thus there is implicit chemical and biochemical knowledge involved in taking the statement that the reaction ‘B + Q ¤ C + R’ occurs in the network and converting that to a network link ‘B ¤ C’ on the basis that it is the principal conversion involved in the reaction. There have been more systematic and automated approaches developed to deciding the principal connections in the network (2,3) based on mapping the correspondences between atoms in the substrates and atoms in the products. For our purposes, we need to consider the implications of including or omitting co–substrate/co–product pairs such as Q and R, since some of them have major implications for the connectivity of the network. This is because these pairs typically participate in a large number of reactions in which the identities of the principal reactants B and C differ. Furthermore, some of these reactions will convert R back to Q, so that the total cyclic flux converting Q to R and back again is far higher than
166
D. A. Fell
......
......
B1
C1
Q
R
C2
B2
......
......
Figure 1. Pathway links through co–metabolites. Two distant pathways, one involving B1 and C1, the other B2 and C2 are apparently linked by a route from B1 through R to C2. However, few or none of the atoms of B1 can reach C2 through this route.
the rates of conversion of Q and R to any other compounds. One example is the pair ATP and ADP, standard abbreviations for the most connected metabolites in the metabolic network. They have a major role in the energetics of the cell, but in chemical terms they only differ in that ATP with three phosphate groups has one more phosphate than ADP. Hence the reactions in which they occur involve transfer of a phosphate group, but there is no interchange of carbon between the the ATP/ADP pair and the B/C pair. If we wanted to trace the flow of phosphate through metabolism, then ATP and ADP would have to be accorded the status of the principal reactants, and biochemists’ maps would need to be redrawn to illustrate this. The reason for not ignoring pairs such as ATP/ADP entirely is that they impose constraints on the carbon flows even though they are not directly contributing to them in a major way (except that a growing cell must make them). This is because, in a cell at steady state, the total rate of conversion of ATP to ADP is equal to the rate of conversion of ADP back to ATP. In other words, in addition to the links in the network with flows that are constrained by mass conservation of carbon, there are additional linkages across the network imposing
Metabolic Networks
167
constraints because of the mass conservation of these co–substrate/ co–product pairs. A model that includes these links is more realistic, since a metabolizing cell at steady state is simultaneously satisfying all the constraints, but the interpretation of the network connectivity becomes more difficult. For example, if we have two separate reactions both involving the pair Q/R, such as: B1 + Q ¤ C1 + R B2 + R ¤ C2 + Q there is the possibility (see Fig. 1 and ref. 4) of construing an invalid route (in terms of carbon metabolism): B1 ¤ R ¤ C2 This route, however, could be valid for a dynamical interaction between B1 and C2. As the rates of enzyme reactions are influenced by the concentrations of their substrates and products, a perturbation of the concentration of B1 could affect the rate of the first of the pair of reactions above, inducing a perturbation in the concentrations of Q and R that could in turn affect the rate of the second reaction, thus transmitting the perturbation to C2. If the matter ended there, it would be a simple choice between including or not the co–substrate/co–product pairs in the representation, and choosing appropriate methods of analysis and interpretation. However, a further complication is that such clear-cut distinctions are not possible. For instance, the compounds 2–oxoglutarate, glutamate and glutamine occur with a high frequency throughout metabolism as the pairs 2–oxoglutarate/ glutamate and glutamate/glutamine, in large part because of their role in the transfer of nitrogen (in the form of amino groups). However, many cells can live on these compounds as a carbon source because 2–oxoglutarate is a central and well–connected component of carbon metabolism, so it is not straightforward to disentangle their roles in carbon and nitrogen metabolism.
168
D. A. Fell
The necessity of representing multi–substrate, multi–product reactions creates difficulties for adequate representation of metabolism as a graph (for example with metabolites as nodes and reactions as edges). Apart from the issue of deciding which nodes should be connected, in the light of the discussion above, there will be correlated edges that must be traversed in tandem. Some of these problems can be addressed by the use of bipartite graph representations such as Petri nets. Biochemists distinguish between reactions that are bi–directional or reversible and those that are uni–directional or irreversible. Unfortunately there is no clear division (see ref. 5 for a discussion) since the extent of the reverse reaction depends on the standard free energy change of the reaction and the concentrations of the substrates and products in the cell. Although the standard free energy change is a characteristic of the reaction, the steady state concentrations of metabolites are variables of the metabolic system, and change according to conditions. Values for the standard free energies of biochemical reactions are available from databases (http://xpdb.nist.gov/ enzyme_thermodynamics/enzyme_thermodynamics_data.html) (6), though these are far from comprehensive. Forsythe et al. have described a method for inferring the favoured directions of metabolic reactions consistent with the overall thermodynamic gradient and the likely limits to the range of feasible metabolite concentrations in the cell (7). Some methods of network analysis require that reversible reactions are represented twice as separate reactions for each direction, but if this is the case, then the trivial cycles formed by the forward and reverse steps of a reaction must be eliminated from any solutions. The case has been made, particularly for kinetic models, that representing reactions as irreversible is an oversimplification that may well misrepresent the network behaviour (8). In the description above, the enzymes are nearly invisible entities that just determine, by their presence or absence, whether a reaction can occur or not. At a more detailed level of description, however, the enzymes are themselves molecules that undergo transient reactions with their substrates and products. Each overall metabolic reaction above can be broken up into a number of discrete events involving the enzyme molecule, which is released unchanged at the end of the catalytic process.
Metabolic Networks
169
A major reason for not descending to this level of detail is that there are several different molecular mechanisms that are known to implement multi–substrate, multi–product reactions (9), but the proportion of enzymes whose reaction mechanisms have been thoroughly investigated is relatively small. In network analyses based on bipartite graphs, such as Petri net methods (10), the substrates have formal directed links to the enzyme, and the enzyme has formal directed links to the products. It is also possible to change the perspective completely and regard metabolism as an instance of a functional protein network. In this case, the enzymes, as the proteins, are the nodes, and they are linked if the product of some reaction of one is the substrate of some reaction of another. Here the metabolites become almost invisible, though the issues, described above, still apply in making the selection of metabolites that are regarded as relevant to the definition of links. 3. Methodology 3.1. Mathematical Representation All methods of analysis require a mathematical representation of the metabolic network, and the starting point for this is provided by the stoichiometry matrix. To illustrate this, consider the following small network:
Apart from this diagrammatic form, two other types of representation contain the same information:
S1 S2
⎡ 1 −1 1 0 ⎤ ⎢ 0 1 −1 −1⎥ ⎣ ⎦
r1: X0 -> S1 r2: S1 -> S2 r3: S2 -> S1 r4: S2 -> X1
170
D. A. Fell
The one on the left is the stoichiometry matrix where the numbers in the matrix represent the moles of the substrate specified by the row that take part in the reaction specified by the column (with negative numbers indicating substrates of the reaction). The form on the right is a symbolic list of reactions, starting with a reaction name (here rn) followed by the balanced equation for the reaction in text form. All three forms can be inter–converted, but the starting point is to define the set of reactions that occur in the network. Note that the stoichiometry matrix shown above only contains rows for what are termed the internal metabolites of the network. This is the extent of the system definition needed for many of the modes of analysis of the steady states of the system, where the production and consumption of each internal metabolite are in balance. The external metabolites are not in general balanced, as there must be a mass flow through the metabolic network maintaining its displacement from chemical equilibrium. If the consumption and production of these externals is to be computed, then a full stoichiometry matrix is required, with additional rows describing the involvement of externals in the reactions. As implied in the previous Section, a decision needs to be taken about whether the co– substrates of each reaction are treated as external or internal metabolites. The main additional piece of information that may not be coded in the stoichiometry matrix is whether reactions are reversible or irreversible. Some methods treat reversible reactions as two separate irreversible reactions, in which case the reversible reactions are present as a pair of columns in the stoichiometry matrix, one the negative of the other. Methods that enter reactions once in the matrix can either maintain a separate record of whether reactions are reversible or irreversible, or else arrange the reaction (column) order so as to partition the matrix into reversible and irreversible components. Other representations, such as an adjacency matrix for constructing a directed graph of the network, or a Petri net, can readily be derived from the stoichiometry matrix. 3.2. Defining the Biological System
At this point, a number of difficult biological questions are raised, relating to defining the relevant metabolic network. It will surprise no
Metabolic Networks
171
one that bacteria, plants and animals all differ in the reactions that are present in their metabolism. Even within these kingdoms, however, there is great variation between species, driven by differences in their natural environments and modes of life, particularly in the way these affect the nature and composition of the organisms’ habitual nutrients. Some of the traditional metabolic charts (e.g. the IUBMB–Nicholson chart; http://www.tcd.ie/Biochemistry/IUBMB-Nicholson/chart.html) provide a superset of known metabolic reactions, as is also the case with the pathways illustrated as maps on the KEGG website (http://www.genome.ad.jp/kegg/). It is, of course, unlikely that any of these maps are comprehensive, since the discovery phase of metabolic biochemistry has never ended, and new reactions and metabolites continue to be added. There would be limited application, however, of carrying out a network analysis on such a superset of known metabolism since it would not represent the biochemistry of any organism that has ever lived. Unfortunately, the alternative approach of representing the metabolism of a specific organism also has drawbacks. Unless the organism is one of the small set of model organisms favoured by biochemists, there are not likely to have been systematic explorations of which of the expected reactions are present. If there are no reasons to suspect the contrary, it will be assumed that the organism has the ability to produce the enzymes that are found in its nearest phylogenetic relatives. Even then, organisms express the enzymes that catalyze particular reactions according to their current environment and nutrients, the stage of their life cycle, and, for multicellular organisms, the tissue type and its metabolic specialization. Since the enzymes are coded by the genes, the sequence of letters in the total DNA of an organism, its genome sequence, contains the information about the set of enzymes it is able to produce. The increasing automation and decreasing cost of DNA sequencing has resulted in a rapidly growing number of species, particularly bacteria, whose genome sequences have been determined. The problem is that we cannot completely decipher the information: detecting the protein–coding regions in the genome (the open reading frames, or ORFs), from the sequence information alone, is not an exact science, and neither is the
172
D. A. Fell
assignment of the proteins coded by the ORFs to particular enzyme activities. Even for organisms that have long been subject to biochemical study (such as yeast or the bacterium Escherichia coli), a significant fraction of the ORFs have still not been given functional assignments. Another issue is that the presence of a gene for an enzyme is not a guarantee that it is directing the synthesis of the enzyme in the organism in the circumstances for which the metabolism is being investigated. The only certain test is to analyze directly whether a specific reaction is being catalyzed by extracts from the organism. Unfortunately, such tests cannot be easily automated in a manner that would allow the detection of the presence or absence of each of hundreds or thousands of enzymes simultaneously, so other approaches are currently used to try and define the active portions of a metabolism. The next closest thing to measuring enzyme activity is to detect whether the protein possessing the activity is present. This can be attempted with high–throughput proteomics techniques, where the proteins are separated by chromatography or electrophoresis, and the amino acid sequences of fragments of the proteins are determined by mass spectrometry. The amino acid sequence can be linked to a corresponding nucleotide sequence in the genome sequence. Although proteomics instrumentation and techniques are advancing rapidly, it is still not possible to produce a comprehensive catalogue of all the proteins present in a cell, across the dynamic range from the least abundant to most abundant proteins. In addition, all types of cell carry out chemical modifications (known as post-translational modifications) of at least some of the synthesized proteins in response to environmental and intracellular cues, with the effect of either enabling or preventing their enzymic activities. Hence detection of the presence of an enzyme protein does not guarantee that it is active in metabolism. An even less direct, but currently easier, method of inferring the set of enzymes present in given conditions is to measure the set of mRNAs being transcribed from the DNA. This is done with microarray technology, where the concentration of thousands of different RNA molecules can be measured in a single experiment. Although on average the level of mRNA reflects the amount of enzyme protein in the cell, the relationship is highly variable. Another problem is that the finite sensitivity of the measurements means that it is not possible to
Metabolic Networks
173
categorically state that the gene for an enzyme has not been transcribed and translated. At the other extreme, it is possible that a gene has been transcribed to give an RNA molecule (or at least, the parts of it that are detected on a microarray), but that the translation of the message to give enzymic protein does not occur. Apart from the experimental issues involved in defining the set of reactions occurring in a cell under specific circumstances, there are problems in collating the information provided by different types of experiment. A trivial but pervasive reason for this is the differences in nomenclature used in different fields of biological study, to which I will return below. The more significant issue is that there are many exceptions to the simple mapping of a gene to an enzyme to a reaction. Each of these relationships can be a one to many, many to one or many to many mapping. Firstly, there is the case of multiple genes specifying an enzyme activity, which itself divides into two major instances. The first is the case of different genes specifying different variants of an enzyme (isoenzymes) that may be expressed in different environmental conditions or different tissues of an multicellular organism to provide enzymes with different kinetic characteristics. Where the different genes can be expressed in the same cell, however, this relationship also implies the existence of multiple enzymes for the same reaction. The second case is where an enzyme is composed of a number of subunits, each specified by a separate gene, that all have to be present and functional to assemble a working enzyme. Secondly, there is the converse case of a single gene that specifies a multi–functional enzyme. Most commonly, such enzymes catalyze two or more adjacent reactions from the network, and in this case a simple treatment is to represent the overall reaction as a single reaction. However, there are cases (for example, aspartate kinase I and homoserine dehydrogenase I of Escherichia coli) where there is an intervening reaction and network branch–point between the two steps. Furthermore, multi–functional enzymes in the network of one species can be replaced by separate individual enzymes in another. A third case is that an enzyme may act on more than one substrate molecule, so that it can catalyze two or more reactions in the network.
174
D. A. Fell
Most enzymes will react with a range of artificial substrate analogues in the laboratory; for our purposes, the issue is which ones, of the different substrates it can act on, will be available to it in a particular cell. This, of course, is itself a network issue, because it cannot be answered without considering which products will be produced by other enzymes in the system. The only feasible approach to take seems to be to produce a comprehensive list of all the reactions that can be catalyzed by the total set of enzymes, and then eliminate reactions that are isolated from the rest of the network because there is no route to their substrates. Finally, it should not be overlooked that there are some reactions, including hydrolyses, hydrations, decarboxylations, anomerizations and isomerizations that occur as spontaneous chemical reactions at an appreciable rate. Cells may produce enzymes to accelerate these further, but absence of the enzyme does not prevent the reaction occurring. The motivation for describing all this detail of the biological complexity is that it is highly relevant to the study and interpretation of robustness and damage of the metabolic network. The issues above all affect the correspondence between biological events, such as gene mutations, and their effects on the metabolic network. In other networks, it is possible to selectively remove both single nodes and single links. This is not the case with metabolism. There is no one–step method, for example, to remove a single metabolite from the network, apart from an end–product of metabolism. In general, it would be necessary to eliminate all producing reactions and all reversible consuming reactions by deleting the enzymes responsible, and it may well not be possible to implement this without simultaneously disconnecting other metabolites from the network. Experimentally, major methods for manipulating metabolism are gene mutation and ‘silencing’, but a single action on the genetic network may either eliminate multiple reactions in the metabolic network, or eliminate none at all, for the reasons given above. Drugs and biocides include compounds that act by damaging the metabolic network of an organism by inhibiting or inactivating specific enzymes; examples include penicillin and the herbicide glyphosate. Again, there is the possibility that blocking one enzyme acts on multiple reactions of the network, though in practical terms, an additional issue is that few enzyme inhibitors are completely specific for a single enzyme. The
Metabolic Networks
175
question of how to perform biologically–relevant damage analysis will be returned to later in this Chapter. 4. Computational Modelling 4.1. Sources of Data
The starting point for any analysis is evidently the metabolic network of a chosen organism. De novo construction of a genome–scale reaction list for the metabolic network of an organism is a significant effort, requiring a reasonable knowledge of metabolic biochemistry in general, and of the biochemical specializations of the chosen organism. Something of the order of a thousand reactions, involving a similar number of metabolites, may need to be defined, though the main core of metabolism can be described by a few hundred reactions and metabolites. One option is to use an existing list. Palsson’s group has published analyses of several genome–scale models (11,12) and makes available the reaction lists in electronic form on the group’s web site (http://gcrg.ucsd.edu/); another recent example is the model of Streptomyces coelicolor (13). Given their size, it would be preferable if the reaction lists could be assembled from the databases of annotated genomes and enzyme reactions. The databases available for this, and the steps involved have been described by Francke et al. (14). Useful collections that include annotated genomes with links to enzyme and reaction data include BioCyc (http://biocyc.org/) and KEGG (http://www.genome.ad.jp/ kegg/), especially as the files containing the data are publicly available and there are interfaces on the databases for automated querying. However, the experience in my group (15) is that models made this way do not correspond to a complete and connected metabolism, and considerable manual intervention is required to arrive at a network that could conceivably represent a functional metabolism. Typical problems have recently been described by Borodina et al. (13) in the preparation of a Streptomyces coelicolor model, many arising from incomplete or inaccurate annotation of the genome sequence. The reported difficulties of using the databases (16) in defining an accurate enzyme graph include
176
D. A. Fell
lack of definitive lists of the full range of substrates used by enzymes in the standard source of enzyme nomenclature (http://www.expasy.org/ enzyme/) and lack of consistency in the names of metabolites that are only partially addressed in more extensive data repositories such as KEGG. These latter authors initiated steps to resolve the problem by modifying the datasets. An important area where improvements are being sought is the initial annotation of a new genome sequence to indicate the enzymes encoded. In the first instance, if the gene has not already been linked to an enzyme by experimental evidence of function, the standard approach is to search for sequence homology between the putative gene and known enzymes in other organisms. This process can be automated with suitable bioinformatics tools, but errors are inevitable because although many sequence changes have no effect on the specificity of an enzyme, it is known to be possible to change specificity with a single change. In addition, function of the gene that is used as the reference in the assignment by homology may itself have been assigned by homology, so there may be chains of inference before a link to direct experimental evidence is reached. Several groups have sought to improve the process by using information about known metabolic networks to search for sets of enzymes that need to occur together to make connected routes, and searching for sequences in the genome being annotated that might code for enzymes that would fill gaps in the route. An example application is the metabolic annotation of bacterial carbohydrate metabolisms (17). More recently, the metabolism coded in the human genome has been investigated (18). Computer tools are being developed to automate some of these and other approaches, such as BioMiner (19), metaSHARK (20) and STRING (21). As indicated above, the definition of the network for further analysis necessarily involves making decisions about which metabolites are external. This will usually include nutrients used and metabolic waste products produced by the organism. In the case of growing cells, it is usually necessary to regard the components of biomass as externals. The classification of co–substrates such as ATP as external or internal has to be appropriate to the intended analysis.
Metabolic Networks
177
4.2. Algorithms and Analyses
4.2.1. Graph Analysis Some of the network characteristics are easily determined from the stoichiometric matrix. For example, the degree of a node in the metabolite graph is given by the non–zero entries in the corresponding row; the degree of a reaction is given by the number of non–zero entries in the relevant column. Other characteristics, such as mean shortest path lengths, clustering coefficients and diameter can be computed with appropriate graph analysis tools after conversion of the stoichiometry matrix to an appropriate adjacency matrix (see Chapter 1). 4.2.2. Flux Balance Analyses For reasons to be discussed in the next Subsection, other approaches than graphical network analysis have been taken to structural analysis of metabolic networks. What these other methods have in common is that they explore the constraints on feasible patterns of flow through the network that are imposed by the requirement that the synthesis and degradation of every internal metabolite is balanced in a metabolic steady state. Any metabolic pathway at steady state satisfies the relationship N.v=0, where N is the stoichiometry matrix, defined above and v is a vector of steady state reaction rates. For the substrate cycle network shown previously: S1 S2
⎡ v1 ⎤ ⎡ 1 −1 1 0 ⎤ ⎢ v2 ⎥ ⎡0⎤ ⎢ 0 1 −1 −1⎥ ⋅ ⎢ v ⎥ = ⎢ 0 ⎥ ⎣ ⎦ ⎢ 3⎥ ⎣ ⎦ ⎢⎣v4 ⎥⎦
Any observed set of velocities at steady state will be a linear combination of a set of vectors K referred to as a kernel or a basis for the null space of the stoichiometry matrix. It does not matter for our purposes how a basis is calculated (e.g. ref. 22), though numerical mathematics programs generally provide a function for doing this. In this
178
D. A. Fell
case, a suitable basis could be:
⎡1 ⎢1 K=⎢ ⎢0 ⎢⎣1
0⎤ 1⎥ ⎥ 1⎥ 0 ⎥⎦
The vectors are in essence, prototype solutions of the problem, since there are no unique solutions in this case (or for metabolic networks in general). The number of null space vectors tells us the number of independent fluxes that can exist in the pathway — in this case, a linear flux and a cyclic flux. The null space vectors can be considered as pathways through the network, and any feasible set of velocities at steady state is a linear combination of these null space vectors, e.g.: ⎡1 ⎢1 K=⎢ ⎢0 ⎢⎣1
0⎤ 1⎥ ⎥ 1⎥ 0 ⎥⎦
and ⎡1 ⎢1 ⎢ ⎢0 ⎢⎣1
0⎤ ⎡ v1 ⎤ ⎡ a ⎤ ⎢v ⎥ ⎥ ⎢ ⎥ a+b 1 ⎡a ⎤ ⎥ ⋅ ⎢ ⎥ = ⎢ ⎥ = ⎢ 2⎥ 1⎥ ⎣b ⎦ ⎢ v3 ⎥ ⎢ b ⎥ ⎢⎣v4 ⎥⎦ ⎢⎣ a ⎥⎦ 0 ⎥⎦
Without the kinetic rate functions, we cannot predict actual values of a and b for a cell, but it seems that we can decompose the network into components that correspond to different ways in which the reactions can operate. For various reasons, a basis for the null space cannot be guaranteed to correspond to a physically realizable path through the network, and is also not a unique decomposition of the network into a set of paths. An alternative that overcomes this limitation is elementary modes analysis developed primarily by Stefan Schuster (23,24). This computes all the minimal paths through the network that satisfy all the constraints on the synthesis and degradation of every internal metabolite. Palsson’s group
Metabolic Networks
179
has developed extreme pathway analysis which shares many attributes with elementary modes analysis (25,26). A number of programs are freely available that implement elementary modes analysis of a metabolic network. A potential problem with elementary modes analysis is that the total set of independent routes through the network can be extremely large, probably more than 106 for a genome–scale network. Another approach to analyzing the equation shown above is linear programming (27-31). Here it is necessary to add additional constraints on the external metabolites, for example, in terms of the maximum allowable net consumption of a particular nutrient, and assign a cost function that must be maximized or minimized, for example, the maximum amount of a given product that can be formed from a fixed amount of nutrient. Linear programming solutions can be obtained with standard codes; for large genome–scale models, commercial programs may give the best performance. The solution obtained will be either a single flow pattern in the network that optimizes the cost function, or a set of solutions that produce equivalent solutions. No information is obtained about non–optimal routes through the network. 4.3. Are Paths in a Graph Metabolic Pathways?
The reason for introducing flux balance analyses was to be able to explore the question of whether the paths obtained by treating the stoichiometry matrix as a graph are equivalent to those determined by flux balance analysis methods, and if the two approaches differ, where this difference arises. Consider the possibility that humans can convert acetyl–CoA (derived from fat in the diet) to the carbohydrate glucose by the metabolic route shown summarized in Fig. 2a. There is no doubt that a directed path can be traced via citrate, succinate and oxaloacetate to glucose. Yet this is not a functioning metabolic route; humans cannot achieve net formation of glucose from acetyl–CoA in spite of the existence of this path. The reason is that there is no possibility for material, in the form of carbon, to follow this path. The two-carbon acetyl fragment needs to be joined to a four carbon molecule, oxaloacetate, to make citrate. On the path from citrate to oxaloacetate, two carbons are lost as CO2, so at this point the pathway has merely
180
D. A. Fell
restored the oxaloacetate that was used at the start, and has not generated any new material that can follow the path to glucose. Flux balance analysis methods such as linear programming and elementary modes do not propose the existence of a route from acetyl–CoA to glucose in this network because they take into account the mass conservation constraints. glucose
glucose
PEP
PEP
acetyl CoA
oxac
citrate
oxac
isocitrate
malate
acetyl CoA
citrate
isocitrate
malate
CO2
CO2 glyox
2-oxoglu
CO2
CO2
succinate
a)
2-oxoglu
succinate
b)
Figure 2. Glucose from acetyl–CoA? a) The potential but invalid route available in humans. b) The route available in many plants, bacteria and fungi.
Organisms that can convert acetyl–CoA to glucose have the reaction network shown in Fig. 2b. Here, citrate is converted to succinate without the loss of CO2; two carbons are retained in the form of glyoxylate, which reacts with a second molecule of acetyl–CoA to form an extra four-carbon oxaloacetate molecule in addition to the one derived from succinate. There is a sense in which the route from acetyl–CoA to glucose shown in Fig. 2a does exist: if the acetate component of acetyl–CoA is labelled by replacing the normal carbon atoms with 13C or 14C, the isotopic label does travel to glucose because the two CO2 molecules
Metabolic Networks
181
released as citrate is converted to succinate do not come from the acetyl– CoA but from the oxaloacetate. That is, individual carbon atoms traverse the path, but there is no net mass transfer along it. 5. Global Topology of the Network 5.1. Introduction
The realization that a metabolic network is an object that can be the subject of computation rather than just a descriptive map was slow to come. One of the first analyses was the analysis of metabolic yields by linear programming, referred to above (27-31). Then, in her formalization of the mathematical basis of metabolic control analysis, Reder (32) showed that there were stoichiometric constraints on the allowable steady states of a metabolic network, and that these constraints were calculable via the null space of the stoichiometry matrix, without reference to the kinetic characteristics of the enzymes. Although they did not propose a formal method for finding them, Leiser and Blum (33) proposed that, in order to estimate the cyclic fluxes contributing to a metabolic flux pattern measured by radioactive isotopes, a metabolic network could be decomposed into a number of underlying linear and cyclic fundamental pathways. Shortly after this, computer–based route–tracing methods were developed to determine the existence and yields of routes through metabolic pathways to specific products (34,35) in order to optimize productivity in biochemical engineering processes. The computation was limited in the size of network because of combinatorial explosion in the number of partial routes that had to be examined on the way to a solution. Apart from this, other developments in analysis of metabolic networks continued to focus on the stoichiometry matrix, leading to the development of elementary modes analysis (23,24) and the related extreme pathway analysis (25,26). At this point, two developments came together to promote the concept of graphical analysis of metabolic networks as opposed to the analysis of stoichiometric constraints. Andreas Wagner and I were
182
D. A. Fell
interested in the ideas coming from complex systems theory that suggested that robust, evolvable systems would be modular. We wondered whether this would apply to metabolism, especially as it was commonly claimed that all cellular components could be synthesized from a small number of central metabolites, potentially the connectors between the metabolic modules, even though suggested lists of these central metabolites (e.g. refs. 36,37) differed in length and members. Barabási’s group was pursuing the idea that a number of natural networks that had previously been modelled as random graphs (e.g. refs. 38,39) had distinctly different characteristics, and they turned to the metabolic network as another potential illustration of this point. Both ourselves and Barabási’s group found that the metabolic network shared the characteristics of a ‘small world’ network (40,41). 5.2. Small World Characteristics
The key characteristic of metabolism, in instances from a number of different organisms when represented as a graph, is that it is clearly not a random graph of the Erdös–Rényi type (see Chapter 1) (40,41). Specifically, if the graph is generated with metabolites as nodes and reactions as edges, the number of connections per node is not distributed as would be expected for random variation about the mean value. In fact, the distribution of number of connections approximately follows a power law whereby a small number of metabolites have a large number of connections and the largest number of metabolites have one or two connections. (Irrespective of the mathematics, we would expect the metabolic graph to be fully connected, at least if we have a complete knowledge of metabolism, since we know that many bacteria and plants can generate all their biomass constituents from single sources of carbon, nitrogen, phosphorous and sulphur.) The graph also shared two other characteristics with the ‘small world’ graphs defined by Watts (42,43): the mean shortest path length between metabolites was as short as for a random graph, but the average clustering coefficient, which measures the fraction of the adjacent nodes of a metabolite that are also adjacent to one another, is very much higher than for a random graph of the same average connectivity.
Metabolic Networks
183
Our conclusions (41) and those of Jeong et al. (40) were essentially similar even though the metabolite graphs had been constructed in different ways. Our graph recorded a node between two metabolites if they occurred in the same reaction, whether as substrate, co–substrate, product or co–product, and irrespective of the reversibility of the reaction. We did repeat the analysis with the most common co–substrates and co–products removed, but with essentially similar results. Jeong et al. included metabolites and enzymes and a hypothetical enzymemetabolite intermediate of the reaction; the graph was directed according to direction of the reaction, and reversible reactions were represented by two directed edges (40). This means that their network for the same organism (E. coli) was significantly larger, though only the metabolite nodes were counted in the calculations. The conclusion seems inescapable that metabolism has neither the regular layout (linear pathways, occasional branches, some cycles) conveyed by biochemistry textbooks and generally used for theoretical investigations, nor a random structure as assumed by Kaufmann (39). Its exact characteristics, and the origin and significance of these characteristics, have remained controversial. 5.3. Short Path Lengths
Both ourselves and Jeong et al. reported the mean shortest path between metabolites as close to 3, increasing to nearly 4 when common cosubstrates were removed, which is of the same order as that expected for a random graph of the same average connectivity. However, this should not be interpreted as the average length of a metabolic pathway, since the analysis does not guarantee that the sequence of steps constitutes a stoichiometrically valid pathway for the conversion of the starting metabolite into the other, for the reasons discussed in Section B above. It does represent a route whereby a perturbation of one metabolite concentration could spread to another through perturbation of the rates of the intervening reactions. Since metabolic systems mostly relax to steady states, the short paths could mean that the system rapidly dissipates the perturbation, as suggested in ref. 43. Supporting the view that the network architecture would not tend to sustain perturbations is the
184
D. A. Fell
observation that the E. coli metabolic network has more cycles of length three and fewer long cyclic sub–graphs than would be expected for both random and synthetic small world graphs (44). Arita measured a more meaningful path length by constructing a directed graph in which two metabolites were connected by an edge only if at least one carbon atom of one is transferred to the other by the reaction (2). This ensures that paths cannot go via substances such as ATP and NADH which transfer phosphate and hydrogen (or strictly, electrons) respectively and not carbon. This does not guarantee that the routes satisfy the strict definition of metabolic pathway given above, but the majority of them should be functional. This increased the path length to 8.4, which is probably more representative of an average metabolic conversion. However, there is still some uncertainty about this figure, since the metabolic model only accounted for about half of the carbon atoms of the total set of metabolites being reachable from glucose, whereas all of them must be because the bacterium can grow on glucose as a sole carbon source. The reasons for the incomplete reachability include the computation not traversing cyclic routes (to ensure completion of the computation) and missing connections owing to gaps and inconsistencies in the underlying metabolic data (2). Rahman et al. have developed a tool that uses automated scoring criteria to determine which product most closely resembles which substrate, and to compute shortest paths and other statistics using only these principal links and ignoring small molecules (3). For the E. coli metabolic network, the average path length is reported as 8.1. Thus both ref. 2 and ref. 3 have contradicted the claim that metabolism has path lengths as short as those expected in a random graph. 5.4. Power Law Connectivity
The claim (40,41) that the connectivity distribution of metabolites is described by a power law relationship, in a range of different organisms, has been met with some scepticism, though it is accepted that the distribution is generally different from a statistically random one. Zhu and Qin, however, have recently disputed that all metabolic networks have this characteristic; they accept is is reasonable for eukaryotes and
Metabolic Networks
185
bacteria, but not for the smaller networks of the archaea, which are closer to random networks (45). A weakness of the case in favour of a power law relationship is that the data barely spans three orders of magnitude, and is not linear (on a log scale) over the whole of that range since there are deviations at the extremes. Unfortunately, the relationship is unlikely to be tested any further in the future, since there is no reason to expect that there is enough undiscovered metabolism to extend the coverage by another decade. Certainly, the limited range of data means that metabolism is not the best example of the claims for the wide–spread occurrence of power law connectivity in biological and man–made networks. Behind the arguments is the issue of whether metabolism can be grouped with other networks that all possess similar properties and obey similar principles, as promoted by the Barabási group (40,46), or whether the properties can be seen as the result of a mixture of chemical constraints and biological processes that produce network properties with a superficial, but not strong or deep, resemblance to other small–world networks (47,48). For the biologist, this is the interesting question. Does the current metabolite connectivity distribution provide information about the processes and constraints that have been active during the evolution of metabolism? Meléndez–Hevia and colleagues (49,50) have argued that certain metabolic pathways can be shown to be optimal solutions to the chemical constraints on feasible biochemical reaction networks, but it remains to be seen how far this line of argument can be extended. I will give further consideration below to the issue of the evolution of the structure. 5.5. Modular Structure
Biochemists think of metabolism as having a modular structure, with the broad divisions of anabolism, catabolism and central (or anaplerotic) metabolism subdivided further into major pathways, such as glycolysis, etc. Functional decomposition of metabolic networks by elementary modes analysis (24) suggests that the traditional biochemical classifications do not capture the rich potential of the biochemical network, though elementary modes analysis itself needs refinement in its implementation in order to reveal internal structure of the network (51).
186
D. A. Fell
Barabási’s group itself noted (52) that there appeared to be an apparent contradiction between the power law connectivity of metabolites, which implied self–similarity of metabolism on different scales, and the high clustering coefficient (41), which implies a modular topology of densely interconnected units linked by a few inter–module links. They analyzed the clustering coefficients in the metabolic networks of 43 different organisms, and concluded that the clustering coefficients were not only higher than for random networks, but also higher than expected for a simple scale–free network. They observed an inverse relationship between the clustering coefficient and the size of the network measured in terms of its number of links that they claimed is consistent with the properties expected of a network showing ‘hierarchical modularity’. They then applied a hierarchical clustering algorithm to the ‘topological overlap matrix’ of the metabolites of a reduced metabolic map of E. coli, which measures the degree of commonality in the sets of neighbours of any two metabolites. This demonstrated distinct clusters that appeared to relate to functional subdivisions of metabolism. Ma and Zeng analyzed the structure of E. coli metabolism by a cluster analysis based on the reachability of one metabolite from another, taking into consideration the directionality of the metabolite graph (53). On this basis, they identified a ‘giant strong component’ (GSC) of strongly inter–connected metabolites that were fully inter–convertible, but that only constituted about a third of the total metabolites. In addition, there was a substrate subset that could give rise to components of the GSC but could not be formed from it, and a product subset that was formed from the GSC but that could not produce it. There was a remnant, the isolated subset, that had weak connections to the substrate and product sets, but no connections to the GSC. They characterized the substrate, GSC, product clusters as a ‘bow–tie’ structure. Csete and Doyle also proposed that metabolism shares this bow–tie organization with a number of other networks, and that this allows highly–optimized trade-offs between numerous requirements such as low reaction complexity, minimization of genome size, efficiency, evolvability and adaptability to perturbations (47). Tanaka arranged the stoichiometry matrix for the metabolism of Helicobacter pylori by biochemically–defined functional classes and
Metabolic Networks
187
arranged the low–connectivity metabolites to diagonalize the matrix; highly–connected ‘precursor’ metabolites and ‘carrier molecules’ were separated at the bottom of the matrix (54). He argues that the connectivity distributions of the metabolites in separate functional modules is exponential rather than power–law, and the power–law arises from the combination of the connectivity distributions of the many metabolites unique to the modules with the few highly–connected carrier metabolites common to all the modules. This arises from the constraints on the structure of the stoichiometry matrix imposed by the ‘bow–tie’ structure of metabolism. For these reasons, he terms the metabolic network “scale–rich”. 5.6. Evolution of the Structure
It is obvious that metabolism is a product of evolution, but the high degree of commonality of the reactions of central metabolism in all known living organisms implies that much of this evolution occurred before the last common ancestor. Hence there is little direct evidence to allow either a reconstruction of the stages of development of metabolism or an assessment of the relative importance of different constraints, such as chemical feasibility of different conversions, that may have influenced its course. One of the reasons, therefore, for the interest in the finding of power–law connectivity in the metabolite graph was that Barabasi and Albert had suggested that a growing network would come to have this characteristic if new nodes were preferentially connected to the nodes that already had the highest connectivity (55). In the case of metabolism, this would seem to be a plausible conjecture, since new enzymes (and hence new reactions) must evolve from existing enzymes, either by gene duplication and subsequent divergence through mutation, or through genetic recombination events that bring together functional domains of different proteins in new arrangements. In either case, the most connected metabolites would be those for which there would be the largest number of enzymes, thus increasing the chance that the highly connected metabolites would contribute one of the reactants to any novel reaction. The implication of this is that the set of most highly connected metabolites will be similar in
188
D. A. Fell
all extant organisms (40) and that the most connected metabolites represent the original core of metabolism (41). In fact, the most highly connected metabolites do match reasonably well with the protometabolism suggested by Morowitz (56,57). Furthermore, Light et al. have provided evidence that the enzymes in E. coli that occur most widely in bacteria and eukaryotes have higher connectivities than average, which is consistent with the preferential attachment hypothesis (58). Pfieffer et al. simulated evolutionary growth of metabolic networks by plausible biological mechanisms to study the emergence of network connectivity (48). They found that this did not lead to power–law connectivity, but the connectivity was similar to that seen in subnetworks of E. coli of similar size. This seems to offer independent support to the argument by Tanaka that power law connectivity is not seen in modules of metabolism (54). 5.7. Robustness and Damage
Jeong et al. argued that the power–law connectivity implied that metabolic networks would be robust to random attacks on their integrity, which in biological terms would correspond to mutations of the genes coding for the enzymes (40). (Destruction of a metabolite or enzyme molecule is not in itself greatly significant, since there are large numbers of them per cell, and they are replaceable in any viable cell.) They simulated mutation by removing nodes at random from the metabolite graph and determining the effect on the properties of the remaining graph. This is not, however, a fair analogy of mutation, which removes the group of reactions catalyzed by the affected enzyme, corresponding to one or more edges of the metabolite graph. This may not remove any metabolites at all if there are other producing and consuming enzymes for all the metabolites associated with the mutated enzyme. On the other hand, there may be occasional cases where loss of an enzyme function could remove a number of metabolites if it catalyzes a set of reactions that are the unique producing reactions for the set of products involved. Assessment of the consequences of mutation thus requires careful consideration of the consequences of removing edges from the
Metabolic Networks
189
metabolite graph. A graphical approach to this consists of removing the edges corresponding to an enzyme from the graph, determining whether there are then any metabolites that are no longer reachable. The reactions (edges) involving any such metabolites are also deleted and the process continued until no further metabolites are affected, when the damage is assessed as the total number of metabolites disconnected from the graph by the original event (54). Contrary to the scenario envisaged by the Barabási group, the highest levels of damage occurred with enzymes producing metabolites of low connectivity, and little damage was associated with enzymes producing metabolites with high connectivity. The nine percent of enzymes causing the most damage corresponded to a significant fraction of the enzymes known to be essential for viability from experimental studies. (For reference, the fraction of gene products found to be essential for microbial growth in one or more environments, when mutated one at a time, is only of the order of a quarter of the total. In yeast, about 70% of the genes reveal no obvious phenotypic effects in single deletion mutants, though they make a measurable contribution to fitness; ref. 60.) Although this damage analysis offers a significantly better assessment of the robustness of the metabolic network, it has the potential for under-estimating the damage caused by mutation since the measure of damage does not include any weighting for whether the metabolites removed are necessary for any vital function such as growth, nor whether the remaining paths on the graph allow stoichiometric production of the essential components of biomass. (The ‘choke point’ analysis of Yeh et al. is even more limited in this respect, since it only identifies metabolites with unique producing or consuming reactions) (61). These issues had already been addressed Palsson’s group using flux balance analysis methods; the reasons why such approaches are more appropriate to answering such questions have been considered earlier in this Chapter. For example, Edwards et al. computed the effect of mutations on the growth characteristics of E. coli correctly, relative to the experimental observations, in 86% of the instances examined (12). Extreme pathway analysis has been used to define network robustness in terms of the number of independent routes available to biomass precursor molecules (62), and elementary modes analysis has been used to define a
190
D. A. Fell
measure related to the fraction of the routes remaining after deletion of a single enzyme (63). Sets of enzymes whose removal abolishes a particular network function have been identified by determining the minimal cut sets of the elementary modes (64). The identification of sites that are vulnerable to damage has potential applications in finding drug targets in pathogen metabolism, and in genetic engineering where it is desired to suppress pathways that are diverting precursors away from the desired end–product. It is less obvious that the metabolic network structure will have evolved to maximize robustness of the network against mutation. Evolution acts at the level of the population, not the individual, and in microorganisms, in which the main components of current metabolism evolved, the occasional cell that acquires a mutation that reduces its fitness will be outgrown by its wild–type siblings. Extended survival of a mutant cell is more likely to be detrimental to the population, by unnecessarily depriving wild–type cells of resources. The robustness that matters is the ability to maintain a coordinated metabolism that produces biomass constituents efficiently in the correct proportions in a fluctuating or even hostile environment. Protection against mutations is achieved by minimizing errors in DNA replication and by repair mechanisms. 6. Dynamics
It is difficult at this time to make very specific statements about the dynamics of complete metabolic networks. This is not because of lack of interest: metabolic engineering and drug development could be revolutionized if we could predict the detailed response of metabolism with accuracy. Currently, the only cell for which we can claim a relatively complete model of its metabolism is the human red cell (e.g. ref. 65), which has a sluggish and very limited metabolic repertoire. The main problem is that, whatever the limitations to our knowledge of the structure of the metabolic network, it is far more complete than our knowledge of the kinetic characteristics of the enzymes of the cell. Enzymes have complex, non–linear kinetic responses to metabolite concentrations, and the number that have been investigated in any detail is a small fraction of the total. It is possible to build useful and
Metabolic Networks
191
informative simulations of small segments of metabolism of particular cells (e.g. ref. 66), but we lack the data to extend this approach to whole cells at the moment. There are two available on–line collections of metabolic models (67,68). For this reason, it is not easy to test the applicability to metabolism of the suggestion by Watts and Strogatz that small–world architecture causes perturbations to dissipate rapidly (43). Consistent with this suggestion, however, is that the metabolic network contains a larger proportion of short cycles and a deficit of larger cycles relative to both random graphs and synthetic small-world networks of the same connectivity (41). This conclusion was reached using the E. coli network developed by Wagner and Fell, and since the study concerns routes for dynamic interactions, it does not matter whether the routes correspond to valid metabolic conversions (41). Given the lack of specific and detailed kinetic knowledge, conclusions about the control and regulation of metabolic dynamics are largely based on approximations to the behaviour in the vicinity of the steady state. Metabolic control analysis (5) is one such approach, but this is too large a topic to enter on here. Further simplifications in the study of dynamics are involved in the flux balance analysis methods discussed earlier in this Chapter. As we shall see in the next Section, it is important, however, to take heed of Einstein’s saying: “Make everything as simple as possible, but not simpler.” 6.1. Distribution of Fluxes
The Barabási group extended their claims about power–law relationships in metabolism by suggesting that fluxes in the network showed such a distribution, with a few very large fluxes and many very small ones (69). Furthermore, they suggested that the small fluxes were resistant to perturbations of the network, whereas the large fluxes were more sensitive. The first claim is not very profound, and the second is an artefact of their methodology. Imagine a microorganism that can sustain growth on a single carbon-containing nutrient, and that requires a hundred small molecule precursors for its biomass, in approximately the same proportions and with the same average number of carbon atoms as
192
D. A. Fell
the nutrient. Conservation of mass requires that one hundred units of flux through the nutrient uptake pathway will produce on average one unit of flux in the outputs of biomass precursors. In between the inputs and outputs, as the flux diverges through the network, flux levels will decrease as the number of branches increase. Providing the number of reactions between successive branch–points is approximately constant, a graph of log of flux against number of reactions will have a slope of -1. In spite of the simplifications of my scenario, the outcome is exactly what is observed in the experimental data shown in Fig. 1d of Almaas et al. (69). The theoretical data presented by the authors was derived by calculation of fluxes in a model of E. coli metabolism using the models and linear programming methodology developed by the Palsson group (12). The fluxes obtained by linear programming are constrained to produce the biomass precursors in the proportions set by the vector of biomass composition, which is assumed to be constant, as a first approximation. (In fact, biomass composition does change somewhat according to nutrients and growth rate, though the variation is a second order effect that can legitimately be neglected for the purposes of the Palsson group’s work.) Thus, when Almaas et al. reported that the smallest fluxes were relatively invariant, they were merely recovering the constraint that they had used to set up the linear programming problem; it is only the larger fluxes further back in the network that are free to adjust in response to perturbations (69). It is thus difficult to see that the frequency distribution of fluxes in a metabolic network could be a sensitive indicator of network structure and connectivity, since mass conservation and the input–output constraints will be dominant. 7. Conclusion and Perspective
The recent interest in the structural properties of networks has highlighted the limitations of viewing metabolism as the sum of a small number of underlying pathways. The discovery of the approximate power–law connectivity of metabolites (40,41) showed that it was also not possible to capture the essence of metabolism with models that have either very regular or entirely random structures. Whether power–law
Metabolic Networks
193
connectivity is the best approximation to metabolic structure, and whether this has further implications for the properties of metabolic networks, remains controversial. Power law connectivity implies a self–similar structure of metabolism on different scales, but there are indications that metabolism does have a modular organization, as biochemists have long believed intuitively. How to objectively reveal that substructure, in a way that is helpful to biologists in understanding the different metabolic phenotypes of different organisms, remains a challenge. Although power–law connectivity could have arisen by growth of metabolism by a preferential attachment process, other plausible evolutionary scenarios and other constraints on a feasible metabolism can also account for the current connectivity distribution properties, so we can not derive easy insight into the evolution of metabolism in this way. However, the need to integrate and interpret the flood of information coming from genome sequencing will continue to motivate the search for improved ways to understand the properties of metabolic networks. References 1. 2. 3.
4.
5. 6.
7.
Huynen, M.A., Dandekar, T. and Bork P. (1999). Variation and evolution of the citric acid cycle: A genomic perspective. Trends Microbiol. 7,281–291. Arita, M. (2004). The metabolic world of Escherichia coli is not small. Proc Natl Acad Sci U.S.A. 101, 1543–1547. Rahman, S.A., Advani, P., Schunk, R., Schrader, R. and Schomburg, D. (2005). Metabolic pathway analysis web service (pathway hunter tool at CUBIC). Bioinformatics. 21, 1189–1193. Croes, D., Couche, F., Wodak, S.J. and Van Helden, J. (2006). Metabolic PathFinding: Inferring relevant pathways in biochemical networks. J Mol Biol. 356, 222–236. Fell, D.A. (1997). Understanding the Control of Metabolism. Portland Press, London. Goldberg, R.N., Tewari, Y.B. and Bhat, T.N. (2004). Thermodynamics of enzyme– catalyzed reactions — a database for quantitative biochemistry. Bioinformatics. 20, 2874–2877. Forsythe, R.G., Karp, P.D. and Mavrovouniotis, M.L. (1997). Estimation of equilibrium constants using automated group contribution methods. Comput Appl Biosci. 13, 537–543.
194 8.
9. 10. 11. 12.
13. 14. 15.
16. 17.
18.
19.
20.
21.
22. 23.
D. A. Fell Cornish-Bowden, A. and Cárdenas, M.L. (2001). Information transfer in metabolic pathways. Effects of irreversible steps in computer models. Eur J Biochem. 268, 6616–6624. Cornish-Bowden, A. (1995). Fundamentals of Enzyme Kinetics. Portland Press, London, 2nd edn. Voss, K., Heiner, M. and Koch, I. (2003). Steady state analysis of metabolic pathways using Petri nets. In Silico Biol. 3, 367–387. Edwards, J.S. and Palsson, B.O. (1999). Systems properties of the Haemophilus influenzae Rd metabolic genotype. J Biol Chem. 274, 17410–17416. Edwards, J.S. and Palsson, B.O. (2000). The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics and capabilities. Proc Natl Acad Sci U.S.A. 97, 5528–5533. Borodina, I., Krabben, P. and Nielsen, J. (2005). Genome–scale analysis of Streptomyces coelicolor A3(2) metabolism. Genome Res. 15, 820–829. Francke, C., Siezen, J.S. and Teusink, B. (2005). Reconstructing the metabolic network of a bacterium from its genome. Trends Microbiol. 13, 550–558. Poolman, M.G., Bonde, B.K., Gevorgyan, A., Patel, H.H. and Fell, D.A. (2006). Challenges to be faced in the reconstruction of metabolic networks from public databases. IEEE Proc. Syst Biol. 153, 379-384. Horne, A.B., Hodgman, T.C., Spence, H.D. and Dalby, A.R. (2004) Constructing an enzyme–centric view of metabolism. Bioinformatics. 20, 2050–2055. Dandekar, T., Schuster, S., Snel, B., Huynen, M. and Bork, P. (1999). Pathway alignment: Application to the comparative analysis of glycolytic enzymes. Biochem J. 343, 115–124. Romero, P., Wagg, J., Green, M.L., Kaiser, D., Krummenacker, M. and Karp, P.D. (2004). Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 6, R2, 1–13. Sirava, M., Schäfer Eigisperger, M., Kaufmann, M., Kohlbacher, O., BornbergBauer, E. and Lenhof, H.P. (2002). BioMiner — modelling, analyzing and visualizing biochemical pathways and networks. Bioinformatics. 18, S219–S230. Pinney, J.M., Shirley, M.W., McConkey, G.A. and Westhead, D.R. (2005). MetaSHARK: Software for automated metabolic network prediction from DNA sequence and its application to the genomes of Plasmodium falciparum and Eimeria tenella. Nucleic Acids Res. 33, 1399–1409. Von Mering, C., Jensen, L.J., Snel, B., Hooper, S.D., Krupp, M., Foglierini, M., Jouffre, N., Huynen, M.A. and Bork, P. (2005). STRING: Known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, D433–D437. Knuth, D.E. (1981). Seminumerical Algorithms, vol. 2 of The Art of Computer Programming. Addison–Wesley, Reading, second edn. Schuster, S., Dandekar, T. and Fell, D.A. (1999). Detection of elementary flux modes in biochemical networks: a promising tool for pathway analysis and metabolic engineering. Trends Biotechnol. 17, 53–60.
Metabolic Networks
195
24. Schuster, S., Fell, D.A. and Dandekar, T. (2000). A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks. Nat Biotechnol. 18, 326–332. 25. Schilling, C.H., Schuster, S., Palsson, B. and Heinrich, R. (1999). Metabolic pathway analysis: Basic concepts and scientific applications in the post–genomic era. Biotechnol Prog. 15, 296–303. 26. Papin, J.A., Price, N.D., Wiback, S.J., Fell, D.A. and Palsson, B.O. (2003). Metabolic pathways in the post-genome era. Trends BiochemSci. 28, 250–258. 27. Papoutsakis, E.T. (1984). Equations and calculations for fermentations of butyric– acid bacteria. Biotechnol Bioeng. 26, 174–187. 28. Watson M.R. (1986). A discrete model of bacterial metabolism. CABIOS. 2, 23–27. 29. Fell, D.A. and Small, J.R. (1986). Fat synthesis in adipose tissue: an examination of stoichiometric constraints. Biochem J. 238, 781–786. 30. Varma, A. and Palsson, B.O. (1993a). Metabolic capabilities of Escherichia coli. 1. synthesis of biosynthetic precursors and cofactors. J Theor Biol. 165, 477–502. 31. Varma, A. and Palsson, B.O. (1993b). Metabolic capabilities of Escherichia coli. 2. optimal growth patterns. J Theor Biol. 165, 503–522. 32. Reder, C. (1988). Metabolic control theory: a structural approach. J Theor Biol. 135, 175–201. 33. Leiser, J. and Blum, J.J. (1987). On the analysis of substrate cycles in large metabolic systems. Cell Biophysics. 11, 123–138. 34. Mavrovouniotis, M.L. and Stephanopoulos, G. (1990). Computer–aided synthesis of biochemical pathways. Biotech Bioeng. 36, 1119–1132. 35. Mavrovouniotis, M.L. and Stephanopoulos, G. (1992). Synthesis of reaction mechanisms consisting of reversible and irreversible steps. 2. A synthesis approach in the context of simple examples. Ind Eng Chem Res. 31, 1625–1637. 36. Ingraham, J.L., Maaloe, O.E. and Neidhardt, F.C. (1983) Growth of the Bacterial Cell. Sinauer Associates Inc, Sunderland, MA. 37. Holmes, W.H. (1986) The central metabolic pathways of Escherichia coli: Relationship between flux and control at a branch point, efficiency of conversion to biomass, and excretion of acetate. Curr Top Cellul Regul. 28, 69–105. 38. Kauffman, S.A. (1967) Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol. 22, 437–467. 39. Kauffman, S.A. (1993) The Origins of Order. Oxford University Press, Oxford. 40. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N. and Barabási, A.L. (2000) The large– scale organization of metabolic networks. Nature. 407, 651–654. 41. Wagner, A.and Fell, D.A. (2001) The small world inside large metabolic networks. Proc Roy Soc (London) Series B. 268, 1803–1810. 42. Watts, D.J. (1997) The Structure and Dynamics of Small–World Systems. Ph.D. thesis, Cornell University. 43. Watts, D.J. and Strogatz, S.H. (1998) Collective dynamics of ‘small–world’ networks. Nature. 393, 440–442. 44. Gleis, P.M., Stadler, P.F., Wagner, A.and Fell, D.A. (2001) Relevant cycles in chemical reaction networks. Adv Complex Syst. 4, 207–226.
196
D. A. Fell
45. Zhu, D. and Qin, Z.S. (2005) Structural comparison of metabolic networks in selected single cell organisms. BMC Bioinformatics. 6, 8, 1–12. 46. Barabási, A.L. (2002) Linked. Perseus Publishing, Cambridge, Massachusetts. 47. Csete, M. and Doyle, J (2004) Bow ties, metabolism and disease. Trends Biotechnol. 22, 446–450. 48. Pfieffer, T., Soyer, O.S. and Bonhoeffer, S. (2005) The evolution of connectivity in metabolic networks. PLOS Biology. 3, 1269–1275. 49. Meléndez-Hevia, E. and Isidoro, A. (1985) The game of the pentose phosphate cycle. J Theor Biol. 117, 251–263. 50. Meléndez-Hevia, E., Waddell, T.G. and Cascante, M. (1996) The puzzle of the Krebs citric acid cycle: Assembling the pieces of chemically feasible reactions and opportunism in the design of metabolic pathways during evolution. J Mol Evol. 43, 293–303. 51. Schuster, S., Pfeiffer, T., Moldenhauer, F., Koch, I. and Dandekar, T. (2002) Exploring the pathway structure of metabolism: Decomposition into subnetworks and application to Mycoplasma pneumoniae. Bioinformatics. 18, 351–361. 52. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N. and Barabási, A.L. (2002) Hierarchical organization of modularity in metabolic networks. Science. 297, 1551– 1555. 53. Ma, H.W. and Zeng, A.P. (2003) The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics. 19, 1423–1430. 54. Tanaka, R. (2005) Scale–rich metabolic networks. Phys Rev Lett. 94, 168101–1– 168101–4. 55. Barabási, A.L. and Albert, R. (1999) Emergence of scaling in random networks. Science. 286, 509–512. 56. Morowitz, H.J. (1992) Beginnings of Cellular Life: Metabolism Recapitulates Biogenesis. Yale University Press, New Haven. 57. Morowitz, H.J. (1999) A theory of biochemical organization, metabolic pathways and evolution. Complexity. 4, 39–53. 58. Light, S., Kraulis, P. and Elofsson, A. (2005) Preferential attachment in the evolution of metabolic networks. BMC Genomics. 6, 159. 59. Lemke, N., Herédia, F., Barcellos, C.K., Dos Reis, A.N. and Mombach, J.C.M. (2004) Essentiality and damage in metabolic networks. Bioinformatics. 20, 115– 119. 60. Thatcher, J.W., Shaw, J.M. and Dickinson, W.J. (1998) Marginal fitness contributions of nonessential genes in yeast. Proc Natl Acad Sci U.S.A. 95, 253– 257. 61. Yeh, I., Hanekamp, T., Tsoka, S., Karp, P.D. and Altman, R.B. (2004) Computational analysis of Plasmodium falciparum metabolism: Organizing genomic information to facilitate drug discovery. Genome Res. 14, 917–924. 62. Papin, J.A., Price, N.D., Edwards, J.S. and Palsso, B.O. (2002) The genome–scale metabolic extreme pathway structure in Haemophilus influenzae shows significant network redundancy. J theor Biol. 215, 67–82. 63. Wilhelm, T., Behre, J. and Schuster, S. (2004) Analysis of structural robustness of metabolic networks. Syst Biol. 1, 114–120.
Metabolic Networks
197
64. Klamt, S. and Gilles, E.D. (2004) Minimal cut sets in biochemical reaction networks. Bioinformatics. 20, 226–234. 65. Mulquiney, P.J. and Kuchel, P.W. (1999) Model of 2,3–bisphosphoglycerate metabolism in the human erythrocyte based on detailed enzyme kinetic equations: Equations and parameter refinement. Biochem J. 342, 581–596. 66. Chassagnole, C., Fell, D.A., Rais, B., Kudla, B. and Mazat, J.P. (2001) Control of the threonine-synthesis pathway in Escherichia coli: A theoretical and experimental approach. Biochem J. 356, 433–444. 67. Olivier, B.G. and Snoep, J.L. (2004) Web-based kinetic modelling using JWS online. Bioinformatics. 20, 2143–2144. 68. Novère, N.L., Bornstein, B., Broicher, A., Courtot, M., Donizelli, M., Dharuri, H., Li, L., Sauro, H., Schilstra, M., Shapiro, B., Snoep, J.L. and Hucka, M. (2006) BioModels database: A free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems. Nucleic Acids Res. 34, 689– 691. 69. Almaas, E., Kovács, B., Vicsek, T., Oltvai, Z.N. and Barabási, A.L. (2004) Global organization of metabolic fluxes in the bacterium Escherichia coli. Nature. 427, 839–843.
This page intentionally left blank
CHAPTER 7 HETEROGENEOUS MOLECULAR NETWORKS
Vincent Schächter Genoscope, CEA, CNRS UMR8030, 2 rue Gaston Crémieux, 91000, Evry, France
[email protected]
1. Introduction With the advent of complete genome sequences, closely followed by the production of large post-genomic datasets on molecular interactions and cellular states in model organisms, several types of large-scale networks of interactions between genes or proteins have become available. Each of these network types has motivated a variety of topological analyses aimed at shedding light on its underlying design principles. Models, simulations and a variety of behavioral analyses have been elaborated for those networks that directly represent molecular interactions. For each type, a variety of computational methods aimed at reconstructing it from experimental data and/or predicting the corresponding type of interaction in other organisms have been proposed. Most importantly, each network type provides a specific type of insight into the inner workings of the cell. This separation is artificial, however, an artifact of data production technologies as well as a methodological choice, initially necessary to break down the complexity of the real processes. Metabolism, signaling, and regulation at the transcriptional, translational, and post-translational levels are intertwined. Studying the underlying networks in isolation is
199
200
V. Schächter
only the first step towards a more ambitious goal : understand the structure and dynamics of cellular processes well enough to characterize their interrelation, and ultimately to construct models of these processes with real predictive power on fine cellular state observables. A number of experimental and methodological challenges stand in the path to that goal. Data coverage and reliability are perhaps the greatest limitations, bounding the scope and interpretation of analyses on both topology and dynamics. Some methodological challenges arise precisely because of the need to compensate for this state of affairs. For instance, predicting interactions of a given type using interaction information from another type of network requires a good handle on the correlation between the respective network topologies, motivating the design of new methods for topological analyses. In addition, together with requirements for tractability of network dynamics at genome scale, data scarcity restricts computational modeling options to qualitative models. Yet qualitative models designed to describe the dynamics of different network types are not necessarily compatible ; thus the need to design computational models that can jointly handle two types of networks at the right level of approximation and still yield useful predictions. In order to design such models and the corresponding predictive methods, it is necessary to better understand the correlations between the dynamics of different component networks, motivating in turn the design of new “model-driven” statistical methods that maximally exploit available information. In this Chapter, we will review the state of the art in heterogeneous network analyses within the context of macromolecular interaction networks. Section 2 will specify the scope of the review, defining the types of networks under study. Section 3 will delineate how component networks and heterogeneous networks are reconstructed. Section 4 will focus on the computational models that have proven instrumental at jointly analyzing two or more types of networks. Section 5 and 6 constitute the core of the Chapter, respectively addressing topology and dynamics. Both will begin with analyses of correlations between two component networks, then move on to analyses on the composite network. Such correlations, if significant enough, may provide a basis not only for insights into the joint functioning of two network types, but
Heterogeneous Molecular Networks
201
also for predictive methods. Section 7 will discuss how one type of network can help expand, refine or validate another, by capitalizing on the results of topological or dynamical analysis. We will then conclude the Chapter and sketch future directions. 2. Heterogeneous Networks: Definitions and Scope Whereas there is little doubt that cellular processes involve several distinct types of biochemical interactions and may also be usefully characterized by many types of functional relationships between genes and gene products, one can envision several different yet each perfectly reasonable definition for “heterogeneous network” in this context. We chose to not stray excessively from the current state of the art, both with respect to the nature of the networks available for combination and analysis, and with respect to the type of integration that can be performed with these. We will call heterogeneous or composite any network which includes at least two different types of interactions, or, equivalently, which combines at least two network types. The latter will be called component networks, and essentially correspond to the types of networks that are being reconstructed and studied piecemeal, for their own sake, at genome scale in model organisms. 2.1. Types of Component Networks The component networks we will consider in this Chapter include direct, if sometimes partial, representations of cellular processes : metabolic networks and transcriptional regulation networks. Both their topology and the dynamics of the processes they represent can be studied, given appropriate experimental information and a suitable modeling framework. Regulatory networks (see Chapter 4) may represent actual interactions between transcription factors and binding sites of the genes they regulate, or they may represent indirect regulatory influences between genes. Likewise, metabolic networks (see Chapter 6) may be provided as “full” networks of biochemical reactions catalyzed by enzymes with their corresponding genes, or as more abstract networks
202
V. Schächter
connecting reactions/enzymes that follow one another in known metabolic pathways. Protein-protein interaction networks (see Chapter 5) represent both stable and transient associations between proteins, in a unified, static framework. There is no cellular process directly encoded nor intrinsic dynamics to study in a protein interaction network ; yet its structure together with additional information may provide useful information on the dynamics of complex assembly or signaling, as both are implemented through protein interactions. Genetic interaction networks represent relationships between genes at the level of their phenotypic role : a genetic interaction occurs between two genes whenever mutations on both have a combined phenotypic effect not caused by a single mutation on either. While a genetic interaction does not directly represent a physical process, it does imply the existence of a mechanism that causes the phenotypic change. There are several types of genetic interactions, but large scale studies have mostly focused on synthetic lethal interactions, in which mutations in two nonessential genes are lethal when combined. Networks of functional links are abstract representations of relationships between genes or gene products that do not necessarily have any direct correspondence with cellular processes or phenotypic observations (1). Prominent among these are the so-called genome context associations, predicted on the basis of comparative analyses on sets of complete genomes. Each type of genome context link is thought to suggest a functional association in the form of physical interaction, pathway involvement or generally similar cellular role between the corresponding proteins, on the basis of a simple evolutionary argument. A conserved gene neighborhood link between two genes denotes that these two genes (more accurately, their orthologs in a given organism) consistently occur in the same genomic neighborhood. A gene fusion relationship indicate that the corresponding proteins exist separately in one genome yet are “fused” in another, yielding a multi-domain protein. A phylogenetic co-occurrence link connects two genes that exhibit sufficiently similar patterns of occurrence over a large set of genomes. Finally, we will also consider coexpression networks, which can be seen both as networks of functional links and as degenerate versions of
Heterogeneous Molecular Networks
203
regulatory networks. Coexpression links connect genes that have similar expression profiles according to a given expression dataset 2.2. Types of Composite Networks The component network type classification provided above stems primarily from historical differences in focus and experimental methods between different scientific schools of thought : the study of transcriptional regulation by molecular biologists has required a different mindset and a different toolset than, for instance, the study of metabolism by biochemists. Actual cellular processes, in contrast, are implemented through combinations of the different mechanisms represented by the component molecular interaction networks. Almost every pairwise combinations is biologically relevant. For instance, metabolism is naturally coupled with regulation : the concentration of enzymes is regulated at the transcriptional level, but their activity is also regulated by metabolite concentrations. Protein-protein interactions are key to the assembly of molecular complexes that govern transcription or facilitate catalysis. Transient protein-protein interactions may also facilitate metabolic reactions through the “metabolic channeling” mechanism (2). Moreover, they can participate in the implementation of indirect regulatory influences through signaling cascades. Genetic interactions provide direct experimental evidence of functional cooperation between genes ; they are caused by cellular processes of varying complexity, that may involve regulation, metabolism, and protein-protein interactions. As such, they are both different in nature and complementary to molecular interaction networks, which they historically helped construct. Networks of predicted functional links also complement networks of experimentally determined interactions by providing indirect evidence of functional cooperation, often based on evolutionary considerations. It is worth noticing that while molecular interaction types are intrinsically limited by the underlying biochemical mechanisms, more abstract networks of functional links are constrained only by the imagination of the computational biologist – as far as predictive methods are concerned – and by the available datasets.
204
V. Schächter
Efforts to integrate different types of interactions in order to describe the functioning of small-scale systems – e.g. a specific biosynthetic pathways or signaling cascades – have been underway for decades (3). At that scale, topological analyses are hardly interesting, unless performed on many instances of the small system within the context of larger networks, e.g. the identification of over expressed network motifs in a composite network. It is also possible to acquire information on the quantitative parameters necessary for detailed simulations and dynamical analyses. Obstacles to the integration of the different types of dynamics break down, as every type of interaction can be translated in the language of (possibly stochastic) differential equations. For example, regulatory influences can be added to a differential-equation based model of the dynamics of a metabolic pathway in a straightforward manner, by modifying the equations describing the evolution of metabolite concentrations and by adding an equation that describes the evolution of regulator concentrationa. In short, both the topics of interest and the methodologies of choice to address them differ significantly between smaller-scale systems and their genome-scale networks counterparts. In order to keep a reasonably tight focus on the most significant recent advances, we thus chose to restrict our study of heterogeneous networks to “large-scale” networks generated from high-throughput datasets or curated databases, and to dynamical models derived from these networks. Finally, in keeping with the state of the art in biological network analyses, this Chapter on heterogeneous networks will mainly focus – with some exceptions – on networks including exactly two modes of interaction.
a
See Klipp, E., Heinrich, R. et al. (2002). "Prediction of temporal gene expression. Metabolic opimization by re-distribution of enzyme activities." Eur J Biochem. 269, 5406-13, Zaslaver, A., Mayo, A.E. et al. (2004). "Just-in-time transcription program in metabolic pathways." Nat Genet. 36, 486-491. for an example of a purely metabolic model of a linear biosynthesis pathway later followed by its regulated counterpart.
Heterogeneous Molecular Networks
205
3. Methodology: Initial Reconstruction of Heterogeneous Networks Obtaining large-scale interaction networks requires a combination of experimental and computational methods, with an added computational layer needed in order to construct the heterogeneous network from its components. As one of the main motivations for the study of heterogeneous networks is the refinement and extension of one component network, using predictive methods based on information from the other components, we will distinguish between the initial reconstruction of a composite network, and the subsequent refinement/prediction phases. The former constitutes the main topic of this Section, while the latter will be addressed later in this Chapter, in Section 7. Note also that the reconstruction methodologies presented below are mostly aimed at recovering network topology. Putting “number on the arrows” necessitates cellular state data and reverse-engineering methods, in addition to interaction data, or experimental methods that have not yet been raised to high-throughput level. 3.1. Obtaining the Component Networks Component networks are typically reconstructed from a combination of high-throughput experimental assays and knowledge from the literature or from databases describing or summarizing preexisting experimental work, using a variety of more or less sophisticated computational methods to interpret and integrate raw data with more structure biological information. Text-mining techniques can provide a useful complement to these approaches, through a more systematic exploitation of the literature (4). For a recent review and list of available pathways resources of various types, see for instance (5,6). We briefly recall below the main types of experimental assays and reconstruction methods typical of each component network type ; the interested reader will find more detailed reviews in the corresponding Chapters of this volume and additional predictive methods in Section 7.
206
V. Schächter
Protein-protein interactions (see Chapter 5) : large sets of proteinprotein interaction (PPI) data have been generated using high-throughput variants of yeast two-hybrid to identify binary interactions (7-10), as well as immunoprecipitation techniques to isolate multi-protein complexes (11,12) The coverage, reliability and biases of these high-throughput technologies, have been extensively discussed in several articles (see for instance) (13,14). In the context of this Chapter, it is worth remembering that protein interaction networks are relatively sparse (average degree ~4 or 5). Regulatory networks (see Chapter 4) : large datasets on direct transcriptional interactions – interactions between a transcription factor and the binding site of a regulated gene– have been compiled from the literature (15) and from chromatin immunoprecipitation together with microarrays experiments (16,17). These direct approaches have been complemented by a family of reverse-engineering methods (see for instance (18-20) aimed at inferring the network from cellular state information, mainly mRNA expression-level measurements over series of conditions and/or time points (21-23), possibly complemented by other types of data. Such methods predict partial networks of regulatory influences which explain the expression dataset best according to a given underlying probabilistic model of regulation, possibly including some a priori knowledge on the network topology ; they do not focus on delineating the precise molecular mechanisms that implement each of these observed influences within the cell. It is also worth remembering that regulatory networks that are reverse-engineered from expression datasets reflect those regulatory influences that were active in the specific conditions of the experiment. Metabolic networks (see Chapter 6) : Computational reconstruction of metabolic networks typically relies on functional annotations that associate genes with enzymatic functions, thereby identifying the set of reactions encoded by the organism and bootstrapping the metabolic network reconstruction procedure (24). The metabolic network is then completed using physiological knowledge on the specific organism to disambiguate between pathways variants, to “fill gaps”, and to ensure that known physiological features – such as the ability to grow on a given medium or produce a given compound – can be reproduced by the
Heterogeneous Molecular Networks
207
reconstructed network. While this network completion phase typically requires heavy curation work, models can greatly help in streamlining and accelerating the process (25-27) Genetic interactions : genetic interactions have classically been identified by mutant screens, but recent studies have applied highthroughput ‘reverse’ methods such as synthetic genetic arrays (SGA) (28,29) or synthetic lethal analysis by microarrays (SLAM) (30) to identify ~4000 synthetic-lethal and synthetic-sick interactions in yeast. The experimental coverage of genetic interactions is thought to be very low. Functional links : the predictive methods referred to in this Chapter are systematically applied to a collection of complete genomes and made available through the STRING server (1). 3.2. Reconstructing the Composite Network It may be possible in theory and even preferable to directly reconstruct composite networks (e.g. regulated metabolism) from a diverse set of experimental data, pertaining to both molecular interactions and cellular states. Such a direct approach would require not only datasets complete and reliable enough, but also a very good understanding of the dynamical interplay between component networks, as well as models and reconstruction methods powerful enough to search over the space of composite models, e.g. directly discriminate between metabolic regulation and transcriptional regulation. While some of the network refinement methods presented later in this Chapter (e.g. ref. 31,32) or iterative refinement of constraint-based models (33,34) may be precursors of direct reconstruction approaches, these are still beyond our present capabilities. Instead, the current state of the art is limited to integration of component networks obtained from separate, initial reconstruction phases. The basic principle behind component network integration is fairly simple and rests implicitly on a multigraph representation (see the “Computational modeling” Section) : networks to be integrated are defined over the same set of elements, e.g. the set of proteins of a given organism. Integration is achieved by merging them into a single network
208
V. Schächter
with several types of links – or edge colors, in graph theory parlance – each drawn from one of the component networks. In practice, however, things are not quite as simple : networks are usually defined over subsets of the entire gene or protein complement of a species, and meaningful integration requires that the overlap of these subsets be sufficiently large. In addition, if differences of reliability between network types are to be taken into account, an integrated reliability scoring scheme needs to be designed (see ref. 1), with the corresponding pitfalls and level of arbitrariness involved in comparing apples and oranges. 4. Computational Modeling The modeling framework, i.e. the formal representation of the network of interactions, constitutes the foundation upon which both the analyses and the reconstruction/refinement methods are built. Its choice is constrained by the nature of the intended analyses – topological or dynamical. Each formal framework comes with a set of methods and tools, and with specific analytical strengths and weaknesses. For instance, wellestablished techniques exist for the study of steady-states and attractors in random Boolean networks, even large-scale (35). The modeling framework is equally constrained by the nature and level of detail of the information available on the underlying biological system. For example, the lack of quantitative parameters for a significant fraction of network interactions precludes the direct use of detailed kinetic models to study the dynamics of large-scale networks (36). Requirements arising from the reconstruction methodology may also influence the modeling framework. For instance, both the existence of reliability scores on different types or sources of experimental evidence and the use of approaches that search for the “best fit” between a model from a given model space and experimental data are arguments in favor of the use of probabilistic graphical models (19). These three categories of constraints (data, reconstruction methodology, analyses) apply to attempts at modeling molecular networks of any single type, and turn the choice of a modeling framework within a specific biological context into a fine balancing act. One consequence is the sheer number of formalisms in use for, e.g.
Heterogeneous Molecular Networks
209
regulatory networks modeling (see ref. 36) for a review primarily focused on continuous models). Requirements further differ between network types : for instance, the dynamical behavior of regulatory networks can be gainfully modeled using discrete/stochastic event-based transition systems, whereas metabolism is generally best described using some sort of continuous flux value representation. In the case of heterogeneous networks, the situation, being even more constrained, becomes paradoxically simpler : very few modeling frameworks have been successfully applied to the joint representation and analysis of two (or more) large component networks of different types. While several frameworks do exist that can describe some features of the dynamics of regulated metabolism or signaling, for instance, they have been applied only to relatively small systems (see also Section 2). We introduce below three computational frameworks that have already had some success in the context of heterogeneous network reconstruction and analysis: graph-based models, logical functions, and regulated constraint-based models. 4.1. Topological Properties of Composite Networks can be Assessed Using Graph-based Models Models based on graphs – including variants and extensions – have been the most extensively used option to date. Each component network is described as a graph of interactions between genes or proteins in a reasonably straightforward process (see the corresponding Chapters in this volume), although the specific graph representation may depend on the targeted analyses. For instance, the metabolic network enjoys a natural bipartite graph structure, where metabolite vertices can only be connected to reaction/enzyme vertices through consumption & production edges. In the context of integrative analyses with other network types, however, it is typically reduced to a simple directed “enzyme-enzyme” graph structure where vertices represent enzymes and an enzyme pair is connected if the product of a reaction catalyzed by the first enzyme is a substrate of a reaction catalyzed by the second one.
210
V. Schächter
Regulatory networks are represented as directed graphs, where vertices stand indifferently for genes or transcription factors, and directed edges connect regulators to regulatees. Protein-protein interactions are typically represented as undirected graphs, as are functional links. Integrating into one heterogeneous network then becomes just a matter of assigning different types (or colors, in graph-theoretical parlance) to edges from the different component networks and of defining a single graph over the union of component graph vertices that includes all these edges. Formally, the structure of the composite graph should be a bit more involved to allow for the existence of more than one edge between two vertices. The resulting structure is thus a multigraph with different edge colors, possibly with a mixture of directed and undirected edges, on which all the more or less classical classical graph analyses, such as searches for paths, dense subgraphs, graph motifs, etc..., may be performed, taking into account edge color or not (37). This basic structure may be extended in a variety of ways. For instance, edges may be weighted with interaction reliability scores. As a more complex example, scores may represent probabilities that the interaction occurs given priors on model structure and an experimental dataset, within a given probabilistic framework. Indeed, Bayesian networks that encode conditional (in)dependence between model variables on graph edges, or other probabilistic graphical model variants, are representations of choice for learning biological network structure from data (19) (see also Chapter 3). 4.2. Structural Properties of Steady-state Dynamics can be Assessed Using Stoichiometric Models of Metabolism Constraint-based models of metabolism (27) represent metabolic processes at steady state : a global state of the metabolic network is defined as a distribution of reaction fluxes. The constraint-based modeling framework emerged in the 90s as a radical simplification of kinetic models – only the reaction network and reaction stoichiometry is needed to build a model, no thermodynamic or kinetic parameter values – and was developed to allow tractable modelling of genome-scale
Heterogeneous Molecular Networks
211
metabolic networks. It has since been applied to a variety of reconstruction, structural analyses and predictive tasks on large metabolic networks in bacteria and yeast, yielding valuable biological insights. The core idea is that expressed phenotypes must satisfy the constraints imposed on the molecular functions of a cell. The focus, rather than being on fully instantiated descriptions of the system’s behavior, is on sets of such descriptions, i.e. sets of flux distributions (“metabolic phenotypes”) compatible with a set of constraints representing the current knowledge on the structure of the network, on thermodynamic and kinetic parameters, and on input/output relationships of the network with its environment. Constraints can be used to predict with a reasonable degree of accuracy various metabolic capabilities and structural features of the entire flux distribution space. That space can be refined incrementally as new constraints are added, ensuring some robustness in structural analyses and metabolic behavior predictions with respect to modifications of the model. Alternatively, a quasi-steady-state assumption on the metabolic network can be used to generate dynamic profiles of cell growth. Time is discretized into small steps, and the metabolic model may be used to predict the optimal flux distribution at each step. This type of dynamic modeling was shown to correlate well with the growth of E. coli on glucose minimal media under aerobic and anaerobic conditions, predicting quantitatively the uptake of glucose and growth rate as well as by-product secretion. The purely metabolic constraint-based framework was extended to include regulatory constraints, that are self-imposed by the cell (38-40). The principle is the following: known regulatory influences are represented as a Boolean network, where variables include external metabolites, regulatory proteins, and enzymes, the end-points of the regulatory effects of interest. Given a set of initial internal and environmental conditions, the Boolean network dynamics can be computed stepwise, and the state of the regulatory network used to constrain the optimal flux distribution at each step, refining the dynamic
212
V. Schächter
Figure 1. In silico modeling of metabolism and transcriptional regulation using the constraints-based approach. A, the constraints-based approach to metabolic modeling. The metabolic genotype is defined from the known genes in the genome, as identified in metabolic databases and in the literature. Once the metabolic network has been defined, known invariant constraints that the network must obey are applied to the cell, enabling the network to be described geometrically as a closed solution space. Flux-balance analysis can be used to identify particular optimal solutions (such as optimization of growth) within the space (blue point), which represent possible behaviors of the cell. Assuming that metabolism is in a quasi-steady state relative to cell growth, the dynamic behavior of the cell may be simulated using numerical integration and flux-balance analysis at each time step. As shown in B, transcriptional regulation reduces the steadystate solution space. For example, external glucose is sensed by various regulatory proteins in the cell, among them CRP, which is activated (red arrow), and Mlc, which is inactivated (green arrow) by the glucose signal. As a result, transcription of glucose ABC transporter operon ptsHI-crr is not repressed by Mlc, whereas transcription of a glycerol kinase gene, glpK, is repressed by CRP. The presence or absence of expression of these genes leads to the availability or unavailability of the respective reactions or transport processes in the metabolic network and possibly to the removal of available extreme pathway basis vectors from the steady-state solution space. The result is a time-dependent solution space defined by temporary regulatory constraints in addition to the invariant constraints mentioned above, which may exhibit a new behavior if the previous solution is no longer in the space. Dynamic simulations incorporating such time-dependent constraints are able to simulate a wider range of cell phenotypes, particularly when regulatory effects have a dominant influence on metabolic behavior. Reproduced from Covert, M.W. and Palsson B.O. (2002). "Transcriptional regulation in constraints-based metabolic models of Escherichia coli". J Biol Chem. 277, 28058-64. Copyright © 2006 by the American Society for Biochemistry and Molecular Biology.
Heterogeneous Molecular Networks
213
profile predictions. The rFBA model, a combination of a Boolean model of regulation with a constraint-based model of steady-state metabolism, thus allows crude simulations of regulated metabolism (see Fig. 1 and Section 7) and also provides the basis for recent attempts at computational refinement of joint regulatory/metabolic models from experimental data. (see Section 6). 5. Logical Functions as a Unifying Framework for Steady State Dynamics of Regulation and Metabolism Gat-Viks and co-workers (2004) introduce MetaReg, a unified framework aimed at the reconstruction and analysis of heterogeneous regulation in biological networks (31). The model covers four variable types, (mRNAs, proteins, internal and external metabolites). The regulation functions represent in a unified – but indistinct – manner several types of control mechanisms at the transcriptional, translational, post-translational and metabolic level. Techically, the framework may be seen as a variation on the Boolean network formalism, where variables may have more than 2 states. Formally, a model M is defined as a set U of variables, a set S = {1,..., k} of discrete states attainable by the variables, and a set of regulation functions f v : S|N(v)| → S for each v ∈ U . fv defines the state of a regulated variable v as a function of the states of its regulator variables N(v) = {rv1 , . . . , r dv v }. The model graph of M is the digraph GM = (U,A) representing the direct dependencies among variables, i.e. (u, v) ∈ A iff u ∈ N(v). A model state s is defined as an assignment of states to each of the variables in the model, s : U → S . The set of stimulators UI includes all variables with zero indegree, ie those variables that determine the boundary conditions of the system, for instance external metabolites. A model stimulation is an assignment of states to all the model stimulators, q : U I → S . Steady states of the system are called ‘ modes ’. For a model M and state s, s is said to agree with M on v if f v ({s(rv1 ), . . . , s(r dv v ) }) = s(v). A model state s of M is a mode if s agrees with M on every v ∈U\UI . The authors show that modes of the system can be computed for a given
214
V. Schächter
environmental condition even in the presence of feedback loops – a limitation of previous network reconstruction schemes. The MetaReg framework corresponds to a drastic simplification of heterogeneous network dynamics, yet, precisely because of this simplicity, allows for a rigorous formulation of learning problems. For instance, a formalization of experimental conditions together with a scoring scheme for candidate models provides a foundation for the design of algorithm that learn regulation functions from data. In 2006, the MetaReg framework was extended and modified to yield the (probabilistic) factor graph network model, better adapted to the goal of learning logical variables from noisy, continuous measurements (32). In the factor graph model, knowledge on the logical regulation function is formalized as conditional probabilities, and continuous, noisy measurements, represented as “sensor variables” are linked to the discrete (but unknown a priori) state variables by so-called discretizer distributions. 6. Topology of Composite Networks The first layer of analysis once a large-scale protein-protein interaction, metabolic or regulatory network has been reconstructed or expanded from a high-throughput assay is the search for salient topological properties, that can hopefully be interpreted as ‘traces’ of underlying biological mechanisms, shedding light either on their dynamics or their evolution or both. The topological properties that have been studied in this context include the distribution of vertex degrees, the distribution of the clustering coefficient and other notions of density, the distribution of vertex-vertex distances (see Chapter 1) ,or the distribution of network motifs occurrences (41). Studies typically compare the statistics of interest in the network under study against the same statistics in an ensemble of random networks embodying the currently admitted “null hypothesis” on network structure. Some of the biases that have been observed have led to interesting, if controversial, hypotheses on the underlying biological design principles. In the case of heterogeneous networks, topology-related questions may be classified within two categories.
Heterogeneous Molecular Networks
215
The first one is the extension to heterogeneous networks of classical topological studies, i.e. the study of the “usual” topological properties, extended to take into account the richer structure of graphs with several types of edges. For instance, the focus may be on shortest path length distributions can be computed on paths that cross at least two types of edges. The heterogeneous context adds another interesting twist to this type of analyses, namely the “cooperation” bias : how does the specific way in which two component networks are connected together influence the property under scrutiny ? The second category of analysis, specific to heterogeneous networks, and perhaps coming (chrono)logically before the first, deals with the correlation between topological features in two component networks: how is the existence of an edge or the degree of a vertex in one network type related to the probability of occurrence of an edge or the degree distribution in the second network type ? We will start with this latter category, then move on to joint analyses. 6.1. Correlation Between Topological Properties in Pairs of Coupled Networks One of the most natural questions that arises given two types of interaction networks over the same sets of genes or proteins is whether a correlation can be identified between interactions of different types that occur between the same pair of entities. This question can be readily extended to interactions of different types between neighboring pairs of genes, to pairs of interaction patterns, or to any couple of topological features where each belongs to only one component network. While intuitive, these questions are not statistically trivial, and defining them rigorously helps distinguish the general from the particular. Balasubramanian et al. (2004), propose a statistical methodology to test the level of association between disparate genomics data-sources (42). Each type of functional genomics data – of the “state” rather than “interaction” category – is first transformed into a network of functional links, using appropriate similarity measures. For instance, expression profiles yield co-expression links by computing Spearman
216
V. Schächter
correlations for each pair of profiles and using random permutations of expression values to define a significance threshold. These networks are defined over the same set of vertices, e.g. the set of all yeast genes. For each network pair, the interSection network can
Figure 2. Permutation tests to evaluate pairwise interaction graph associations. Reproduced from Balasubramanian, R., LaFramboise, T. et al. (2004). "A graph-theoretic approach to testing associations between disparate sources of functional genomics data". Bioinformatics. 20, 3353-62.
Heterogeneous Molecular Networks
217
be used to quantify the association between the two datasets ; pairwise association between networks is tested by assessing whether the edges in (observed) network A are overrepresented in network B. Two different tests are defined : in the edge permutation test, edges from graph A are permuted and the number of edges in the interSection with B is computed. The proportion of permutations for which the number of edges in the interSection graph is at least as large as the observed value X is an approximate p-value for testing the null hypothesis of noassociation between the two graphs. The second node label permuation test conditions on the entire structure of both graphs rather than on the number of edges alone : the p-value then represents the probability of observing at least the real number of intersecting edges given no association between the graphs, conditional on their complete structure. These two permutation schemes differ by their underlying reference distribution – the choice depends on the nature of the tested component networks. For instance the edge permutation, assuming very little in the way of structure, may not be suited for networks that appeared to be highly structured. The permutation method provides a first test of the degree of association between component graphs, and is applicable to any pair of such graphs that is potentially related. The methodology could be extended along several directions, for instance by generalizing to k>2 graphs or by adding weights in order to incorporate reliability scores on various data or interaction . Balasubramanian et al. demonstrate its use by testing pairwise associations between 3 types of data : gene-expression profiles, mutant growth phenotypes and GO functional annotations. The first two networks are constructed using Spearman correlation respectively between expression and growth phenotype profiles ; the third functional network is derived from GO annotations using a similarity measure based on the depth of the terms shared in common in the GO ‘molecular function’ (MF) and ‘biological process’ (BP) graphs, filtered by dissimilarities in cellular localization. Results confirm the existence of a significant association between the GO-derived network and the expression-derived network. Somewhat more surprisingly, the phenotypic data shows strong association with GO-MF but not with GO-
218
V. Schächter
BP ; the authors conjecture that this might reflect either the fact that the phenotype dataset that was used corresponded to growth in suboptimal conditions, under which normal biological processes may have been repressed and genes may have been more involved in stress response. Phenotype data shows strong association with one cell-cycle related mRNA expression dataset (21) after clustering, yet only weak association with another, stress-response related, expression dataset (23). This result led the authors to hypothesize that the fitness measured in the mutant phenotype dataset was more strongly influenced by the signal related to the role of a gene in the cell-cycle than by the signal related to its stressresponse role. 6.1.1. Metabolism and Coexpression Several studies have analyzed the relationship between metabolism and coexpression, initially in individual metabolic pathways, then at larger scale (43,44), in order to gain better insight into the regulation of metabolism. From their study of local correlations between the two networks, Ihmels et al. (2004) concluded that transcriptional regulation biases metabolic flow toward linearity by coexpressing only distinct branches at metabolic branchpoints, and observed that individual isozymes were often separately coregulated with distinct processes (43). This latter property could be interpreted as a way to reduce crosstalk between pathways using a common reaction. They could also identify a hierarchical organization induced by transcriptional regulation on metabolic pathways, which get organized into groups of varying expression coherence. In a recent study Kharchenko et al. (2005) systematically investigated local and global correlations between coexpression and the metabolic (enzyme-enzyme) network in yeast (45). First addressing the variation of the degree of coexpressionb with metabolic network distance, they b
While the coexpression measure is not presented as a network, i.e. the network is not explicitly constructed from the outset by choosing an expression profile similarity cutoff, we felt that the analysis does implicitly address the correlations between the metabolic and the coexpression networks.
Heterogeneous Molecular Networks
219
confirm the intuitive notion that genes close to each other in the metabolic network tend to have on average a higher degree of coexpression. While positive coexpression is strongest among adjacent genes and decreases monotonically with network distance, negative coexpression peaks at intermediate distances. The former correlation pattern is consistent with the one observed between the metabolic network and functional links based on genome context methods; it can be interpreted as favoring the optimization of metabolic fluxes (see for instance ref. 46). The authors interpret the latter correlation by suggesting that regulation defines small positively coexpressed regions of metabolism, that may express some degree of negative coexpression between each other. Furthermore, positive coexpression and functional links are stronger in linear parts of the metabolic network, whereas negative coexpression is stronger in branched regions : this corroborates other studies, e.g. in suggesting the existence of compensatory pathways effect (47). Finally, an analysis of 2- and 3-genes network motifs in the metabolic network shows that coexpression in divergent branches is significantly stronger than that observed in convergent branches, suggesting that regulation controls more tightly reactions in which one precursor can be used to synthesize several different compounds. 6.1.2. Protein-Protein Interactions and Coexpression In a systematic study including several large yeast protein-protein interaction datasets, Jansen et al (48) investigate the relationship between complexes-related protein-protein interactions and mRNA expression levels, using both an aggregated dataset of absolute expression levels, and two large expression profile datasets, the classical time-course cellcycle dataset (21) and the Rosetta yeast compendium (49). They show evidence for strong coexpression between subunits of complexes known to be permanently stable, the relationship being much weaker for transient complexes as well as for protein-protein interactions derived from yeast two-hybrid or classical genetics experiments. Studying several well-characterized complexes, the authors show that these can be broken into subcomponents that exhibit high expression correlation, i.e. that seem more “permanent” than the entire complex.
220
V. Schächter
This correlation was confirmed by subsequent work and provided the basis for a protein interaction prediction method (50) (see Section 7). Further investigation of the transient vs permanent distinction led to an elegant study on the dynamics of complex assembly during the cell-cycle (51) (see Section 2). 6.1.3. Genetic Interactions and Physical Interactions Several groups have jointly studied the networks of genetic interactions and physical interactions in yeast, not only showing the existence of correlations, but also providing clues on the mechanism behind specific synthetic-lethal effects. Ozier et al (2003) showed that highly connected nodes in the protein interaction network had a higher probability to genetically interact with each other (52). Global assessment by Tong et al (2004) of the (then) current set of known genetic interactions in yeast suggested that two proteins that are close within the genetic network are more likely to physically interact, and that genes belonging to the same protein complex often exhibit similar patterns of genetic interactions (29). On the other hand, the authors estimate that the genetic interaction network is at least 4 times denser than the protein interaction network, and that only a small fraction of genetically interacting pairs (~1%) physically interact. In a more detailed study, Ye et al (2005) investigated correlations between the protein interaction network and the genetic congruence network, where congruent genes are pairs of genes with similar sets of genetic partners (53). The analysis shows that high genetic congruence of a gene pair is correlated with the existence of a protein-protein interaction, and that distances in these two networks are “commensurate”, a property not shared with the genetic network. These results support the conclusion that genetically interacting pairs usually belong to compensatory pathways, rather than exhibit direct physical interactions. In addition, the authors suggest that genetic congruence might be a better predictor of protein-protein interaction. Such findings are coherent with those of Kelley et al. (2005), who integrate genetic and several types of physical interactions (proteinprotein, protein-DNA, and KEGG metabolic reactions) by identifying
Heterogeneous Molecular Networks
221
joint patterns in the composite networks that correspond to likely interpretations of one type of interaction in terms of the other (54). Specifically, they search for statistically significant occurrences of the following two patterns : ‘between-pathways’ and ‘within-pathway’, where pathways are defined here in the rather generic sense of sets of proteins that are densely connected in the physical network. The former pattern denotes pairs of pathways that are densely connected by genetic interactions ; its natural interpretation is that the two pathways have redundant or complementary function, and that deletion of one genetic interaction partner will suppress the function of one pathway only. The latter pattern corresponds to sets of proteins that are densely connected by both physical and genetic interactions, the interpretation being that a single gene may be dispensable for overall pathway function, but that additive effects of multiple deletions are lethal. In order to identify sets of proteins that are well-explained by either of the above two models, the authors use a greedy network search procedure and a natural log-odds score to rate candidate patterns. The score contrasts models of dense subgraphs of physical interactions (pathways) or genetic interactions (between two pathways), where the pairwise interaction probability is uniform and high, with the background random model, where the probability of observing each interaction is determined by estimating the fraction of networks with identical degree distribution which also contain that interaction. Approximately 40% of the available synthetic lethal interactions could be interpreted by including them with one of the above patterns, with ~3.5 more explanations relating pairs of pathways than within-pathway. In other words, genetic interactions seem to be primarily explained by interactions between complementary processes, while protein-protein interactions connect genes involved within the respective processes. These two network types are, in a sense, orthogonal. The authors also find that more than half of the physical interactions in between-pathways models and 52/91 in within-pathways models are significantly enriched for proteins with a given molecular function. Taking advantage of that correlation, the authors predict functional annotations by propagating existing annotations in between- and withinpathways models to uncharacterized proteins.
222
V. Schächter
It is worth emphasizing that the above analyses are inherently limited by the current coverage and construction biases of the protein-protein interaction (14) and, even more so, of the genetic network. The latter is not only incomplete relative to the expected “real” number of genetic interactions in yeast, but it may also be particularly biased by query gene selection in asymmetrical SGA experiments. For instance, the 150 query genes chosen by Tong et al. (2004) in their large-scale synthetic-lethal study all enjoy a fairly high number of genetic interaction partners (29). Recognizing this worse-than-usual situation, Ye et al. (2005) carefully distinguish in their analyses between the asymmetric network including all available genetic screen results, and the symmetric genetic network covering only interactions that occur between genes that have been used as queries in SGA experiments (53). 6.2. Topological Features of the Composite Network The search for significant global or local topological features in the composite network can be seen as a natural sequel to association analyses. How are the global topological properties that were assessed separately influenced by the coupling between component networks ? If local features are correlated across component networks, whether in a simple or complex manner, can the additional statistical support provided by the integration of the two networks be leveraged to identify patterns in the composite networks, such as dense clusters or network motifs ? The first line of investigation raises basic methodological issues. Bourguignon et al. (2006) (55) introduce. A general method to assess cooperation between pairs of component networks, based on randomizations that preserve the complete structure of both components networks. Shuffles of the original composite network are defined as those networks that are composed exactly of the two original component networks, the variable part being the way these are ‘glued’ together. They are used to assess the bias of topological features of the composite network, eg the distribution of shortest paths, against the distribution of networks preserving the invariant(s). By measuring the influence of the specific manner in which two component graphs interface, shuffles allow
Heterogeneous Molecular Networks
223
a rough description of the interplay between the coupling of component networks and the composite topology. Glueing p1
p2
p3
Shuffling α1(u)=v, α1(v)=u
α1 (G1 ) = u ⎯ ⎯→ v
α2=Id
α3=Id
a
p1
p2
p3
Figure 3. An example of shuffling on a heterogeneous graph with 3 component graphs. The heterogeneous graph G is obtained by “glueing” component graphs G1, G2, G3 onto a set of vertices using 3 maps p1: V1→V, p2:V2→V, p3: V3→V. Given α1: V1→V1, α2: V2→V2, α3: V3→V3, 3 permutations on the vertices of the component graphs, the shuffled graph Gα1 ,α 2 ,α3 is obtained by glueing α 1 (G1 ),α 2 (G2 ),α 3 (G3 ) using the same maps p1, p2, p3. Reproduced from Bourguignon, Danos et al. (2006).
Shuffles are easily computable and can be generated uniformly, by drawing from a set of acceptable permutations. The latter property is in contrast with more informal randomizations based on sequential rewiring strategies, where each rewiring step perturbs the structure while preserving one or more local invariants. While these approaches may prove to be asymptotically equivalent in some cases, they typically do not provide a direct definition nor the means to uniformly sample the set of randomizations which preserve the invariant, since the order of the rewiring steps matters. Bourguignon et al (2006) illustrate the use of the shuffle methodology by studying the degree of cooperation between the regulatory and the protein-protein interaction network in yeast, yielding an intriguing result on the topology of the composite network (55).
224
V. Schächter
Within the population of networks which embed the same two original component networks, the real network exhibits simultaneously higher biconnectivity (the amount of pairs of nodes which are connected using both subgraphs), and higher distances. A first interpretation might be that the real graph is trading off compactness for better bi-connectivity. Additional analysis shows this interpretation to be incorrect, however. Both biconnectivity and pairwise distances were further compared between the real composite graph and ‘equatorial’ shuffles that also preserve the interface between component networks : networks that have been subjected to equatorial shuffles do not lose any bi-connectivity, but still tend to be more compact, suggesting that these two properties are independent. Interpreting biconnectivity as a rough measure of the capability to transmit a signal that crosses between component graphs, the authors suggest the following tentative explanation : the actual network may be adapted to both high signal flow (biconnectivity) and high signal specificity, since, at constant bi-connectivity, longer average distances make it easier for receivers to distinguish emitters. Obviously, confirmation of this interpretation requires further investigation beyond the realm of topology. Possible extensions on the methodological front include generalizations to k>2 graphs, defining other categories of invariants, or incorporating reliability scores on various data or interaction types by adding weights on the edges. It seems likely that significant effort will be focused in the coming years on bridging the gap between informal yet intuitive statistical analyses and principled statistical methodology, both on single-type and on heterogeneous networks. 6.2.1. Layered Structure of the Protein Interaction Network Smidtas et al. (2006) focuses on the structure of the joint TRI/PPI network directly upstream of the control of genes by TFs, i.e. immediately upstream of the interface between the two networks (56). His analysis shows that the protein-protein interaction network exhibits a distinct layered structure upstream of its interface with the protein-dna interaction network : 3 layers of proteins (transcription factors, their
Heterogeneous Molecular Networks
225
direct interaction partners, and the interaction partners of the latter) differ by their within-layer and between-layer connectivities. This observation appears consistent with subcellular localization data and an analysis of functional annotations.
Entity Integration
1069 CoCoTF (incl 84 TF) (incl 205 COTF)
Random
12 CoCoTF / CoTF 6 int / CoTF in 6 transformations
0%
10%
20%
30%
40%
50%
60%
70%
80%
90% 100%
Nuclear matrix, nucleus
230 CoTF (incl 44 TF) Nucleolus, Nuclear pore
2 TF / CoTF 2 CoTF / TF
8,5 P / P Cytoplasm, ER Membrane, Spindle pole body, vacuolar membrane, actin and tubulin cytoskeleton, nuclear envelope
157 TF Other: Cell wall, ER lumen, golgi, other Mitochondria, plasma membrane
TF Co-reg Co-coreg Protein
3 TF / Gene
Genes
Figure 4. Layered structure of protein-protein interactions upstream of transcriptional regulation and their spatial distribution among cell compartments. Subcellular localization data was taken from MIPS (Mewes 2002). Error bars represent the standard deviation computed for 10 sets of 500 randomly selected proteins. They help validate visually the statistical significance of localization distribution differences. Reproduced from Smidtas (2006).
6.2.2. Modules Defined over Composite Networks A module can be loosely defined as a group of physically or functionally linked molecules that work together to achieve a (relatively) discrete and (relatively) well-defined function (57). Classical illustrations of the module notion include 'physical' modules, e.g. stable protein-protein and protein-RNA complexes that are at the core of several key cellular functions, but also modules with no direct physical realization, such as sets of temporally co-regulated proteins that govern various stages of the cell-cycle. The quest for both the “right” definition of module and the identification of functional modules has spurred considerable interest and effort in recent years, resulting in a variety of module identification method proposals. Some of these methods rely purely on topology,
226
V. Schächter
others seek to bridge the gap with dynamics and/or assess evolutionary conservation. They also differ by the type of information used as input – interaction, cellular state, or both. For a review, see Chapter 1. We focus here on the added value brought about by heterogeneous networks, i.e. the possibility to search for modules that are supported by interactions of multiple types. One natural approach is to search for clusters of densely interacting proteins in the composite network. Two 'extreme' notions of densely interacting clusters can be defined, with possible intermediates : clusters that are dense in the composite network, or clusters that are dense in all of the component networks. In Von Mering et al. (2003) show that networks of functional links based solely on comparative genomics can identify known metabolic pathways with high accuracy (58). As a first step, they integrate three networks of genomic context functional links, computed using the conserved gene neighborhood, the gene fusion and the phylogenetic profile prediction methods on 89 complete genomes. The composite (weighted) multi-species network, connecting ~ 20000 clusters of orthologous genes (260000 proteins participating in ~2 million links), is first projected on the E. coli K12 genome, yielding ~ 113000 links. Modules are then searched for through the use of 3 different unsupervised clustering methods : single-linkage clustering, Markov clustering, and an unweighted pair-group method with arithmetic mean clustering (UPGMA), after exploration and fine-tuning of the respective parameter spaces. For benchmarking purposes, the authors compared the resulting clusters with EcoCyc’s metabolic pathways (59). The best performance was obtained by UPGMA clustering, recovering more than half of the set of metabolic pathways with very high specificity. An assessment of the relative contributions of the 3 types of functional links shows that 56% of the pathways were detected separately by all methods, 20% by 2 methods out of 3 and 24% by one method only. On the other hand, the gene neighborhood method alone detected 89% of the reference module/pathways set : further analysis of the data may be required in order to assess the added value of integration for module discovery. Interestingly, the authors observe that biosynthesis pathways are recovered much more efficiently than degradation pathways, perhaps
Heterogeneous Molecular Networks
227
because they are more linear, more energy consuming and more highly regulated, thus under stronger evolutionary constraints.
Functional Categories Information Processing: Translation, Transcription, DNA
YfbU YeaH
YcgB YeaG
YeaH predicted Integrin I domain YeaG predicted ATPase domain
Cellular Processes: Transport, Motility, Signalling Metabolism: Anabolism, Catabolism, Energy Unassigned/Uncharacterized, or multiple assignments
Figure 5. A network of predicted functional modules in E. coli. Only modules of size four or larger are shown. Nodes represent single proteins or groups of highly similar proteins as defined in the COG database. Genomic context links within predicted modules are shown in dark gray, and those across modules are shown in light gray. Reproduced from von Mering, Zdobnov et al. (2003).
The notion of network motif was introduced in the context of topological analysis of the E.coli transcriptional regulation network (60), and later elaborated on for other single-type interaction networks (61,41). A network motif is a pattern of interaction that is significantly overrepresented in the network under study, relatively to a family of random networks with the same size and degree distribution (see also Chapter 2). This overrepresentation may be seen as reflecting an evolutionary constraint, one possible explanation being that a specific dynamical behavior typical of that motif topology is positively selected, for instance because it has an important signal-processing role. It has been argued that network motifs of a given interaction network type characterize it at a local level, hinting at underlying “design principles”. While the general argument is controversial, several dynamical
228
V. Schächter
interpretations for network motifs have been confirmed later by detailed experimental and theoretical studies on motif dynamics (62,46,63,64). Yeger-Lotem et al (2004) generalized the notion to motifs occurring in networks that include more than one type of interaction, with an application to a combined network of protein-protein and transcriptional regulation in yeast (65). Extending the original method (60), their algorithm detects composite motifs that are over represented in a statistically significant manner by comparing to an ensemble of networks obtained by randomizing the network under study. The randomization procedure switches edges iteratively in a way that preserves both the extended degree of a node (the number of incoming and outgoing edges of each type) and the edge profile of each node pair ( the set of edges that connects the pair, with type and direction). The authors identified one significant two-protein motif, defining a mixed feedback loop involving both types of interactions, and five types of three proteins motifs. Two of these are “pure” motifs of either protein-protein (PPI) or transcriptional interactions (TRIs), but the other three are “true” composite motifs : interacting transcription factors that coregulate a gene, interacting proteins that are coregulated by the same transcription factor, and a mixed feedback loop between transcription factors that regulate a gene (see Fig. 6 for details on their topology and interpretation). Of the many possible four-protein pattern, 63 were identified as motifs, virtually all of which included one or more of the three-protein motifs. 36 of one three-protein motif with an additional node, whereas 21 could be seen as combinations of one or more smaller motifs. Conversely, every three-protein motif pair but one could be combined in at least one way to produce a four-protein motif. These results lend weight to the interpretation of smaller motifs as network building blocks, and led the authors to propose that combination operations could be viewed as an algebra for composite network construction. In a recent study, Zhang et al. (2005) continued along the same path by searching for over represented motifs in a network integrating five types of links in yeast: coexpression links, regulatory interactions, protein-protein interactions and genetic interactions, and sequence homology (47). For instance, many overlapping triangles are expected to
Heterogeneous Molecular Networks
Motif *
229
Illustration†
A. Protein clique
B. Interacting transcription factors that coregulate a third gene
C. Feed-forward loop
D. Coregulated interacting proteins
E. Mixed-feedback loop between transcription factors that coregulate a gene
Figure 6. Mixed transcriptional regulation (TRI) – protein-protein interaction. (PPI) motifs These motifs were highly statistically significant in both the stringent and nonstringent networks; in all 1,000 randomized networks the number of their occurrences was lower than in the actual network. A node represents a gene and its protein product; a red, directed edge represents a TRI; a black, bidirected edge represents a PPI. Reproduced from Yeger-Lotem, Sattath et al. ( 2004).
occur once conditions occurrence probabilities by the presence of a dense local cluster, leading to a raise of the threshold for biological significance of motif overrepresentation. The fact that motifs overlap complicates even more their interpretation in terms of dynamics, as these interpretations are typical constructed for ideal, isolated motifs, connecting with the rest of the network through specific input and output nodes.
V. Schächter
230
A
a
B
C
D
A B
C D b
i
iii
v
ii
iv
vi
Figure 7. Protein motifs as combinations or extensions of smaller motifs. Four-protein network motifs discovered in the stringent network. (a) Motifs that can be represented as combinations of three-protein network motifs. When there is more than one possible way to generate a four-protein motif, the combination involving the more abundant threeprotein motifs is presented. Dangling motifs, where a fourth node is connected to only one of the nodes of the three-protein motif, are not presented. A three-protein motif may appear more than once in a combination that yields a four-protein motif [e.g. entry (A,D)]. (b) Motifs that cannot be constructed from three-protein motifs. i, the bi-fan motif; ii, a motif containing a feed-forward loop; iii–vi, motifs that appear as extensions of smaller network motifs, for which one of the PPIs in each smaller motif (Left) is extended to a series of PPIs by means of an intermediate protein (Right). A node represents a gene and its protein product; a red, directed edge represents a TRI; and a black, bidirected edge represents a PPI. Reproduced from : Yeger-Lotem, E., Sattath, S. et al. (2004).
Heterogeneous Molecular Networks (a)
Motif set A
A motif example
(b)
A theme example
Motif set B
A motif example
(c)
A theme example
Motif set Motif setC C
A motif example
(d)
A theme example
Motif set D
A motif example
(e)
A theme example
Motifset set Motif EE
A motif example
(f)
A theme example
Motif F F Motifsetset
A motif example
(g)
A theme example
Motif set G
A motif example
(h)
231
A theme example
Key
Motif set H
S: synthetic sickness or lethality H: sequence homology X: correlated expression P: stable physical interaction R: transcriptional regulation
Figure 8. Three-node motifs and corresponding themes in the integrated S. cerevisiae network. (a) A motif corresponding to the ‘feed-forward’ theme; (b) motifs corresponding to the ‘co-pointing’ theme; (c) motifs corresponding to the ‘regulonic complex’ theme; (d) motifs corresponding to the ‘protein complex’ theme; (e) motifs corresponding to the theme of neighborhood clustering of the integrated SSL/homology network; (f) motifs corresponding to the ‘compensatory complex members’ theme; (g) motifs corresponding to the ‘compensatory protein and complex/process’ theme; (h) other unclassified motifs. Each of (a-g), from left to right, shows a schematic diagram unifying the collection of motifs in that set, the list of motifs with the motif statistics, a specific example of a subgraph matching one or more of these motifs, and a larger structure corresponding to the network theme. Each colored link represents one of the five interaction types according to the color scheme (bottom right). For a given motif, Nreal is the number of corresponding subgraphs in the real network, and Nrand describes the number of corresponding subgraphs in a randomized network, represented by the average and the standard deviation. A node labeled ‘etc.’ signifies that the structure contains more nodes with connectivity similar to the labeled node. Reproduced from (Zhang, King et al. 2005).
232
V. Schächter
Without addressing frontally these delicate issues, Zhang et al. (2005) propose that motifs be viewed as “signatures” of higher-order structures (47). They introduce the notion of “network themes”, defined loosely as “higher-order interconnection patterns that encompass multiple occurrences of network motifs and reflect a common organizational principle, and provide examples of how motifs assemble into themes. Fig. 8 summarizes their findings on 3-node motifs and corresponding themes ; note that theme names in the legend correspond to the unifying biological interpretation behind the respective theme. The space of 4-node composite motifs being considerably larger (> 5000 types), the study focuses on the subset of such motifs that fit with the “compensatory complexes/processes” theme, in which a protein has compensatory function with other proteins in a complex or a process. This theme can obviously be seen as a generalized version of the “between-pathway” model described above. As a final step, the authors propose an abstract representation of the composite network as a thematic map of compensatory complexes : after searching for patterns of complexes connected by genetic interactions, each complex is collapsed into one node and each sufficiently enriched set of genetic interactions between a pair of complexes is collapsed into one “compensating link”. The nature and degree of functional relevance of each specific module or motif type are still mostly open questions. The above topological analyses, however, do provide a useful breakdown of heterogeneous network complexity into elementary bricks, as well as useful clues for the theoretical and experimental studies that have already started following themc.
c
See Smidtas, S. (2006). Local and global analyses of heterogeneous molecular interaction networks. Bioinformatics. University of Evry. for an example of an experimentally confirmed theoretical analysis on the function of a heterogeneous motif detected in a purely topological manner.
Heterogeneous Molecular Networks
233
7. Dynamics Progressing on to the dynamics of cellular processes restricts the scope of relevant component networks: only those types of interactions that can be translated into some kind of relationship between cellular state variables can be taken directly into account. Direct physical interactions are not the only option, however: a regulatory link, representing an indirect influence, mediated by an unknown cascade of protein interactions and transcription factors, may still participate in a simulation or analysis of the model dynamics. Genetic interactions and functional links are excluded: while these can be correlated with dynamical properties, the former are one step of abstraction too remote to be translated into time-dependent processes, and the latter mostly represent evolutionary relationships between genes. Component networks of interest will thus be restricted to metabolic, regulatory, and to some extent, protein-protein interaction networks. The latter are a tricky case, as they represent physical but time-independent traces of different types of dynamic phenomena: stable complex existence, transient complex assembly/disassembly, and signal transduction. Whereas the first type of “dynamics” is obviously simple enough, the other two have non-trivial interactions with regulation and metabolism. The dynamics of signaling arguably falls outside of the scope of this Chapter on the large-scale heterogeneous network analyses. We will mention below an interesting analysis on the regulation of complex assembly. As already mentioned in the “Computational Modeling” Section, studying the behavior of a composite network requires the choice of an adequate framework, where the dynamics of two or more types of networks can be jointly expressed and analyzed, in spite of the differences in nature and in time scale. For instance, regulatory network dynamics which typically occurs on a slower time-scale can be naturally approximated as a series of discrete changes, while it is more natural to think in terms of continuous fluxes and fast relaxation towards steadystate of the metabolite fluxes dynamics in the context of a metabolic network. This difficulty comes in addition to the usual challenges in studying the dynamics of single-type (component) networks.
234
V. Schächter
The design adequate frameworks and methods to directly analyze the dynamics of heterogeneous networks in a biologically meaningful manner is neither a new problem nor a solved one. Several groups have been focusing on the search for tractable approximations of differentialequations-based joint models of regulation and metabolism, i.e. simplifications that capture salient features of network dynamics. Approaches as diverse as Biochemical systems theory (66), Metabolic Control Analysis (69, 68) or, much more recently, different types of qualitative models (70,71) can be understood in that light. Past approaches with a long and distinguished history are now being revisited in the light of genomic and post-genomic datasets, and new and promising approaches are emerging. Scaling-up to whole-cell networks while retaining some predictive capability remains a major challenge, however. Most work on the analysis of heterogeneous networks dynamics has focused so far on relatively small systems ; for the time being, it is only very crude approximations of network dynamics that are able to yield useful predictions on the dynamics of large heterogeneous networks. A stepping-stone towards that goal is to study the interplay between the dynamics of component networks, as this may lead to a better understanding of which reciprocal influences are dominant and which can be neglected, and ultimately guide the design of useful approximations. In keeping with the structure of the “Topology” Section, the first part of this Section will thus focus on attempts at characterizing correlations between the dynamics of two component network. The second part will address ongoing efforts to directly characterize the dynamics of largescale composite networks. 7.1. Interplay Between the Dynamics of Component Networks In order to gain insight into the relationships between the dynamics of different types of networks, a method of choice is to search for correlations between experimental information interpreted in the light of the respective dynamical models. One possible approach would be to directly relate two types of measurements corresponding to the time-
Heterogeneous Molecular Networks
235
dependent cellular states (trajectories) of the respective dynamics, for example gene expression time-courses with metabolite concentrations taken at comparable time-points. This is obviously quite a labor-intensive path if the set of state variables is large and to the best of our knowledge has not been attempted for genome-scale systems. Another possibility is to relate time-course measurements reflecting the dynamics of one component network with structural features of the other component network which can be interpreted in terms of its dynamical capabilities. We focus here on two examples that fall within this latter category. 7.1.1. Correlation Between Steady State Fluxes and mRNA Expression Levels Assessed Using Stoichiometric Models of Metabolism Stelling et al. (2002) investigate the existence of a correlation between metabolic gene expression and a specific coefficient that characterizes the role of an enzyme and its corresponding reaction in the steady-state dynamics of the metabolic network (72). The authors start from the fact that no quantitative correlation between metabolic fluxes and expression patterns were observed in preexisting studies and search for an indirect correlation. This leads them to the definition of control-effective fluxes, a parameter that characterizes jointly metabolic efficiency and metabolic flexibility. Control effective fluxes are based on the notion of elementary flux modes, which are themselves defined within the context of stoichiometric models of metabolism (see the Computational Modeling Section) as the minimal subnetworks that enable a given metabolic network to operate at steady-state. Each elementary mode can be assigned a metabolic efficiency score, relating the mode output (biomass or ATP production) to the investment required to produce the enzymes necessary for that mode, in a given environment. The control-effective flux of a reaction is then computed as the average flux through that reaction in all elementary modes, the flux in each mode being weighted by the mode efficiency. Analyzing expression data for E. coli growing exponentially on acetate vs glucose, the authors show that control effective fluxes correlate significantly with differential expression. This is in contrast
236
V. Schächter
with measures based purely on efficiency, e.g. weighing modes only with biomass production capability. Whereas this specific analysis did not yield a really precise new biological insight, it is an interesting example of a correlation between traces of regulatory network dynamics and a carefully crafted structural property of individual reactions computed using a “pure” model of metabolic network dynamics at steady-state. Knowledge of such correlations could help guide the design of joint dynamical models of metabolism and regulation, for instance motivating the search for more refined principles behind flux distribution prediction than “just” the optimization of biomass. 7.1.2. Synchronization Between Complex Formation and Cell-Cycle Regulation Building on topological analyses showing correlations between proteinprotein interactions and coexpression, De Lichtenberg et al (2005) analyze the dynamics of protein complex assembly during the Saccharomyces cerevisiae cell cycle by combining a protein-protein interaction network with DNA microarrays time series (51). As a first step, the authors extracted from the expression dataset a set of 600 genes that were periodically expressed, with a clear time point where expression peaks. On that basis, they constructed a protein-protein interaction network between the corresponding proteins, by combining all the available interaction datasets, ranking interactions using a topology-based confidence score, and filtering to exclude interactions between proteins annotated as belonging to different compartments. They complemented the resulting network of 184 “dynamic” proteins (412 out of 600 had no reliable enough evidence for interaction) with a set of 116 constitutively expressed “static” proteins. This procedure yielded a network of 29 dense “modules”, representing complexes or groups of complexes, each “time-tagged” with a specific point of existence during the cell-cycle. Static proteins that were placed within a heavily time-dependent complex could then be identified as cell-cycle related, indirect targets of transcriptional regulation. Indeed, half of the interaction partners of periodically regulated proteins were found to be
Heterogeneous Molecular Networks
237
Figure 9. Temporal protein interaction network of the yeast mitotic cell cycle. Cell cycle proteins that are part of complexes or other physical interactions are shown within the circle. For the dynamic proteins, the time of peak expression is shown by the node color; static proteins are represented by white nodes. Outside the circle, the dynamic proteins without interactions are both positioned and colored according to their peak time and thus also serve as a legend for the color scheme in the network. Reproduced from de Lichtenberg, Jensen et al. (2005).
static subunits, which could be identified as such only through the integration of the protein interaction network with expression data (Fig. 9). In contradiction with previous claims about cell-cycle related genes, the time-tagged interaction network did not support the hypothesis of
238
V. Schächter
just-in-time synthesis of complexes subunits. The authors propose instead an explanation that fits better with the observed evidence, just-intime assembly, where only some subunits of each complex are transcriptionally regulated and control the timing of complex assembly. They argue that this putative control mechanism would have the advantage of being less costly for the cell. This study shows that joint analysis of a static network of protein interactions together with traces of the regulatory dynamics controlling the presence of interaction partners can actually lead to a dynamical picture of complex assembly, turning, in a sense, an essentially static representation into a dynamic one. 7.2. Investigating the Dynamics of Heterogeneous Networks Modeling and predicting the dynamics of large heterogeneous networks is mostly beyond the reach of current methods. As already mentioned in the Computational Modeling Section, however, there has been recent progress, however, in improving predictions on the dynamical behavior of large scale metabolic models by integrating with regulatory network information within the constraint-based framework (38, 39). We provide additional detail on how that rFBA framework may be used to simulate cell growth in a given environment. The transcriptional regulatory network of the cell can be described as a Boolean network. Since the time constants characterizing transcriptional regulation are generally on the order of a few minutes, considerably slower than the time constants associated with metabolism, the state of the regulatory network can be seen as imposing temporary, adjustable constraints on the metabolic flux solution space. For each time-step, the regulatory network state is computed from the conditions at that time. Specifically, transcription may be altered by the presence or surplus of an intracellular metabolite, an extracellular metabolite, regulatory proteins, signaling molecule, or any combination thereof. The metabolic model, constrained by the state of the regulatory network, is then used to predict an optimal flux distribution. Extracellular concentrations are then computed and used as initial conditions for the next time step.
Heterogeneous Molecular Networks
239
This extension of pure metabolic constraint-based models into a combined regulatory/metabolic model allows (approximate) quantitative dynamic simulation of substrate uptake, cell growth and by-product secretion, but also qualitative simulation of gene transcription events and the presence of proteins in the cell. For instance, in Barrett et al. analyze the properties of geneexpression time-series in a large variety of different growth conditions by using such an integrated regulatory/metabolic model of E. coli (73). This allows them to identify interesting structure in the space of “functional states” (compact representation of the time-series), confirming by a theoretical analysis that the regulatory network responds primarily to the available electron acceptor and the presence of glucose as a carbon source. 8. Interaction Prediction and Network Refinement As discussed in Section 2, the reconstruction of large-scale molecular interaction networks in model organisms often rests upon laborintensive, costly experimental strategies, complemented by computational method to help interpret the assays. While these strategies often yield false positives, false negatives are a given : the resulting molecular interaction networks are invariably known to be incomplete relatively to the real version. The extent of that coverage problem varies with the type of network and the organism : from a number of “gaps” bounded by the percentage of uncharacterized genes in the metabolic reaction network of E. coli or S. cerevisiae to a vast majority of unknown genetic interactions in yeast, the only organism for which highthroughput synthetic lethal screens have been performed. Obviously, networks reconstructed from a combination of de novo experiments and comparative methods on other organisms are even more incomplete. Computational methods aimed at predicting new/missing interactions can help in several ways. By providing sets of interactions that are more likely to occur given existing experimental evidence, they can help design the next round of experimental work in cases where exhaustiveness is not (yet) a viable option. They can also help generate testable hypotheses in a case such as metabolism, where systematic
240
V. Schächter
identification methods for enzymatic activities do not really exist – some prior on the nature of the biochemical reaction beings required for assay design. Finally, predicted interactions that have not been experimentally validated but are thought to be sufficiently reliable may be added to the existing network model, yielding a richer base for topological and dynamical analyses. As we will show in this Section, heterogeneous networks arise naturally in this context. Interaction prediction and network refinement methods are often inspired by and in some sense, dual to observed correlations between the topology and/or dynamics of component networks. Several successful interaction prediction methods have been directly inspired by correlations observed between two component networks, the statistical bias being used to constrain the space of possible network structures. Likewise, partial instances of salient topological features (e.g. modules or motifs) identified in the combined network can be “completed”, yielding putative interactions. Finally, it should be noted that the prediction of an interaction with the help of another type of interaction network often comes with a candidate mechanistic explanation, in contrast with many “black-box” predictive methods. 8.1. Filling Gaps in Metabolic Networks Whereas metabolic network reconstruction strategies based on sequence homology have been remarkably successful overall, they fail to assign functions to a considerable fraction of genes (31-80%) in completely sequenced genomes, and have been known to produce vague or incorrect annotations (74). Often, there exists biological evidence that strongly suggests the existence of a given pathway in a species, yet one or more enzymes encoding the critical reaction steps cannot be identified via sequence homology methods alone. The gene may not be present in that species, the reaction may be bypassed, or an alternative pathway may exist. The enzyme could also be encoded by a gene with little similarity to known genes encoding similar enzymatic activities, because of convergent evolution or horizontal transfer. The identification of the enzymes
Heterogeneous Molecular Networks
241
catalyzing individual metabolic reactions in a reasonably complete metabolic network is known as the ‘missing genes problem’ (75). In this context, functional links of the gene-neighborhood, genefusion and phylogenetic profile types are natural extensions of pure homology-based methods that leverage the availability of many complete genomes with simple yet elegant evolutionary reasoning. They have been used to predict missing metabolic genes, both individually and in combination (56). Exploiting their observations on the correlation between the structures of the metabolic network and the coexpression network/ regulatory network dynamics, Kharchenko et al. (2004) introduce in the “Metabolic Expression Placement” (MEP) method (76). Given a specific unassigned metabolic reaction, located at a node L in the metabolic (enzyme-enzyme) network, and a set of candidate genes, the MEP algorithm ranks each candidate using a cost function that evaluates the similarity between its expression profile and the combined expression of a metabolic network neighbourhood of L. In practice, the neighbourhood consists of the enzymes – for which there is a known gene – belonging to three layers of increasing radii (1 to 3) ; the closer the enzyme is to L, the greater the relative weight of the respective gene expression in the cost function. The MEP algorithm was validated using the S.cerevisiae metabolic network reconstruction (77) and the Rosetta compendium expression dataset (49), by computing the “self-rank” of each gene, a form of leave-one-out procedure. Results show that >20% of known enzymes can be predicted within the top 50 out of 5594 candidates for their enzymatic function, but this figure rises to 70% when the test is restricted to metabolic enzymes that were significantly perturbed across multiple conditions of the expression dataset. In other words, the expression dataset has predictive power on metabolic network structure when and where it captures the underlying dynamics of the regulatory network. As a final example, Von Mering’s module identification method based on the search for dense clusters of functional links (58) can also be used as a predictor. An analysis of false positive assignments to pathways suggests that 40% of these result from the limited resolution of orthology assignments (similar E. coli proteins that belong to the same
242
V. Schächter
orthologous group have been assigned to different pathways in EcoCyc), and that 50% may represent real connections between pathways that have not yet been recorded in EcoCyc. This observation led the authors to use predicted modules for two types of predictions : extensions to known pathways and functional links between pathways. Many of the predicted metabolic modules contained proteins with no annotations in EcoCyc, up to a total of 300. These are predicted to represent extensions to metabolic pathways and placed accordingly within the metabolic network. Moreover, in the average metabolic module, 58% of the proteins are enzymes annotated in EcoCyc, 23% are identified as putative enzymes in other databases and can be associated at the pathway level, and 19% seem to be hypothetical or noncatalytic. The authors remark that hypothetical proteins are good candidates for enzymes that can ‘‘fill’’ gaps in our current pathway knowledge, and exhibit several such examples in the predicted modules, some of which were confirmed by the literature. Several links were also identified between pathways that were not connected in EcoCyc, with no obvious explanation by shared metabolites or orthology assignment artefacts ; these warrant further investigation. 8.2. Predicting Genetic Interactions Mapping genetic interactions is an extremely labor intensive endeavor : the network is thought to be far denser and largely non-overlapping with the network of protein-protein interactions, and comprehensive assessment of synthetic lethals in e.g. Saccharomyces cerevisiae would require constructing 18 million double mutants, not to mention higher organisms. To date, SGA analysis has been used to assess ~ 4% of gene pairs in one growth condition (29) : there clearly is room for help by predictive method. The type of associations between genetic and physical interactions detected by Kelley’s method (54) for the identification of within- and between-pathways explanation, Ye’s congruent networks (53), or Zhang et al (2005) “compensatory” theme (47) can be leveraged to predict new genetic interactions. These different methods actually implement variations on the same principle : network motif completion.
Heterogeneous Molecular Networks
243
Focusing for instance on the version of Kelley and Ideker (2005), between-pathways model suggests that proteins in one pathway genetically interact with the same partners in the second pathways (54). This should in principle lead to the occurrence of a complete bipartite genetic interaction “matching” between the two pathways defined by physical interactions. When an almost-complete matching of that type is detected, the remaining interaction is predicted. In practice, for each “missing” genetic interaction between two pathways, the method counts the number of incomplete 2*2 gene motifs that would include the interaction if completed : the interaction is predicted is that number passes a given threshold. Predictions reached an 87% estimated accuracy using cross-validation at a threshold of eight or more incomplete motifs, against 5% when bipartite motif completion was performed without imposing that the motifs belong to a between-pathway model. Completing within-pathways models, a.k.a. dense clusters of genetic interactions, was significantly less successful in predicting genetic interactions. Wong et al. (2004) use an even richer heterogeneous network than Zhang, King et al. (2005) (65,47). to predict pairs of synthetic lethal genes in yeast. The approach is formulated slightly differently, as the use of probabilistic decision trees to integrate several types of data : localization, mRNA coexpression, physical interaction, MIPS functional categories, genome-context functional links, sequence homology, as well as some characteristics of network topology. Each data type is represented as one or several binary characteristics on gene pairs, which is equivalent to a network of links between genes. In addition to characteristics encoding directly an interaction or a similarity between two attributes of a genes, the authors include 11 types of so-called “2hop” characteristics. Each 2hop Y,Z relationship between a pair of genes A-B signifies the existence of a 2-step path from A to B, passing through gene C, such that A has relationship type Y with C and C has relationship type Z with B. For example, if protein A physically interacts with C and C genetically interact (“SSL” link) with B, then the pair A-B has the “2hop physical-SSL” characteristics. The probabilistic decision tree is used to model the conditional probability that a gene pair is SSL given a combination of its other
244
V. Schächter
characteristics. During the learning phase, a training set of genes is propagated from the root to the leaves, sorting gene pairs between two daughter nodes based on the characteristic that seems most informative for SSL interaction. In the end of that phase, each gene pair from the training set is assigned to a single leaf of the tree ; each leaf is given a score based on its fraction of SSL pairs. The tree can then be used to predict the status of other gene pairs by mapping it to the leaf that corresponds to its combination of characteristics, with a reliability corresponding to the score of that leaf. The method was assessed using cross-validation on SSL-tested genepairs : a plot of sensitivity vs false positive rate at various score thresholds showed a sensitivity of 80% at a false-positive rate of 18%. This results is interpreted by the authors as suggesting that a large-scale screen guided by their method could capture 80% of SSL interactions by testing <20% of gene pairs. In addition, analyzing decision tree structure ranked both characteristics and their combinations according to their predictive power. The top predictor characteristics of SSL pairs were 2hop SSLSSL and 2hop physical-SSL, in agreement with both the within- and between-pathways models described above. The best combinations included : 2hop physical-SSL together with same functional class and same subcellular localization, but also, more interestingly, 2hop SSLSSL together with 2hop physical-SSL, 2hop SSL-coexpression and same phenotype (fig. 10). A few examples provide convincing evidence that the study of such combinations may provide valuable clues on the specific mechanisms behind genetic interactions. 8.3. Predicting Protein-Protein Interactions Protein-protein interactions datasets are known to be incomplete and sometimes contradictory (14), a problem magnified by the fact that protein-interaction networks are very sparse : even reliable methods can generate many false-positives when applied at the scale of a genome. Considerable effort has been devoted to the design of computational methods for protein interaction prediction, using protein or domain
Heterogeneous Molecular Networks
245
sequence, structure, or subtle patterns of phylogenetic distributions (see ref. 78 for a review). We will focus here on those methods that utilize, among other types of evidence, networks of genomic-context and coexpression links.
a
S
ALG5
WHP1
b
S
P
S?
SPO7
P
P, S
SWP1
c
FKS1
d
CIN3, GIM5, PAC10, TUB3, YKE2
S
X CPR6
CDC45, HPR5, KAR3, MMS4, SOD1
S
S S
PAC2
S
FEN1
JNM1
S
P NIP100
S S?
ASF1
S,X RAD27
S
XRS2
S
P RAD50
Figure 10. Significant combinations of pairwise relationships. Gene-pair relationships. (a and b) Known (a) and predicted (b) SSL gene pairs from the highest-scoring leaf of the decision tree. (c and d) Known (c) and predicted (d) SSL gene pairs from the thirdhighest-scoring leaf. P, physical interaction; S, synthetic sick or lethal interaction; X, correlated mRNA expression. Reproduced from Wong, Zhang et al. (2004).
Computational efforts that used networks of functional links predicted from genomic context (e.g. fusion, neighborhood and phylogenetic profile) (79, 80) predated the publication of the first large-scale protein interaction detection assays. After a comparative assessment of the four major yeast-two hybrid and immunoprecipitation datasets together with these predictive methods, it became clear that the strategy of choice for both validation and de novo prediction was to integrate evidence from several sources, among which coexpression networks play an important role.
V. Schächter
246
Jansen et al. (2003) introduced in, an approach based on Bayesian networks that weighs and combines several types of genomic features : coexpression, coessentiality and colocalization, in order to both (in)validate experimental evidence for interaction and predict novel interactions (50). The core idea is to assess each type and source of evidence against a ‘gold-standard’ reference interaction set., yielding a statistical reliability score for that source. The gold standard was chosen as follows : the MIPS complexes catalog (81) was used for positives, and a curated list of proteins known to be localized in different compartments for negatives. The Bayesian network was then used to predict the interaction likelihood for each protein pair by combining evidence sources according to their reliability, and yielding so-called ‘Probabilistic Interactome’ predictions. This procedure was applied separately to interaction datasets, resulting in an ‘experimental’ PI (PIE), and to genomic datasets, yielding a ‘predicted’ PI, or PIP. These two sets of predictions were integrated in a subsequent step, yielding the ‘total’ network or PIT. Fig. 11 provides an overview of the data integration and prediction scheme.
mRNA co-expr.
Y2H
In vivo pull-down
Probabilistic interactome (PI)
⎧⎪Gavin ⎨ ⎪⎩ Ho ⎧⎪Uetz ⎨ ⎪⎩ Ito
⎧⎪ Rosetta ⎨ ⎪⎩Cell cycle GO process
MIPS function
Integration process Fully connected Bayes
Data source
PIE Naïve Bayes
PIT Naïve Bayes
PIP
Essentially
Figure 11. ‘A Bayesian networks approach for protein-protein interaction prediction from genomic data.’ Reproduced from Jansen, Yu et al. (2003).
Heterogeneous Molecular Networks
247
The results were cross-validated against the gold-standard, and shown to be superior to a naive voting procedure using the same evidence types, confirming the added-value of combining datasets. Pairwise predictions were also used to identify putative complexes as dense clusters of interactions ; several of these were later confirmed by TAP-tagging experiments. Rhodes et al. (2005) went one step further in and predicted human protein-protein interactions, by integrating several types of evidence with a naive Bayes classifier (82). Their method takes as input the following information types on gene pairs : interaction datasets from model organisms mapped to putative human orthologs, human coexpression networks, shared functional annotation (using Gene Ontology), and interaction-enriched domain pairs (using InterPro). After suitable calibration, the method predicts 40000 interactions and was validated on an independent test set of known interactions ; two new interactions were confirmed experimentally. These methods show that a network of coexpression links can be instrumental in refining a protein-protein interaction network, in conjunction with other types of evidence. Obviously, their predictive power depends on the number and specific nature of the underlying expression datasets. As these accumulate and diversify, it seems reasonable to expect an increase in prediction accuracy. 8.4. Refining the Structure and Logic of Transcriptional Regulation Network Several methods have relied at least partially on metabolic network structure to predict transcriptional units (TU). For instance, Zheng et al. (2002) propose an algorithm that identifies putative operon structure by using the fact enzyme-encoding genes in operons tend to catalyze successive reactions in metabolic pathways (83). Specifically, the algorithm searches for a subnetwork of the metabolic network for which the corresponding enzyme-encoding genes are clustered together on the genome. Operons predicted in E. coli, on the basis of 42 other microbial genomes, were compared with a selected metabolism-related operon
248
V. Schächter
dataset from the RegulonDB database, showing good prediction sensitivity (89%) and specificity (87%). Karp et al. (2004) recently improved a TU predictor that used only intergenic distance and functional classification of genes by adding information on metabolic pathways, protein complexes and transporters, available in the EcoCyc database, and obtained a moderate improvement (correct prediction of 80% of the known E.coli TUs and 69% of the known operons, versus 75% of TUs and 65% of operons without metabolic network structure information) (84). Since many genes are not directly related to metabolism, however, transcription-unit predictions based on this type of approaches are limited in principle. On the other hand, very promising attempts at automatically refining simple dynamical heterogeneous network models have emerged in the last couple of years. In Covert et al. (2004) propose a method to systematically detect conflicts between predictions of a regulated constraint-based metabolic model of E.coli and experimental data on mutant growth phenotypes and expression profiles, leading to manual improvement of the regulatory network to eliminate these conflicts (33). This iterative, “model-driven” approach was very recently extended in Herrgard et al. (2006)(85). These author not only pinpoint conflicts between predictions of a regulated metabolic model of S. cerevisiae and similar types of experimental data, but also propose a computational network expansion strategy that suggests new regulatory rules that would eliminate them. While manual selection of biologically acceptable regulatory rules still appears necessary, and the method is not principled in the sense that there is neither model scoring nor systematic search for best fit in model space, this research track appears quite fruitful, if only because of the link between diverse experimental assays and genomescale models. Gat-Viks et al. (2004, 2006) improve on previous work on learning regulatory network logic from data, using the unified discrete framework described in Section 2 to learn regulation functions in heterogeneous models of regulated metabolism or signaling, from gene expression, protein expression and growth phenotype data (31, 32).
Heterogeneous Molecular Networks
249
This approach, while still limited in its biological applications scope – the learning procedure was applied to the regulation of lysine biosynthesis and to the response to osmotic stress – may allow step-bystep construction of principled answers to essential questions such as : “What is the most likely regulatory logic hidden behind a given interaction graph and a set of experimental results ? ”. 9. Conclusions and Perspectives As Sharan et al. (2006) point out in a recent review on network comparison methods, one aspect that differentiates network comparison from sequence comparison, a field that has been blooming for more than 30 years, is the fact that our understanding of the dynamical and even the topological interrelations between the different types of network is in its infancy (86). While it is well known that protein-protein interactions can implement signaling cascades or improve metabolic pathway efficiency through metabolite channeling, that protein-DNA interactions are the scaffold of gene regulatory control or that genetic interactions may connect parallel pathways, both the principles and the details of how the cell integrates these in order to self-sustain or produce a coordinated response to environmental stimuli are still by-and-large unknown. As we have seen in this Chapter, there are several intermediate steps on the path towards an integrated understanding of cellular processes. Perhaps the main conclusion is a rather predictable one: it is necessary to understand how component networks interact in order to design computational models and analytical methods capable of providing real insights into the dynamics of heterogeneous networks. Significant efforts have already been dedicated to assessing correlations between the topologies of different network types, a logical first step since topological information is available, by definition, for all the network types considered here. Another fruitful research path has been to leverage the correlations observed between component network topologies in order to predict new interactions. Such methods have been designed for the expansion and refinement of metabolic, protein-protein interaction, and genetic interaction networks – providing candidate mechanistic explanations along the way for the latter type of network.
250
V. Schächter
The situation is more difficult with respect to dynamics: the only type of ‘cellular state’ data able to provide direct knowledge on the dynamics of a molecular network at genome scale is still gene-expression timecourse data. At the time of writing, experimental evidence on the dynamics of large-scale metabolic networks is still very scarce, but the situation may evolve in the coming years with the coming of age of metabolomics and fluxomics. Logically, existing attempts to correlate the dynamics of component networks have relied on expression data to provide regulation-related temporal aspects, together with static (for protein-protein interactions) or steady-state (for metabolism) models. Computational models are key to dynamical analyses, as they determine the nature and level of detail of predictions on network dynamics. The very approximations needed to make a model of network dynamics tractable at genome scale often create incompability with computational models for other network types : for instance, it is precisely the effects of regulatory control on metabolism, that need to be neglected at first to construct genome scale metabolic models. The hope is that new insights that are gained by studying component network interplay will lead to the design of refined models, that remain tractable while increasing their predictive resolution and accuracy. Recent work on models that integrated transcriptional regulation and metabolism at genome scale confirms that may be well-founded. In the next few years, as more molecular interaction and cellular state datasets are generated, part of the burden that rests for the moment on computational models and methods may be alleviated. Yet, the larger the coverage and depth of available experimental information, the higher the degree of abstraction that may be required to extract meaningful biological insights on the working of cellular processes from that information, and generalize these insights from one organism to another. Acknowledgements The author is grateful to F. Képès, P. Bourguignon and M. Heinig for fruitful discussions, and acknowledges support from Genoscope and the ENFIN Network of Excellence funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences,
Heterogeneous Molecular Networks
251
genomics and biotechnology for health,’ contract number LSHG-CT2005-518254. References 1.
2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
von Mering, C., Jensen, L.J. et al. (2005). STRING: known and predicted proteinprotein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, 433-7. Welch, G.R. and Marmillot, P.R. (1991). Metabolic "channeling" and cellular physiology. J Theor Biol. 152, 29-33. Tyson, J.J., Chen, K. et al. (2001). Network dynamics and cell physiology. Nat Rev Mol Cell Biol. 2, 908-16. Hoffmann, R., Krallinger, M. et al. (2005). Text mining for metabolic pathways, signaling cascades, and protein networks. Sci. STKE. 2005, 21. Bader, G.D., Cary, M.P. et al. (2006). Pathguide: a pathway resource list. Nucleic Acids Res. 34, 504-6. Uetz, P. (2000). A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 403, 623-627. Ito, T., Chiba, T. et al. (2001). A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U.S.A. 98, 4569-74. Cary, M.P., Bader, G.D. et al. (2005). Pathway information for systems biology. FEBS Lett. 579, 1815-20. Rain J.C. (2001). The protein-protein interaction map of Helicobacter pylori. Nature. 409, 211-215. Li, S. (2004). A map of the interactome network of the metazoan C. elegans. Science. 303, 540-543. Gavin, A., Bosche, M. et al. (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 415, 141-7. Ho, Y. (2002). Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 415, 180-183. Schachter, V. (2002). Protein-interaction networks: from experiments to analysis. Drug Discov Today. 7, S48-54. von Mering, C., Krause, R. et al. (2002). Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 417, 99-403. Guelzim, N., Bottani, S. et al. (2002). Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet 31, 60-3. Lee, T.I. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 298, 799-804. Harbison, C.T., Gordon D.B. et al. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature. 431, 99-104. Segal, E., Shapira, M. et al. (2003). "Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data." Nat Genet. 34, 166-76.
252
V. Schächter
19. Friedman, N. (2004). Inferring Cellular Networks Using Probabilistic Graphical Models. Science. 303, 799-805. 20. Hartemink, A.J. (2005). Reverse engineering gene regulatory networks. Nat Biotechnol. 23, 554-5. 21. Cho, R.J., Campbell, M.J. et al. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell. 2, 65-73. 22. Spellman, P.T., Sherlock, G. et al. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol. Cell. 9, 3273-97. 23. Causton, H.C., Ren, B. et al. (2001). Remodeling of yeast genome expression in response to environmental changes. Mol Biol Cell. 12, 323-37. 24. Karp, P.D., Paley, S. et al. (2002). The Pathway Tools software. Bioinformatics. 18, 225-32. 25. Reed, J.L. and Palsson, B.O. (2003). Thirteen years of building constraint-based in silico models of Escherichia coli. J Bacteriol. 185, 2692-9. 26. Segre D., Zucker J. et al. (2003). From annotated genomes to metabolic flux models and kinetic parameter fitting. Omics. 7, 301-16. 27. Price, N.D., Reed, J.L. et al. (2004). Genome-scale models of microbial cells: evaluating the consequences of constraints. Nat Rev Microbiol. 2, 886-97. 28. Tong, A.H. (2001). Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science. 294, 2364-2368. 29. Tong, A.H. (2004). Global mapping of the yeast genetic interaction network. Science. 303, 808-813. 30. Ooi, S.L., Pan, X. et al. (2006). Global synthetic-lethality analysis and yeast functional profiling. Trends Genet. 22, 56-63. 31. Gat-Viks, I., Tanay, A. et al. (2004). Modeling and analysis of heterogeneous regulation in biological networks. J Comput Biol. 11, 1034-49. 32. Gat-Viks, I., Tanay, A. et al. (2006). A probabilistic methodology for integrating knowledge and experiments on biological networks. J Comput Biol. 13, 165-81. 33. Covert, M.W., Knight, E.M. et al. (2004). Integrating high-throughput and computational data elucidates bacterial networks. Nature. 429, 92-6. 34. Barrett, C.L. and Palsson B.O. (2006). Iterative reconstruction of transcriptional regulatory networks: an algorithmic approach. PLoS Comput Biol. 2, 52. 35. Kauffman, S.A. (1993). The Origins of Order : Self Organization and Selection in Evolution. New York, Oxford University Press. 36. de Jong, H. (2002). Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol. 9, 67-103. 37. West, D.B. (1996). Introduction to Graph Theory, Prentice Hall. 38. Covert , M.W., Schilling, C.H. et al. (2001). "Regulation of gene expression in flux balance models of metabolism." J Theor Biol. 213, 73-88. 39. Covert, M.W. and Palsson, B.O. (2003). Constraints-based models: regulation of gene expression reduces the steady-state solution space. J Theor Biol. 221, 309-25. 40. Covert M.W. and Palsson B.O. (2002). Transcriptional regulation in constraintsbased metabolic models of Escherichia coli. J Biol Chem. 277, 28058-64.
Heterogeneous Molecular Networks
253
41. Milo, R. (2002). Network motifs: simple building blocks of complex networks. Science. 298, 824-827. 42. Balasubramanian, R., LaFramboise, T. et al. (2004). A graph-theoretic approach to testing associations between disparate sources of functional genomics data. Bioinformatics. 20, 3353-62. 43. Ihmels, J., Levy R. et al. (2004). Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae. Nat Biotechnol. 22, 86-92. 44. Patil, K.R. and Nielsen J. (2005). Uncovering transcriptional regulation of metabolism by using metabolic network topology. Proc Natl Acad Sci U.S.A. 102, 2685-9. 45. Kharchenko, P. Church G.M. et al. (2005). Expression dynamics of a cellular metabolic network. Mol Syst Biol. 1, 2005, 0016. 46. Zaslaver, A., Mayo, A.E. et al. (2004). Just-in-time transcription program in metabolic pathways. Nat Genet. 36, 486-491. 47. Zhang, L.V., King, O.D. et al. (2005). Motifs, themes and thematic maps of an integrated Saccharomyces cerevisiae interaction network. J Bio. l 4, 6. 48. Jansen R., Greenbaum D. et al. (2002). Relating whole-genome expression data with protein-protein interactions. Genome Res. 12, 37-46. 49. Hughes, T.R., Marton, M.J. et al. (2000). Functional discovery via a compendium of expression profiles. Cell. 102, 109-26. 50. Jansen, R., Yu, H. et al. (2003). A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302, 449-53. 51. de Lichtenberg, U., Jensen, L.J. et al. (2005). Dynamic complex formation during the yeast cell cycle. Science. 307, 724-7. 52. Ozier, O., Amin, N. et al. (2003). Global architecture of genetic interactions on the protein network. Nat Biotechnol. 21, 490-1. 53. Ye, P., Peyser, B.D. et al. (2005). Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast. BMC Bioinformatics. 6, 270. 54. Ye, P., Peyser, B.D. et al. (2005). Gene function prediction from congruent synthetic lethal interactions in yeast. Mol Syst Biol. 1, 2005, 0026. 55. Kelley, R. and Ideker, T. (2005). Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol. 23, 561-6. 56. Bourguignon, P., Danos, V. et al. (2006). Property-driven statistics of biological networks. Transactions in Computational Systems Biology. 57. Smidtas, S. (2006). Local and global analyses of heterogeneous molecular interaction networks. Bioinformatics. University of Evry. 58. Hartwell, L.H., Hopfield, J.J. et al. (1999). From molecular to modular cell biology. Nature. 402, C47-52. 59. von Mering, C., Zdobnov, E.M. et al. (2003). Genome evolution reveals biochemical networks and functional modules. Proc Natl Acad Sci U.S.A. 100, 15428-33. 60. Karp, P.D., Riley, M. et al. (2002). The EcoCyc Database. Nucleic Acids Res. 30, 56-8.
254
V. Schächter
61. Shen-Orr, S.S., Milo, R. et al. (2002). Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet. 31, 64-8. 62. Alon, U. (2003). Biological networks: the tinkerer as an engineer. Science. 301, 1866-7. 63. Mangan S., Alon U. (2003). Structure and function of the feed-forward loop network motif. Proc Natl Acad Sci U.S.A. 100, 11980-5. 64. Kalir, S., Mangan, S. et al. (2005). A coherent feed-forward loop with a SUM input function prolongs flagella expression in Escherichia coli. Mol Syst Biol. 1, 2005, 0006. 65. Mangan, S., Itzkovitz, S. et al. (2006). The incoherent feed-forward loop accelerates the response-time of the gal system of Escherichia coli. J Mol Biol. 356, 1073-81. 66. Wong, S.L., Zhang, L.V. et al. (2004). Combining biological networks to predict genetic interactions. Proc Natl Acad Sci U.S.A. 101, 15682-7. 67. Savageau, M.A. (1976). Biochemical Systems Analysis: a Study of Function and Design in Molecular Biology. Reading, Massachusetts, Addison-Wesley. 68. Heinrich, R., Schuster, S. (1996). The Regulation of Cellular Systems. New York, Chapman and Hall. 69. Fell, D. (1997). Understanding the Control of Metabolism. London, Portland Press. 70. Simao, E., Remy, E. et al. (2005). Qualitative modelling of regulated metabolic pathways: application to the tryptophan biosynthesis in Escherichia coli. Bioinformatics. 21, 190-196. 71. Siegel, A., Radulescu, O. et al. (2006). Qualitative analysis of the relation between DNA microarray data and behavioral models of regulation networks. Biosystems. 84, 153-74. 72. Stelling, J., Klamt, S. et al. (2002). Metabolic network structure determines key aspects of functionality and regulation. Nature. 420, 190-3. 73. Barrett, C.L., Herring, C.D. et al. (2005). The global transcriptional regulatory network for metabolism in Escherichia coli exhibits few dominant functional states. Proc Natl Acad Sci U.S.A. 102, 19103-8. 74. Iliopoulos, I., Tsoka, S. et al. (2003). Evaluation of annotation strategies using an entire genome sequence. Bioinformatics. 19, 717-26. 75. Osterman, A., Overbeek, R. (2003). Missing genes in metabolic pathways: a comparative genomics approach. Curr Opin Chem Biol. 7, 238-51. 76. Kharchenko, P., Vitkup, D. et al. (2004). Filling gaps in a metabolic network using expression information. Bioinformatics 20 Suppl 1, I178-I185. 77. Forster, J., Famili, I. et al. (2003). Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res. 13, 244-53. 78. Bork, P., Jensen, L.J. et al. (2004). Protein interaction networks from yeast to human. Curr Opin Struct Biol. 14, 292-9. 79. Enright, A.J., Iliopoulos I. et al. (1999). Protein interaction maps for complete genomes based on gene fusion events. Nature. 402, 86-90. 80. Marcotte, E.M., Pellegrini, M. et al. (1999). A combined algorithm for genome-wide prediction of protein function. Nature. 402, 83-6. 81. Mewes, H.W. (2002). MIPS : a database for genomes and protein sequences. Nucleic Acids Res. 30, 31-34.
Heterogeneous Molecular Networks
255
82. Rhodes, D.R., Tomlins, S.A. et al. (2005). Probabilistic model of the human proteinprotein interaction network. Nat Biotechnol. 23, 951-9. 83. Zheng, Y., Szustakowski, J.D. et al. (2002). Computational identification of operons in microbial genomes. Genome Res. 12, 1221-30. 84. Romero, P.R. and Karp, P.D. (2004). Using functional and organizational information to improve genome-wide computational prediction of transcription units on pathway-genome databases. Bioinformatics. 20, 709-17. 85. Herrgard, M.J., Lee, B.S. et al. (2006). Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae. Genome Res. 16, 627-35. 86. Sharan, R. and Ideker, T. (2006). Modeling cellular machinery through biological network comparison. Nat Biotechnol. 24, 427-33. 87. Barabasi A.L. and Oltvai Z.N. (2004). Network biology: understanding the cell's functional organization. Nat Rev Genet. 5, 101-13. 88. Klipp, E., Heinrich, R. et al. (2002). Prediction of temporal gene expression. Metabolic opimization by re-distribution of enzyme activities. Eur J Biochem. 269, 5406-13. 89. Milo, R., Itzkovitz, S. et al. (2004). Superfamilies of evolved and designed networks. Science. 303, 1538-42. 90. Yeger-Lotem E., Sattath S. et al. (2004). Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. Proc Natl Acad Sci U.S.A. 101, 5934-9.
This page intentionally left blank
CHAPTER 8 EVOLUTION OF REGULATORY NETWORKS
Amélie Veron, Dion Whitehead and Erich Bornberg-Bauer Division of Bioinformatics, Institute for Evolution and Biodiversity, The Westphalian Wilhelm's University of Münster, Germany
[email protected]
1. Introduction The variety of shapes and forms observed in plants and animals is ultimately a result of their different genomes. The genome encodes many regulatory networks consisting of interacting genes and gene products. Understanding the evolution of such networks will explain not only the variation and diversification of various body plans during evolution, but also how biological complexity in general is generated. Networks are a focus in many areas of biology. New technologies in the fields of genomics, proteomics and transcriptomics provide opportunities to study many biological components simultaneously, revealing complex interacting networks that span multiple biological scales, from interacting populations of cells to networks of genes and gene products. Graph analysis has revealed similar patterns in the structures of qualitatively different networks, hinting that common rules or constraints may be guiding the evolution of different networks. The basic elements in gene regulatory networks are the transcription factors. A transcription factor may bind to a region of DNA, usually upstream of a gene, and cause an increase or decrease in the transcription rate of the gene, which may itself code for a transcription factor. When
257
258
A. Veron, D. Whitehead and E. Bornberg-Bauer
such transcription factors interact with and/or regulate each other, they can form complex networks (see Chapter 4). Some transcription factors are arranged into specific subnetwork structures that occur much more often than chance would allow, suggesting a maintained function: these preserved small networks are called motifs (see Chapter 2). At a higher scale, groups of motifs are organised into modules that are controlled by key regulators, and often consist of evolutionary related transcription factors. Current evidence suggests that the major driving forces in the evolution of regulatory networks operate at two levels. At the level of the gene by series of (whole genome or single gene) duplications, and at the level of domains through domain rearrangements. Although large scale chromosomal duplications have occurred, it is uncertain how important they are compared to single gene duplications. The DNA-binding domains of transcription factors comprise of a small set of conserved domains. The functions of the transcription factors are fine-tuned during evolution by altering secondary regulation domains. Much less is known about the evolution of promoter regions as they are sparsely located, difficult to accurately predict, and depend on the overall context of the transcription factor. Recently a debate has begun on the contribution of neutral and positive selection on the evolution of transcription factor networks. The independent evolution of similar yet complex traits presents a challenge when explaining biological complexity. The occurrence of a single complex trait may be attributed to chance, but the repeated generation of similar, yet complex structures implies that chance should be relegated to a secondary role. For example, the structure of the eye in higher vertebrates is intriguingly similar to the eye in the octopus, yet it does not exist in the last common ancestor, suggesting that both eyes evolved independently. Other examples are the repeated evolution of flight by bats and birds, and the streamlining of body structures in response to an aquatic environment in fish, saurians, birds and mammals. These events are often termed phenotypically convergent evolution and can be seen as an incarnation of Gould's metaphoric question of "What would be conserved if the tape were played twice ?" (1). Computer based experiments have demonstrated that complex structures can arise repeatedly during evolution from different initial
Evolution of Regulatory Networks
259
conditions, for example at the level of abstract chemistry (2) and primordial evolution in an RNA world (3). Whole organism evolution is much more difficult to study and analyse as there are many conflicting fitness parameters, probably with complex interdependencies. Gene regulation appears to be the main platform where evolution tinkers with possibilities. However, when complex, interdependent networks are changed, the result is often detrimental. Explaining how evolution occurs on such networks is fundamental in understanding evolution and the origins of complexity. Gene regulation occurs mostly at the transcription level in both Eukaryotes and Prokaryotes; the number of transcription factors ranges from a few hundred in E. coli to more than 3000 in humans (4,5). The picture in Eukaryotes is complicated further still by the seeming importance of non-coding RNAs that edit the raw transcript (6). Thus, a small amount of shuffling or duplicating connections can lead to a combinatorial number of different networks. Combined with this plasticity is an apparent contradiction: small changes in single genes can have drastic consequences during development (7) while at the same time regulatory networks display a robustness to random changes (8). Since the development of body structures requires the concerted action of many interacting genes, the question of how such structures evolve in the first place is highly relevant and difficult. Do networks evolve in big leaps by duplicating a whole module during a genome duplication? Do new transcription factors always emerge together with their up- and down-stream binding sites? Observations from developmental molecular biology indicate that reuse of regulatory genes (and sets of genes) plays an important role in the evolution of more complex forms. For example, the Hox gene clusters are a set of genes that not only have a high degree of sequence conservation between metazoan phyla, but also retain chromosomal order. They have many roles in development, and it appears that the clusters have been duplicated in more complex organisms such as vertebrates (9), suggesting that large scale duplications of developmental pathways have provided raw material for increases in organism complexity.
A. Veron, D. Whitehead and E. Bornberg-Bauer
260
Here we summarise briefly some of the most recent insights from genomics, proteomics, transcriptomics, and in silico studies on regulatory network evolution. After defining a gene regulatory network, we break the subject into three Sections. The first looks at network components, such as transcription factors and binding sites, and discusses the rules and constraints of evolution on these components. The second Section takes a more holistic view, regarding a large-scale network as an evolving entity. The third Section discusses recent work on the evolutionary dynamics of regulatory networks, addressing the dynamical processes that give rise to observed network structures. We close with some speculations which attempt to provide a coherent picture on how complex networks may have evolved. It can only be preliminary since the field is moving rapidly. Therefore, much caution is required in the interpretation of current findings and knowledge. 2. Definitions From a simplified perspective, a regulatory network consists of three components (see also Fig. 1 and Chapter 4) : 1. 2.
3.
The gene encoding a transcription factor (TF); The upstream (cis-) regulatory region of the gene by which (through another transcription factor) transcription of the gene is mediated; The regulatory region of a different gene to which the transcription factor finally binds.
As we will discuss later, the links from transcription factor to regulatory region are mostly one-to-many, i.e. each transcription factor regulates more than one gene and most genes are controlled by some, albeit generally few, transcription factors. Most transcription factors are regulated by at least one other transcription factor, with the exception of maternal transcription factors, giving rise to quite complex regulatory networks. Each of these elements can be decomposed into sub-elements. For example, regulatory regions are often modular; the transcription factors
Evolution of Regulatory Networks
261
themselves typically contain a DNA binding domain associated with regulatory domains. These are dimerisation domains to form homo- or hetero-dimers and other protein interaction domains which mediate interaction with different proteins such as signalling proteins, allowing the TF to react to physiological or environmental changes.
Figure 1. Diagram of a transcription factor and its binding site which comprise the units of gene regulatory networks. Thin arrows represent binding, thick arrows represent transcription and translation. The transcription factor A binds to the binding site (dark grey DNA) of gene B, activating transcription of B and production of transcription factor B. then in turn binds to the promoter region of gene A, activating transcription of gene A. The effects of each transcription factor on its target gene may be different, for example gene B may inhibit the expression of gene A, resulting in a form of negative feedback. A can also affect its own transcription, shown here within the shaded grey box. Such regulation is called auto-regulation. The activity of A may be further regulated by e.g. external stimuli acting on the second domain of A. By linking together many such transcription factors and inputs for external stimuli, large networks can be formed that can process biological information about the the state of the cell and its surroundings.
3. Evolution at the Molecular Level 3.1. Transcription Factors: DNA-Binding Domains The DNA binding domains of transcription factors are among the most ancient protein domains and are derived from a relatively small set of
262
A. Veron, D. Whitehead and E. Bornberg-Bauer
folds (10-13). During evolution, these domains have recombined, forming such families as the bZIP proteins and nuclear receptors (NR) that comprise an ancient DNA-binding domain and some additional domains. The family of bHLH transcription factors is another example of a complex family history (see Fig. 2). In this family, the DNA binding basic region is adjacent to the HLH domain. The HLH domain is an ancient homo-and hetero-dimerising domain and is characteristic for the protein family. The specific combination of additional domains (basic DNA binding domain, leucine zipper, PAS or Orange) defines the monophyletic subgroups with distinct functions, many of which have emerged subsequently during evolution. The evolution for most of the above mentioned transcription factors appear to follow a pattern of single-gene duplications originating from a
Figure 2. Schema of the protein interaction network of bHLH proteins. Each bHLH group has evolved separately from an ancestral gene, primarily through series of gene duplications. Some proteins appear as hubs and, typically, are descendants of ancient homo-dimerising proteins. The corresponding domain-architectures are given in corresponding boxes. Homo-dimerising proteins are drawn as ellipses, hetero-dimerising ones as rectangles and hetero-dimerising interactions as connecting lines. After (14, 23) and with permission from Cell. Mol. Life Sci. (Birkhäuser Verlag).
Evolution of Regulatory Networks
263
single ancestral protein (bHLH, NR, and bZIP proteins, and TFs in E. coli (14,15). Thus, those protein families are monophyletic. Although many TF families are present in all kingdoms, some TFs are more exclusive. For example, some Helix-Turn-Helix (HTH) families appear mainly in bacteria (16) while the nuclear receptors are particularly frequent in C. elegans (17). The Zn2/Cys6 type zinc cluster only appears in fungal genomes(18). The MADS-box genes are restricted to Eukarya, and were duplicated before the plant-animal divergence (19). Both subsequent types of genes are found in all Eukarya, however, the type II genes have particularly proliferated in plants, giving rise to the numerous MIKC family (20). In general, the distribution of protein families is more homogeneous in Bacteria than Eukarya: a possible reason is frequent horizontal transfer (21-23). There are around 12 to 15 families of DNA-binding domains according to structural classifications. The most frequent family in Eukarya are the zinc-fingers, which are strongly over represented in animals and plants. 3.2. Transcription Factors: Dimerisation Domains Dimerisation is the binding of two similar transcription factors whose function is determined by both binding partners. Transcription factor dimers can then subsequently bind (or not) to DNA. Ignoring whether DNA binding precedes dimerisation or vice versa, the network of dimerisation pairs is of particular interest as it can be analysed by abundant protein interaction data and phylogenetic trees (14). For the bHLH family, protein members of the "D-group" act as modulators. They have a dimerisation (HLH) domain but no DNA-binding domain, therefore they can act as repressors by binding to and deactivating the dimerisation partner. From phylogenetic profiling it was concluded that the ancestral proteins are likely to have been homo-dimerising (24). Following a series of single gene duplications, as mentioned above, they have produced a complex interaction network where the network hubs are descendants of the ancestral protein. The MADS-box genes display what might be called "evolvability": by evolving the dimerisation domains, the MADS-box genes can interact
264
A. Veron, D. Whitehead and E. Bornberg-Bauer
with many different dimerisation partners and achieve a specificity for the target gene and corresponding physiological conditions. A large fraction of fungal and animal MADS-box genes interact with phylogenetically unrelated partners, e.g. such as homeodomain or Zinc finger proteins. Even though the plant MADS-box genes form complexes mostly with other MADS-box genes (25), the multimerisation of some plant MADS-domain proteins (termed MIKC-type) have been facilitated by the acquisition of a second protein-interaction domain during evolution (26). This combination of high specificity combined with the ability to potentially bind with many other dimerisation partners provided adaptability to the MADS network. 3.3. Transcription Factors: Effector Domains Effector domains trigger specific responses upon activation from a regulatory domain. Examples of such responses include flagella rotation, enzymatic catalysis, or relaying a signal from a signalling pathway to the DNA-binding property of a transcription factor. As of our current knowledge, no studies have linked genetic regulatory networks with non-genetic networks such as signalling pathways, even though many effector domains are protein-interaction domains. It has been suggested that in the bHLH system, the loss (leucine zipper) and accretion (PAS, Orange) of domains has supported the structural differentiation of duplicated genes (14). A result is that these proteins lost the ability to bind to the parental domains, hence removing an evolutionary constraint, and the proteins became free to differentiate into other networks. Thus, domain architecture mediates structural changes which facilitates the formation of networks. Apart from the DNA-binding domain (DBD), the nuclear receptors contain an effector domain that binds ligands (LBD) such as hormones. For example, in response to a hormone signal the effector domain binds specifically to a steroid that has entered the nucleus. A structural change in the LBD induces a structural change in the DBD, activating or deactivating its regulatory function by changing the DNA binding capability. In the absence of a signal, such nuclear receptors are inert or
Evolution of Regulatory Networks
265
repressors. Structural studies and phylogenetic analysis showed that during evolution, some LBDs have lost their ligand binding ability (27). 3.4. Transcription Factors: Cofactors Cofactors mediate transcriptional activity by interacting directly or indirectly with transcription factors. They include ligands, other proteins, hormones, and small molecules such as sugars. Such small molecules are less likely to change over evolution, as they are often signalling molecules from the external environment, and usually rather small compared to cellular proteins. Cofactors often contain information from the outside world, and by binding to specific transcription factors, they allow specific genes to be activated at specific times under specific conditions. This is done by modifying either the DNA-binding domain or the effector domain of the transcription factor (25), or triggering the movement of the transcription factor within the cell, for example from the cytoplasm to the nucleus (28). Few studies on such occurrences have been published, likely because of the difficulty in unravelling such relationships due to the tight integration of signalling pathways, signalling molecules, and metabolic pathways. Cofactors can play a role in linking different networks, such as the LIM proteins, which interact with bHLH and zinc-finger transcription factors, protein kinases and numerous other types of proteins (29). Transcription factors, such as the above mentioned D-group of the bHLH proteins, can also be viewed as cofactors because they block other bHLH proteins through dimerisation (see Fig. 2). Transcription factors can act as repressors or activators which strength can change according to other interacting proteins or cofactors. These interactions are integrated by the promoter regions, and a set of transcription factors, cofactors, and promoter regions can be regarded as an information processing module (30,31). 3.5. Evolution of Promoters The sequences that determine when a gene is expressed are as important for the evolution of the gene as the gene itself. DNA sequences coding
266
A. Veron, D. Whitehead and E. Bornberg-Bauer
for promoters are qualitatively different from protein coding sequences in that the interaction of a promoter with other cellular components is determined from the structure of the promoter sequence, whereas coding sequences must first be translated into amino acid sequences. This is reflected in differences in organization, structure and function of promoter sequences compared to coding sequences, and consequently in sequence variation (32). Populations appear to have extensive variation in promoter sequences, and how the differences in variation affect or are affected by evolution is still uncertain. Some of this variation is under the influence and constraints of the DNA structure and corresponding chromatin structure (33,34). Despite extensive variation, the fact that certain promoter regions are preserved across distantly related species and that some promoter sequences are consistently underrepresented implies direct evolutionary selection (for a comprehensive review, see ref. 30). The distribution of promoters in Eukaryotes varies widely. Eukaryotic promoters usually have many transcription factor binding sites (TFBS) and can vary widely in length, from a few hundred bases up to 100kb in length. The TFBSs are usually widely scattered over a promoter, with 80-90% of the promoter containing no binding sites. TFBSs are usually 5-8bp in length. On a promoter, several transcription factors and cofactors may interact in a context, combination dependent manner according to certain physiological and environmental conditions. These complexes can act to activate or repress the corresponding gene and require a modular organisation of the binding sites (25). Regions without transcription factor binding sites can also modulate interactions and influence the local DNA conformation (30). Although promoters undergo the same types of mutations and sequence rearrangements as coding sequences, the subsequent selection pressure is different due to distinctive consequences of mutation events such as, for example, during organismal development (32). The complexity, context dependency, and subtlety of signalling interactions means that systematic studies are difficult and sometimes produce apparent contradictions. For example, only 2-3% of the expression variation in yeast duplicates can be explained by motif divergence in cisregulatory regions (35), whereas point mutations in essential binding
Evolution of Regulatory Networks
267
sites can have dramatic consequences on the phenotype, especially if the gene is a transcription factor influencing the expression of many other genes. Several known phenomena exist that could explain this contradiction. For example, the existence of compensatory binding site mutations has been shown by Tautz and coworkers (36), where a possibly negative mutation in the binding site of a transcription factor is neutralised by a complementary mutation in its binding partner. In addition, not all transcription factor binding sites identified by computational studies are functional and are therefore likely to play a smaller role in gene regulation (32). Due to the physiological/ environmental context dependency of transcription factor modules, consequences from mutations can be very context dependent, whereas mutations in coding regions will always have the same consequences (30). The regulation of transcription factors plays a key role in morphological diversity. This is illustrated by Hox genes, which govern early organism development. Their regulation is spatially diverged between species (37). Simple modifications within the cis-regulation region of TF can explain both minor and major changes between species, without involving any disruption of gene structure, contrarily to what would occur if the mutation happened in the coding sequence of the effector genes (37). Therefore, evolution of regulatory regions is thought to be a major source of diversity. Some more general principles were inferred from studies on large scale “omics” data. When a target gene is duplicated it will only be of use if it also has a functioning regulatory region. Unless it has been copied together with its previous regulatory region (or is copied accidentally into a position of another transcribed region), it will be inert and thus rapidly become a pseudo-gene. Consequently, target genes with a close homology (as inferred by sequence similarity) can be expected to have a tendency to be more strongly co-regulated than those which have drifted more away since, on average, promoter changes will be correlated with genetic drift. Indeed, a positive correlation is observed between the degree of sequence similarity and the co-regulation of two duplicated genes (38). This has been explicitly shown for yeast target genes that are controlled by the same TFs (39,40). The yeast coexpression network
268
A. Veron, D. Whitehead and E. Bornberg-Bauer
properties have been successfully explained based on this model of coduplication of genes with their TFBSs, deletion and duplication of individual TFBSs and gene loss (38). 3.6. Co-evolution and Decoupled Evolution From the standpoint of the neutral theory, it is expected that a universally valid and exact molecular clock would exist if, for a given molecule, the mutation rate for neutral alleles per year were exactly equal among all organisms at all times (41). Based on this assumption, Wagner tested for a correlation between divergence in sequence and divergence in expression patterns, and found that, for yeast duplicates, there was no significant association (42). More recent studies, using more comprehensive data sets (43,44), suggested that a positive correlation existed between sequence similarity of duplicated genes and their coexpression (38). Another study by the same group (45) combined co-expression and TFBS data, revealing a high conservation of gene co-regulation between C. elegans and S. cerevisiae. It was observed that, in the case of a gene duplication in one of the species, only one of the duplicated genes had a conserved co-expression. Snel et al. then proposed a model for gene duplication in which one copy would maintain the relations of the ancestral gene, while the other would be `”free'” of selection constraints and could differentiate and/or undergo sub-functionalisation. Recently, the question of how network genes diverge in their transcriptional regulation after duplication was asked and tested (46), based on the data set by (44). The authors found that divergence after duplication is often rapid and that there is a frequent net loss of TFBS. However, this study has to be balanced with the observation that although the number of shared modules decreased, the number of modules in regulatory regions of duplicated genes in yeast is stable whatever the age of the duplication event (39). Evolution of regulatory units are extensively studied in Bacteria, but transferring this knowledge to Eukarya is problematic since Eukaryotic regulation is more complex (in particular as much as the binding sites are
Evolution of Regulatory Networks
269
concerned) and the high frequency of horizontal gene transfer in Bacteria has a strong influence on network organization (47). 4. Motifs to Modules to Networks Although the molecular basis of evolution is relatively well understood, very little is known about the large scale evolution of biomolecular networks. In the previous Section, the perspective taken was of the evolution of individual network components and their general evolutionary history. This perspective is necessary to understand the `rules' and constraints that nature has imposed on the different components. However, a more global or holistic approach is necessary to understand how the constraints affect the evolution of large-scale networks. As a consequence of the abundance of data which became recently available, the idea has emerged to conceptualise genetic networks as directed graphs with nodes corresponding to the transcription factors linked by edges to their target genes (see Chapter 4). The above mentioned most basic elements (TF, upstream bindingsite, downstream binding-site) can be combined into a next higher level, so called network motifs (48,13,49). Probably the most basic motif is the auto-regulatory loop (see Fig. 1, shaded box): a transcription factor regulates its own expression. In particular, for the inhibition autofeedback-loop, Becskei and Serrano (2000) have elegantly shown in theory and experiment that it is an important basic building block enabling stability against perturbation. Generally, these motifs should not be seen as separate units of transcription since they often overlap (51) but rather as basic architectures with a certain response behaviour. For example, the feed forward loop (FFL, see Fig. 4), also known as oriented triangles (48) was shown experimentally to be designed such that noise from the upstream signals is eliminated while there is still rapid response to the target genes (52). Alon and coworkers have investigated the structure of both genetic and protein-protein interaction networks using graph analysis algorithms (53). As a main conclusion they found that certain topologies of small subnets are statistically very much over represented (49,44).
270
A. Veron, D. Whitehead and E. Bornberg-Bauer
Figure 3. Units of transcription factors are often arranged into motifs, which are topological structures occurring more often than chance alone could explain. Motifs could represent specific functions, for example modulation of expression by the feed forward loop. Motifs can then arranged into modules, which are groups of motifs controlling the higher order logic of information processing in a cell (Figure adapted from ref. 55).
Conant and Wagner have followed up this idea and introduced the notion of common ancestry for gene circuits or motifs: two motifs share a common ancestor if every pair of genes in the two circuits is derived from a common ancestor: all pairs in the circuits must be duplicated genes (54). Finding that essentially no pair of motifs with identical topology had common ancestry, they concluded that their emergence is the result of convergent evolution and not duplication of one or a few ancestral circuits, and noted that convergent evolution was more likely to be important in module topology than in protein sequence (54). Intuitively, such a conclusion suggests that there is a positive selection on these motifs. The possible scenarios for the evolution of elementary networks, such as motifs, require the investigation of loss and gain of interactions in terms of duplication or co-duplication of transcription factors and target genes as Teichmann and Babu (2004) have elaborated in a seminal paper (55). Duplication of a target gene alone will result in two genes (co-) controlled by the same TF, duplication of the TF alone will mean the target genes are controlled by the two TFs while co-duplication will lead to genes co-regulated by both TFs (see Fig. 4). Changes in sequences of
Evolution of Regulatory Networks
271
regulatory regions and/or binding domains or secondary domains will then remove or add further links. Clearly, these basic mechanisms will lead to single input motifs (SIMs) and multiple input motifs (MIMs) while FFLs require different mechanisms like the gain of a new module in the promoter of a pre-existing gene (see Fig. 4, also for definitions). Intuitively, one might assume that the afore mentioned observation of convergent motif evolution indicates that they have arisen predominantly by duplication of the whole unit. However, although many (≈ 45%) of the regulatory interactions in E. coli and yeast have arisen by gene duplication and conservation of the corresponding interactions (55) the main mechanisms of generating motifs are frequent losses and gain of interactions following the more elementary events mentioned afore (38). Thus, at the motif level, large scale duplication events have probably not played a major role in the evolution of genetic networks, as it has been recently showed (56). The third level of network organisation according to ref. 13 comprises transcriptional modules (Fig. 3). These are collections of motifs which are fairly independently regulated. They correspond to functional units as has been concluded from several studies using expression data (44). Modules represent collections of transcription factors that are expressed under distinct experimental conditions (57) and largely controlled by one (or very few) regulator as was shown by hierarchical clustering of expression profiles (58). Tavazoie and coworkers (59) showed experimentally that the clusters of co-expressed genes largely correspond to sets of genes with similar function according to database classifications. It is worthwhile noting here that modules of networks have also been identified for metabolic pathways, however in terms of smallest stoichiometric units that can not be further decomposed (60,61) and can be defined for signalling pathways as well. Future research will show if these units refer to evolutionary related units and if their genes, too, are highly co-regulated. For metabolic pathways, genomic clustering, indicating strong evolutionary relationship, has been shown (62). For genetic networks first indirect evidence of common evolutionary origin was provided by Schwikowski et al. (63) who showed that functionally related proteins tend to have overrepresented protein-protein
272
A. Veron, D. Whitehead and E. Bornberg-Bauer
interactions and that most interactions are confined to proteins acting in the same cell-compartment, including the nucleus and the transcription factors contained therein. We can now return to one of the initial questions on convergent evolution of the camera eye. A recent study based on EST data suggested that, in spite of differences in the actual topology, eye development in both organisms is controlled by sets of pairwise homologous genes although they are not involved in eye development in the common ancestor (64). Assuming now that these regulatory genes correspond to modules in the sense mentioned above (which remains to be proven) and are controlled by few transcription factors, it is conceivable that the modules arose repeatedly from this (or these) ancestral genes through the mechanism outlined above. Further analysis of global features of genetic networks and protein interaction networks were done based on concepts borrowed from graph analysis and has revealed a scale-free topology (65,66). This means there are few genes, so called hubs, controlling many others and many genes with only few links. However, regulatory networks are dynamic : it was shown that large-scale topological changes happened in the yeast genome and that, although a few TFs serve as permanent hubs, most act transiently only during certain conditions (67). The formula Ci = 2ni /k(k1) defines a clustering coefficient where ni is the number of links connecting the neighbours of node i to each other (66). A value close to 0 with many neighbours indicates a hub, a value close to 1 a cluster. Protein-interaction networks are very useful since they can help investigate networks emerging from dimerisation and secondary regulatory domains of TFs. In contrast to gene regulatory networks, they induce non-directed graphs. Observing the directed graph of gene-regulatory networks in yeast (48) from two distinct perspectives revealed a fundamentally different behaviour of the two sub-graphs : (i) the graph of incoming interactions, i.e. the population of target genes being regulated by transcription factors, where the degree of connectivity (in-degree) is the number of regulatory proteins binding to a promoter region, and (ii) the graph of outgoing interactions, i.e. the population of transcription factors regulating target genes, where the degree of connectivity (out-degree) is
Evolution of Regulatory Networks
273
the number of target genes to which a transcription factor binds. It appears (48) that the in-degree has a very narrow distribution: around the mean number of transcription factor, the probability decreases with a exponential rate (nearly all target genes have the same number of transcription factor), while the out-degree has a broader distribution and follows a power-law (see Chapter 4). Knockout studies in yeast suggested that target genes regulated by more than 9 TFs are, in general, less essential (they contain proportionally less lethal genes), while the most important transcription factors are the ones binding the most target genes (68). It is also worthwhile noting that more complex organisms have a higher number of regulatory genes per target gene (5), suggesting that higher complexity is achieved in a large part through the evolution by duplication and diversification of transcription factors and of their interactions. 5. Models of Network Evolution and Simulation Studies Observing the dynamics of regulatory network evolution is difficult in vivo as biological evolutionary processes are normally very slow. Thus, networks reconstructed from species living today are only snapshots of a dynamic process. Like cosmologists who wish to study the birth and death processes of stars, researchers turn to computer-based models to test their theories. Simulation is widely used to test and improve current knowledge of existing biological networks, as well as test and improve models (e.g. the λ phage life cycle (69) and B. subtilis sporulation (70). Simulation studies also represent a way of testing rules or scenarios regarding growth and evolution for different types of biological networks. A general description of gene network evolution is a series of (whole genome or single gene) duplications, with changes in the topology obtained by gain and loss of interaction between elements of the network. The idea of gene and genome duplications providing the raw material for subsequent evolutionary tinkering was comprehensively presented more than 30 years ago (71). However, the rules concerning the frequency of duplication and modification of interaction,
274
A. Veron, D. Whitehead and E. Bornberg-Bauer
conservation of a gene duplicate, and the creation and deletion of interactions are still under question, as well as their relative frequencies and importance. In this Section, several simulation studies are presented that cover most of today's hypothesis on the evolution of biomolecular networks. 5.1. Simulation Principles In the real world, regulatory, protein interaction and metabolic networks have different types of elements playing different roles resulting in different dynamics. For example, metabolic networks appear to be optimized for efficiency, whereas it is unlikely that transcriptional regulatory networks are under such direct selection pressure. Metabolic networks are concerned with the flow of materials and energy, whereas regulatory networks are concerned with the flow of information. From the point of view of simulating the evolution of the network, after all necessary simplifications the main actors are the proteins involved in the process studied. Most of the time, the system can be represented as a graph, with the vertices representing proteins which are connected by edges if they interact (the interaction being regulatory, physical or through a common metabolite). Growth of the system arises then mainly by gene duplications (be it single genes, whole genomes, or anything in between). The topology of the network is a consequence of local modification of protein function or DNA binding regions resulting in a gain, loss, or adjustment of the interaction. Several models have been proposed, starting with the random model (73), followed by more specific growing random models giving rise to scale-free networks (65), to models dedicated to biology, which will be presented in the next paragraphs. The general procedure of current methods is to define a model of network growth, simulate some networks, and then compare the simulated networks with appropriate real biological networks. The comparison measures are often global statistical properties, such as the scaling coefficient (γ) or the clustering coefficient, or subgraph count (74). In fact, the comparison measures between networks is a field that needs more research, as it is not often clear what biological significance
Evolution of Regulatory Networks
275
global statistical properties have. Ideally, the model should be built on testable assumptions (regarding the duplication or mutation events). What follows next is a discussion of different models of network evolution. 5.2. Duplication–Mutation Models These models investigate the dynamics of growing networks based on network growth operations. For example, after the duplication of a gene, there are several evolutionary possibilities (Fig. 4). Transcription factor and target gene
Duplication of transcription factor
(i)
Duplication of target gene
(ii)
Duplication of transcription factor and target gene
(iii)
(iv)
Figure 4. Growth and evolution of regulatory networks by duplication, modified from ref. 55. The circles represent transcription factors and the boxes are the target genes. The thin arrows represent regulation, while the large grey arrows represent evolution events. From an initial transcription factor and binding site, three types of duplication can occur, which are followed by different evolutionary events. i) After duplication of the transcription factor, the new transcription factor can lose affinity for the old target gene and acquire a new target, ii) After duplication of the target gene the new gene can lose its transcription factor binding site and gain a new one (or maintain both targets). After duplication of the transcription factor and the target gene, this results in multiple possibilities: iii) the new pair can gain another transcription factor, or iv) a transcription factor can gain a new interaction. Further duplications and mutations result in more complex network structures.
276
A. Veron, D. Whitehead and E. Bornberg-Bauer
Pastor-Satorras et al. have developed a simple model based on gene duplication and correlated rewiring of the graph applied to proteinprotein interaction networks (75). When a protein is duplicated, all interactions of the original protein with its partners are copied. Each of the new edges has a probability δ to be deleted, while an edge between the new node and an other randomly chosen can be created with the probability α. The model is quite robust to the values of the edge creation rate α as long as it stays small (proportional to (1)/N, N being the size of the network). On the other hand δ, is very important for the topology, and can be chosen to have realistic values of the scaling factor γ. Indeed, as introduced by (65), the connectivity distribution of the proteins in the protein interaction network follows a scale-free power-law function: P(k) = c.k- γ, where k is the connectivity and γ is the scaling factor. This simple model gives global statistical properties (such as the average connectivity degree, the clustering coefficient and the average path length) that are similar to biological networks. Moreover, as observed in the yeast PPI network, about 40% of the vertices are excluded from the main cluster (75). Duplication of a protein (through gene duplication) can also be correlated to its interacting partners. A stochastic model of the evolutionary growth of metabolic and signal transduction networks includes duplication rates for upstream and downstream partners of the randomly chosen duplicating protein (76). In this model, proteins are composed of 2 (or 1) domains, a downstream and an upstream domain. At each time step, a domain is randomly chosen and has the opportunity to duplicate with a given probability. Interacting partners, the upstream and downstream domains, can also be duplicated with specific probabilities. The second type of mutation included in the network it the creation of new edges (interactions) between domains. This model uses a finite number of domains, which can be seen as domain families. The number of domains of each family in the network follows a scale-free law, as well as the global metabolic network. The resulting network fits the yeast protein network in terms of connectivity (76). Alternatively, a stronger attention can be set on the rewiring on the network, simulating high mutation rate of the proteins or their regulatory regions. In the case of gene expression networks, Bhan et al. (77) have
Evolution of Regulatory Networks
277
designed several models based on 3 sets of yeast expression data. The networks are derived from a kinetic model of gene expression that shows influence of one expression level on another, but without considering whether this influence is activation or repression. Based on two different seeds (random or high clustering), the networks are grown with one of the two possible growth rules. For the first growth rule, named partial duplication, at each time step a node is chosen to be duplicated with all its interactions. Each new interaction is then removed with a probability close to 50%. The other implemented scenario, gene duplication with preferential rewiring, treats both events (duplication and edge rewiring) as stochastic events, each of them having a 0.5 probability. The networks obtained from both seeds with both growth models are then compared with biological data and growing random networks (65) and a pure duplication model (no edge is deleted) for the values of the cluster coefficient, the average path length and the scaling coefficient. Both partial duplication and duplication with rewiring are found to fit the 3 yeast data sets : diauxic shift, cell cylce data (alpha factor) and the cell cycle data (cdc20) (78,79). Link dynamics (addition and deletion of edges) rather that duplication might be responsible for the global network topology (56). From yeast PPI data, the rates of gene duplication 10-3 per gene and per million year), gain (or attachment rate, 10-1 new interaction partners per node and million years) and loss (~10-1) of interaction were obtained (80). The fact that the link dynamic is estimated to be much faster than the duplication rate suggests that the events might be decoupled (56). Observing the existing network, the authors found that low connected nodes tended to be connected to highly connected nodes (56). Therefore, the authors concluded that the link attachment was asymmetric. Biologically, this asymmetry might reflect the fact that in order to establish a new interaction, only one protein needs to mutate. Specifically, at each time step, new links are added at a rate of 0.59 new interactions per node per million years. A node is chosen randomly and its new interaction partner is determined preferentially depending on its connectivity. In the mean time, edges are removed when the average network connectivity becomes too great, choosing equiprobably in the
278
A. Veron, D. Whitehead and E. Bornberg-Bauer
network. The connectivity distribution of the obtained networks are compatible with the data (56). 5.3. Mutation-Only Models Some models do not represent gene duplication, but only rewiring of the network. Two such models applying to genetic networks are reviewed in (81). Regulatory interactions are represented by Boolean networks, and more particularly by threshold networks, where the nodes are the genes and the edges are the regulatory interactions between a gene product and the regulatory region of another gene. In this type of network, each node can take two discrete values σi = ±1. The edges have a weight value of 1 activation or -1 repression. The absence of interaction is represented by a weight wij = 0 (expression level of gene i does not influence expression level of gene j). The weight has to be multiplied by the value representing the state of the gene (1 : on; - 1 : off). These state values reflect the input received by a node and are updated simultaneously at each time step: σi = 1 if Σj∈wi wijσj ≥ 0
σi = -1 if Σj∈wi wijσj < 0. The first model (82) assumes that robustness is an evolutionary force and thus incorporates a selection based on robustness. At each time step, 1) a daughter network is created by (i) adding, (ii) removing, (iii) replacing a weight in the network, where each operation occurs with a probability of 1/3. Specifically, this means that the weight of an edge will be turned from 0 to ± 1 or vice versa. 2) Both the mother and the daughter network are iterated until they have reached and completed the same attractor cycle, or until the time when σi differs between the two networks. The daughter network is kept if it has the same dynamics as the original networks. Otherwise the mother network is kept. Such a selection clearly favors robustness: only mutations conserving the expression pattern are kept.
Evolution of Regulatory Networks
279
The author use the “frozen component” (the set of silent genes for a given attractor) and initial conditions to compare their network to random and natural networks. The networks obtained with the first model have a significantly bigger frozen component than random networks, suggesting a much simpler expression than in the random networks. Moreover, the biological expression patterns seem to be surprisingly simple, thus confirming the hypothesis of robustness as an evolutionary principle in nature (81). The second model (83) concentrates on local functional changes of single genes and their influence on global network architecture and dynamics. The model still uses threshold networks, but the algorithm differs in the fact that each time the system reaches either a dynamical attractor or a maximal time Tmax, a random node is chosen and modified. If it appears that during the last run towards the attractor the node has not changed state (the gene was silent) then a non-zero weight is added between the gene and another one randomly chosen. On the contrary, if the node had changed state, then one of its non-zero weights is randomly chosen and deleted. The system is then iterated until either the attractor or Tmax is reached. This systems simulates local adaptations: if a gene is completely silent, either its regulation patterns change or it is eliminated ; if a gene is overexpressed, it will be down-regulated. Both models presented in ref. 81 exhibit simple expression patterns that can be found in nature particularly in switch-like networks. Further random knock-out studies will be needed to determine the structure of the regulatory network and be able to fully test the models (81). 5.4. Using Subgraph Count Subgraphs, like FFLs (feed forward loops, when a transcription factor A regulates the expression of genes B and C, and that transcription factor B also regulates the transcription of gene C) or other patterns have been counted in different networks (84). From this study, it was observed that networks of the same type (transcriptional networks of different organisms) share network motifs but also subgraph ratio profile (SP) with similar proportion of motif and anti-motif. On the contrary, different networks (transcriptional network versus language network) have very
A. Veron, D. Whitehead and E. Bornberg-Bauer
280
different SP. Some subgraphs (FFLs, multiple input, ...) have been shown to have an important role in biological networks (52). Moreover, ref. 74 revealed the link between the commonly used statistical features (scaling factor and clustering coefficient) and the subgraph content of a network. Raw subgraph count can be a very powerful method to distinguish from which model a given network arose, and can capture more structural information than the global statistic features mentioned earlier (85). Using a predictive supervised learning technique from the machine learning community, the algorithm Adaboost (86), Middendorf et. al. generate a decision tree able to compute which mechanism out of seven best reproduces Drosophila melanogaster protein-protein interaction network. The seven model mechanisms tested in this study (85) are: •
•
• • • • •
DMC (duplication mutation complementation): at each time step of the growth process, a vertex is randomly chosen and duplicated, together with all its interactions (edges). Each one of the edges from the original vertex or its twin is then deleted with a fixed probability qdel, while the twin vertices have the probability qcon to interact. DMR (duplication followed by random mutation): after duplication, only the twin edges can be removed while new edges are created with random nodes of the network. No interaction is set between a vertex and its duplicate. LPA (linear preferential attachment): favours attachment of a new vertex to a highly connected node. RDS (random static network): the vertices are connected randomly. RDG (random growing network): new edges are added randomly between existing vertices. AGV (aging vertex network): probability of attachment of a new edge decreases with a vertex age. SMW (small-world network): between regular ring lattices and randomly connected graphs.
A total of 7000 networks are generated using these models and are used as a learning set for the algorithm which generates the decision tree. Technically, the decision tree assigns scores to each of the 7 models
Evolution of Regulatory Networks
281
according to the number of subgraphs found in the scanned network. All subgraphs that can be constructed by a walk of length 8 are counted, which represent 148 nonisomorphic graphs (ie. no relabelling of the vertices of any two graphs can be done such that the two graphs would be identical). The decision tree is then applied to the Drosophila PPI network. The DMC growth model is found to best describe the Drosophila PPI network, suggesting that evolution favours functional complementarity rather than random addition or deletion of functions. However, certain sets of subgraphs are best described by other growth models, such as DMR, LPA or RDG. This means that although DMC is the best explaining pure model, it is most probable that the real biological networks grow and evolve following a mix of several mechanisms. The DMC mechanism can be used as a base to build new mechanisms. 5.5. Models Motivated by Pattern Formation In the previous models (except for the mutation-only model), selection was assumed to be neutral and evolution a completely stochastic process. While it is unclear at present the types and relative importance of selection pressures that operate on regulatory networks, some form of direct or indirect selection must exist. Introducing a selection pressure in a model of network evolution means that some networks are “fitter” than others, so a fitness function must be defined. In studying genetic models of morphogenesis and pattern formation, Salazar-Ciudad et al. developed and studied a model of regulatory network evolution where the fittest networks were those that best produced a defined pattern of gene products, for example alternating high and low concentrations of a gene product, or gradual gradients (87). It was observed that certain topological structures occurred at different frequencies and different times: a so called “emergent” network occurred more earlier in the simulated evolution, and the “hierarchic” network occurring later. The authors interpretation was that emergent networks are more likely to occur and produce the required pattern but were unstable, being easily disabled by chance mutations. On the other hand, hierarchic networks occurred less often but were more robust to mutations. A gradual
282
A. Veron, D. Whitehead and E. Bornberg-Bauer
restructuring from emergent to hierarchic networks was observed in the simulations. This supports the idea that underlying regulatory networks can accumulate change and diversity while the resulting phenotype remains constant. To what degree this occurs in nature is currently unknown. 5.6. Can Biological Network Evolution Be Modeled? Modeling biomolecular network evolution implies the existence of a global mechanism ruling the system. However, detailed studies of individual families reveal different evolutionary dynamics (24). The evolution of gene families is strongly influenced (88) by an "interaction" between the paralogues in all models fitting with real data sets, suggesting that the presence of selection shapes the network, putting different force on different gene families. The global features are explained by local properties: both detailed, local studies and broad, global studies are necessary in order to understand complex systems such as biomolecular networks in a fully integrated way (74). 6. Conclusions Contemporary biological research is characterised by the large amount of available data, whereas previously, biological networks had to be slowly pieced together from many experiments. The availability of many genomes allows experimental data to be extended from well studied model organisms, such as E. coli and S. cerevisiae, onto the newly sequenced genomes, allowing comparisons between species which furthers our understanding of the evolution of complex networks. At the level of the individual protein or transcription factor, cross species observations reveal that regulatory genes that have key roles in conserved processes, such as development, are often maintained over startlingly long periods, for example the master regulatory gene for eyes (89). The most ancestral genetic networks probably consisted of a small group of DNA binding domains that converged into many different types of transcription factor, but preserving their general fold or structure. The
Evolution of Regulatory Networks
283
most ancestral proteins are believed to have been single domain proteins. They have evolved further by a series of single gene duplications that are believed to be arranged into modules, which can be reused to build up higher complexity. Most proteins appear to have undergone many cycles of domain rearrangements (90), where different domains, such as sensoring domains, were added and lost again. Motifs are the organisational level above single transcription factor units (Fig. 3). An increasing number of findings provide evidence that structurally similar motifs arise repeatedly during the course of evolution. While it has not been proven whether the conserved motifs are due to positive selection pressure or due to the constraints and dynamics of network growth (although the former seems more likely), they are examples of convergent evolution at the level of the regulatory network. More complex modules, arising through a series of single gene duplications, are a collection of regulatory genes with a tightly linked function and correlated expression. The importance of large scale gene duplications is uncertain. While they were probably crucial in allowing the evolution of complex vertebrates, e.g. via the duplication of the Hox genes (91), it is not clear if these kinds of events still play an important role in vertebrate evolution (92). For plants, however, whole genome duplications appear to occur quite often (93). Due to the difficulty in exactly defining promoters and determining the binding partners, the evolution of promoters is less well understood, although certainly of great importance. The difference in promoters could explain paradox of the large amount of variation in body forms compared to a comparatively small difference in genomes. This makes sense from the viewpoint of adaptability also, as small changes in the organisation of promoters can have drastic consequences for gene expression and thus the body plan. In summary, it is beginning to look possible to explain the origin and evolution of complex networks. The differences in phenotypes appear to arise from the standard set of genetic operators: gene duplications, deletions, redundency, divergence, neutral and adaptive selection, modular linking and rearranging. The next steps are finding out the frequency, dynamics and constraints of these operators in the context of
A. Veron, D. Whitehead and E. Bornberg-Bauer
284
evolution and adaptation. Armed with this knowledge, the door will be open to begin making predictions and possibly reconstructing the evolution of complex biological networks. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13.
14.
15.
16.
Gould, S. (1989). Wonderful Life: The Burgess Shale and the Nature of History. W.W. Norton, New York. Fontana, W. and Buss, L. (1994). What would be conserved if "the tape were played twice"? Proc Natl Acad Sci. U.S.A. 91, 757-61. Fontana, W. and Schuster, P. (1998). Continuity in evolution: on the nature of transitions. Science. 280, 1451-5. Levine, M. and Tjian, R. (2003). Transcription regulation and animal diversity. Nature. 424, 147-51. van Nimwegen, E. (2003). Scaling laws in the functional content of genomes. Trends Genet. 19, 479-84. Ambros, V. (2001). microRNAs: tiny regulators with great potential. Cell. 107, 8236. Wilkins, A. (1986). Genetic analysis of animal development., John Wiley and Sons, New York. von Dassow, G., Meir, E., Munro, E.M. and Odell, G.M. (2000). The segment polarity network is a robust developmental module. Nature. 406, 188-92. Wilkins, A.S. (2002). The Evolution of Developmental Pathways. Chapter 9, 127136. Sinauer Associates. Perez-Rueda, E. and Collado-Vides, J. (2000). The repertoire of DNA-binding transcriptional regulators in Escherichia coli K-12. Nucleic Acids Res. 28, 1838-47. Aravind, L. and Koonin, E. (1999). DNA-binding proteins and evolution of transcription regulation in the archaea. Nucleic Acids Res. 27, 4658-70. Riechmann, J., Heard, J., Martin, G., Reuber, L., Jiang, C., Keddie, J. et al. (2000). Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science. 290, 2105-10. Babu, M., Luscombe, N., Aravind, L., Gerstein, M. and Teichmann, S. (2004). Structure and evolution of transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14, 283-91. Amoutzias, G., Robertson, D. and Bornberg-Bauer, E. (2004). The evolution of protein interaction networks in regulatory proteins. Comparative and Functional Genomics. Amoutzias, G., Robertson, D., Oliver, S. and Bornberg-Bauer, E. (2004). Convergent networks by single-gene duplications in higher eukaryotes. EMBO Rep. 5, 274-9. Huffman, J. and Brennan, R. (2002). Prokaryotic transcription regulators: more than just the helix-turn-helix motif. Curr Opin Struct Biol. 12, 98-106.
Evolution of Regulatory Networks
285
17. Sluder, A., Mathews, S., Hough, D., Yin, V. and Maina, C. (1999). The nuclear receptor superfamily has undergone extensive proliferation and diversification in nematodes. Genome Res. 9, 103-20. 18. Akache, B., Wu, K. and Turcotte, B. (2001). Phenotypic analysis of genes encoding yeast zinc cluster proteins. Nucleic Acids Res. 29, 2181-90. 19. Alvarez-Buylla, E., Pelaz, S., Liljegren, S., Gold, S., Burgeff, C., Ditta, G. et al. (2000). An ancestral MADS-box gene duplication occurred before the divergence of plants and animals. Proc Natl Acad Sci. U.S.A. 97, 5328-33. 20. Becker, A. and Theissen, G. (2003). The major clades of MADS-box genes and their role in the development and evolution of flowering plants. Mol. Phylogenet. Evol. 29, 464-89. 21. Lespinet, O., Wolf, Y., Koonin, E. and Aravind, L. (2002). The role of lineagespecific gene family expansion in the evolution of eukaryotes. Genome Res. 12, 1048-59. 22. Coulson, R., Enright, A. and Ouzounis, C. (2001). Transcription-associated protein families are primarily taxon-specific. Bioinformatics. 17, 95-7. 23. Bornberg-Bauer, E., Beaussart, F., Kummerfeld, S., Teichmann, S.A. and Weiner, I.J. (2005). The evolution of domain arrangements in proteins and interaction networks. Cell Mol Life Sci. 62, 435- 445. 24. Amoutzias, G., Weiner, J. and Bornberg-Bauerg, E. (2005). Network phylogeny of protein interactions in eukaryotic transcription factors. Gene. 347, 247-53. 25. Messenguy, F. and Dubois, E. (2003). Role of MADS box proteins and their cofactors in combinatorial control of gene expression and cell development. Gene. 316, 1-21. 26. Kaufmann, K., Melzer, R. and Theissen, G. (2005). Mikc-type mads-domain proteins: structural modularity, protein interactions and network evolution in land plants. Gene. 347, 183-98. 27. Schwabe, J. and Teichmann, S. (2004). Nuclear receptors: the evolution of diversity. Sci STKE. 2004, 4. 28. Kawana, K., Ikuta, T., Kobayashi, Y., Gotoh, O., Takeda, K. and Kawajiri, K. (2003). Molecular mechanism of nuclear translocation of an orphan nuclear receptor, SXR. Mol Pharmacol. 63, 524-31. 29. Matthews, J. and Visvader, J. (2003). LIM-domain-binding protein 1: a multifunctional cofactor that interacts with diverse proteins. EMB. 4, 1132-7. 30. Wray, G., Hahn, M., Abouheif, E., Balhoff, J., Pizer, M., Rockman, M. et al. (2003). The evolution of transcriptional regulation in eukaryotes. Mol Biol Evol. 20, 377419. 31. Emberly, E., Rajewsky, N. and Siggia, E. (2003). Conservation of regulatory elements between two species of Drosophila. BMC Bioinformatics. 4, 57. 32. Rodriguez-Trelles, F., Tarrio, R. and Ayala, F. (2003). Evolution of cis-regulatory regions versus codifying regions. Int J Dev Biol. 47, 665-73. 33. Wang, H., Noordewier, M. and Benham, C. (2004). Stress-induced DNA duplex destabilization (SIDD) in the E. coli genome: SIDD sites are closely associated with promoters. Genome Res. 14, 1575-84.
286
A. Veron, D. Whitehead and E. Bornberg-Bauer
34. Bode, J., Goetze, S., Heng, H., Krawetz, S. and Benham, C. (2003). From DNA structure to gene expression: mediators of nuclear compartmentalization and dynamics. Chromosome Res. 11, 435-45. 35. Zhang, Z., Gu, J. and Gu, X. (2004). How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution? Trends Genet. 20, 403-7. 36. Tautz, D. (2000). Evolution of transcriptional regulation. Curr. Opin. Genet. Dev. 10, 575-9. 37. Carroll, S. (2000). Endless forms: the evolution of gene regulation and morphological diversity. Cell. 101, 577-80. 38. van Noort, V., Snel, B. and Huynen, M. (2004). The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Rep. 5, 280-4. 39. Papp, B., Pal, C. and Hurst, L. (2003). Evolution of cis-regulatory elements in duplicated genes of yeast. Trends Genet. 19, 417-22. 40. Maslov, S., Sneppen, K., Eriksen, K. and Yan, K. (2004). Upstream plasticity and downstream robustness in evolution of molecular networks. BMC Evol Biol. 4, 9. 41. Gojobori, T., Moriyama, E. and Kimura, M. (1990). Molecular clock of viral evolution, and the neutral theory. Proc Natl Acad Sci. U.S.A. 87, 10015-8. 42. Wagner, A. (2000). Decoupled evolution of coding region and mRNA expression patterns after gene duplication: implications for the neutralist-selectionist debate. Proc Natl Acad Sci. U.S.A. 97, 6579-84. 43. Hughes, T., Marton, M., Jones, A., Roberts, C., Stoughton, R., Armour, C. et al. (2000). Functional discovery via a compendium of expression profiles. Cell. 102, 109-26. 44. Lee, T., Rinaldi, N., Robert, F., Odom, D., Bar-Joseph, Z., Gerber, G. et al. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 298, 799-804. 45. Snel, B., van Noort, V. and Huynen, M. (2004). Gene co-regulation is highly conserved in the evolution of eukaryotes and prokaryotes. Nucleic Acids Res. 32, 4725-31. 46. Evangelisti, A. and Wagner, A. (2004). Molecular evolution in the yeast transcriptional regulation network. J. Exp. Zoolog. Part B Mol Dev Evol. 302, 392411. 47. McAdams, H., Srinivasan, B. and Arkin, A. (2004). The evolution of genetic regulatory systems in bacteria. Nat Rev Genet. 5, 169-78. 48. Guelzim, N., Bottani, S., Bourgine, P. and Kepes, F. (2002). Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet. 31, 60-3. 49. Shen-Orr, S., Milo, R., Mangan, S. and Alon, U. (2002). Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet. 31, 64-8. 50. Becskei, A. and Serrano, L. (2000). Engineering stability in gene networks by autoregulation. Nature. 405, 590-3. 51. Dobrin, R., Beg, Q., Barabasi, A. and Oltvai, Z. (2004). Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinformatics. 5, 10.
Evolution of Regulatory Networks
287
52. Mangan, S., Zaslaver, A. and Alon, U. (2003). The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. J Mol Biol. 334, 197204. 53. Yeger-Lotem, E., Sattath, S., Kashtan, N., Itzkovitz, S., Milo, R., Pinter, R. et al. (2004). Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. Proc Natl Acad Sci. U.S.A. 101, 5934-9. 54. Conant, G. and Wagner, A. (2003). Convergent evolution of gene circuits. Nat Genet. 34, 264-6. 55. Teichmann, S. and Babu, M. (2004). Gene regulatory network growth by duplication. Nat Genet. 36, 492-6. 56. Berg, J., Lassig, M. and Wagner, A. (2004). Structure and evolution of protein interaction networks: a statistical model to link dynamics and gene duplications. BMC Evol Biol. 4, 51. 57. Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y. and Barkai, N. (2002). Revealing modular organization in the yeast transcriptional network. Nat Genet. 31, 370-7. 58. Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D. et al. (2003). Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet. 34, 166-76. 59. Tavazoie, S., Hughes, J., Campbell, M., Cho, R. and Church, G. (1999). Systematic determination of genetic network architecture. Nat Genet. 22, 281-5. 60. Schuster, S., Fell, D. and Dandekar, T. (2000). A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks. Nat Biotechnol. 18, 326-32. 61. Schilling, C. and Palsson, B. (1998). The underlying pathway structure of biochemical reaction networks. Proc Natl Acad Sci. U.S.A. 95, 4193-8. 62. Lee, J. and Sonnhammer, E. (2003). Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 13, 875-82. 63. Schwikowski, B., Uetz, P. and Fields, S. (2000). A network of protein-protein interactions in yeast. Nat Biotechnol. 18, 1257-61. 64. Ogura, A., Ikeo, K. and Gojobori, T. (2004). Comparative analysis of gene expression for convergent evolution of camera eye between octopus and human. Genome Res. 14, 1555-61. 65. Barabasi, A. and Albert, R. (1999). Emergence of scaling in random networks. Science. 286, 509-12. 66. Barabasi, A. and Oltvai, Z. (2004). Network biology: understanding the cell's functional organization. Nat. Rev. Genet. 5, 101-13. 67. Luscombe, N., Babu, M., Yu, H., Snyder, M., Teichmann, S. and Gerstein, M. (2004). Genomic analysis of regulatory network dynamics reveals large topological changes. Nature. 431, 308-12. 68. Yu, H., Greenbaum, D., Xin Lu, H., Zhu, X. and Gerstein, M. (2004). Genomic analysis of essentiality within protein networks. Trends Genet. 20, 227-31. 69. Arkin, A., Ross, J. and McAdams, H.H. (1998). Stochastic kinetic analysis of developmental pathway bifurcation in phage lambda-infected escherichia coli cells. Genetics. 149, 1633-48.
288
A. Veron, D. Whitehead and E. Bornberg-Bauer
70. Jong, H.d., Geiselmann, J., Batt, G., Hernandez, C. and Page, M. (2004). Qualitative simulation of the initiation of sporulation in bacillus subtilis. Bull Math Biol. 66, 261-99. 71. Ohno, S. (1970). Evolution by gene duplication., Springer-Verlag, Berlin, New York. 72. Edwards, J.S., Ibarra, R.U. and Palsson, B.O. (2001). In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data. Nat Biotechnol. 19, 125-30. 73. Erdös, P. and Renyi, A. (1960). On the evolution of random graphs. Publ Math Inst Hung Acad Sci. 5, 17-61. 74. Vazquez, A., Dobrin, R., Sergi, D., Eckmann, J., Oltvai, Z. and Barabasi, A. (2004). The topological relationship between the large-scale attributes and local interaction patterns of complex networks. Proc Natl Acad Sci. U.S.A. 101, 17940-5. 75. Pastor-Satorras, R., Smith, E. and Sole, R. (2003). Evolving protein interaction networks through gene duplication. J Theor Biol. 222, 199-210. 76. Rzhetsky, A. and Gomez, S. (2001). Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics. 7, 988-96. 77. Bhan, A., Galas, D. and Dewey, T. (2002). A duplication growth model of gene expression networks. Bioinformatics. 18, 1486-93. 78. DeRisi, J., Iyer, V. and Brown, P. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 278, 680-6. 79. Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M. et al. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 9, 3273-97. 80. Wagner, A. (2003). How the global structure of protein interaction networks evolves. Proc R Soc Lond B. Biol Sci. 270, 457-66. 81. Bornholdt, S. (2001). Modeling genetic networks and their evolution: a complex dynamical systems perspective. Biol Chem. 382, 1289-99. 82. Bornholdt, S. and Sneppen, K. (2000). Robustness as an evolutionary principle. Proc R Soc Lond B. Biol Sci. 267, 2281-6. 83. Bornholdt, S. and Rohlf, T. (2000). Topological evolution of dynamical networks: global criticality from local dynamics. Phys Rev Lett. 84, 6114-7. 84. Milo, R., Itzkovitz, S., Kashtan, N., Levitt, R., Shen-Orr, S., Ayzenshtat, I. et al. (2004). Superfamilies of evolved and designed networks. Science. 303, 1538-42. 85. Middendorf, M., Ziv, E. and Wiggins, C. (2005). From The Cover: Inferring network mechanisms: The Drosophila melanogaster protein interaction network. Proc. Natl. Acad. Sci. U.S.A. 102, 3192-7. 86. Freund, Y. and Schapire, R. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comp. Sys. Sciences. 55, 119139. 87. Salazar-Ciudad, I., Newman, S.A. and Sole, R.V. (2001). Phenotypic and dynamical transitions in model genetic networks (1): Emergence of patterns and genotypephenotype relationships. Evol Dev. 3, 84-94.
Evolution of Regulatory Networks
289
88. Karev, G., Wolf, Y., Berezovskaya, F. and Koonin, E. (2004). Gene family evolution: an in-depth theoretical and simulation analysis of non-linear birth-deathinnovation models. BMC Evol Biol. 4, 32. 89. Halder, G., Callaerts, P. and Gehring, W. J. (1995). New perspectives on eye evolution. Curr Opin Genet Dev. 5, 602-9. 90. Bornberg-Bauer, E., Beaussart, F., Kummerfeld, S., Teichmann, S. and Weiner, J. (2005). The evolution of domain arrangements in proteins and interaction networks. CMLS. 62, 435-45. 91. Jozefowicz, C., McClintock, J. and Prince, V. (2003). The fates of zebrafish hox gene duplicates. J Struct Funct Genomics. 3, 185-94. 92. Wolfe, K.H. (2001). Yesterday's polyploids and the mystery of diploidization. Nat Rev Genet 2, 333-41. 93. Bancroft, I. (2002). Insights into cereal genomes from two draft genome sequences of rice. Genome Biol. 3, REVIEWS1015.
This page intentionally left blank
CHAPTER 9 COMPLEXITY IN NEURONAL NETWORKS
Yves Frégnac, Michelle Rudolph, Andrew P. Davison and Alain Destexhe Unité de Neurosciences Intégratives et Computationnelles (UNIC), UPR CNRS 2191, 1 Avenue de la Terrasse, 91198 Gif-sur-Yvette, France
[email protected]
1. Introduction The brain can be thought of as a collective ensemble ranging in the spatial domain from microscopic elements (molecules, receptors, ionic channels, synapses) to macroscopic entities (layers, nuclei, cortical areas, neural networks) (Fig.1). The same multi-scale analysis can be replicated in the temporal domain, when decomposing brain activity in a multitude of dynamic processes with time constants ranging from microseconds (molecule transconformation, channel opening) to years (postnatal cell replacement, for example in the bird song system; long-term memories, for example in vertebrate hippocampus). A tantalising challenge to the field of system and computational neuroscience is to bind in a coherent way these different hierarchies of organisation on the basis of experimentally defined descriptors, each of which is endowed with a specific spatio-temporal domain and measurement precision. An issue central to the theme of complexity in biological systems concerns the inferences that can be made from one level of integration to the next, along reductionist top-down (from macroscopic to microscopic) or synthetic bottom-up (from microscopic to macroscopic) axes. The question addressed by combining computational and system
291
292
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
1m
CNS
10 cm
Systems
1 cm
Maps
1 mm
Networks Receptive field
100 μm
Neurons
Cell
(A) SIMPLE
Lateral geniculate cells
1 μm
Synapses
E
(B) COMPLEX
1Å
Simple cortical cell
Molecules
Figure 1. Schematic levels of spatial integration in the nervous system. The spatial scale at which anatomical organizations can be identified varies over many orders of magnitude. Icons to the right represent structures at distinct levels in a bottom-up fashion: (bottom) a chemical synapse, (middle-bottom) a network model of how thalamic afferent cells could be connected to simple cells in visual cortex, (middle-top) maps of orientation preference and ocular dominance in a primary visual area; (top) the subset of visual areas forming visual cortex and their interconnections (adapted from ref. 1).
neuroscience tools is 'to what extent can properties specific to one level of organisation be predicted by those demonstrated at lower levels of organisation?', or, in other words, is the 'whole' the sum of the 'parts'? This question applies not only at the structural level but also at the functional level. In the latter case, one wants to assess to what extent can the global systemic behaviour (e.g. network dynamics) be reproduced on the basis of knowledge of the 'intrinsic reactivity' of the elements (ion channels, neurons) and the 'extrinsic' relational links established between
Complexity in Neuronal Networks
293
them (synapses)? Any failure of the linear/additive synthetic approach opens the Pandora's Box of complexity. As underlined by Tomaso Poggio in a famous essay (1), one of the main reasons explaining the conceptual distance between brain theoreticians and biologists is the relative ignorance of the nature and properties of the biophysical substrate that implements the elementary stages of neural information processing. The classical vision of the neuron and its integrative function as a summating unit, with multiple input lines, static synaptic gains, a postsynaptic threshold and a single 'all-or-none' spiking output reflects our incapacity to recognise the necessity of non-linear operations on graduated, analogue input signals. Theoreticians dealing with McCulloch-Pitts (2) assemblies were initially tempted to use only additions and subtractions, while the neuronal machinery of the brain is obviously capable of non-linear input-output relationships, such as temporal integration and division of excitation by inhibition, and thus of more elaborate computations. The existence of distributed local non-linearities in the process of assembly making is a first indication of the complexity of living networks. An additional difficulty in crossing bridges between different organisational levels is the lack of systematic methods for reducing complexity. The problem here is to define methods and rationales for separating informationless or 'noisy' variability from structural diversity capable of providing distinct biophysical substrates for different functions. This chapter will present a review of the structural and functional complexity of neurons and networks, with an emphasis on the vertebrate brain and more specifically neocortex. We will also point out major unsolved issues that limit a bottom-up synthetic approach, such as the lack of separability between intrinsic excitability of neurons and extrinsic modulation by synaptic activity. A third problem is whether current knowledge of brain processes is advanced enough to allow the derivation of function from structure. The classical engineering approach, the so-called black-box system approach, relies only on the study of input-output relationships and inference of transfer functions. The link with biological 'hardware' is not guaranteed and proceeds mostly from analogy (e.g. lateral inhibition and Mexican hat profile of sensory receptive fields). A fallacious strategy followed by
294
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
many cortical physiologists for the last 50 years has been to map operational engineering models onto the biological structure and ignore the 'hidden' structural complexity. By doing so, one may transpose wrongly at a more 'microscopic' level the additive/linear nature of the global computation realized at a more integrated level (4), without realizing that non-linear interactions between non-linear elements can subserve a global linear transform (5-7). For instance, the functional study of sensory perception in the brain shows numerous abuses of mapping ad-hoc serial linear-non-linear (L-NL) engineering models onto biological recurrent networks, whose topological architecture is composed mostly of non-linear reverberating loops. 1.1. Interacting Partners As has been clear since Golgi and Cajal, the principal components of the vertebrate brain are neurons. Neurons are a particular class of cells, endowed with excitable membranes and specialised compartments (soma, dendrite and axon in the vertebrate brain, soma and neuropile in the invertebrate brain), and secreting neurotransmitter molecules across synapses. The action potential they emit and which propagates along axons controls neurotransmission and as such is considered as the major 'signal' source in the network. However, supporting glial cells, to which one attributes a dominant importance in metabolic maintenance, also play a role in the slower dynamics of the brain (8). In contrast to neurons, glia do not transmit spikes although they are heavily involved in the capture, transport, diffusion and clearance of ions and transmitters. Electrically, the graded potential of glia reflects the local extracellular potassium concentration modulated by the spiking behaviour of neighbouring cells. For the sake of simplicity, the putative role of glia in signal processing and synaptic communication will be ignored in the rest of the chapter, although strong evidence points to its importance in regulating network plasticity (9). Note also that not all neural-based computations are expressed at the spiking level. In some neuron types (e.g. olfactory bulb granule cells) different regions of the dendritic arbour function relatively independently, so that sub-neuronal structures could be also considered as the relevant interacting partners (see below) (10).
Complexity in Neuronal Networks
295
As alluded to in the Introduction, the prevailing view in understanding network function is to use a Lego-type approach, and reproduce the collective behaviour of neural ensembles by combining elementary bricks. This is usually done by progressing in a bottom-up and hierarchical fashion across various scales of spatial integration, from ion channel, conductance, membrane compartments to the full network range (3). Two hidden assumptions, whose validity will be disputed in this chapter, should be stated clearly: 1) to a certain degree, one is expecting that the 'whole' assembly behaviour can be simulated by 'adding' together the different 'parts' (synapses and neurons); 2) intrinsic excitability properties of neurons, i.e. their electrical reactivity to intracellularly injected electrical currents, are separable from the modulation of neuronal integrative properties by the activity-driven synaptic inputs of the other partner-neurons in the network studied. According to these views, and keeping in mind their limitations, the intrinsic properties of individual neurons studied in isolation can be thought of as cellular 'building blocks' that contribute to how a specific neuron will respond to a given synaptic input. Such a reductionist approach has often been attempted in simplified preparations such as in vitro slices maintained in artificial cerebro-spinal fluid (ACSF) (11,12) or cultures (13). Until recently, such reference networks were studied in the absence of external drive and devoid at rest of spontaneous or ongoing activity, mostly because of the massive deafferentation in the slicing process and the ionic concentrations chosen for the ACSF. If the network was found to be spontaneously active, the influence of the rest of the network was further suppressed by pharmacological blockade of synaptic transmission. In the case of paucineuronal nets, such as invertebrate sensorimotor ganglia composed of Giant cells with invariant morphology, the size of the network may become so reduced that the total blockade of the afferent connectivity to any given cell can be obtained by injecting fluorescent compound in the other somata and then photoinactivating all the putative synaptic partners. Early experiments in the molluscan invertebrate revealed that isolated neurons could generate oscillatory bursts of action potentials and it soon became clear that individual neurons from all species display a large variety of intrinsic
296
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
membrane potential patterns such as bursting, plateaux, post-inhibitory rebound, and spike-frequency adaptation (14). Interestingly, although the importance of intrinsic properties for circuit dynamics has been accepted by the entire community of small (mostly invertebrate) circuit researchers for almost twenty-five years, until relatively recently most workers studying large cell assemblies in the vertebrate brain, both experimentalists and theoreticians, have continued to assume that circuit dynamics depend exclusively on synaptic connectivity and synaptic strength. This view has changed in the last five years, mainly for two reasons: 1) patch recording techniques are no longer restricted to the soma and now allow recording simultaneously from several points distributed along the dendritic structure in the same neuron. More and more recordings from central neurons have shown the prevalence of complex voltage- and time-dependent firing properties which must shape, to a certain extent, circuit function; 2) digital computer simulation power has increased tremendously and allows simulation in real-time of the kinetics of the spatial distribution of voltage change and calcium influx across the entire neurite. The neuron is no longer considered as a point-like integrator receiving multiple input lines and emitting a thresholded binary output (3), but as a complex tridimensional spatio-temporal integrator (15). Consequently, the complex morphology of central neurons must be taken into account to fully understand their integrative properties. Classic theoretical work on integration in passive cables (16-19) has shown the strong filtering exerted by passive dendritic structures. In particular, synaptic inputs are differentially attenuated or distorted according to their position on the dendrite. Recent dendritic recordings have revealed that, as proposed 30 years ago (20), dendrites are not simple passive structures, but contain a myriad of ion channel types (21-25). Dendrites can have regenerative properties (26-29) and initiate Na+ or Ca2+ spikes, propagating towards or away from the soma (21,25) (Fig. 2). The presence of dendritic ion channels may also contribute to the renormalisation of synaptic inputs (30-33), correcting for dendritic filtering, or may produce more subtle effects such as establishing coincidence detection (34,35). The emergence of efficient techniques for 3D morphological reconstruction of single neurons, combined with
Complexity in Neuronal Networks
297
Figure 2. Snapshots of the distribution of membrane potential in a cat cortical (layer VI) pyramidal neuron during simulated in vivo activity. The colors indicate the membrane potential (from violet, -80 mV, to yellow, -25 mV; see scale). Time runs from left to right, then top to bottom (steps of 0.5 ms, except first frame which was taken 4 ms earlier). The simulation shows a dendritic action potential which propagates forward to the soma and evokes a backpropagating action potential in all other dendrites. These forward-propagating action potentials may play an important role in the integrative properties of pyramidal neurons during in vivo like states (modified from ref. 42, with permission).
sophisticated numerical tools for simulating cable equations in such structures (36-38) has revolutionised this area as such tools now become standard (36,37,39). Computational studies are now able to incorporate measurements from in vitro and in vivo studies (40-42) and infer the function of dendritic trees under such in vivo-like conditions. One of the main findings is that the dendritic tree may function in fundamentally different modes of integration in these states (see below), and consequently that the network might perform qualitatively different operations during in vivo-like states (42).
298
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
Even considering only neocortex, a huge variety of neuronal types can be discriminated based solely on the diversity of morphology and of the intrinsic properties of cortical neurons (43,44). The most commonly established correlation between morphology and excitability revealed by current injection is 'regular-spiking'. Neurons with this property display responses with a prominent spike-frequency adaptation (regular increase in interval between successive spikes). These neurons correspond mostly to a spiny pyramidal cell morphology. Other neuron types include different variants of bursting cells, such as the 'intrinsically-bursting' cells of layer V, the 'fast rhythmic bursting' or 'chattering', and the 'lowthreshold spiking' cells. The latter types correspond to either excitatory (pyramidal) cells or various types of interneurons (45). More recent, extensive studies show that structural diversity of cortical neurons is not limited to the stereogeometry of axons and dendrites or to the multiple excitability patterns that a step of depolarising current produces in the recorded cell, but extends also to neurochemical markers (calcium-binding proteins, neuropeptides, neurotransmitters or their synthesis enzymes), synaptic dynamics (connectivity and its plasticity, decay time constants of EPSPsa and IPSPsb), as well as to a specific repertoire of expressed proteins (e.g. ion channels, receptors). The genomic expression identity profile can be revealed in the patchrecorded cell by harvesting the cytoplasmic content at the end of the recording session and by applying off-line multiplex RT-PCRc. Although initial cortical cell classifications concerned mostly excitatory cells (which represent 80% of the whole population), more refined attempts have been made in the case of inhibitory interneurons (the remaining 20%), and in this latter case a consensus in taxonomy has almost been reached (46,47). Recent attempts correlating firing properties with protein expression corroborate the existence of clear-cut GABAergic interneuron subtypes (48,49). Present data also suggest the existence of molecular determinants underlying oscillatory and synchronous network a
EPSP : Excitatory Post-Synaptic Potential IPSP : Inhibitory Post-Synaptic Potential c RT-PCR : Reverse Transcription Polymerase Chain Reaction b
Complexity in Neuronal Networks
299
activity and lead to the conclusion that different types of interneurons may subserve distinct functions, for example by participating in the generation of oscillatory activity in different frequency bands (50,51). These examples illustrate striking high-order correlations between the morphologies of the axon and dendrites, firing patterns, spike and AHP characteristics, EPSP and IPSP kinetics, synaptic dynamics, coupling through gap junctions to neurons of the same class, and the expression of distinct protein markers. It is very probable, as was demonstrated for interneurons in the spinal cord by Jessell and colleagues (52), that different classes of neocortical interneurons differentiate under the control of different promoters and play specific roles in the building-up of circuits. In general, if one assumes the existence of repeated modules with some degree of structural invariance, it appears absolutely essential to come to terms with neuronal diversity in order to understand the function of so-called canonical circuits (review in ref. 53). There are however several caveats with this descriptive approach of structural diversity. The number of neurons in an anatomical cortical column (54) or functional hypercolumn (55) is roughly on the order of the number of classes based on anatomical, electrophysiological, and genomic criteria, implying that to a certain extent some neurons may be the sole exemplars of their class. Furthermore, the multiplex RT-PCR technique has its own experimental limitations. The first is quantitative and concerns the existence of high probabilities of false negatives. The second is that mRNA measurement in the slice looks like a static 'photograph' made after a massive disturbance of activity imposed by the slicing process itself, which may have already resulted in spurious activity-dependent regulation of gene expression. A third, fundamental issue is that in principle one should not limit oneself to cytoplasmic mRNA harvesting. The search should be extended to the proteome and membrane-bound proteins in order to establish the cell-by-cell distribution of receptors and ions. In other words, it is likely that multiplex RT-PCR will not give access to the 'molecular shape' of the neuron (Changeux, personal communication).
300
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
2. Modes of Interaction 2.1. Synaptic Transmission The dominant mode of interaction between neurons in the brain for fast information processing is via synapses. Of these, the most important type are chemical synapses, in which a presynaptic depolarisation of the cell membrane triggers the release of chemical neurotransmitters, which diffuse across a small gap and bind to receptor proteins embedded in the membrane of the postsynaptic neuron. This leads to the opening of ion channels, which may lead to depolarisation or hyperpolarisation of the postsynaptic cell membrane, or to shunting of other currents entering the cell. Besides chemical synapses, electrical synapses (gap junctions), which provide direct contact (in the Angström range) between cell membranes and allow current to flow between the cells through a purely resistive link (with almost zero time-constant), play an important role in many brain regions. In neocortex, electrical synapses are involved in synchronisation of neuronal activity of specific classes of inhibitory interneurons (56-58). There is a great diversity of chemical synapse types, with different neurotransmitters and many different types of receptor proteins, each of which may also have subtypes, with different subtypes expressed in different neuron types. In the central nervous system, the principal neurotransmitters involved in fast information transmission are glutamate, associated with excitatory synapses (those producing a depolarisation of the postsynaptic membrane), and γ-amino butyric acid (GABA), associated with inhibitory synapses (those that generally produce a hyperpolarisation of the postsynaptic membrane). Synaptic receptor proteins may either be ionotropic (the binding site is on an ion channel protein and causes the opening of the channel - also known as ligand-gated ion channels) or metabotropic (the binding triggers an intracellular signal cascade which leads indirectly to ion channel opening). Ionotropic receptors tend to operate with a more rapid time course. Ionotropic receptors that bind glutamate include α-amino-3hydroxy- 5-methyl-4-isoxazolepropionic acid (AMPA) and kainate receptors, which have a rapid time course of opening/closing (sub-
Complexity in Neuronal Networks
301
millisecond rise time and decay time constants below 10 ms), and Nmethyl-D-aspartic acid (NMDA) receptors, which are both ligand-gated and voltage-gated, and have a slower time course (decay time constants from a few tens to a few hundreds of milliseconds). There is only one type of ionotropic receptor that binds GABA, the GABAA receptor, which has a rapid time course. The exact effect that the opening of a synaptic ion channel has on the postsynaptic membrane potential depends on the value of the membrane potential during the opening and on the reversal potential of the channel, which in turn depends on the permeability of the channel to different ion species and on the concentrations of the ions on either side of the membrane. The reversal potential of glutamatergic receptor channels is around 0 mV, which means that activation of such receptors almost always leads to an inward flow of current and depolarisation of the cell membrane (the exception is near the peak of the action potential when the membrane potential exceeds 0 mV). The reversal potential of GABAA channels is near -70 mV, which is near the resting membrane potential of the cell, and above the reversal potential for potassium iongated channels. This means that if the cell is strongly hyperpolarised by a potassium current, then opening of a GABAA channel leads to depolarisation of the cell, an excitatory effect. If the membrane potential is near the GABAA reversal potential, the main effect of the channel is an electrical shunt, increasing the total conductance of the membrane and clamping the membrane potential to the GABAA reversal potential. Thus, starting from this resting condition, when inhibition is applied alone, no change in the resting potential will be seen, hence the denomination of 'silent' inhibition. In the case where excitation occurs concomitantly with inhibition, the excitatory evoked depolarisation will be reduced in amplitude and shortened in time-course, hence the term 'divisive or shunting inhibition'. Although disputed for a long time (e.g. ref. 59), there is now clear evidence for the existence and the role of shunting inhibition in shaping cortical functional selectivity in vivo (5,7). Until recently it was thought that all synapses of a neuron release only one type of neurotransmitter (Dale's Law, 60), although there is now evidence to the contrary. Neurons are usually described as 'excitatory' or 'inhibitory' depending on the effects of the neurotransmitter they
302
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
principally release. The almost-fixed association which is usually made between the synaptic type and the electrical 'sign' of the synaptic effect (i.e. depolarisation vs hyperpolarisation) is to be interpreted with care. As pointed out above, the functional effect will depend on the resting state of the neuron and of the degree of invariance of the reversal potentials of ions whose transport through the membrane is responsible for the voltage change. For instance, in developing networks in the neonate, inhibitory GABAA synapses become depolarising, due to a much more positive chloride reversal potential than in adult neurons, and have in fact a positive effect in driving spiking activity (61). Furthermore, in the invertebrate, synaptic gain can be occasionally reversed in sign by appropriate conditioning events (62). Another important feature worth noting is the large proportion of anatomicallyidentified synaptic contacts which are functionally silent in the resting condition, and can become expressed functionally once the neuron becomes strongly engaged by depolarising input and activation of specific subtype of receptors (63,64). This effect is most preeminent in the immature vertebrate brain. The interaction types discussed above are those which are most generally considered when discussing fast information processing in neuronal networks. Besides these, however, there are other possible fast interactions, e.g. between neurons mediated by current flow through the extracellular space (ephaptic interactions) that may be of importance in certain structures or circuits, and very many slower interactions: mediated via glia or through diffusion of signalling molecules (e.g. nitric oxide, hormones). 2.2. Synaptic Dynamics and Plasticity The conductance change associated with synaptic transmission is generally characterised by its rise time, decay time constant, and peak amplitude. However, one should not consider that synaptic transmission is reduced to a simple static multiplicative factor. The efficacy or 'gain' in transmission at a given synapse should in fact be seen as a dynamic variable, which is regulated by its past synaptic activity (homosynaptic
Complexity in Neuronal Networks
303
plasticity) and the global state of network dynamics (heterosynaptic plasticity and network-driven homeostasis). At many synapse types, these descriptive parameters have their own dynamics on a time scale of tens of milliseconds, determined by the pattern history of presynaptic spikes. In addition to the genomic, electrical and morphological diversity that we reviewed earlier (see Section 2), the synaptic interactions between different neuronal types also display various short-term plasticity mechanisms, such as facilitation or depression, which result from the modulation to various degrees of conductance amplitude, the size of the available pool of neurotransmitter and its release probability (65-67,45). On longer time scales, from minutes upwards, the efficacy of synapses (probability of transmitter release in response to a presynaptic action potential and size of postsynaptic conductance change) may be changed by activitydependent and/or by homeostatic processes (reviews in refs. 68,69). The most classical forms of synaptic plasticity found in neocortex are induced by convergence of synchronous activation sources acting at the pre and/or postsynaptic levels (review in ref. 70. Long-term potentiation (LTP) (71-73) and depression (LTD) (74-77) are found both in hippocampus and neocortex. Although it was initially assumed that synchrony between presynaptic activity and postsynaptic depolarisation was the key event in plasticity induction, more diverse and reversible forms of associative plasticity have been reported and result from various patterns of time-locked associations between presynaptic input and the postsynaptic spike (78,69). For instance, spike-timing-dependent plasticity (STDP), initially demonstrated in vitro in hippocampus (79) but immediately forgotten (!), was rediscovered in neocortical slices from young animals and in hippocampal organotypic cultures. The synaptic change rule differs from the classical correlation-based postulate, since the sign of the change of synaptic strength depends critically on the temporal order between pre- and post-synaptic spikes. More specifically, in cortical and hippocampal excitatory synapses, when the presynaptic event precedes the postsynaptic event by less than 50 ms, synaptic strength is augmented, whereas the reverse temporal order leads to a reduction in synaptic strength (35,80,81). Reverse-sign STDP rules have also been found, in cerebellum-like structures in the electric fish
304
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
(82). However, the impact of multiple correlations and spontaneous activity in the pre- and post-synaptic cells, the existence of STDP in vivo and its applicability in the intact brain are still debated, which motivates building computational models. The obvious consequence of the diversity that may exist in associative synaptic plasticity rules is that 1) not all synapses follow a unique rule and 2) the same network can use multiple forms of adaptation for different purposes, through each one of these different plasticity algorithms. For instance, LTP is considered as a major substrate of associative memory formation (83). In spite of the fact that its implication in behavioural learning remains disputed, this rule obeys the general principle that 'neurons that fire together, wire together' and has been considered as responsible for functional epigenesis in sensory neocortex (84-86). Microcircuits that incorporate spike-timing dependent plasticity (STDP) rules account most accurately for the emergence of causal chains within neuronal assemblies and best support phase sequence learning (87) or multimodal coordinate transformation (88). Other circuits which incorporate a mirrored form of STDP may be used for enhancing novel information and filtering out expected changes in sensory environment due to motor exploration. Hypothetically, these different plasticity algorithms could coexist together in the same network and operate in a synergistic fashion, for instance by acting on different synaptic types and cell targets on different time scales. In addition to these dominant forms of plasticity, other processes are designed to stabilise the integrative function of the cell within a reference working range. Five forms of 'homeostasis' are generally recognised (89). The simplest form regulates the efficacy of transmission around a mean synaptic gain or between two boundary values. When evaluated on a longer time scale, the fast input-dependent regulation of synaptic transmission, described earlier, results in maintaining average synaptic efficacy approximately constant in the face of rapid changes in the probability of transmitter release (90). Furthermore, the probability of inducing potentiation, depression or depotentiation depends on the previous stimulation history of the network ('metaplasticity ' in ref. 91) and on the initial efficacy state of the synapse. A second form of homeostasis, predicted in many models of learning, acts more globally,
Complexity in Neuronal Networks
305
at the neuronal level (92). It assumes that the sum (or sum of squares) of all the synaptic weights of synapses impinging on the cell remains constant. A third form of homeostasis limits the anatomical divergence of growing axons and corresponds to the conservation of the sum of the synaptic weights of all contacts made in the network by the same parent axon (93). The fourth form of homoeostasis concerns the capacity of a cell to maintain a similar output (e.g. its firing level) in spite of strong alterations in the activity pattern of the network (94,95) (review in ref. 89,96). This last regulatory process involves activity-dependent tuning of postsynaptic sensitivity. Examples so far come from the regulation of intrinsic conductances. A last form of homeostasis concerns possible changes in the site of impulse initiation and the direction of impulse propagation in the dendrite (97). 2.3. Interdependency Between Intrinsic Excitability and Extrinsic Synaptic Factors Knowledge of the intrinsic conductance repertoire of the cell under study and of the biophysical conditions of expression (voltage-dependency, activity priming) is generally used to reinterpret the role of intrinsic properties in the functional context of the full network. For example, in many rhythmic central pattern generating circuits, the timing of the neurons within a circuit is often governed by the intrinsic post-inhibitory rebound properties of the neurons in the circuit (98). Similarly, it is clear that neurons with strong plateau properties can turn a rapid synaptic input into more sustained firing, thus changing the timing of discharge between the presynaptic neuron and follower neuron (99,100). One of the first suggestions that the intrinsic properties of individual cortical neurons might be crucial for understanding network behaviour comes from the early work of Bremer (101,102) who suggested that the rhythmic activity seen in the electroencephalogram (EEG) might result from the interactions among neurons that display an intrinsic propensity to oscillate, as is now thought to be the case (103,104). Bremer further suggested that the cortex should not be viewed as a system passively driven by its inputs, but rather as a system having spontaneous, intrinsic activity which is modulated by sensory inputs (101), a theme which has
306
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
been explored in detail (105,106). A striking illustration of this last point in mammalian neocortex is the discovery of the implication of intrinsic high frequency oscillatory properties of cortical neurons in the functional binding of their activity during the transient time course of a percept (107). Such intrinsic oscillatory potential had in fact been observed in vitro much earlier (108), but totally ignored until in vivo studies suggested they could be the hallmark of a specific class of cells, chattering type, involved in the genesis of gamma band activity during sensory processing (109). At a more microscopic level, active properties intrinsic to the neuronal membrane, such as dendritic action potentials, may play a central role in synaptic plasticity because they provide the strong depolarisation necessary to establish coincidence of pre- and postsynaptic activity required for inducing synaptic changes (35,72). Interestingly, this coincidence can be established by local dendritic spikes, without participation of the soma, which opens the possibility for local dendritic computations - or associations - to occur without participation of the cell body (68,110). These problems are heavily investigated today. In addition, the efficacy of a synaptic input depends on the conductance state of the neuron and its dendrites, and therefore will depend on the level of network activity. The responsiveness to a given input has been shown, both experimentally and theoretically, to strongly depend on background inputs (reviewed in ref. 42). This 'contextual' dependence of integrative properties is of primary importance to understand cellular operations. The impact of such dependence at the network level still remains a subject largely uncharacterised, both theoretically and experimentally. Evidence that network-driven activity may even control the intrinsic repertoire of conductances of any given neuron comes from the field of invertebrates (111-113). This process can be thought of as an extreme case of contextual control resulting from the non-synaptic diffusion of neuromodulators or hormones in the network, and its implication in the reorganisation of cortical network dynamics has not yet been addressed. Pioneering work in the invertebrate motor ganglia, in particular in the stomatogastric ganglion of the lobster, showed that the repertoire of intrinsic conductances could in fact change dramatically in the presence
Complexity in Neuronal Networks
307
of neuromodulatory signals secreted by specific cells in the afferent sensory network (114). The activity of such cells was found to be a determinant in triggering the expression of active conductances in cells which are part of a central motor program generator assembly. The turning on and off of neuromodulation was found to reorganise the dynamics of the full network in distinct functional assemblies associated with different motor behaviours (115,112) (see Fig. 3).
(a)
(b)
(c)
Figure 3. Reconfiguration of network dynamics under the influence of orchestra-leader cells (called PS here). The same network is functionally reorganized in independent assemblies depending of the activity state of neuromodulating cells (PS) (upper row). These assemblies are characterized by specific excitability patterns expressed by each of the composing cells and specific phase relationships between them (middle row). The functional role of each assembly during the swallowing stomatogastric cycle is depicted in each cartoon below. (a) When PS is silent, the oesophageal, gastric, and pyloric networks (top) generate independent rhythmic output patterns (middle) involved in regionally specific and separate behavioural tasks (bottom). (b) When PS is rhythmically active, it drives opening of the oesophageal valve (bottom), and by breaking down preexisting networks and using certain neurons, it constructs a single novel network (top) that generates a coordinated motor pattern (middle) appropriate for swallowing behaviour. (c) When PS is again silent, the oesophageal valve closes (bottom) and motor units immediately resume their original network activity while units (i.e., gastric and pyloric) controlling regions more caudal to the sphincter continue to generate a single pattern before resuming their separate activities (adapted from ref. 112).
308
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
These data support the hypothesis of the existence of 'orchestra leader'-like neurons, whose activity conditions and formats merging and segmentation across functional assemblies. Their identification, or even evidence of their existence, in larger networks such as mammalian neocortex is still lacking, but there is already ample evidence that modulatory signals linked with various amines and acetylcholine change the repertoire of expressed conductances. Like central pattern generators, thalamic circuits are subject to neuromodulatory influences (116). In this case, neuromodulators such as acetylcholine, norepinephrine or serotonin affect intrinsic currents and switch the circuit from an oscillatory mode to a 'relay mode' in which oscillations are abolished (117). These neuromodulators are present in activated states, promoting the relay of sensory information by the thalamus, while their diminished levels during slow-wave sleep promote the participation of the thalamus in the genesis of large-scale synchronised oscillations in the entire thalamocortical system. Some aspects of these mechanisms, in particular the type of processing of sensory information that thalamic neurons perform, are still unclear and subject to current investigation (118). We conclude from this brief review that, to a certain extent, both in invertebrate ganglia and in vertebrate brain, the separability of intrinsic and extrinsic factors in the control of cellular excitability is doomed to fail. Thus, the 'whole' cannot be the sum of the 'parts'. 3. Computational Modelling We give here only a brief overview of computational modelling of neuronal networks. More complete treatments may be found in, for example, refs. 38, 119 and 120. 3.1. Modelling Cells and Synaptic Interactions We will consider here only that class of neuron models in which the action potential is explicitly represented (spiking models), and neglect traditional neural network-type models and mean-field models which represent only the time-averaged mean activity of neurons or populations. The class of spiking models may be further subdivided into
Complexity in Neuronal Networks
309
biophysically-realistic models and simplified models (integrate-and-fire, spike response models). The concept of the integrate-and-fire model dates back to Lapicque (121). In its standard form, the model represents a purely passive membrane with no non-linear properties until the membrane potential reaches a fixed spike threshold, at which point the potential is reset to some sub-threshold value. To match the model behaviour more closely to that of real neurons, sub-threshold non-linearities (e.g. quadratic (122), exponential (123), a second state variable (124,125), and approximations of the effects of background synaptic activity (126) have variously been added. Integrate-and-fire-type neurons are usually point models, although multi-compartment models have also been used (e.g. ref. 127). Biophysically realistic models build on the work of Hodgkin and Huxley, who developed a mathematical description of the kinetics of sodium and potassium channels [128] which is still the basis of almost all ion channel models used today, and of Rall, who introduced compartmental modelling of spatially-extended structures such as dendrites (15). The level of detail in modelling synaptic interactions covers a large range, from a step change in the postsynaptic potential to modelling of biochemical pathways, quantised neurotransmitter release and uptake, and post-synaptic binding and channel kinetics. The most commonly used representation is that a presynaptic action potential causes a postsynaptic conductance change with a fixed time course. 3.2. Modelling Networks Traditionally, neural networks are viewed as being composed of simple neuronal units interconnected in large and specifically structured networks, hence forming complicated systems. Mathematical tools borrowed from statistics and physics provide in many cases an effective, possibly stochastic or probabilistic, description of the dynamics of these complicated systems. Classical examples date back to Anderson, Cooper and Hopfield (review in ref. 129), who showed that such networks can be described by a formalism analogous to spin-glasses (Ising models), or the mean-field description of networks of interconnected oscillators and
310
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
linear neurons. However, an unfortunate common aspect of this approach is also that when departing from idealised model systems by endowing single neuronal units with more realistic, non-linear dynamic behaviours, the availability of mathematical tools which allow an exact or even a statistical description ceases quickly. Looked at carefully, it is exactly this now experimentally evidenced non-linearity in the dynamical behaviour of neuronal units which might, or does, in fact constitute the crucial ingredient for the immense computational power of the biological neuronal circuitry. Starting from the simplest description of neurons as threshold or integrate-and-fire units, in the past decades much experimental and theoretical work was done in filling up these neurons with a plethora of biophysical and functional properties giving rise to the wealth of observed dynamic behaviours. The level of detail is already now sufficient to provides a theoretical picture of single neuron dynamics which comes very close to its biological original, but with the caveat that at the same time also the border towards complex systems is passed. The latter renders many exact and statistical mathematical tools useless, and linearisations or approximate numerical descriptions remain the only way to access a vanishingly small regime of possible dynamic behaviours. Similar to the statistical description of networks composed of many, but simply- behaving, components, systems with only few, but highly non-linear, components can be accessed with mathematical tools borrowed from chaos theory (e.g. refs. 130,131). Indeed, such systems can be viewed as low-dimensional deterministic chaotic systems. Corresponding measures and notations were successfully applied to characterise the behaviour observed in isolated neuronal systems such as bursting (132) or spontaneously discharging neurons in the mammalian cortex (for a recent review see refs. 133). However, in order to approach a real understanding of the brain or macroscopic functional sub-structures of it as complex systems, it is not sufficient to restrict to small isolated units. Instead, the dynamic aspect of neural behaviour has to be exported to whole networks comprised of such units. A first attempt in this direction was the characterisation of the large-scale dynamic behaviour of the brain in the context of chaotic systems using experimental data from available electroencephalographic
Complexity in Neuronal Networks
311
(EEG) studies and new distributed imaging techniques such as functional magnetic resonance imaging (fMRI). Indeed, studies dating back two decades indicate the presence of low-dimensional chaos in human EEG data (134). Although still subject to dispute, this suggestion found much support afterwards (e.g. ref. 135,136). In the absence of useful analytical tools for studying large, biophysically-detailed neuronal networks, numerical simulation is the most widely used tool. Large scale models of specific cortical or subcortical regions (137-140) have given insights into the functioning of these regions. A more general approach aims to understand the behaviour of generic cortical circuits, comparing numerical results with available analytical predictions where possible (e.g. ref. 141). 4. Complexity in Structural and Functional Network Topology 4.1. Diversity of Structural Network Topology As noted above, there are at least dozens, and possibly hundreds, of distinct cell types in neocortex. The connectivity between cell types, however, is far from random. A broadly similar distribution of neuron types and similar local connectivity is found throughout neocortex, although there are regional variations and specialisations. Neocortex is divided into six layers on the basis of histological studies; some layers are further sub-divided. Most cell types, defined narrowly, have cell bodies that are only found in a single layer, and have dendrites and axons that have a layer-specific pattern of ramification. The layered structure of cerebral cortex has also motivated the introduction of microcircuits (review in ref. 142). This is based on the high specificity of intracortical inter-layer connections 143, and the fact that thalamic inputs end preferentially in layers I, IV and VI and systematically avoid other layers (144,145). Long-range intracortical connections are made quasiexclusively by layer II-III cells (146), and cortico-thalamic input originates exclusively from layer VI and the lower part of layer V. This highly specific arrangement has motivated the conceptual introduction of cortical microcircuits (147-149), although there is no clear 'cortical
312
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
module' but rather a continuum with clear connectivity templates (150,151) (review in ref. 53). In parallel with this modular decomposition based primarily on anatomical descriptions of the circuit, similar arrangements have been validated on a functional basis. One way to identify primary functional modules is to follow, step by step, the vertically dominant integration flow of activity evoked by thalamic input (see Fig. 4, third top left drawing). EPSP/IPSP sequences show stereotyped behaviour as a function of layer in cortex (149) (but see ref. 152). At a more integrated level, numerous studies of sensory evoked responses using single unit electrophysiological recordings, since the pioneering work of Mountcastle (153,154) in somatosensory cortex and Hubel and Wiesel (155) in visual cortical areas, have emphasised the invariance of receptive fields along vertical electrode penetrations (orthogonal to the layer planes). In visual cortex, this 'columnar' arrangement holds not only in terms of spatial location of the receptive fields in the visual field, defined primarily by the thalamic afferent input, but also in terms of orientation preference, a property profoundly influenced by intracortical recurrency. These pioneering studies led to the specific proposal of functional columns of different scales. The macrocolumn is defined as a complex processing and distributing unit that links a number of inputs to a number of outputs via overlapping internal processing chains (minicolumns) (156). One should note, however, that the definition of the functional column initially applied only to the input/output circuit formed by serial excitatory links from layer IV (input layer) to layer VI (one of the output layers). The laminar relay description (IV→II-III→V→VI) within the column is based on the assumption that axons are connected to neurons whose somata are located in the layer to which the axon projects (157). Since these initial studies, other definitions of columnar entities have been given, in which all coexist within the same network, giving a crystalline organisational architecture whose elementary motif depends on the computation under study (review in ref. 53). These different views are summarised in Fig. 4.
Complexity in Neuronal Networks
313
Figure 4. Different levels of integration are schematised from left to right and top to bottom: (a) The cortical pyramidal cell and its membrane compartments represents an elementary site of synaptic convergence; (b) Bundles of axons of pyramidal cells form radial microcolumns; (c) One of the best studied input/output circuit characterises the serial processing of layer IV afferents by first-order targets, the stellate cells in layer 4, which, after a series of successive relays in layer II-III and layer V, terminate on layer VI neurons who send their axons out of the functional column (158); (d) The canonical microcircuit exemplifies the high level of recurrency of excitatory local connections whereas the inhibitory interneurons control the gating of the avalanche of excitatory amplification (149); (e) The concept of meta-column, introduced by Somers et al. (159), corresponds to the network influence carried via long-distance horizontal connections in the supragranular layers (see inset), that needs to be added to the column to predict its context-dependent behaviour; (f) This last schema summarises the hypothesis of selection of computational circuits (red volume) by the neuromodulatory action of ACh fibres running in layer I; (g) Inverted contrast picture of two biocytin-labelled layer II/III pyramidal cells connected by horizontal axons (Frégnac and Friedlander, unpublished). (taken from ref. 53, with permission).
314
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
If one extends the search for modules to the scale of the whole cortex, it is clear that different areas are specialised for different computations, each with a different role in sensory perception, generation of motor commands, memory, or other areas of cognition. The basis of this functional diversification may in part reflect differences in individual neuron properties or in the distributions of neuron types, but is presumably mainly due to differences in connectivity, although very little is known about any such connectivity-computation correlations. Although there is strong evidence that the morphological lay-out of these dedicated networks operates under strong genetic constraints, there is also ample evidence for a shaping of cortical circuit anatomy by activity. The most remarkable illustrations can be found in the developing thalamocortical pathway, during a 'critical' period during which sensory experience shapes to a certain extent cortical organisation (160). In an imaginative experiment performed in the developing ferret, the group of Mriganka Sur was able to turn an auditory cortical area (A1) into a new visual area by 1) depriving, early in development, the auditory thalamus (which feeds the primary auditory cortex) from its normal input, and 2) substituting this input for rerouted visual afferents (161,162). Without going into the details of such a complex and artificial operation, this rewiring of afferent pathways resulted in two major restructurings. The first, expressed at the functional level, was the genesis of visual receptive fields in an 'auditory' cortex, showing all the attributes of a normal visual cortical area: a large proportion of the rewired A1 cells were found to be orientation selective, and the orientation network visualised with optical imaging techniques showed the progressive shift in orientation preference and the existence of pinwheel singularities that is characteristic of the normal topology of the V1 visual network. The second, seen at the structural level, was a rewiring of intracortical longdistance 'horizontal' connectivity according to a pattern normally seen only in the V1 area. Thus, the imposition of a drastic change in sensory activation and patterns of afferent activation (in sensory thalamus at an early stage of development) induced a structural reorganisation in the rewired A1 area, resulting in a binding-architecture anatomy indistinguishable from that found in a normal V1 visual cortical area. We conclude from this example that sensory cortical circuits are capable of
Complexity in Neuronal Networks
315
profound activity-dependent reorganisation, in such a way as to realize a computational 'fit' with the statistical nature of the information to be processed. 4.2. Complexity of Structural Network Topology Leaving specificities and differences between cortical areas, we now consider the commonalities between brain regions and the regularity in cortical structure, to ask what we can learn from a more theoretical view of cortical connectivity and topology, and to what degree cortical networks resemble other complex networks found in physics, biology and sociology. In general, it is true to say that although an increasing amount is known about the connectivity of local circuits in cortex, at least in a statistical sense (the probabilities of connection between pairs of nearby neurons of given types), and the general strengths of connectivity between different cortical areas (163), much less is known about medium-to-long range connections in cortex, although much experimental work has been done to reveal anatomical connectivity patterns in a variety of brain networks (164). The term anatomical connectivity refers here to the set of physical or structural, i.e. synaptic, connections which link all neuronal units comprising the network (142). Among the most notable early studies in this direction is the quantitative neuroanatomical work of the group around Braitenberg (165), which analysed cortical tissues of mice with respect to principal connectivity patterns and connection probabilities. Based on the idea of a rather homogeneous distribution of excitatory neurons, these investigations revealed a high short-range intra-cortical connection probability, which decays exponentially with a decay length of a few hundred micrometres, as well as a more sparse, patchy, long-range cortico-cortical connectivity. Interestingly, although each given neuron connects to tens of thousands of other neurons in their near vicinity, the probability that more than one contact is made with another neuron at the same time was found to be vanishingly small. These results suggest that the cortical network is comprised of interconnected, rather local, processing units or
316
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
cortical microcircuits, containing a few hundred thousands of densely connected neurons. More recently, extensive anatomical studies revealed comparative aspects of neocortical circuitry in different species, ranging from mice over giraffes (166) to humans, with respect to the number (density) of neurons and their synaptic terminals as well as to patterns of neuronal interconnectivity. The reported results highlight a vast variability in density within the neuronal populations that constitute cortical microcircuits across cortical areas and in different species. Viewed in the light of results obtained in earlier studies, these findings suggest that this variability in the topological structure goes hand in hand with functional differences in the specific cortical circuits, and is the result of evolutionary adaptation of circuits in different species to particular functions (review in ref. 53). 4.3. Structural Network Topology and the 'Small-World' Analogy Given the limitations of the classical approaches noted above - studying either large, regular networks of simple units or small networks with complex units - in understanding the behaviour and the emergence of behaviour in large, complex networks, we now ask whether the anatomical studies briefly reviewed above provide support for a view of the cortex as a small-world, scale-free system such as has been linked in other fields with self-organisation and emergent properties of complex systems (see Chapters 1 and 2). Szentágothai's module concept shows at least that the cortex consists of integrative units densely packed with neurons and linked with each other by long-range connections. Together with Braitenberg's statistical investigations of the long-range anatomical connectivity, this suggests indeed the existence of a cortical network topology with properties similar to those found in small-world networks, namely high clustering due to modules and small connectivity length due to long-range connections. Concrete support in this direction, although not explicitly linked to the small-world phenomenon, follows from numerous neuroanatomical studies of large-scale cortico-cortical pathways in different mammalian species. Most notable here are the investigations of Angelucci, Kennedy, Lund, Rakic, White, and many
Complexity in Neuronal Networks
317
others in the cerebral cortex, in particular the macaque and cat visual cortex, as well as experimental studies of the group around Scannell in the cat cortico-cortical and cortico-thalamic system (167,168), and in the rat hippocampus by Burns and Young (169). All these anatomical investigations provided maps of large-scale pathways in the investigated brain areas and graphically confirmed earlier findings which show a highly hierarchically organised structure with 'streams' and 'systems' with connections which mostly link to nearest neighbours (170). Moreover, these densely intra-connected systems reflect functionally specialised sets of cortical areas, suggesting once more that function and structure are closely linked at the system level (171). However, these early studies did not go far beyond drawing of largescale connectivity maps of neuronal populations. Only recently were these maps of network pathways re-examined with specific graphtheoretical tools (172) in order to find a decisive answer to the question whether the brain network, or at least parts of it, shares properties of scale-free, small-world networks otherwise so commonly found in nature. For this purpose, anatomical maps are represented as directed graphs (170,173) in which vertices, or nodes, describe neuronal units (neurons, population of neurons, brain areas) and (directed) edges describe connections (synaptic or 'streams') between these units. With this, the average path length and cluster index can be calculated and compared with that typical for random, lattice or small-world networks. As mentioned above, two criteria must be fulfilled for the latter (174). First, the average path length should be small and comparable with that of random networks. Second, the cluster index should be larger than that of highly ordered networks, such as lattice networks. Recently, Sporns and Zwi investigated the large-scale cortical connectivity maps obtained from previous neuroanatomical studies in the macaque and cat cortex (175). They found indeed that in all cases the cortical connectivity patterns had properties of small-world networks. Pairs of neuronal units are linked together by short paths, as in random networks, despite the spatial extent and rather sparse connectivity of the network. Moreover, neighbouring neuronal units shared many more interconnections than typical for random networks, resulting in a correspondingly high cluster index (Fig. 5). However, and this must be
318
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
Figure 5. Topological connectivity pattern in biological neural systems. Top: Large-scale connectivity matrices of the macaque visual cortex, the macaque cortex and the cat cortex. In the latter case, the density-based connectivity is shown. Bottom: Average path length (left) and cluster index (right) deduced from the large-scale connectivity matrices of cortico-cortical pathways. Results from the macaque visual cortex, the macaque cortex and the cat cortex are shown, along with the average path length and cluster index for corresponding random and lattice networks. The arrows indicate that the small average path length observed in the biological systems is closer to that seen in random networks, whereas the cluster index in biological systems is closer to that of corresponding lattice networks. Figs. and data modified from ref. 175.
viewed as a rather surprising result, in this study no clear evidence was found for scale-free degree distributions in the large-scale connection maps. In fact, the investigated networks exhibit a rather homogeneous degree distribution. This, indeed, could be a major setback in the argumentation outlined above which could potentially prove that brain networks indeed form complex systems. Even more crucial, does this mean that the brain, or its substructures, obey rules which deviate from that seen in so much abundance in other natural and social systems? In fact, experiments done in sensory systems of various species long before the investigations of topological connectivity patterns in mammalian
Complexity in Neuronal Networks
319
cortices indicate differently. Specifically, studies of functional, rather than structural, aspects of sensory coding do provide strong evidence for a power-law, scale-free, behaviour. 4.4. Functional Network Topology and the 'Scale-Free' Analogy Investigations showing power-law behaviour in neuronal systems date back to the early second half of the last century. For example, Landgren (176) investigated the response of cat carotid sinus baroreceptor units during constant intrasinusal pressure stimuli and found a power-law scaling in the impulse frequency as a function of time. Similar findings were reported in the transient response (impulse frequency) as a function of time as well as the impulse frequency modulation (sensitivity) as a function of the forcing frequency in cockroach mechanoreceptors (177), the decay of the response impulse frequency of a slowly adapting stretch receptor following step stretches and the peak impulse frequency as a function of the velocity of stretching in the stretch receptor of crayfish (178), or the gain as a function of the light modulation frequency in the lateral eye of the limulus (179). More recently, an investigation of Teich and colleagues (180) showed that the discharge statistics of cat retinal ganglion cells and lateral-geniculate nucleus cells show long-duration power-law correlations. In most of these cases, the power-law behaviour was linked to how sensory information is coded, i.e. to a specific functional aspect of sensory processing. These findings suggest the possibility that the presence of specific scaling-laws might be true for functional networks in higher brain areas, such as the cortex, where the investigation of topological connectivity patterns did not reveal such a structural signature. With the availability of recording techniques which cover spatially separated brain regions, such as distributed multielectrode EEG recordings or fMRI, as well as new mathematical analysis methods, which allow the effective activity in such spatially extended systems to be screened and characterised (181), the experimental and theoretical basis now exists to address this possibility. Among the first clear indications that, indeed, the functional dynamics in macroscopic brain
320
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
networks might be governed by power-laws was the work of Jung and collaborators (182). In their study of spiral chemical waves mediated by the network noise in cultured networks of glial cells, they found that these waves could cover a few centimetres and persist for many seconds. In turn, a power-law distribution of wave sizes was reported along with the argumentation that the processes which create these waves do not exhibit a preference for a specific spatial or temporal scale, i.e. are scalefree. Recently, similar results were reported in experimental investigations of the propagation of spontaneous activity in mature organotypic cultures and acute slices of the rat cortex (183). By continuously recording spontaneous local field potentials using a multielectrode array, Beggs and Plenz showed that these waves of activity are similar to avalanches, a phenomenon well-known to exhibit power-law amplitude distributions. Indeed, the propagation of these 'neuronal avalanches' in terms of size and lifetime obeyed a power law (186). Moreover, their spatiotemporal patterns were stable and significantly repeatable for hours of recordings (183). The authors suggested that these avalanches may constitute a novel mode of activity in cortical networks which differs profoundly from those known for a long time, such as oscillatory or synchronised network states. The presence of scale-free or avalanche dynamics is, however, unclear for association cortex. In a recent analysis, (187) we found no evidence for avalanche or scale-free dynamics in neuronal activity from parietal cortex of awake cats, but rather that the dynamics shows exponential scaling, as if neuronal discharges were described by Poisson stochastic processes. This corroborates the work of Softky and Koch showing that the statistics of neuronal discharges in another type of association cortex (area MT) is similar to Poisson processes. It still remains possible that the cortex switches from Poisson-like to scale-free dynamics as a function of its state of attention or arousal, but such a hypothesis has not been tested yet. Thus, there is presently no direct in vivo evidence for scale-free dynamics in association cortex, but the only evidence so far was obtained in primary sensory cortex or thalamus, suggesting that this type of dynamics may be mostly relevant to sensory pathways.
Complexity in Neuronal Networks
321
Figure 6. Functional connectivity pattern in biological neural systems. Left: Matrix of functional connectivity in the macaque cortex deduced from the spread of activity between cortical areas as determined by strychnine neuronography (fig. taken from ref. 184). Right top: Average path length and cluster index deduced from the functional connectivity matrix in comparison with the values for a corresponding random network (data from ref. 184). Whereas the average path length of the biological functional network was small as in a corresponding random network, the cluster index was markedly larger, suggesting a functional connectivity with small-world structure. Right bottom: Functional network (top) and degree-distribution (bottom) extracted from functional magnetic resonance imaging data on the human cortex during a behavioural task. The degree-distribution shows a scale-free behaviour. Fig. taken from preprint by Chialvo et al. (185).
At the level of functional interactions, direct evidence that the cortical network of functional interactions in vivo is not homogeneous but rather is segregated into mutually interacting functional assemblies comes from the work of Stephan and colleagues in the macaque cerebral cortex (184). Functional connectivity refers here to deviations from
322
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
statistical independence in the activity between spatially separated neuronal units, and measures their temporal correlation or spectral coherence (188). Stephan et al. systematically collected and investigated data from many earlier studies on the spread of activity after strychnineinduced disinhibition and indeed found, using graph-theoretical tools, that the functional cortical network is closer to a small-world than to a random or lattice network (Fig. 6). These findings are in good agreement with results obtained in investigations of the anatomical connectivity patterns in the mammalian cerebral cortex (see above), and demonstrate once more strong arguments for the close link between functional and structural connectivity. However, in the cited study, again no evidence for a scale-free, but for a rather single-scaled degree distribution was found. Finally, using fMRI data, the network of functionally correlated, spatially separated brain sites in humans was investigated (185). It was found that both the distribution of functional connections and the probability of finding a link as a function of distance between vertices closely followed a scale-free behaviour (Fig. 6). In conjunction with the finding that the average path length between vertices in this fMRI network was very small and accompanied by a clustering coefficient which was much larger than that of equivalent random networks, these findings suggest that, indeed, the large-scale functional connectivity in the brain follows that typically seen in scale-free small-world networks. 5. Complexity in Network Dynamics 5.1. A Possible Role of 'Noise' in the Functional Dynamics of Cortical Networks One of the most striking differences between cerebral cortex and central pattern generating networks is that cortical neurons in vivo show a considerable degree of apparent randomness in their activity. The membrane potential of cortical neurons is subject to highly fluctuating activity (Fig. 7a), mostly of synaptic origin (189), consistent with the extremely dense connectivity in cortex (150,190,142). An essential
Complexity in Neuronal Networks
323
characteristic of this 'synaptic noise' is that it sets the membrane in a 'high-conductance state', which may impact considerably on the integrative properties of cortical neurons (42). Computational models have predicted that in high-conductance states cortical neurons follow (a)
(b)
(c)
Figure 7. Synaptic noise as a building block of neocortical circuit computations. (a) Intracellular recordings in cat parietal cortex in vivo showing the sustained membrane potential fluctuations during activated states. (b) Enhancement of responsiveness by synaptic noise. Computational models of pyramidal neurons simulated synaptic noise from the random release of thousands of glutamatergic and GABAergic synapses distributed in soma and dendrites. Left: the response to additional excitatory inputs leads to probabilistic responses (red line; 40 trials shown). Right panel: response curve of the neuron without noise (red), with noise (green) and with the equivalent leak conductance (blue). Synaptic noise changed the gain of neurons and enhanced the responsiveness to low-amplitude inputs (*). (c) Equalisation of synaptic efficacies by synaptic noise. Synaptic inputs were simulated at different distances from the soma (left scheme). The response probability is shown as a function of input strength (Amplitude) and localisation with respect to soma (Path distance). Input efficacy was weakly dependent on the location of the input in dendrites. Modified from ref. 42.
324
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
several 'computational principles' (42). First, their responsiveness is strongly modulated by synaptic noise, and in some cases it may boost the response to synaptic inputs (191) (Fig. 7b), similar to stochastic resonance phenomena (192). Some of these predictions were confirmed experimentally using dynamic-clamp (193-195). Second, complex interactions between synaptic noise and dendritic ion channels may considerably reduce the dependence of the efficacy of synaptic inputs on their location in dendrites (196), resulting in a more 'democratic' dendritic tree in which each synapse would have an approximately equal vote in firing an action potential in the axon (Fig. 7c). This scheme is, however, only valid for isolated inputs, and the integration of multiple inputs may reveal the existence of 'dendritic subunits', as suggested by experiments (197) and models (198). Third, high-conductance states sharpen temporal resolution, allowing cortical neurons to detect millisecond coincidences and therefore resolve precisely timed inputs (42,199). However, how cortical neurons integrate thousands of inputs distributed in their dendrites still remains an open problem. Finally, an obvious consequence of synaptic noise is that cortical neurons display a high trial-to-trial variability in their responses (200), a feature often seen in vivo. Consequently the only measures that makes sense for a cortical neuron in vivo are probabilities, and indeed probabilities have been used for several decades to characterise responses recorded in cortex in vivo, under the form of 'post-stimulus time histograms' (199). There is also a whole family of computational models of cortical coding based on probabilistic models (201), some of which will be mentioned below. The property of noise-induced enhancement of responsiveness (190), or gain modulation by noise (194), suggests two possible views about the computational role of synaptic noise. It could be that the apparently stochastic membrane potential fluctuations result from the processing of a large number of 'meaningful' signals. Assuming that these signals are weakly correlated, any given input sees the other inputs as 'noise', and can be boosted by these other inputs. The neuron would therefore multiplex many such weakly correlated signals (202). Alternatively, it may be that the cortical network produces self-generated stochastic activity in order to be in a regime of enhanced responsiveness where
Complexity in Neuronal Networks
325
afferent inputs can be processed efficiently. Here, synaptic noise is viewed as the result of a particular network state where responsiveness is optimised, perhaps related to attentional mechanisms (191,195), see discussion in ref. 53). These lines of evidence suggest that 'noise' is one of the building blocks that need to be taken into account to understand cortical computations, in addition to the intrinsic and synaptic properties introduced earlier. In contrast to what is usually thought, the effect of 'noise' may be not detrimental but possibly beneficial for near-threshold signalling and decision making. However, other theoretical frameworks, which are no longer limited to the classical Shannonian transmission of a signal source by a noisy channel, should also be considered. When studying the impact of 'noise' no longer at the level of input/output operations performed by the single cell but at the level of assembly dynamics, network-driven 'noise' reflects the activity of all unseen units and depends on the degree of redundancy/sparseness achieved by the population code. Under such dynamic regime, 'noise' no longer acts as an additive or multiplicative constant at the cell level, but becomes embedded in the global dynamics of the system (203). Recent work in visual cortex, based on the trial-by-trial reliability of the spike pattern as well as of the subthreshold membrane potential trajectory, indicates that 'noise' covaries as a function of the global uncertainty provided by the sensory context (204,205). The variability of the network dynamics can be derived from the respective dimensions of the sensory input information to be processed, i.e. its complexity and the intrinsic information capacity of the network. 5.2. Self-Organisation and Adaptive Properties in Network Dynamics The concept of self-organisation dates back to the British psychiatrist and engineer W.R. Ashby (1956), who became one of the founding fathers of systems theory and cybernetics in the late 1940s (207). Based on the work of Ashby and his successors, we can now formulate several principles that are linked to the dynamics of self-organisation. First, selforganising systems exhibit a balance between positive feedback, which
326
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
leads to an acceleration of the development of the system, and negative feedback, which is required to stabilise the system. Second, selforganising systems show an adaptation to their environment. For that, the system needs a huge variety of stable states, i.e. the number of states must be large enough to allow the system to react to environmental perturbations yet remain stable. Third, the process of self-organisation is equivalent to an increase in coherence or synchrony which can span the whole system and is accompanied by a decrease of statistical entropy. Interestingly, this principle points, apparently, towards the well-known thermodynamic paradox. However, according to Ashby, dynamic systems always tend to evolve to a specific attractor state, which can be viewed as a specific state of equilibrium which reduces the uncertainty about the system's state and, therefore, minimises the system's statistical entropy. This argumentation, indeed, circumvents the paradox. Fourth, in self-organising systems, only correlation or coherence patterns which can maintain themselves can result from the inherent dynamics and, therefore, continue to exist. This property is a manifestation of the concept of closure and self-sufficiency. It has been shown that many physical and chemical systems develop a hierarchical, structured, i.e. complex, architecture. Finally, this organisational closure turns a collection of interacting elements into an individual coherent whole, which has properties that arise out of its organisation and that cannot be reduced to the properties of its elements. This is the birthplace of emergent properties. In generalisation of this last point of organisational closure, selforganisation also means adaptation of system constituents, populations or sub-components to themselves. This leads, naturally, to the concept of self-regulation or self-control. This requires that, first, the system is able to produce a sufficient variety of actions to cope with possible perturbations (external and intrinsic) and, second, that the system selects the most adequate counter-action for a given perturbation. Here, variety keeps the system far from equilibrium, i.e. endows it with many possible attractor states, characteristic of chaotic systems. In contrast, selectivity pushes towards a sufficiently small number of attractor states in order to ensure stability, as is typical for deterministic systems. The coexistence, or union, of variety and selectivity leads to the idea that complex
Complexity in Neuronal Networks
327
adaptive systems tend to reside on the edge of chaos (208,209), a regime between chaotic disorder and deterministic order. Moreover, the mechanism by which complex systems tend to maintain themselves on this critical edge is called self-organised criticality (210). Interestingly, as has been shown to be the case in many natural and social contexts, the system's behaviour on this edge is typically governed by power laws. This closes the loop of argumentation, although with many remaining holes of explanation, for a connection between the definition of a system as being complex and the property of possessing a topological or functional small-world structure which follows a power-law distribution in some of its characteristic properties. 6. Conclusion and Perspectives: Complexity as a Computational Principle? An alternative to attempting to explain cortical function on the basis of canonical microcircuits is to exploit the phenomenal diversity of cortical structure and dynamics. Cortical neurons display a wide diversity of morphologies and intrinsic properties (45), and synaptic dynamics are highly variable and show properties from facilitating to depressing synapses (65). The essential feature of cortical anatomy may be precisely that there is no canonical pattern of functional connectivity, consistent with the considerable apparently-random component of cortical structural connectivity templates (150,151). Taking these observations together, one may argue that the cortex is a circuit ever-adapting to the functional task that seems to maximise its own complexity, both in terms of recruitment of distributed elementary intrinsic properties at the singlecell level and in terms of relational topology at the network connectivity level. Along with this view, computational models are now emerging, in which the goal is to take advantage of the special information processing capabilities - and memory - of such complex systems. Such large-scale networks can transform temporal codes into spatial codes by selforganisation (211), and computing frameworks have been proposed which exploit the capacity of such complex networks to cope with complex input streams (212,213). In these examples, information is
328
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
stored in the ongoing activity of the network, in addition to its synaptic weights. Progress in the study of complex systems may also be crucial for understanding cortical computations. We have mentioned above the idea that cortical wiring is given by the repetition of canonical microcircuits (53,147-149) which was contrasted with the idea that cortical wiring templates are highly variable (150,142,151). We can go further and note that the nervous system is perfectly able to wire well ordered, almost crystalline structures, suggesting that the considerable variability and diversity found in neocortex is important for its computations. We may have to abandon the usual concept of 'large networks of identical units' and replace it with the idea that the cortex consists of 'large networks of diverse elements', guided by a continuum of properties, rather than of prototypical units or microcircuits. Advances in this field will depend on combined progress in molecular tagging, instrumentation (e.g. two-photon (214) and in multiple-scale monitoring of brain activity. New methods are already available to link levels of integration and visualise the collective dynamics of large distributed ensembles of spiking elements. Such tools find a natural field of application in the delineation of parts of the brain, and more specifically of cortical areas, involved in specific cognitive functions and in the genesis or processing of mental representations (215). However their present use and repeated abuse illustrate the fact that most advanced brain imaging techniques (calcium imaging with two-photon, PETd, fMRI, BOLDe, EEG-MEGf) rely on explicative variables (metabolic, haemodynamic) which differ greatly from those used to decipher the neural code and information transfer at a more microscopic level (current source density, evoked potentials, spike counts). Increased efforts have to be made in establishing correlations and when possible transfer functions between variables collected at different levels of integration and scales of observation (216,217). The ignorance, or minoration, of elementary sources of complexity in the d
PET : Positron Emission Tomography BOLD : Blood Oxygenation Level Dependent f MEG : Magnetoencephalography e
Complexity in Neuronal Networks
329
living brain has already attributed specific locations to consciousness and we should not wait long for the pineal gland to be reinstated as the centre of mind/soul. It is only when the complexity of brain processes will be fully recognised that one may hope to go beyond the past errancy of phrenology and system behaviour linearisation. Acknowledgements This review work has been supported by the European Union under the Bio-inspired Intelligent Information Systems Programme, project reference IST-2004-15879 (FACETS). References 1. 2. 3. 4. 5. 6.
7.
8. 9. 10. 11. 12. 13.
Churchland, P. and Sejnowski, T. (1988). Perspectives on cognitive neuroscience. Science. 242, 741-745. Poggio, T. (1983). Visual algorithms. In: O.J. Braddick and A.C. Sleigh (eds.), Physical and Biological Processing of Images, 128-153. Springer-Verlag, Berlin. McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115-133. Ferster, D. and Miller, K.D. (2000). Neural mechanisms of orientation selectivity in the visual cortex.. Annu. Rev. Neurosci. 23, 441-471. Borg-Graham, L. and Monier, C. (1998). Visual input evokes transient and strong shunting inhibition in visual cortical neurons. Nature. 393, 369-373. Shapley, R., Hawken, M. and Ringach, D.L. (2003). Dynamics of orientation selectivity in the primary visual cortex and the importance of cortical inhibition. Neuron. 38, 689-699. Monier, C., Chavane, F., Baudot, P., Graham, L.J. and Frégnac, Y. (2003). Orientation and direction selectivity of synaptic inputs in visual cortical neurons: a diversity of combinations produces spike tuning. Neuron. 37, 663-680. Oliet, S., Piet, R. and Poulain, D. (2001). Control of glutamate clearance and synaptic efficacy by glial coverage of neurons. Science. 292, 923-926. Todd, K., Serrano, A., Lacaille, J. and Robitaille, R. (2006). Glial cells in synaptic plasticity. J. Physiol. 99, 75. Jacobs, R. (1999). Computational studies of the development of functionally specialized neural modules. Trends Cogn. Sci. 3, 31-38. Connors, B., Gutnick, M. and Prince, D. (1982). Electrophysiological properties of neocortical neurons in vitro. J. Neurophysiol. 48, 1302-1320. Hille, B. (1991). Ionic channels of excitable membranes, Sinauer Associates, Sunderland, 2nd edn. Gähwiler, B.H. (1981). Organotypic monolayer cultures of nervous tissue. J. Neurosci. Meth. 4, 329-342.
330 14. 15.
16.
17. 18. 19.
20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.
31. 32.
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe Kandel, E. (1976). Cellular basis of behavior: an introduction to behavioral neurobiology. W.H. Freeman. Rall, W. (1964). Theoretical significance of dendritic tree for input-output relation.In: R.F. Reiss (ed.), Neural Theory and Modeling. 73-97. Stanford University Press, Stanford. Rall, W. (1967). Distinguishing theoretical synaptic potentials computed for different soma-dendritic distributions of synaptic input. J. Neurophysiol. 30, 11381168. Rall, W. (1969). Time constants and electrotonic length of membrane cylinders and neurons. Biophys. J. 9, 1483-1508. Rall, W. and Rinzel, J. (1973). Branch input resistance and steady attenuation for input to one branch of a dendritic neuron model.Biophys. J. 13, 648-687. Segev, I., Rinzel, J. and Shepherd, G.M. (eds.) (1995). The Theoretical Foundation of Dendritic Function. Selected Papers of Wilfried Rall with Commentarie,. MIT Press, Cambridge, MA. Llinás, R. (1975). Electroresponsive properties of dendrites in central neurons. Adv. Neuro. 12, 1-13. Johnston, D., Magee, J.C., Colbert, C.M. and Christie, B.R. (1996). Active properties of neuronal dendrites. Annu. Rev. Neurosci. 19, 165-186. Migliore, M. and Shepherd, G.M. (2002). Emerging rules for the distributions of active dendritic conductances. Nat. Rev. Neurosci. 3, 362-370. Segev, I. and Rall, W. (1998). Excitable dendrites and spines: earlier theoretical insights elucidate recent direct observations. Trends Neurosci. 21, 453-460. Yuste, R. and Tank, D.W. (1996). Dendritic integration in mammalian neurons, a century after Cajal. Neuron. 16, 701-716. Stuart, G., Spruston, N. and Häusser, M. (eds.) (2000). Dendrites, MIT Press, Cambridge, MA. Wong, R.K., Prince, D.A. and Basbaum, A.I. (1979). Intradendritic recordings from hippocampal neurons. Proc. Natl. Acad. Sci. U.S.A. 76, 986-990. Benardo, L.S., Masukawa, L.M. and Prince, D.A. (1982). Electrophysiology of isolated hippocampal pyramidal dendrites. J. Neurosci. 2, 1614-1622. Regehr, W., Kehoe, J.S., Ascher, P. and Armstrong, C. (1993). Synaptically triggered action potentials in dendrites. Neuron. 11, 145-151. Andreasen, M. and Lambert, J.D. (1995).Regenerative properties of pyramidal cell dendrites in area CA1 of the rat hippocampus. J. Physiol. 483, 421-441. Schwindt, P.C. and Crill, W.E. (1995). Amplification of synaptic current by persistent sodium conductance in apical dendrite of neocortical neurons. J. Neurophysiol. 74, 2220-2224. Stuart, G. and Sakmann, B. (1995). Amplification of EPSPs by axosomatic sodium channels in neocortical pyramidal neurons. Neuron 15, 1065-1076. Berger, T., Larkum, M. E. and Lüscher, H. R. (2001). High I(h) channel density in the distal apical dendrite of layer V pyramidal cells increases bidirectional attenuation of EPSPs. J. Neurophysiol. 85, 855-868.
Complexity in Neuronal Networks 33.
34. 35.
36. 37. 38. 39.
40. 41.
42. 43.
44. 45. 46. 47.
48.
331
Williams, S. R. and Stuart, G. J. (2000). Site independence of EPSP time course is mediated by dendritic I(h) in neocortical pyramidal neurons. J. Neurophysiol. 83, 3177-3182. Stuart, G. J. and Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature. 367, 69-72. Markram, H., Lübke, J., Frotscher, M. and Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science. 275, 213-215. Hines, M. L. and Carnevale, N. T. (1997). The NEURON simulation environment. Neural Comput. 9, 1179-1209. Hines, M. L. and Carnevale, N. T. (2001). NEURON: a tool for neuroscientists. Neuroscientist. 7, 123-135. Koch, C. and Segev, I. (eds.) (1998). Methods in Neuronal Modeling, MIT Press, Cambridge, MA. Bower, J.M. and Beeman, D. (1997). The Book of GENESIS: Exploring realistic neural models with the GEneral NEural SImulation System, TELOS, SpringerVerlag, New York, 2 edn. Steriade, M., Timofeev, I. and Grenier, F. (2001). Natural waking and sleep states: a view from inside neocortical neurons. J. Neurophysiol. 85, 1969-1985. Matsumura, M., Cope, T. and Fetz, E. (1988). Sustained excitatory synaptic input to motor cortex neurons in awake animals revealed by intracellular recording of membrane potentials. Exp. Brain Res. 70, 463-469. Destexhe, A., Rudolph, M. and Paré, D. (2003). The high-conductance state of neocortical neurons in vivo. Nat. Rev. Neurosci. 4, 739-751. Nowak, L.G., Azouz, R., Sanchez-Vives, M.V., Gray, C. M. and McCormick, D. A. (2003). Electrophysiological classes of cat primary visual cortical neurons in vivo as revealed by quantitative analyses. J. Neurophysiol. 89, 1541-1566. Connors, B.W. and Gutnick, M.J. (1990). Intrinsic firing patterns of diverse neocortical neurons. Trends Neurosci. 13, 99-104. Gupta, A., Wang, Y. and Markram, H. (2000). Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science. 287, 273-278. Yuste, R. (2005). Origin and classification of neocortical interneurons. Neuron. 48, 591-604. Alonso-Nanclares, L., Anderson, S., Ascoli, G., Benavides-Piccione, R., Burkhalter, A., Buzsaki, G., Cauli, B., DeFelipe, J., Fairen, A., Feldmeyer, D., Fishel, G., Frégnac, Y., Freund, T. F., Fukuyi, K., Glarreta, M., Goldberg, J., Helmstaedter, M., Hensch, T., Hestrin, S., Kisvarday, Z., Lambolez, B., Lewis, D., McBain, C., Marin, O., Markham, H., Monyer, H., Muñoz, A., Petersen, C., Rockland, K., Rossier, H., Ruby, B., Somogyi, P., Staiger, J. F., Tamas, G., Thomson, A., Toledo-Rodriguez, M., Wang, X.-J., Wang, Y., West, D. and Yuste, R. (2005). Nomenclature of features of GABAergic interneurons of the cerebral cortex. Petilla. Classification endorsed by the American Society for Neuroscience. Toledo-Rodriguez, M., Blumenfeld, B., Wu, C., Luo, J., Attali, B., Goodman, P. and Markram, H.. (2004). Correlation maps allow neuronal electrical properties to
332
49.
50.
51. 52.
53.
54. 55. 56.
57. 58. 59. 60. 61. 62.
63.
64.
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe be predicted from single-cell gene expression profiles in rat neocortex. Cereb. Cortex. 14, 1310-1327. Monyer, H. and Markram, H. (2004). Interneuron diversity series: Molecular and genetic tools to study GABAergic interneuron diversity and function. Trends Neurosci. 27, 90-94. Blatow, M. and Rozov, A. and Katona, I. and Hormuzdi, S.G. and Meyer, A.H. and Whittington, M.A. and Caputi, A. and Monyer, H. (2003). A novel network of multipolar bursting interneurons generates theta-frequency oscillations in neocortex. Neuron. 38, 805-817. Whittington, M.A. and Traub, R.D. (2003). Interneuron diversity series: inhibitory interneurons and network oscillations in vitro. Trends Neurosci. 26, 676-682. Tsuchida, T., Ensini, M., Morton, S.B., Baldassare, M., Edlund, T., Jessell, T.M. and Pfaff, S.L. (1994). Topographic organization of embryonic motor neurons defined by expression of LIM homeobox genes. Cell. 79, 957-970. Frégnac, Y.R., Blatow, M., Changeux, J.P., DeFelipe, J., Markram, H., Lansner, A., Maass, W., McCormick, D., Michel, C. et al. (2006). Ups and downs in the genesis of cortical computation. In: Grillner, S. et al. (ed.), Microcircuits: The Interface between Neurons and Global BrainFunction Microcircuits:, vol. 93 of Dahlem Workshop Report, 397-437. MIT Press, Cambridge, MA. Szentágothai, J. (1975). The 'module-concept' in cerebral cortex architecture. Brain Res. 95, 475-496. Hubel, D.H. and Wiesel, T.N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. J. Physiol. 160, 106-154. Dudek, F., Andrew, R., MacVicar, B., Snow, R. and Taylor, C. (1983). Recent evidence for and possible significance of gap junctions and electrotonic synapses in the mammalian brain. In: BasicMechanisms of Neuronal Hyperexcitability, 31-73. A. Liss, New York. Galaretta, M. and Hestrin, S. (1999). A network of fast-spiking cells in the neocortex connected by electrical synapses. Nature. 402, 72-75. Gil, Z., Connors, B. and Amitai, Y. (1997). Differential regulation of neocortical synapses by neuromodulators and activity. Neuron. 19, 679-686. Douglas, R., Martin, K. and Whitteridge, D. (1988). Selective responses of visual cortical cells do not depend on shunting inhibition. Nature. 332, 642-644. Dale, H. (1935). Pharmacology and nerve-endings. Proc. R. Soc. Med. (Lond). 28, 319-332. Cherubini, E., Gaiarsa, J.L. and Ben-Ari, Y. (1991). GABA: an excitatory transmitter in early postnatal life. Trends Neurosci. 14, 515-519. Alkon, D., Sanchez-Andrés, J. V., Ito, E., Oka, K., Yoshika, T. and Collin, C. (1992). Long-term transformation of an inhibitory into an excitatory GABAergic synaptic response. Proc. Natl. Acad. Sci. U.S.A. 89, 11862-11866. Liao, D., Hessler, N. and Malinow, R. (1995). Activation of postsynaptically silent synapses during pairing-induced LTP in CA1 region of hippocampal slice. Nature. 375, 400-404. Malenka, R.C. and Nicoll, R.A. (1997). Silent synapses speak up. Neuron. 19, 473476.
Complexity in Neuronal Networks 65. 66. 67. 68. 69.
70. 71.
72. 73. 74.
75.
76.
77.
78. 79. 80.
81. 82. 83.
333
Thomson, A. M. (2000). Facilitation, augmentation and potentiation at central synapses. Trends Neurosci. 23, 305-312. Markram, H. and Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature. 382, 807-810. Tsodyks, M., Pawelzik, K. and Markram, H. (1998). Neural networks with dynamic synapses. Neural Comput. 10, 821-835. Frégnac, Y. (1999). A tale of two spikes. Nature Neurosci. 2, 299-301. Frégnac, Y. (2002). Hebbian synaptic plasticity. In: M. A. Arbib (ed.), The Handbook of Brain Theory and Neural Networks, 2nd edition, 515-522. MIT Press, Cambridge, MA. Brown, T., Ganong, A., Kairiss, E. and Keenan, C. (1990). Hebbian synapses: biophysical mechanisms and algorithms. Annu. Rev. Neurosci. 13, 475-511. Bliss, T. V. and Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. 232, 331-356. Kelso, S. R., Ganong, A. H. and Brown, T. H. (1986). Hebbian synapses in hippocampus. Proc. Natl. Acad. Sci. U.S.A. 83, 5326-5330. Bliss, T. V. P. and Collingridge, G.L. (1993). A synaptic model of memory: longterm potentiation in the hippocampus. Nature. 361, 31-39. Artola, A., Bröcher, S. and Singer, W. (1990). Different voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of the rat visual cortex. Nature. 347, 69-72. Mulkey, R. and Malenka, R. (1992). Mechanisms underlying induction of homosynaptic long-term depression in area CA1 of the hippocampus. Neuron. 9, 967-975. Dudek, S. and Bear, M. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and effects of N-methyl-D-aspartate receptor blockade. Proc. Natl. Acad. Sci. U.S.A. 89, 4363-4367. Kirkwood, A., Dudek, S., Gold, J., Aizenman, C. and Bear, M. (1993). Common forms of synaptic plasticity in hippocampus and neocortex in vitro. Science. 260, 1518-1521. Bi, G. and Poo, M. (2001). Synaptic modification by correlated activity: Hebb's postulate revisited. Annu. Rev. Neurosci. 24, 139-166. Levy, W. and Steward, O. (1983). Temporal contiguity requirements for long-term associative potentiation/depression in the hippocampus. Neuroscience. 8, 791-797. Bi, G. and Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci. 18, 10464-10472. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron. 27, 45-56. Bell, C., Han, V., Sugawara, Y. and Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278-281. Hebb, D. O. (1949). The Organization of Behavior, John Wiley and Sons, New York.
334 84.
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
Frégnac, Y., Shulz, D., Thorpe, S. and Bienenstock, E. (1988). A cellular analogue of visual cortical plasticity. Nature. 333, 367-370. 85. Frégnac, Y. and Shulz, D. (1999). Activity-dependent regulation of receptive field properties of cat area 17 by supervised Hebbian learning. J. Neurobiol. 41, 69-82. 86. Kirkwood, A., Lee, H. and Bear, M. (1995). Co-regulation of long-term potentiation and experience-dependent synaptic plasticityin visual cortex by age and experience. Nature. 375, 328-331. 87. Mehta, M., Quirk, M. and Wilson, M. (2000). Experience-dependent asymmetric shape of hippocampal receptive fields. Neuron. 25, 707-715. 88. Davison, A. and Frégnac (2006). Learning cross-modal transformations through spike timing-dependent plasticity. J. Neurosci., In press. 89. Frégnac, Y. (1998). Homeostasis or synaptic plasticity? Nature, (News and Views) 391 391, 845-846. 90. O'Donovan, M.J. and Rinzel, J. (1997). Synaptic depression: a dynamic regulator of synaptic communication with varied functional roles. Trends Neurosci. 20, 431433. 91. Abraham, W. and Bear, M. (1996). Metaplasticity: the plasticity of synaptic plasticity. Trends Neurosci. 19, 126-130. 92. Glanzman, D., Kandel, E. and Schacher, S. (1991). Target-dependent morphological segregation of Aplysia sensory outgrowth in vitro. Neuron. 7, 903913. 93. Willshaw, D. and Von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proc. R. Soc. Lond. B Biol. Sci. 194, 431-445. 94. Turrigiano, G., Abbott, L. and Marder, E. (1994). Activity-dependent changes in the intrinsic properties of cultured neurons. Science 264, 974-977. 95. Turrigiano, G., Leslie, K., Desai, N., Rutherford, L. and Nelson, S. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature 391, 892-896. 96. Miller, K. and MacKay, D. (1996). The role of constraints in Hebbian learning. Neural Comput. 6, 100-126. 97. Chen, W.R., Midtgaard, J. and Shepherd, G.M. (1997). Forward and backward propagation of dendritic impulses and their synaptic control in mitral cells. Science. 278, 463-467. 98. Getting, P.A. (1989). Emerging principles governing the operation of neural networks. Annu. Rev. Neurosci. 12, 185-204. 99. Hartline, D.K., Russell, D.F., Raper, J.A. and Graubard, K. (1988). Special cellular and synaptic mechanisms in motor pattern generation. Comp. Biochem. Physiol., C. 91, 115-131. 100. Kiehn, O. and Eken, T. (1998). Functional role of plateau potentials in vertebrate motor neurons. Curr. Opin. Neurobiol., 8, 746-752. 101. Bremer, F. (1938). L'activité électrique de l'écorce cérébrale. Actualités Scientifiques et Industrielles, 658, 3-46. 102. Bremer, F. (1949). Considérations sur l'origine et la nature des '`ondes'' cérébrales. Electroencephalogr. Clin. Neurophysiol. 1, 177-193.
Complexity in Neuronal Networks
335
103. Destexhe, A. and Sejnowski, T.J. (2003). Interactions between membrane conductances underlying thalamocortical slow-wave oscillations. Physiol. Rev. 83, 1401-1453. 104. Steriade, M. (2003). Neuronal Substrates of Sleep and Epilepsy.Cambridge University Press, Cambridge, UK. 105. Llinás, R.R. (2002). I of the Vortex: from Neurons to Self, MIT Press, Cambridge, MA. 106. Llinás, R.R. and Paré, D. (1991). Of dreaming and wakefulness. Neuroscience. 44, 521-535. 107. Singer, W. and Gray, C.M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci. 18, 555-586. 108. Stafström, C.E., Schwindt, P.C. and Crill, W. E. (1984). Repetitive firing in layer V neurons from cat neocortex in vitro. J. Neurophysiol. 52, 264-277. 109. Gray, M. and McCormick, D. (1996). Chattering cells: Superficial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science. 274, 109-113. 110. Golding, N.L., Staff, N.P. and Spruston, N. (2002). Dendritic spikes as a mechanism for cooperative long-term potentiation. Nature. 418, 326-331. 111. Hooper, S. and Moulins, M. (1989). Switching of a neuron from one network to another by sensory-induced changes in membrane properties. Science. 244, 15871589. 112. Meyrand, P., Simmers, J. and Moulins, M. (1991). Construction of a patterngenerating circuit with neurons of different networks. Nature. 351, 60-62. 113. Le Masson, G., Marder, E. and Abbott, L. (1993). Activity-dependent regulation of conductances in model neurons. Science. 259, 1915-1917. 114. Dickinson, P. and Nagy, F. (1983). Control of a central pattern generator by an identified modulatory interneurone in crustacea: II. induction and modification of plateau properties in neurones. J. Exp. Biol. 105, 59-82. 115. Dickinson, P.S., Mecsas, C. and Marder, E. (1990). Neuropeptide fusion of two motor-pattern generator circuits. Nature. 344, 155-158. 116. Steriade, M., Jones, E.G. and McCormick, D.A. (eds.) (1997). Thalamus, Elsevier, Amsterdam. 117. McCormick, D. A. (1992). Neurotransmitter actions in the thalamus and cerebral cortex and their role in neuromodulation of thalamocortical activity. Prog. Neurobiol., 39, 337-388. 118. Sherman, S.M. and Guillery, R.W. (2001). Exploring the Thalamus, Academic Press, New York. 119. Gerstner, W. and Kistler, W. (2002). Spiking Neuron Models, Cambridge University Press, Cambridge, UK. 120. Feng, J. (ed.) (2003). Computational Neuroscience: A Comprehensive Approach, Chapman and Hall/CRC, Boca Raton, FL. 121. Lapicque, L. (1907). Recherches quantitatives sur l'excitation électrique des nerfs traitée comme une polarisation. J. Physiol. Pathol. Gen. 9, 620-635. 122. Ermentrout, B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation. 8, 979-1001.
336
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
123. Fourcaud-Trocmé, N., Hansel, D., van Vreeswijk, C. and Brunel, N. (2003). How spike generation mechanisms determine the neuronal response to fluctuating inputs. J. Neurosci. 23, 11628-11640. 124. Izhikevich, E.M. (2004). Which model to use for cortical spiking neurons? IEEE Trans. Neural Networks. 15, 1063-1070. 125. Brette, R. and Gerstner, W. (2005). Adaptive exponential integrate-and-fire model as an effective description of neuronal activity. J. Neurophysiol. 94, 3637-3642. 126. Rudolph, M. and Destexhe, A. (2006). Integrate-and-fire neurons with highconductance state dynamics for event-driven simulation strategies. Neural Comput., In press. 127. Bressloff, P. (1995). Dynamics of a compartmental model integrate-and-fire neuron with somatic potential reset. Physica D. 80, 399-412. 128. Hodgkin, A.L. and Huxley, A.F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117, 500-544. 129. Anderson, J. and Rosenfeld, E. (eds.) (1991). Neurocomputing: Foundations of Research, MIT Press. 130. King, C.C. (1991). Fractal and chaotic dynamics in nervous systems. Prog. Neurobiol. 36, 279-308. 131. Faure, P. and Korn, H. (2001). Is there chaos in the brain? I. Concepts of nonlinear dynamics and methods of investigation. C. R. Acad. Sci. III, 324, 773-793. 132. Schiff, S.J., Jerger, K., Duong, D.H., Chang, T., Spano, M.L. and Ditto, W.L. (1994). Controlling chaos in the brain. Nature. 370, 615-620. 133. Korn, H. and Faure, P. (2003). Is there chaos in the brain? II. Experimental evidence and related models. C. R. Biol. 326, 787-840. 134. Babloyantz, A., Salazar, J. and Nicolis, G. (1985). Evidence of chaotic dynamics of brain activity during the sleep cycle. Phys. Lett. A. 111, 152-156. 135. Freeman, W. J. (2000). Mesoscopic neurodynamics: from neuron to brain. J. Physiol. (Paris). 94, 303-322. 136. Pritchard, W.S. and Duke, D.W. (1995). Measuring “chaos” in the brain: a tutorial review of EEG dimension estimation. Brain Cogn. 27, 353-397. 137. Maex, R. and De Schutter, E. (1998). Synchronization of Golgi and granule cell firing in a detailed network model of the cerebellar granule cell layer. J. Neurophysiol. 80, 2521-2537. 138. Davison, A. P., Feng, J. and Brown, D. (2003). Dendrodendritic inhibition and simulated odor responses in a detailed olfactory bulb network model. J. Neurophysiol. 90, 1921-1935. 139. Rangan, A.V., Cai, D. and McLaughlin, D.W. (2005). Modeling the spatiotemporal cortical activity associated with the line-motion illusion in primary visual cortex. Proc. Natl. Acad. Sci. U.S.A. 102, 18793-18800. 140. Traub, R.D., Contreras, D., Cunningham, M.O., Murray, H., LeBeau, F.E.N., Roopun, A., Bibbig, A., Wilent, W. B., Higley, M. J. and Whittington, M.A. (2005). Single-column thalamocortical network model exhibiting gamma oscillations, sleep spindles, and epileptogenic bursts. J. Neurophysiol. 93, 2194232.
Complexity in Neuronal Networks
337
141. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci. 8, 183-208. 142. White, E.L. (1989). Cortical circuits, Birkhauser, Boston. 143. Thomson, A.M. and Bannister, A.P. (2003). Interlaminar connections in the neocortex. Cereb. Cortex. 13, 5-14. 144. Herkenham, M. (1980). Laminar organization of thalamic projections to the rat neocortex. Science. 207, 532-535. 145. Peters, A. and Jones, E. (eds.) (1985). Cerebral cortex, vol. 3. Plenum, New York. 146. Gilbert, C. and Wiesel, T. (1989). Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci. 9, 2432-2442. 147. Mountcastle, V. B. (1979). An organizing principle for cerebral function: the unit module and the distributed system. In: F.O. Schmidt and F.G. Worden (eds.), The Neurosciences: Fourth Study Program, 21-42. MIT Press, Cambridge, MA. 148. Hubel, D.H. and Wiesel, T.N. (1963). Shape and arrangement of columns in cat's striate cortex. J. Physiol. 165, 559-568. 149. Douglas, R. and Martin, K. (1991). A functional microcircuit for cat visual cortex. J. Physiol. 440, 735-769. 150. Braitenberg, V. and Schüz, A. (1998). Cortex: statistics and geometry of neuronal connectivity, y.Springer-Verlag, Berlin. 151. Silberberg, G., Gupta, A. and Markram, H. (2002). Stereotypy in neocortical microcircuits. Trends Neurosci. 25, 227-230. 152. Contreras, D., Dürmüller, N. and Steriade, M. (1997). Absence of a prevalent laminar distribution of IPSPs in association cortical neurons of cat. J. Neurophysiol. 78, 2742-2753. 153. Mountcastle, V. (1957). Modality and topographic properties of single neurons in a cat's somatosensory cortex. J. Neurophysiol. 20, 408-434. 154. Mountcastle, V.B. (1982). An organization principle for celebral function: The unit module and the distributed system. In: V. Mountcastle and V. Edelman (eds.), Mindful Brain: Cortical Organization and the Group-Selective Theory of Higher Brain Function, 7-50. MIT Press, Cambridge, MA. 155. Hubel, D. and Wiesel, T. (1977). Functional architecture of macaque monkey visual cortex. Proc. R. Soc. Lond. B Biol. Sci. 198, 1-59. 156. Mountcastle, V. (1997). The columnar organization of the neocortex. Brain. 120, 701-722. 157. Gilbert, C.D. and Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci. 3, 1116-1133. 158. Gilbert, C. and Wiesel, T. (1979). Morphology and intracortical projections of functionally characterized neurones in the cat visual cortex. Nature. 280, 120-125. 159. Somers, D., Todorov, E., Siapas, A., Toth, L., Kim, D. and Sur, M. (1998). A local circuit approach to understanding integration of long-range inputs in primary visual cortex. Cereb. Cortex. 8, 204-217. 160. Wiesel, T. (1982). Postnatal development of the visual cortex and the influence of the environment (Nobel lecture). Nature. 299, 583-591. 161. Sharma, J., Angelucci, A. and Sur, M. (2000). Induction of visual orientation modules in auditory cortex. Nature. 404, 841-847.
338
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
162. von Melchner, L., Pallas, S. and Sur, M. (2000). Visual behaviour mediated by retinal projections directed to the auditory pathway. Nature. 404, 871-876. 163. Binzegger, T., Douglas, R.J. and Martin, K.A.C. (2004). A quantitative map of the circuit of cat primary visual cortex. J. Neurosci. 24, 8441-53. 164. Felleman, D. J. and Van Essen, D.C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex. 1, 1-47. 165. Braitenberg, V. and Schüz, A. (1991). Anatomy of the cortex, Springer, Berlin. 166. DeFelipe, J., Alonso-Nanclares, L. and Arellano, J.I. (2002). Microstructure of the neocortex: comparative aspects. J. Neurocytol. 31, 299-316. 167. Scannell, J.W., Blakemore, C. and Young, M.P. (1995). Analysis of connectivity in the cat cerebral cortex. J. Neurosci., 15, 1463-1483. 168. Scannell, J.W., Burns, G.A., Hilgetag, C. C., O'Neil, M.A. and Young, M.P. (1999). The connectional organization of the cortico-thalamic system of the cat. Cereb. Cortex. 9, 277-299. 169. Burns, G.A. and Young, M.P. (2000). Analysis of the connectional organization of neural systems associated with the hippocampus in rats. Philos. Trans. R. Soc. Lond. B Biol. Sci. 355, 55-70. 170. Young, M.P. (1992). Objective analysis of the topological organization of the primate cortical visual system. Nature. 358, 152-155. 171. Hilgetag, C.C., Burns, G.A., O'Neill, M.A., Scannell, J.W. and Young, M.P. (2000). Anatomical connectivity defines the organization of clusters of cortical areas in the macaque monkey and the cat. Philos. Trans. R. Soc. Lond. B Biol. Sci. 355, 91-110. 172. Bienenstock, E. (1996). On the dimensionality of cortical graphs. J. Physiol. (Paris), 90, 251-256. 173. Sporns, O. (2002). Graph theory methods for the analysis of neural connectivity patterns. In: R. Kötter (ed.), Neuroscience Databases. A Practical Guide., 171-186. Klüwer, Boston, MA. 174. Watts, D.J. and Strogatz, S.H. (1998). Collective dynamics of 'small-world' networks. Nature. 393, 440-442. 175. Sporns, O. and Zwi, J.D. (2004). The small world of the cerebral cortex. Neuroinformatics. 2, 145-162. 176. Landgren, S. (1952). On the excitation mechanism of the carotid baroceptors. Acta. Phys. Scandinav. 26, 1-34. 177. Chapman, K. and Smith, R. (1963). A linear transfer function underlying impulse frequency modulation in a cockroach mechanoreceptor. Nature. 197, 699-700. 178. Brown, M. and Stein, R. (1966). Quantitative studies on the slowly adapting stretch receptor of the crayfish. Kybernetik. 3, 175-185. 179. Biederman-Thorson, M. and Thorson, J. (1971). Dynamics of excitation and inhibition in the light-adapted Limulus eye in situ. J. Gen. Physiol., 58, 1-19. 180. Teich, M.C., Heneghan, C., Lowen, S.B., Ozaki, T. and Kaplan, E. (1997). Fractal character of the neural spike train in the visual system of the cat. J.Opt. Soc. Am. A Opt. Image Sci. Vis. 14, 529-546. 181. Lachaux, J., Rodriguez, E., Martinerie, J. and Varela, F. (1999). Measuring phase synchrony in brain signals. Hum. Brain Mapp. 8, 194-208.
Complexity in Neuronal Networks
339
182. Jung, P., Cornell-Bell, A., Madden, K.S. and Moss, F. (1998). Noise-induced spiral waves in astrocyte syncytia show evidence of self-organized criticality. J. Neurophysiol. 79, 1098-1101. 183. Beggs, J.M. and Plenz, D. (2004). Neuronal avalanches are diverse and precise activity patterns that are stable for many hours in cortical slice cultures. J. Neurosci. 24, 5216-5229. 184. Stephan, K.E., Hilgetag, C.C., Burns, G.A., O'Neill, M.A., Young, M.P. and Kötter, R. (2000). Computational analysis of functional connectivity between areas of primate cerebral cortex. Philos. Trans. R. Soc. Lond. B Biol. Sci. 355, 111-126. 185. Chialvo, D.R., (2004). Critical brain networks. Physica A. 340, 756 186. Beggs, J. M. and Plenz, D. (2003). Neuronal avalanches in neocortical circuits. J. Neurosci. 23, 11167-11177. 187. Bedard, C., Kroger, H. Destexhe A., (2006). Does the 1/f frequency scaling of brain signals reflect self-organized critical states, Physical Review Letters. 97, 118102 188. Sporns, O., Chialvo, D.R., Kaiser, M. and Hilgetag, C.C. (2004). Organization, development and function of complex brain networks. Trends Neurosci. 8, 418425. 189. Paré, D., Shink, E., Gaudreau, H., Destexhe, A. and Lang, E.J. (1998). Impact of spontaneous synaptic activity on the resting properties of cat neocortical pyramidal neurons in vivo. J. Neurophysiol. 79, 1450-1460. 190. DeFelipe, J. and Fariñas, I. (1992). The pyramidal neuron of the cerebral cortex: morphological and chemical characteristics of the synaptic inputs. Prog. Neurobiol. 39, 563-607. 191. Hô, N. and Destexhe, A. (2000). Synaptic background activity enhances the responsiveness of neocortical pyramidal neurons. J. Neurophysiol. 84, 1488-1496. 192. Wiesenfeld, K. and Moss, F. (1995). Stochastic resonance and the benefits of noise: from ice ages to crayfish and SQUIDs. Nature. 373, 33-36. 193. Destexhe, A., Rudolph, M., Fellous, J. and Sejnowski, T. (2001). Fluctuating synaptic conductances recreate in vivo-like activity in neocortical neurons. Neuroscience. 107, 13-24. 194. Chance, F.S., Abbott, L. and Reyes, A.D. (2002). Gain modulation from background synaptic input. Neuron. 35, 773-782. 195. Shu, Y., Hasenstaub, A., Badoual, M., Bal, T. and McCormick, D.A. (2003). Barrages of synaptic activity control the gain and sensitivity of cortical neurons. J. Neurosci. 23, 10388-10401. 196. Rudolph, M. and Destexhe, A. (2003). A fast-conducting, stochastic integrative mode for neocortical neurons in vivo. J. Neurosci. 23, 2466-2476. 197. Wei, D.S., Mei, Y.A., Bagal, A., Kao, J.P., Thompson, S. M. and Tang, C. M. (2001). Compartmentalized and binary behavior of terminal dendrites in hippocampal pyramidal neurons. Science. 293, 2272-2275. 198. Mel, B.W. (1994). Information processing in dendritic trees. Neural Computation. 6, 1031-1085. 199. Softky, W. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience. 58, 13-41.
340
Y. Frégnac, M. Rudolph, A. P. Davison and A. Destexhe
200. Moore, G.P., Perkel, D. H. and Segundo, J.P. (1966). Statistical analysis and functional interpretation of neuronal spike data. Annu. Rev. Physiol. 28, 493-522. 201. Rao, R., Olshausen, B. and Lewicki, M. (eds.) (2002). Probabilistic Models of the Brain, MIT Press, Cambridge, MA. 202. Rudolph, M. and Destexhe, A. (2001). Do neocortical pyramidal neurons display stochastic resonance? J. Comput. Neurosci. 11, 19-42. 203. Rieke, F., Warland, D., de Ruyter van Steveninck, R. and Bialek, W. (1997). Spikes: Exploring the Neural Code, MIT Press, Cambridge, MA. 204. Fregnac, Y., Baudot, P., Levy, M., Marre, O. (2005). An intracellular view of time coding and sparseness of cortical representation in V1 neurons during virtual oculomotor exploration of natural scenes. Cosyne05, Conf17, 30. 205. Marre, O., Baudot, P., Levy, M. and Frégnac, Y. (2005). High timing precision and reliability, low redundancy and low entropy code in V1 neurons during visual processing of natural scenes. In: Soc. Neurosc. Abstr. 285.5., Washinton D.C., 285.5. Washinton D.C. 206. Ashby, W. (1947). Principles of the self-organizing dynamic system. J. Gen. Psych. 37, 125-128. 207. Ashby, W. (1956). An Introduction to Cybernetics. Chapman and Hall, London. 208. Langton, C. (1990). Computation at the edge of chaos: Phase transitions and emergent computation. Physica D. 42, 12-37. 209. Kauffman, S. (1993). Origins of Order: Self-Organization and Selection in Evolution, Oxford University Press, Oxford. 210. Bak, P. (1997). How Nature Works: The Science of Self-organized Criticality, Oxford University Press, Oxford. 211. Buonomano, D.V. and Merzenich, M.M. (1995). Temporal information transformed into a spatial code by a neural network with realistic properties. Science. 267, 1028-1030. 212. Maass, W., Natschläger, T. and Markram, H. (2002). Real-time computing without stable states: a new framework for neural computation based on perurbations. Neural Comput. 14, 2531-2560. 213. Bertschinger, N. and Natschläger, T. (2004). Real-time computation at the edge of chaos in recurrent neural networks. Neural Comput. 16, 1413-1436. 214. Ohki, K., Chung, S., Ch'ng, Y. H., Kara, P. and Reid, R.C. (2005). Functional imaging with cellular resolution reveals precise micro-architecture in visual cortex. Nature. 433, 597-603. 215. Dehaene, S. and Naccache, L. (2001). Towards a cognitive neuroscience of consciousness: basic evidence and a workspace framework. Cognition. 79, 1-37. 216. Logothetis, N., Pauls, J., Augath, M., Trinath, T. and Oeltermann, A. (2001). Neurophysiological investigation of the basis of the fMRI signal. Nature. 412, 150157. 217. Grinvald, A. (2005). Imaging input and output dynamics of neocortical networks in vivo: exciting times ahead. Proc. Natl. Acad. Sci. U.S.A. 102, 14125-14126.
CHAPTER 10 NETWORKS OF THE IMMUNE SYSTEM
Robin E. Callard* and Jaroslav Stark† *
Immunobiology Unit, Institute of Child Health and CoMPLEX, University College London, UK †
CISBIC and Department of Mathematics, Imperial College London, UK
[email protected],
[email protected]
1. Introduction The immune system is made up of a complex set of interacting cells and molecules that protect a living organism against invading microorganisms such as viruses, bacteria, fungi and various parasites. In doing so, the immune system must be able to distinguish between harmful invaders of this sort and commensal organisms such as the bacteria found on the skin and in the gut, which are tolerated and not attacked by the immune system. The immune system must therefore distinguish not just between self and non-self but also between harmful and not harmful. The diversity of potentially harmful pathogens also demands that the immune system has a range of protection mechanisms and that the most appropriate response is elicited to protect the host. These requirements have resulted in an extraordinarily complex system with many hundreds of different signalling molecules controlling the function of at least twenty different cell types. A great deal is known about the individual cells and molecules and their interactions but formal analysis of the immune system as a network or collection of network modules has hardly begun. 341
342
R. E. Callard and J. Stark
In vertebrates, the immune system is commonly considered as two subsystems: innate and adaptive. Of these, innate immunity is the more ancient and is found in one form or another in almost every living organism, including plants. It is characterised by both soluble and cell surface molecules know as Pattern Recognition Receptors (PRRs) that recognise and bind to Pathogen Associated Molecular Patterns (PAMPS) expressed by micro-organisms. The PRRs are encoded in the germ line and their specificity is therefore fixed (baring mutations) from one generation to another. The innate system responds rapidly to an invading pathogen but it has no memory. That is, the response is essentially the same and not heightened on repeated exposure to the same pathogen. The adaptive system evolved more recently and is found only in vertebrates. It is characterised by the random generation of receptors for antigen during ontogeny that are expressed on T cells (T Cell Receptors) and B cells (B Cell Receptors). The receptors expressed on each individual T cell and B cell have the same amino acid sequence and specificity. Antibody (immunoglobulin) is the soluble form of the B cell receptor (BCR), which binds to and helps eliminate invading pathogens. Recognition of an invading pathogen by T cells and B cells bearing receptors specific for antigenic determinants on the pathogen results in proliferation and increased numbers of the T and B cells. This is the basis for the specificity and memory that is characteristic of the adaptive response. It is a common mistake to consider the innate and adaptive immune responses of vertebrates to be independent. Many experimental immunologists focus their research on some aspect or other of the adaptive immune response and may ignore the contribution to the response made by the innate system, and vice versa. This error can and does lead to misunderstanding and/or wrong interpretations of the “immune system”. Although the adaptive system evolved later than the innate system, it evolved on the background of a fully functional innate system and functionally the two “systems” interact closely sharing many of the same components. Invertebrates have no adaptive immune system but are nevertheless at least as successful in evolutionary terms as vertebrates showing how effective the innate system can be. In this
Networks of the Immune System
343
chapter, the immune system will be considered as an integrated cell and molecular system that includes both innate and adaptive immunity. The immune system is a very complex collection of interconnected cells and molecules that may be separated in both space and time. Making the right response to an invading pathogen depends on integrating intracellular gene and signalling networks with communication networks between many different cells and cell types. This includes migration of signalling molecules to different compartments within the cell and migration of activated cells to specific anatomical sites where they interact with other cell types. The immune system can therefore be considered as a distributed network operating over different time scales in spatially distinct local microenvironments. These properties make the immune system very difficult to study experimentally and most modern immunology concentrates on characterising the properties of individual cells and molecules rather than the system as a whole. As a result, many important questions about the immune system remain unanswered. These include: • •
• •
• •
How does the immune system distinguish between self and nonself? How does the immune system distinguish between harmful and useful micro-organisms (i.e when to attack and when not to attack)? How does the immune system make the most appropriate response to an invading pathogen? How do populations of cells reach a consensus response, e.g. what is the relationship between single cell responses and responses of populations? Why do we still not understand the mechanisms of immunological diseases such as allergy and autoimmune disease? How can we predict the outcome of manipulation of the immune system for therapy? (still largely trial and error)
Answers to these questions will probably depend on a systems biology (immunology) approach. One way of doing this is to consider the
344
R. E. Callard and J. Stark
immune system as a network of interacting molecules and cells and apply techniques for analysing networks to the problem. The description of the immune network that follows does not (and cannot) at this stage address these issues, but a systems (network) immunology approach may be required to solve them. In this chapter we will consider whether or not the immune system can be usefully considered as a network and if so what network analyses could be used to gain deeper understanding of how it works. To do this, the components that make up the immune system and the nature of their connections should first be considered. We will not discuss immune idiotypic networks as these have little relevance to normal immune regulation and do not take into account the well characterized molecular and cellular interactions that control the immune response. Only the immune response to an invading micro organism will be considered here and not ontogeny of the cells and tissues that make up the immune system. 2. Outline of the Biology of the Immune Response Before considering the molecular and cellular networks of the immune system itself, it is appropriate first to outline the processes that underlie a typical immune response to an invading pathogen (Fig. 1). The usual site of entry for an invading micro-organism or pathogen is the skin, gut or mucosal surfaces. Once in a suitable site, it then sets about obtaining nutrition from the host and replicating. Whatever the micro-organism, its presence and activities will normally be detected by the hosts innate immune system, which will mount a defence response. Recognition of PAMPS on the pathogen by PPRs activates the innate immune system with the release of inflammatory mediators including cytokinesa such as IL1 (interleukin-1) and TNFα (tumour necrosis factor α) as well as the migration of cells of the innate system (monocytes, dendritic cells, natural killer cells, polymorphonuclear cells, neutrophils, granulocytes etc) to the site of infection. Together, these components mount a a Cytokines are small soluble proteins that convey signals from one cell to another. They are sometimes known as messenger molecules.
Networks of the Immune System
345
formidable defence against the invading pathogen. But the innate response also has a role in activating the adaptive immune system. At the site of infection, the pathogen will come into contact with local professional antigen presenting cells such as dendritic cells (DC) and/or (in the skin) Langerhans cells. These highly specialized cells function at
Figure 1. Immune response to an invading micro-organism. A pathogen such as a virus or bacteria gains access to the host, usually through the skin or mucosal surface. The innate immune system will normally recognize the presence of the invader and initiate a defence response including activation of dendritic cells. The dendritic cells internalize the pathogen and break it up into antigenic fragments. They then migrate from the site of infection through the lymphatic system to draining lymph nodes where the antigenic fragments derived from the invading pathogen are displayed in association with MHC on the surface of the dendritic cells. CD4 T cells are activated by antigen fragments presented in association with MHC class II and CD8 T cells are activated by antigen fragments presented in association with MHC class I. In addition, the strength and type of T cell response elicited (e.g. Th1 and Th2) depends on the array of surface activation molecules expressed by the dendritic cells and production of cytokines. The activated T cells then proliferate and differentiate into effector T cells (CD4 Th1 and Th2 helper cells, CD8 cytotoxic cells, T regulatory cells etc.) and migrate to specific local microenvironments. For example, Th cells will migrate to B cell areas in the lymph node and spleen to provide help for B cell proliferation and differentiation into switched high affinity IgG, IgA or IgE antibody producing cells. The activated B cells then migrate into specific local microenvironments in spleen, bone marrow etc as antibody secreting plasma cells.
346
R. E. Callard and J. Stark
the interface between the innate and adaptive immune systems. Binding of PRRs expressed by professional antigen presenting cells (DC) to PAMPS on the pathogen causes the DCs to internalize the pathogen and break it up into small peptide fragments, which are then re-expressed on the DC surface in association with MHCb class II molecules. In addition, dendritic cells activated by pathogens express surface activation molecules, secrete cytokines and migrate from the local site of infection through the lymphatic system to draining lymph node(s) where they come into contact and present the antigen to T cells. Crucially, different patterns of surface molecule expression and cytokine production by activated dendritic cells will occur depending on the particular PAMPs expressed on the pathogen and on the cytokines and inflammatory mediators produced in the local microenvironment. In this way, pathogens bearing particular sets of PAMPS will activate dendritic cells in different ways giving rise to distinct T cell responses (1). In theory at least, this interaction may also allow for feedback between the type of response elicited by a particular set of PAMPs and killing of the pathogen expressing the PAMPs, which may be important for selecting the best response for eliminating a particular pathogen (2). Once activated by antigen on dendritic cells, T cells proliferate and differentiate into effector cells such as CD8c cytotoxic T cells (Tc) which target and kill virally infected cells, CD4 T cells that mediate cellular immune responses such as delayed type hypersensitivity and CD4 helper T cells (Th) that migrate to B cell areas in the lymph node and stimulate antigen activated B cells to proliferate and differentiate into antibody producing cells. In each case, the combined network of signals received by T cells determines the expression of T cell surface activation molecules and
b
The MHC (major histocompatibility complex) is a complex of cell surface proteins that bind to antigen peptides derived from foreign proteins for T cells to respond to. They come in two forms; class I which are expressed on all cells of the body and are recognised by CD8+ cytotoxic T cells, and class II which are only expressed by “professional” antigen presenting cells such as dendritic cells for presentation of antigen to CD4+ helper T cells. c CD4 and CD8 are cell surface antigens found on T cells that recognize peptide on MHC class II and class I respectively.
Networks of the Immune System
347
production of cytokines characteristic of each effector cell. For example, Th cells express CD40d ligand (CD40L)and produce cytokines (e.g. IL4) that control B cell differentiation and Ig class switching to ensure the appropriate antibodies (IgG, IgA, IgE etc) are produced to the invading pathogen (3-6). 3. The Components of the Immune System The human body has been estimated to have 1014 cells of about 200 cell types with 25,000 genes. The number of transcription factors, kinases, phosphatasese and receptors encoded by the human genome is estimated to be approximately 1850, 518, 150, and 1543 respectively. By taking into account splice variants and post-translational modification, it can be estimated that there may be 30,864, 10,360 and 3,000 distinct receptor, kinase and phosphatase states respectively, each of which may have different properties in the network (7). In addition to the large number of different protein states, network interactions between the different moieties allow for even greater combinatorial control. The immune network includes many different components and cell types making up about 10% of all cells in the human body. Each cell has been estimated to have a maximum of 40 different receptors of about 105 proteins each. Of course, the immune system does not function in isolation but interacts with virtually all the different cells that make up the human body. The number of potential connections in the immune network within cells and between cells is therefore very large indeed. The immune system can be considered as a hierarchy of networks at different levels, including genes, signalling molecules, cells and organs. These act together in a coordinated fashion to defend the host against infection. The first step in analysing the immune network is to identify d
The CD nomenclature refers to different surface proteins expressed on white blood cells (leucocytes) and other cells. CD40L is found on activated T cells and is important for T cell interactions with antigen presenting cells and B cells, both of which express the complementary receptor CD40. e Kinases and phosphatases are enzymes in the cell that contribute to intracellular signalling pathways.
R. E. Callard and J. Stark
348
the different components involved. These can be considered at five different levels of organisation: •
•
•
•
•
Gene networks within each cell. These consist of genes and the transcription factors that bind to regulatory sequences on the gene and control gene expression (synthesis of RNA). Each transcription factor is itself a gene product (protein) under the control of another set of transcription factors and activation by intracellular signalling events. Intracellular signalling networks. These consist of networks of proteins and other small molecules that link receptors on the cell surface to gene regulatory events. Cell networks. T cells and B cells of the adaptive system, dendritic cells, monocytes, macrophages, natural killer cells and others of the innate system and many other cells that may not always be considered as part of the immune system such as endothelial cells, stromal cells, and even nerve cells. All these cells communicate with each other through cell surface molecules and soluble mediators such as cytokines, and other low molecular weight molecules. Local microenvironments within tissues and organs also make up part of the immune network. These consist of spatially defined specialised areas such as germinal centres in lymph nodes where B cells undergo rapid division, affinity maturation and class switching or the thymic medullaf where developing T cells are selected by their affinity for self MHC. Organs (bone marrow, spleen, thymus, lymph nodes). Although these may not always be thought of as a network, their function is controlled and modified by communication with other tissues/ organs via cell traffic, small molecules acting at a distance and also the nervous system. For instance, lymphoblastoid B cells migrate from lymph nodes after they have undergone affinity
f Inner part of the thymus where T cells are generated from pre T stem cells derived from the bone marrow.
Networks of the Immune System
349
maturationg and Ig class switchingh to the bone marrow where they receive other signals from the bone marrow stroma to differentiate into antibody producing plasma cells (8). The different levels of organisation within the immune network are separated from each other by time and space (Fig. 2). Networks between organized tissues and/or local microenvironments
Intercellular networks
Intracellular signalling networks
Gene networks
Figure 2. Levels of immune network organisation. The different gene, signalling, intercellular and tissue networks are spatially compartmentalized in the cell nucleus, cytoplasm, local microenvironments of lymphoid tissues and organs respectively. Integration demands communication between network components at each level and vertical integration between the networks at different levels. For example, migration and organisation of T and B cells within compartmentalized local microenvironments of lymph nodes depends on communication between gene networks, intracellular signalling networks, and the network of interacting lymphocytes, follicular dendritic cells and stromal cells (9-12).
g
Affinity maturation is the increase in affinity of antibody for antigen during the immune response. It occurs by somatic mutation of the antibody molecule and selection for antibody with high affinity. The process is T helper cell dependent. h Ig class switching is the process by which the class and subclass of antibody (IgG 1-4, IgA, IgE) produced by the antibody-forming cell is determined. Initially the antibody made by B cells is IgM and this switches to the other isotypes in response to specific cell surface and cytokine signals from T helper cells.
350
R. E. Callard and J. Stark
There are significant differences in scale between the different levels of network organisation. For example, interactions between tissues/ organs through cell migration or release of cytokines are on time scales of minutes to hours and over distances ranging from millimetres to metres – in humans of about a metre – whereas intercellular networks within a local microenvironment work on time scales of seconds to minutes over distances of a few millimetres. The signalling and gene networks within cells may operate on time scales of less than a second to several hours if migration of molecules from cytoplasm to nucleus is required. These differences in time scales and distances need to be taken into account when constructing models of immune networks. 4. The Architecture of the Immune Network At each level, connections between the network components are mediated mostly by proteins that bind to other proteins, nucleic acids (DNA) and small signalling molecules such as cyclic AMP. To give an idea of how various components (nodes) of the immune system communicate within and between networks, examples at each level of the immune system are considered: 4.1. Gene Networks A good example of the complexity involved in gene regulation can be seen in the control of IL4 transcription (13). Regulation of IL4 involves more than 13 transcription factors (TF) binding to the IL4 promoter and a further 7 on the DNA enhancer region of the IL4 gene (Fig. 3). All TFs are proteins and therefore subject to transcriptional control by another set of TFs (see Chapter 4). In this way they can be seen to make up a gene regulatory network. Some of the regulatory proteins binding to the promoter and enhancer regions such as STAT4 are activated by signalling pathways (see below) and move from cytoplasm to nucleus when activated whereas others are under transcriptional control only.
Networks of the Immune System
351
Figure 3. Transcriptional regulation of IL4 production. Adapted from Li-Weber and Krammer, (2003) (13). NFκB in T cells activated in cytoplasm (see Fig. 4) migrates to the nucleus and binds to the IL4 promoter and increases IL4 transcription. NF-IL6 is activated by phosphorylation through many different signalling pathways and regulated by binding to other proteins such as C/EPB family. GATA3 is produced in response to Th2 signals such as STAT6 activated by IL4 binding to the IL4 receptor (positive feedback) and is itself under control of other transcription factors. NFAT and AP1 are activated by TCR engagement and are required for IL4 production. The different TFs are themselves subject to transcriptional control by another set of TFs and constitute the gene network that controls IL4 production.
4.2. Intracellular Signalling Networks Closely linked to gene networks are the intracellular signalling networks that connect the internal cell response to the extracellular environment. Typically, signals from the extracellular milieux and/or neighbouring cells are detected by binding of soluble mediators (cytokines) such a TNF or ligands expressed on the surface of adjacent cells to receptors on the cell surface. Binding of ligands to specific surface receptors triggers a complex set of intracellular events in a signalling network that controls gene expression and cell function. An example of a signalling network crucial for the immune system is NFκB activation (Fig. 4). NFκB activates a set of other gene networks important in inflammation, innate and adaptive immunity and control of apoptosis (14). 4.3. Intercellular Signalling Networks An immune response involves complex interactions between various cell types that communicate with each other by binding of cell surface molecules to specific receptors on the surface of other cells (cell contact) and/or by release of soluble messenger molecules (cytokines), which
R. E. Callard and J. Stark
352
IκBα
A20 Inflammation Apoptosis Innate and adaptive immune system
Figure 4. The NFκB signalling network. The IkB/NFkB signalling molecule is kept inactive in the cytoplasm by three IkB isoforms. Cell stimulation by for example TNF binding to its receptor activates the IKK complex to form activated IKK (IKKa) resulting in phosphorylation and degradation of IkB proteins. The liberated NFkB then translocates to the nucleus where it binds to promoter sites on IκBα, A20 and other genes important in inflammation, apoptosis and immune activation. Newly synthesized IκBα enters the nucleus and binds to NFκB forming a negative feedback loop. The complex then moves from the nucleus back into the cytoplasm and newly synthesised A20 inactivates IKKa to form inactive IKKi. The dynamics of this network result in oscillations of NFkB between the cytoplasm and nucleus, which have been shown experimentally (15,16).
activate target cells (at a distance) by binding to cytokine receptors also expressed on the cell surface. More than two hundred different surface receptors and ligands have now been identified and any one cell may express 40 or more distinct surface signalling molecules (17-19). Signalling at a distance has received particular attention in the modelling community (20) and has led to the concept of the cytokine network. This is an extremely complex nonlinear system with many components and examples of both positive and negative feedback. It is
Networks of the Immune System
353
integral to the behaviour of the immune system, enabling cell communication at a distance, i.e. by cells that are not in contact. The distance may be very small as in local microenvironments or considerable in the case of cytokines circulating in blood. The cytokine network should not be considered in isolation, independent of the many other signalling molecules essential for cell communication, especially cell-surface receptor-ligand interactions. This is particularly important because many cell surface interactions modify both production and response to cytokines. For example, B cell or T cell responses to IL2 depend on expression of the IL2 receptor subunits that bind IL2 (IL2Rα/CD25 which binds IL2 but does not signal and IL2Rβ/CD122 which binds IL2 and signals) together with the non-binding signalling subunit (IL2Rγ). The functional IL2R is a trimer of all three subunits and is not expressed on resting T cells or B cells but is induced on activation by for example ligand binding to the antigen receptor (TCR or BCR). The cytokine network thus needs to be modelled only as part of the communication mechanisms in the intercellular networks discussed above. An example of the complexity of these interactions is illustrated by T cell collaboration with B cells required for antibody production, immunoglobulin isotype switching and affinity maturation (Fig. 5). Effective collaboration between T cells and B cells involves a signalling network combining more than 20 different receptor, surface ligand and cytokine interactions that link intracellular networks with extracellular events (21). In this particular example, antigen presentation by B cells to T helper cells induces surface expression of CD40 ligand (CD40L) on T cells over several hours. CD40L on activated T cells then binds to CD40 on B cells. B cell activation by CD40 is required for isotype switching and affinity maturation. 4.4. Networks of Microenvironments The immune system is a highly distributed system of semi-autonomous functional modules partitioned in defined local microenvironments. Each module can be considered as a distinct cell network (Fig. 6). For example, Th1 and Th2 differentiation by T cells activated by antigen
354
R. E. Callard and J. Stark
Figure 5. Dialogue between T helper cells and B cells. Binding of T cell receptor to antigen peptide associated with MHC class II on B cells leads to T cell activation and expression of CD40L. At the same time, signalling through B cell MHC class II leads to expression of CD80. The appearance of CD40L and CD80 on the cell surface takes several hours. These ligands then bind to their receptors (CD40 on B cells and CD28 on T cells) amplifying the initial activation event leading to B cell and T cell proliferation. In addition, binding of CD80 to CD28 induces cytokine production by T cells needed for immunoglobulin class switching. CD40 signalling is needed for B cell affinity maturation and also for class switching. Many other receptor ligand interactions between T cells and B cells are also involved but are not illustrated here.
presenting dendritic cells in T dependent areas of spleens and lymph nodes (22-25) or T cell collaboration with B cells for antibody production, isotype switching and affinity maturation in germinal centres (26-29). Examples of local microenvironments are the distinct B cell and T cell areas in lymph nodes and spleen; thymic medulla and cortex; light and dark zones of germinal centres, and the marginal zone in the spleen. Distinct cell interaction networks are contained in each local microenvironment and the structure of each microenvironment itself depends on networks of cell interactions for its formation. For example,
Networks of the Immune System
355
the cytokine TNF is required for formation of B cell follicles in the lymph node and subsequent germinal centre formation depends on migration of newly activated B cells into the follicle in response to chemokine and cell interaction signals (30,11). T cell proliferation CD4 CD8 & Th1 Th2 differentiation
pathogen pathogen entry causes activation and migration of dendritic cells
circulation of lymphocytes other immunocytes and cytokines
B cell proliferation Affinity maturation Isotype switching
B cell maturation into antibody secreting plasma cells
Figure 6. Network compartments/modules of the immune system. The immune system is compartmentalized into cell network modules that perform different functions at different times. Shown here are networks between dendritic cells, epithelial cells and monocytes at sites of infection in skin or mucosal surface that result in activation and migration of dendritic cells; between T cells dendritic cells and stromal cells in T dependent areas of lymph nodes; between T cells, B cells and follicular dendritic cells in germinal centres of lymph nodes; and between B cell blasts and stromal cells in bone marrow for differentiation of antibody producing plasma cells. Communication between these network modules is by chemokine directed cell migration and by circulation of cells and soluble mediators (cytokines).
The local microenvironments are connected by migration of cells from one compartment to another in response to chemokines (31,28,32) and by continuous circulation of lymphocytes and other immunocytes (monocytes, natural killer cells, dendritic cells etc) between interstitial spaces, the lymphatic system and blood. Communication between networks in particular local microenvironments can also be mediated by cytokines, hormones and the nervous system (33-36). Network modelling
356
R. E. Callard and J. Stark
of the complete immune system clearly needs to take into account this modular structure and the different time scales of cell migration and circulation between the different compartments. 5. Integration Between Different Levels There are large differences in time scales from <10-1 sec for events such as protein modification and changes in Ca2+ concentration through minutes, hours or even days for responses such as gene transcription, cell migration, proliferation, differentiation and apoptosis. A systematic approach to the modelling and mathematical analysis of the immune network therefore requires integration of nonlinear events that occur over very different spatio-temporal scales. Moreover, although the focus in this Chapter has been on biochemical events in intracellular and extracellular signalling, other processes including coupling between mechanical and biochemical events such as in the cytoskeleton are also important and present major challenges (37,38). The importance of integrating intracellular signalling networks that control the behaviour of an individual cell with signalling between cells in a population is particularly well illustrated in Th1 and Th2 differentiation. The decision by the immune system to generate Th1 or Th2 cells in response to antigen can be a matter of life or death. Th1 cells are defined by production of interferon-γ and are required for immunity to intracellular infections such as viruses and mycobacteria. Failure to generate an appropriate Th1 response can result in fatal infection whereas inappropriate Th1 responses can lead to autoimmune and inflammatory diseases. Th2 cells on the other hand are defined by the production of IL4 and are required for protection against some parasites such as nematode worms. Failure to make an appropriate Th2 response can lead to overwhelming parasitic infection whereas inappropriate Th2 responses are responsible for allergic diseases including asthma and atopic dermatitis. A number of models of Th1 and Th2 differentiation have been described. Most of these have concentrated on either the cross regulation of Th1 and Th2 cells by cytokines, particularly IL4 and IFNγ, or on intracellular transcriptional control of GATA-3 (Th2 responses) and T-bet (Th1 responses) (22,25). There is good evidence that the initial
Networks of the Immune System
357
decision to make a Th1 or Th2 response occurs when T cells interact with antigen presenting (dendritic) cells (1). In the past year, a crucial signal that determines Th1 or Th2 differentiation has been identified as Notch on T cells binding to either Delta (Th1) or Jagged (Th2) expressed on dendritic cells depending on how dendritic cells are activated by different pathogens (39). We have recently shown that the intracellular signalling network controlling GATA-3 (Th2) and T-bet (Th1) expression behaves as a bistable switch so that Th1 and Th2 differentiation is mutually exclusive within any one cell. Long term commitment to either differentiation pathway as well as consensus within the population of T cells depends however on integration of the intracellular GATA-3 and T-bet signalling network with the extracellular cytokine signalling network (25). These considerations show that integration of networks across scales is crucial for biological function and future models of the immune system must take this into account. 6. Modelling Immune Networks As the above description makes clear, developing network models of the immune system is a challenging task, which needs to be carried out at a variety of different levels. At the gene and intra-cellular signalling level there is little to distinguish cells in the immune system from other cell types. At this level therefore any of the approaches used to study intracellular networks in cell biology generally may be appropriate. Many of these are discussed elsewhere in this volume (see Chapters 4, 5 and 7). The greatest amount of progress in this area has been in bacteria and simple unicellular eukaryotes such as yeast. Biological useful models of intra-cellular networks for multicellular organisms are much rarer and typically much less comprehensive. As an example, although good models exist of the cell cycle in yeast (40), only parts of the cycle have been modelled in mammalian cells and there is no comprehensive model of the whole cycle in that case. Cell proliferation is a fundamental response of many immune cell types (e.g. T- and B-cells proliferate rapidly when activated), emphasizing the problem of modelling immune networks at this level. Turning to immune specific intra-cellular networks, such as T-cell receptor recognition, and the downstream
358
R. E. Callard and J. Stark
signalling cascade, whilst many of the key molecular components have been identified, there is insufficient information about their connections and their interactions to warrant comprehensive network models. Typically therefore such models incorporate only a handful of components (e.g. ref. 41,22,25) although one or two more complex network models are beginning to appear (e.g. ref. 42). These models have provided insight into how affinity of antigen binding can determine T cell receptor activation and indicate the existence of bistable switches in T cell activation and in Th1 versus Th2 differentiation. A detailed discussion of how models of the immune system can illuminate the biology can be found elsewhere (43,25,44). A further problem at this level is that most experimental data is measured from samples of a population of cells, containing as many as 106 (or more) cells. Measured values are thus some poorly defined average over many cells. However, models implicitly describe the behaviour of a signalling network within a single cell. Unfortunately, the averaged behaviour of many cells may not correspond to the behaviour of any particular instance of a single cell. This is often ignored with the consequence that comparisons between computational models and real data are of dubious validity. This problem is not unique to the immune system, but is particularly problematic there due to rich heterogeneity of cell types and cell behaviours. There are two possible approaches to overcoming this. In one, modelling can be restricted to experimental data from single cells. With modern imaging techniques this is becoming increasingly possible, and a number of excellent examples of modelling intra-cellular networks based on this now exist (45,16). The alternative is to explicitly model a whole population of cells, each of which contains an intra-cellular network in a different state, and possibly with different parameters. Whilst constructing such a model is in itself not difficult, there are few if any tools for its subsequent analysis, and even fitting such a model to experimental data is potentially challenging. The only model of such a population of intra-cellular networks that we are aware of is the statestructured Th1/Th2 model described above (25). This in fact goes somewhat further and also incorporates cytokine signalling between cells. On the other hand, the intra-cellular network incorporated in the
Networks of the Immune System
359
model is extremely simplistic, with only two explicitly modelled transcription factors. Nor is any attempt made in this paper to fit such a model to experimental data. We believe that the development of such “population of networks” models (Fig. 7) is a very important, but extremely difficult, task for theoretical immunology in the future.
Figure 7. In order to model most experimental data collected from populations of cells it is not sufficient to describe a single intra-cellular network. Rather, the correct framework must somehow describe a population of networks, each of which is potentially in a different state. It may also be necessary to allow each instance of the network to have different parameters.
Modelling immune networks becomes even more challenging when we move to interactions between cells and/or microenvironments. The difficulty is that at this level cells move between different locations, and their behaviour depends on location, internal state and external signal. This makes it potentially difficult to describe the network of interactions, and it is not even clear that current mathematical concepts of networks are adequate to describe the immune system at this level. Most existing models at this level assume that the populations of cells, and signalling molecules, are “well mixed”. As a result, the rate at which they interact is usually assumed to be proportional to their concentrations. This leads to a network model where the nodes represent overall concentrations of the different components, and the links in the network represent interactions. A typical example is the model of Th1/Th2 differentiation in Fig. 8. As such models become more detailed in response to the growing availability of in vivo data, it becomes apparent that the various interactions do not necessarily occur in the same spatial location. Furthermore some of the signalling molecules may be produced in one
360
R. E. Callard and J. Stark
location, but act at another, and some of the cells may undergo transformations, then migrate to a different location where they perform their intended action. Having the right cell type at the wrong location may thus be as problematic as having the wrong cell type altogether. For example, activated T cells in joints in rheumatoid arthritis.
Figure 8. Network models such as this one of Th1/2 differentiation typically assumed a well mixed population of cells and signals. Adapted from Yates (25).
In order to model such situations, it becomes necessary to incorporate spatial position into the mathematical description of the network. This requires careful consideration of what the nodes of such a network represent. The most obvious approach is to divide the system into a number of discrete spatial compartments. A simple example would be blood circulation and lymphatic circulation. Each cell type (and if necessary each molecular signal) is then duplicated to give a node for each combination of cell type and position. For example, instead of having a network node to represent Th1 cells, we would have two, one to represent Th1 cells in the bloodstream and one to represent Th1 cells in the lymph system. A good example are the models developed by Kirschner et al. to describe changes in homing behaviour by T-cells due to HIV infection (46,47). Other good examples outside immunology with
Networks of the Immune System
361
a strong network theme include models of pattern formation in Drosophila (48). The drawback with such an approach is that as the number of cell types/signalling molecules and of spatial compartments grows, the total number of nodes in the network can quickly become difficult to manage, and any analysis of the resulting network can become impossible. In particular, if the well mixed network has N nodes and there are M spatial compartments, the compartmental version of the network has NM nodes. Determining the links between such nodes and estimating kinetic parameters is likely to become impractical for even moderate sized systems. This problem becomes even more acute if the spatial location needs to be described by continuous variables, as for instance in models of germinal centres (e.g. ref. 28,49). A variety of models, using techniques ranging from partial differential equations to cellular automata have been developed, but none have a particularly network oriented approach, and indeed it is difficult to envisage how network concepts can usefully be incorporated in such models. Clearly, in the future it will be necessary to devise more structured modelling frameworks which make better use of the spatial and physiological relationships between nodes in such spatially distributed systems. Currently, the mathematical and computational tools to do this are not available and this would therefore seem to be a very fruitful area for further research. Of course it is not certain that the mathematical structures that are eventually developed will even merit the name of networks. References 1. 2. 3. 4. 5.
Kapsenberg, M.L. (2003). Dendritic-cell control of pathogen-driven T-cell polarization. Nat Rev Immunol. 3, 984-993. Kalinski, P. and Moser, M. (2005). Consensual immunity: success-driven development of T-helper-1 and T-helper-2 responses. Nat Rev Immunol. 5, 251-260. Coffman, R.L., Lebman, D.A. and Rothman, P. (1993). Mechanism and regulation of immunoglobulin isotype switching. Adv Immunol. 54, 229-270. Fuleihan, R., Ramesh, N. and Geha, R.S. (1993). Role of CD40-CD40 ligand interaction in Ig-isotype switching. Curr Opin Immunol. 5, 963-967. Snapper, C.M., and Mond, J.J. (1993). Towards a comprehensive view of immunoglobulin class switching. Immunol Today. 14, 15-17.
362 6. 7.
8. 9.
10. 11.
12. 13. 14. 15. 16.
17.
18.
19. 20. 21. 22.
23.
R. E. Callard and J. Stark Zhang, K. (2003). Accessibility control and machinery of immunoglobulin class switch recombination. J Leukoc Biol. 73, 323-332. Papin, J.A., Hunter, T., Palsson, B.O. and Subramaniam, S. (2005). Reconstruction of cellular signalling networks and analysis of their properties. Nat Rev Mol Cell Biol. 6, 99-111. Shapiro-Shelef, M. and Calame, K. (2005). Regulation of plasma-cell development. Nat Rev Immunol. 5, 230-242. Banks, T.A., Rouse, B.T., Kerley, M.K., Blair, P.J., Godfrey, V.L., Kuklin, N.A., Bouley, D.M., Thomas, J., Kanangat, S. and Mucenski, M.L. (1995). Lymphotoxinalpha deficient mice: effects on secondary lymphoid development and humoral immune responsiveness. J Immunol. 155, 1685-1693. Manser, T. (2004). Textbook germinal centers? J Immunol. 172, 3369-3375. Matsumoto, M., Mariathasan, S., Nahm, M.H., Baranyay, F., Peschon, J.J. and Chaplin, D.D. (1996). Role of lymphotoxin and the type I TNF receptor in the formation of germinal centres. Science. 271, 1289-1291. von Andrian, U.H. and Mempel, T.R. (2003). Homing and cellular traffic in lymph nodes. Nat Rev Immunol. 3, 867-878. Li-Weber, M. and Krammer, P.H. (2003). Regulation of IL4 gene expression by T cells and therapeutic perspectives. Nat Rev Immunol. 3, 534-543. Tian, B. and Brasier, A.R. (2003). Identification of a nuclear factor kappa Bdependent gene network. Recent Prog Horm Res. 58, 95-130. Lipniacki, T., Paszek, P., Brasier, A.R., Luxon, B. and Kimmel, M. (2004). Mathematical model of NF-kappaB regulatory module. J Theor Biol. 228, 195-215. Nelson, D.E., Ihekwaba, A.E., Elliott, M., Johnson, J.R., Gibney, C.A., Foreman, B.E., Nelson, G., See, V., Horton, C.A., Spiller, D.G. et al. (2004). Oscillations in NF-kappaB signaling control the dynamics of gene expression. Science. 306, 704708. Mason, D., Andre, P., Bensussan, A., Buckley, C., Civin, C., Clark, E., de Haas, M., Goyert, S., Hadam, M., Hart, D. et al. (2001). CD antigens 2001. Cell Immunol. 211, 81-85. Zola, H., Swart, B., Boumsell, L. and Mason, D.Y. (2003). Human Leucocyte Differentiation Antigen nomenclature: update on CD nomenclature. Report of IUIS/WHO Subcommittee. J Immunol Methods. 275, 1-8. Zola, H. and Swart, B.W. (2003). Human leucocyte differentiation antigens. Trends Immunol. 24, 353-354. Callard, R.E., George, A.J.T., and Stark, J. (1999). Cytokines Chaos and Complexity. Immunity. 11, 507-513. Clark, E.A. and Ledbetter, J.A. (1994). How B and T cells talk to each other. Nature. 367, 425-428. Hofer, T., Nathansen, H., Lohning, M., Radbruch, A. and Heinrich, R. (2002). GATA-3 transcriptional imprinting in Th2 lymphocytes: a mathematical model. Proc Natl Acad Sci U.S.A. 99, 9364-9368. Morel, P.A., and Oriss, T.B. (1998). Crossregulation between Th1 and Th2 cells. Crit Rev Immunol. 18, 275-303.
Networks of the Immune System
363
24. Callard, R.E. (2000). Cytokine-modulated regulation of helper T cell populations. J Theor Biol. 206, 539-560. 25. Yates, A.J., Callard, R.E. and Stark, J. (2004). Combining cytokine signalling with T-bet and GATA-3 regulation in Th1 and Th2 differentiation: a model for cellular decision-making. J Theor Biol. 231, 181-196. 26. Iber, D. and Maini, P.K. (2002). A mathematical model for germinal centre kinetics and affinity maturation. J Theor Biol. 219, 153-175. 27. Kepler, T.B. and Perelson, A.S. (1993). Cyclic re-entry of germinal center B cells and the efficiency of affinity maturation. Immunol Today. 14, 412-415. 28. Meyer-Hermann, M. (2002). A mathematical model for the germinal center morphology and affinity maturation. J Theor Biol. 216, 273-300. 29. Oprea, M., van Nimwegen, E. and Perelson, A.S. (2000). Dynamics of one-pass germinal center models: implications for affinity maturation. Bull Math Biol. 62, 121-153. 30. Cupedo, T. and Mebius, R.E. (2005). Cellular interactions in lymph node development. J Immunol. 174, 21-25. 31. Cyster, J.G. (1999). Chemokines and cell migration in secondary lymphoid organs. Science. 286, 2098-2102. 32. Moser, B. and Loetscher, P. (2001). Lymphocyte traffic control by chemokines. Nature Immunol. 2, 123-128. 33. Gefter, M. (1997). Effects of growth hormone and insulin-like growth factor I on T and B lymphocytes and immune function. Acta PaediatrSuppl. 423, 76-79. 34. Serafeim, A. and Gordon, J. (2001). The immune system gets nervous. Curr Opin Pharmacol. 1, 398-403. 35. Straub, R.H. (2004). Complexity of the bi-directional neuroimmune junction in the spleen. Trends Pharmacol Sci. 25, 640-646. 36. Wang, J., Whetsell, M. and Klein, J.R. (1997). Local hormone networks and intestinal T cell homeostasis. Science. 275, 1937-1939. 37. Ingber, D.E. (2003a). Tensegrity I. Cell structure and hierarchical systems biology. J Cell Sci. 116, 1157-1173. 38. Ingber, D.E. (2003b). Tensegrity II. How structural networks influence cellular information processing networks. J Cell Sci. 116, 1397-1408. 39. Amsen, D., Blander, J.M., Lee, G.R., Tanigaki, K., Honjo, T., and Flavell, R. A. (2004). Instruction of distinct CD4 T helper cell fates by different notch ligands on antigen-presenting cells. Cell. 117, 515-526. 40. Chen, K.C., Calzone, L., Csikasz-Nagy, A., Cross, F.R., Novak, B. and Tyson, J.J. (2004). Integrative analysis of cell cycle control in budding yeast. Mol Biol Cell. 15, 3841-3862. 41. Chan, C., George, A.J., and Stark, J. (2004). Feedback control of T cell receptor activation. Proc R Soc Lond B in press. 42. Li, Q.J., Dinner, A.R., Qi, S., Irvine, D.J., Huppa, J.B., Davis, M.M., and Chakraborty, A.K. (2004). CD4 enhances T cell sensitivity to antigen by coordinating Lck accumulation at the immunological synapse. Nature Immunol. 5, 791-799.
364
R. E. Callard and J. Stark
43. Yates, A.J., Chan, C.C.W., Callard, R.E., George, A.J.T. and Stark, J. (2001) An approach to modelling in immunology. Brief. Bioinform. 2, 245-257. 44. Callard, R.E. and Yates, A.J. (2005) Immunology and mathematics: crossing the divide. Immunology. 115, 21-33. 45. Dower, S.K. and Qwarnstrom, E.E. (2003). Signalling networks, inflammation and innate immunity. Biochem Soc Trans. 31, 1462-1471. 46. Bajaria, S.H., Webb, G., Cloyd, M. and Kirschner, D. (2002). Dynamics of naive and memory CD4+ T lymphocytes in HIV-1 disease progression. J Acquir Immune Defic Syndr. 30, 41-58. 47. Kirschner, D., Webb, G.F. and Cloyd, M. (2000). Model of HIV-1 disease progression based on virus-induced lymph node homing and homing-induced apoptosis of CD4+ lymphocytes. J Acquir Immune Defic Syndr. 24, 352-362. 48. von Dassow, G., Meir, E., Munro, E.M. and Odell, G.M. (2000). The segment polarity network is a robust developmental module. Nature. 406, 188-192. 49. Meyer-Hermann, M.E. and Maini, P.K. (2005). Cutting edge: back to "one-way" germinal centers. J Immunol. 174, 2489-2493.
CHAPTER 11 A HISTORY OF THE STUDY OF ECOLOGICAL NETWORKS
Louis-Félix Bersier Unit of Ecology and Evolution, Fribourg University, ch. duMusée 10, CH-1700 Fribourg, Switzerland
[email protected]
1. Introduction Ecology is the science of how organisms interact with each other and with their environment. Given this definition, first proposed by Haeckel a (1866), a non-ecologist may suppose that the study of networks of interactions between species in ecosystems is a mature and wellestablished domain of ecology (1). It is not. A likely reason is the difficulty of documenting interactions: it is easy to observe organisms, but the examination of interactions of any kind between species is much more elusive. For this reason, the vast majority of studies in ecology and evolution deal with few interacting species. Fortunately, the study of ecological networks is currently enjoying a burst of interest and is promised a bright future – a paramount incentive for strong development of network analysis is the urgent need of an understanding of how ecosystems will react to global changes. a
The original definition is: "By ecology we mean the body of knowledge concerning the economy of nature - the investigation of the total relations of the animal both to its inorganic and its organic environment; including, above all, its amical and inimical relations with those animals and plants with which it comes directly or indirectly into contact - in a word, ecology is the study of all those complex interrelations referred to by Darwin as the conditions of the struggle for existence." (source: http://www.unijena.de/-page-364-lang-en.html) 365
366
L.-F. Bersier
In this chapter, I will review the major steps in the development of the study of ecological networks. I will discuss some hypotheses on the underlying processes behind observed patterns in network structure. Before going into the history and state-of-the-art of this research theme, it is first necessary to go through some definitions. Intraspecific interactions – interactions between members of the same species - are of course essential for survival and reproduction. They are very rich and give rise to complex hierarchical patterns of structured interactions. Intraspecific interactions will not be tackled in this chapter – with the exception of cannibalistic interactions – and I will concentrate only on interspecific interactions. Also, I will not concentrate on just one pair of interacting species, but on communities. A first difficulty lies in the definition of this term (2,3). Typically, the term community defines the set of all species living in a given location (4). If the physical conditions in this location are more or less homogeneous, it forms a biotope and the species living in it a biocoenosis. Different biotopes of the same kind can be interspersed geographically and linked by migration of the constituent species; they are structured in so-called metacommunities (5). The term community is sometimes used to describe a subset of a whole community. This is unfortunate since precise words exist, but which definition is pertinent is most often easily deduced by the context. The smallest subset is a guild, a group of species using similar resources in a similar way (6). A group of taxonomically related species is a taxocene. This term is not widely used in ecological literature despite the fact that most studies and theories on biodiversity are best suited to such subsets (e.g. 7-9). When concentrating on feeding interactions between species, one can define a food-web, i.e. a group of species linked by such interactions, thus describing the paths by which biomass flows through the community. A customarily used subset of food-web is the notion of trophic level, which typically describes a subset of a community with similar feeding habits, e.g. herbivores (see ref. 10 for other definitions). Finally, it is worth mentioning the concept of ecosystem, which not only considers the interactions between species in a community, but also the interactions between species and their physical environment. In the following, we
A History of the Study of Ecological Networks
367
will tackle mostly whole communities or subsets of them described by the type of interactions linking the species. Classically, ecological interactions between species are classified according to their reciprocal effects (11). In this context, the effect of one species on another can be measured in terms of the consequences for growth rate, population size, or relative fitness. Relative fitness is the ratio of the growth rate of the species of interest in the presence of the interacting species to that in its absence. Fig. 1 illustrates the possible outcomes for a pair of species. Typically, ecologists have tackled only one type of interaction at a time. Very few studies merge different interactions at the community level – for example predation and mutualism in Melián (2005) (12).
Figure 1. Types of biological interactions between two species. A zero indicates an absence of measurable effect.
When considering a group of species linked by one type of interaction, a useful representation of the topology of such an ecological network is a graph whose nodes (or vertices) are the members of the group, and whose links (or edges) are the interactions between them. Such a graph is undirected for mutualistic and competitive interactions, and directed for predation, amensalism, and commensalism. In the latter, edges have a direction and are sometimes called arcs. The same information can be captured in an adjacency matrix. This is a square binary matrix whose rows and columns are labelled by the members of the group ordered in a similar way. A zero in the matrix indicates an absence of interaction and a one an interaction between species in the corresponding row and column. Adjacency matrices are symmetric for undirected graphs but not for directed ones. For such non-symmetric matrices, it is necessary to indicate the status of the species. For trophic interactions, consumers are typically listed columnwise, and prey
368
L.-F. Bersier
rowwise, forming a food-web matrix A = [aij], with aij = 1 if species j consumes species i, and 0 otherwise. There is little doubt that the interest and recognition of the importance of ecological networks was bolstered by Darwin (1859) (13) himself, when he described natural communities as an entangled bank : "It is interesting to contemplate an entangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent on each other in so complex a manner, have all been produced by laws acting around us." A description of these underlying laws is made more explicit when Darwin described the succession of forests : "When we look at the plants and bushes clothing an entangled bank, we are tempted to attribute their proportional numbers and kinds to what we call chance. But how false a view is this! Every one has heard that when an American forest is cut down, a very different vegetation springs up; but it has been observed that the trees now growing on the ancient Indian mounds, in the Southern United States, display the same beautiful diversity and proportion of kinds as in the surrounding virgin forests. What a struggle between the several kinds of trees must here have gone on during long centuries, each annually scattering its seeds by the thousand; what war between insect and insect - between insects, snails, and other animals with birds and beasts of prey - all striving to increase, and all feeding on each other or on the trees or their seeds and seedlings, or on the other plants which first clothed the ground and thus checked the growth of the trees! Throw up a handful of feathers, and all must fall to the ground according to definite laws; but how simple is this problem compared to the action and reaction of the innumerable plants and animals which have determined, in the course of centuries, the proportional numbers and kinds of trees now growing on the old Indian ruins ! " Competitive and trophic interactions lie clearly at the heart of Darwin's concept of the entangled bank. But together with this admirable description comes the warning of the complexity of the task! And the discovery of underlying processes is not only a daunting undertaking, the
A History of the Study of Ecological Networks
369
simple description of the interactions between the members of a community is already a difficult exercise. Moreover, competition is observationally more elusive than predation, and it is the probable reason why ecologists have been historically more interested by ecological networks of trophic interactions – food-webs. We will now pass through some historically important cornerstones in the study of ecological networks. It is not in the scope of this Chapter to describe comprehensively the history of the study of ecological networks. Interested readers can refer to authoritative works by Hagen (1992) (14), Golley (1993) (15), or by Kingsland (1995) (16) for history of dynamical models. My aim here is to review the principal contributions until the nineteen-seventies, and to pay more attention to the major routes of developments of this discipline since this period. As it will be apparent, the main body of research on the network of interactions between species in a community tackled the links between consumers and their prey. Network representations incorporating other kinds of interactions are much more recent. 2. The Pioneers The first attempt to represent the trophic interactions in a community as a network was made by the Italian scientist Lorenzo Camerano (1880) (17). Earlier verbal descriptions of food-webs existed (e.g. ref. 18), but Camerano was apparently the first to link species in a diagrammatic manner (19). It must be pointed out that nodes in Camerano's food-web are actually functional groups (e.g. amphibians, carnivorous fish) rather than true species (Fig. 2). The motivation behind this study was to provide a theoretical background to a practical problem: which animals are useful and which are harmful to crops, one of the most debated issues at that time. Camerano's contribution is extremely interesting for several reasons. First, it shows that the idea that communities are at equilibrium was prevalent at that time. Being at equilibrium means that the densities of the plants and animals do not vary "significantly" above or below what is usually observed. Camerano had a deep feeling that this equilibrium was achieved through fine-tuned interactions between species. The idea that communities are dynamic through species mutual
370
L.-F. Bersier
Figure 2. The first food-web graph reported in the literature, from Camerano (1880) (17). The graph was redrawn to improve readability. The structure is similar but species names were translated from Italian to English. (Adapted from Camerano, 1880) (17).
interactions was implicit in Darwin's (1859) (13) aforementioned description of forest succession, as well as in Möbius (1877) (20) definition of the biocoenosis. It was already expressed explicitly by Edward Forbes (1843) (21) when he described the change in species abundance and identity in animal communities of the Aegean Sea. What is apparently new with Camerano is the notion of a balance of nature resulting in an equilibrium in the densities of the component species, which is naturally re-established after a perturbation due to the abnormal growth of a species. This concept of equilibrium communities has had a huge influence in ecology (22,23), and continues to be one of the big issues in this field (24,25). For example, the classical work of May (1974) on the stability of model ecosystems is based on the assumption of communities being at equilibrium (26). Second, the contribution of Camerano teaches us that the question of top-down or bottom-up control of communities was already a lively one
A History of the Study of Ecological Networks
371
more than one hundred years ago. Top-down control refers to a situation where regulation occurs predominantly through predators. A classical example is the sea otter which is able to prevent sea urchins from overgrazing seaweeds (27). If otters are removed from the system, sea urchins are able to completely deplete kelp forests. If this concept is correct, Camerano deduced that the removal of birds would be detrimental since they destroy insects which damage crops. Bottom-up control (or donor control) refers to a situation where prey (here plants) control the density of their consumers, while the consumers have little impact on their prey. In such bottom-up systems, birds would have little effect in destroying insects since the abundance of the latter is primarily determined by the availability of plant food, and not by their predators. Camerano's was apparently in support of an intermediate position: while he was assuming that animals develop in proportion of their available food, he also recognized that a perturbation in the number of carnivores could alter the abundance of herbivores and subsequently of plants. From this, it is apparent that Camerano foreshadowed the current debate on the importance of top-down versus bottom-up control of communities (e.g. ref. 28). He also cleverly introduced the idea of indirect effects cascading through a food chain. If species A is eaten by species B, and B is in turn eaten by species C, then C will have an indirect positive effect on A. This idea was extended to higher trophic levels. It anticipated the models of population dynamics applied to lake communities found in contemporary literature (e.g. ref. 29). Finally, Camerano noted that an articulate answer to the question of useful and harmful species could only be attained when considering the interactions inside the whole community, and not simply between plants and animals. From these elements, one would guess that Camerano has had a vivid legacy in ecological thinking. He has not. Camerano's work was left unknown to ecologists until very recently (30,31). For example, no citation to his work appears in an earlier contribution on the history of ecology (32), and the important concepts he developed had to be reinvented – a situation not uncommon in the course of scientific progress. In this vein, it is interesting to consider the work of Cajander, a Finnish ecologist who also conceived many modern concepts of ecology in the early years of the 20th century, but was also forgotten (33). However, while Cajander's thoughts gave a central
372
L.-F. Bersier
importance to species interactions in order to understand community organization, and notably why communities are at equilibrium, he apparently did not introduce a network representation.
Figure 3. The food-web of the boll weevil, from Pierce et al. (1912) (34). Direction of arrows is from consumer to prey; numbers and size of boxes refer to the number of species in the group. (Reprinted from Pierce et al. 1912 (34), with permission from USDA.)
The second network oriented approach to an ecological problem only appeared 30 years later. It is due to Pierce, Cushman, and Hood (1912) (34) who, similarly to Camerano's concern, were interested in a practical
A History of the Study of Ecological Networks
373
problem of agricultural relevance, the control of boll weevil populations, which caused economically substantial damages to cotton industry. They attempted to describe the predatory and parasitic interactions of the boll weevil as a web of interactions (Fig. 3). Allee et al. (1949) reproduced this figure in their Chapter on the history of ecology because of its novelty and since it presaged much of the research of that time (32). It is interesting that one of the main conclusions of Pierce et al. is that the constellation of predators of the boll weevil has a high rank in the struggle against this pest, and that parasites could be introduced to improve its control. It is likely one of the first attempts to use biological control to enhance crop production. What is exemplary is the care in tackling this problem through an understanding of how player species are structured in a network of interactions. In the same epoch, a food-web graph was published by Shelford (1913), a pioneer of community ecology. It was however based on hypothetical data, and is thus not of great interest (35). For many, the father of food-web ecology is Charles Elton, who analyzed in some detail the trophic interactions among the species inhabiting Bear Island, during the 1921 University of Oxford expedition to Spitsbergen (Fig. 4 ; refs. 36,37). Elton's contribution to modern ecology was essential. He described important concepts like the pyramid of numbers – the number of individuals decreases in higher trophic levels – and the food cycles, which is exemplified in Fig. 4 - the diagram traces the flows of nitrogen in the Bear Island community. He gave a strong impetus to a move from a descriptive ecology to a functional one: it is not enough to tally species in an ecosystem, one must know what they do. The food cycle is an illustration of the role of species in a community. Elton also pointed out the importance of allochthonous inputs in Bear Island, with most of the nitrogen coming from the sea. The role of such imports for the understanding of food-web structure has recently received a revived interest by the late Gary Polis and collaborators (e.g. refs. 38-40). He also cleverly noted the importance of body size on food-web organization, with consumers tending to be larger than their prey, and the opposite for parasites. This apparently trivial observation has nontrivial consequences for food-web structure, as will be seen later. Finally, it is very important to point out that Elton, in line
374
L.-F. Bersier
Figure 4. Diagram of the nitrogen cycle on Bear Island, from Summerhayes and Elton (1923) (36). Dotted lines are links that were not observed but probable. (Reprinted from Summerhayes and Elton, 1923, with permission of Blackwell Science Ltd.) (36)
with contemporary thinking, viewed food cycles as the regulatory process responsible for the fact that communities remain at equilibrium. These three pioneering works all recognized the importance of studying the interactions between species in order to understand the organization and dynamics of communities. This was not in itself an originality among ecological thinking at that time. The novelty lies in the diagrammatic approach, specifically treating the community as a network. It is interesting to note that there was a debate in the nineteen thirties about the existence of a "balance of nature". A widespread view among ecologists was that of a "divinely determined stability, orderliness and predictability in natural systems" (41) – the balance of nature. Such a vision was reinforced by influential ecologists like Clements, who considered communities as behaving like autonomous super-organisms (42). The opposite view was taken by Gleason (1926) (43), who considered communities to be structured more by chance immigration events and individual selection. In the same line, Elton was also opposed to the concept of a balance of nature. He assembled data on mammals
A History of the Study of Ecological Networks
375
showing that their abundances constantly vary, with irregular amplitudes and periods (44). However, he was obviously not opposed to the fact that communities may reach some stable state. Indeed, Elton is famous for his list of empirical and theoretical arguments in favour of the positive relationship between complexity and stability in ecological systems (45). That more complex systems – in term of number of species, and number and strength of interactions – should be more stable than simpler ones was a paradigmatic view among ecologists in the fifties and sixties. May's (1974) challenge to this paradigm has triggered a bloom of new studies on ecological networks, as we will soon see; but we first go back to the first half of the nineteenth century (26). There were apparently few studies of whole communities that used a network approach after Summerhayes and Elton's (1923) (36) description of the Bear Island food-web (e.g. refs. 46,47). These early studies represented a bold attempt to embrace the whole complexity of communities, with the aim of understanding their dynamical functioning. However, apart from their descriptive value, few generalisations came out of these works. Ecologists were simply not armed to analyze and extract information from these interaction networks. 3. Energy Based Approaches A major step was Lindeman's (1942) contribution to the understanding of community structure as governed by energetic constraints (48). Elton had observed that species occurred in quite discrete size categories, with larger ones typically being scarcer and eating smaller ones, thus forming a pyramid of numbers: the number of individuals decreases as one moves up in trophic levels. Why does such a pattern exist? This intriguing question was answered by Lindeman (1942) (48), who considered communities as systems that transform energy. Energy from sunlight is processed essentially through photosynthesis – with an efficiency of about 2% – to form plant biomass, which is eaten by the second trophic level – herbivores – in turn consumed by species at higher levels. Lindeman recognized that energy transfer from one trophic level to the other is inefficient – 10% as a gross estimate – and that this energetic constraint is a basic organizing principle of ecological systems. Elton's
376
L.-F. Bersier
pyramid of numbers can be understood as a consequence of energy flow between trophic levels and of body size differences between predators and their prey.
Figure 5. Energy flow network of the mesohaline area of Chesapeake Bay during summer (carbon flows in mg·m-2·summer-1), from Baird and Ulanowicz (1989) (49). POC, particulate organic carbon; DOC, dissolved organic carbon; circles refer to external sources, "bullets" to autotrophs, hexagons to heterotrophic species, "birdhouses" to nonliving storages, and ground symbols to respiration. Number inside the boxes is the standing biomass in mg/m2. (Reprinted from Baird and Ulanowicz, 1989, with permission of the Ecological Society of America.) (49).
Lindeman's breakthrough paved the way for an important school in ecology which considered energy flow through ecosystems (50). Brothers Eugene and Howard Odum initiated this approach by studying the impact of nuclear bomb tests on marine ecosystems, with the assumption that disequilibrium between productivity and respiration is an indication of an unsustainable system. These energy budget studies at the ecosystem level gained in popularity with the development of a formal symbolism aimed at graphically representing the flows of energy through the various actors of the ecosystem (see Fig. 5). The Odums developed an important body of ecosystem theory (51-53). Notably, Howard is
A History of the Study of Ecological Networks
377
famous for his thermodynamic view of ecosystems and the recognition that energy takes various forms when flowing through the ecosystem – the concepts of transformity and emergy (53-56). One of the Odum brothers' prominent ideas is the view that ecosystems evolve toward higher levels of homeostasis, and consequently higher stability, through a combination of selection at the system level and evolution. They considered ecosystems as self-developing in terms of energy: species and interactions are selectively reinforced towards more efficient energy use at the system level. Recycling loops, providing nutrients back into greater production, are particularly important in this respect: they are naturally favoured since they are auto-reinforcing. Such loops, depicted in Fig. 6, are called "indirect-mutualism" (57); they are autocatalytic in the sense that an increase in activity of one node will consequently increase the activity of all nodes. This early view of self-organization and evolution at the ecosystem level may appear a courageous tenet in the 1960s. At that time most evolutionists dismissed any idea of group selection (58) in favour of individual selection, mainly because of the emergent conflicts between the various levels of selection (59). But schools of evolutionists and of system ecologists have largely progressed independently. It is interesting to note that evolutionists have recognized since the importance of selection at levels higher – and smaller – than the individual (e.g.refs. 60-63). Notably, that ecosystems are prone to selection has been demonstrated experimentally in artificial conditions (64). The recent development of the "extended evolutionary theory" (65),
Figure 6. Example of indirect mutualism. Such configurations are autocatalytic loops. When embedded within a larger system, the species involved in such loops will be naturally reinforced. A simple biological example of such a loop is the system formed by the aquatic carnivorous plants Utricularia, which excrete on their surface exudates that are consumed by a periphytic community, in turn consumed by zooplanktons that close the loop when they are captured inside the hollow leaves of the plant.
378
L.-F. Bersier
may in a way reconcile the Odum brothers' early ideas with current evolutionary theory. This new theory considers evolution not only as a one-way process with natural selection acting from the environment to the individuals, but as a two-way process where selected individuals concomitantly modify their environment through "niche construction". The modified environment is bequeathed to the descendants, a case of ecological inheritance. For the specific concern of network analysis, the Odums were not so much interested in the structure and architecture of the ecosystem networks, but on their global functioning in terms of energy processing. They used very global measures of ecosystem "health", for example the ratio of primary production to community respiration (52). However, they pioneered a holistic and systemic analysis of ecosystems, and instigated concepts that lie at the heart of current thinking in network research, notably self-organization. They are also the fathers of system ecology, a school that regrettably appears to be forgotten by many researchers in the field of network analysis. This school developed a strong and wide-ranging body of theory on ecosystem analyses (e.g. refs. 66,67). A multi-faceted account can be found in Patten and Jørgensen (1995) (68). The work of Ulanowicz (69,57) is particularly relevant for the study of networks, where ecosystems are considered weighted networks: nodes are species (or functional groups), and links are biomass flows, typically expressed in milligrams of carbon per square meter and year [mgC·m-2·y-1]; exogenous inputs, respiration, and exports are taken into account (see Fig. 5). Ulanowicz developed a suite of macrodescriptors of weighted networks based on information theoretical indices. It is outside the scope of the present Chapter to review all descriptors here, but three particularly useful measures are worth mentioning. "Average mutual information" (AMI) is a measure of constraint in the network: a maximally connected system (all nodes are connected to all others) with links of equal weight has the minimum AMI – there is no constraint since biomass can flow everywhere (Fig. 7a); a simple chain connecting all nodes will have maximum AMI – biomass is constrained to follow this route (Fig. 7b). For a given ecosystem, a way to scale this information measure is to multiply AMI with the total system throughput (TST, the sum of all flows). This yields a quantity
A History of the Study of Ecological Networks
379
called the "ascendency". A core hypothesis of system ecology is that evolving systems naturally gain ascendency. According to system ecologists, the driving forces are autocatalytic loops – indirect mutualism (Fig. 6) – which will necessarily enhance the importance of flows comprised in their paths. This asymmetric force will favour links to the detriment of others, yielding more and more articulated networks (of the kind of Fig. 7b). Ascendency is a macro-descriptor of ecosystem development that is based on information on the weighted network structure of the ecosystem, with weights corresponding to flows of biomass between species. Another intriguing descriptor of food-web structure is the so-called "effective connectance" (70). In food-web ecology, connectance is a measure of the density of links in a network. Unfortunately, it possesses different definitions (71), the most widely used being called directed connectance, the number of observed links L divided by the number of possible links (72). The number of possible links is the number of species (or nodes) squared, S2 (thus, directed connectance is equal to 1 in Fig. 7a, and to 0.2 in Fig. 7b). However, directed connectance considers all links as equal and thus disregards the variation in flow within weighted networks. The effective connectance, which is based on the conditional entropy of the system, is a way of expressing the density of links in weighted networks. It is a fundamental measure that has fostered interesting hypotheses on the structure and dynamic of communities (73). Ascendency and effective connectance are two examples of descriptors of weighted networks developed in system
Figure 7. Two extreme configurations of a five compartment network. In (a), there is no constraint where energy can flow in this maximally connected network, while the route is unique in (b).
380
L.-F. Bersier
ecology. Other measures derived from information theory have been devised to describe different aspects of ecological networks (69), and they also have a strong potential to prove very useful in future analyses of various kinds of weighted networks. 4. Complexity and Stability It is interesting to note that the school of system ecology – that dealt mainly with ecosystems – has developed quite independently from "classical" community ecologists. For example, Paine's (74) seminal work on food-web complexity does not cite any of Odum's works, nor does Morin (3) give any reference on system ecology in his textbook on community ecology. This paucity of interactions between the two communities may appear surprising when one traces back to the source of the interest in network analysis. As described above, the Odums paved the way for system ecology, but Eugene (51) is also the author of a proposition that, I believe, lies at the heart of the current interest in network analysis by community ecologists. Odum (51) suggested that, in a community, the amount of choice (in other words, of alternative paths) a quantum of energy has in going from autotrophs to higher trophic levels is a measure of community stability. This statement was used by MacArthur (75) in a very influential paper on the link between complexity and stability in food-webs. In fact, this link between the static and dynamic properties of a community was made so strongly that the number of connections in a food-web was taken as a direct measure of stability – without regard to any yearly observations of the abundances of species. This statement was a strong incentive for the view that complexity begets stability – the "conventional wisdom" among ecologists in the 60s (76). Apart from this reasoning based on the density of links in networks, Elton (45) provided a suite of arguments in favour of the complexity-stability positive relationship. MacArthur's (75) contribution gave a strong impetus for the study of the effect of diversity on stability in real communities. It is worth mentioning that MacArthur suggested a function to describe the stability of a network, and used Shannon's famous measure of information for this purpose. However, the application of Shannon's formula in this context has apparently been
A History of the Study of Ecological Networks
381
forgotten by most community ecologists, while it has been widely used by system ecologists. The vast majority of ecological studies using Shannon's formula applied it to measure the species diversity of communities (77), and not to interactions between species. Two reasons can explain this shift: first, MacArthur deduced that greater stability could be achieved with more species, given that the number of prey per species remains constant, and second it is much more difficult to quantify the magnitude of interactions – MacArthur remained very vague on this point – than the importance of species, readily given by their abundance. Few field studies of ecological networks were published in the years after MacArthur's (75) paper (e.g. refs. 78,79). Among them, a very important contribution is that of Robert Paine (74) on the study of foodwebs in marine rocky intertidal zones along a latitudinal gradient. This work is noteworthy because it provides a precise estimation of the relative importance of interactions – in terms of the frequency of predation acts, and of the energy involved in each link. The vast majority of subsequent work on food-webs does not include information on the importance of trophic interactions, the omission of which has hindered the discovery of robust patterns in the structure of food-webs (80). Another unusual feature lies in the experimental approach taken by Paine. By removing the top predators (starfishes), he was able to show that the community became simpler because one of the prey species was able to out-compete the other species and occupy the available space. Even species not preyed upon by starfishes were affected by their removal, an experimental demonstration of an indirect effect called keystone predation (see below). Also, when comparing tropical vs. temperate systems, Paine suggested that the stability of annual production would allow more predators to be supported, thus increasing the diversity of the whole system. Though this argument is not directly related to MacArthur's assertion that more complex systems are more stable, it is certainly not incompatible with it. It remains that ecologists in the late 1960s were unanimous about the positive relationship between complexity and stability. One of Elton's (45) arguments in favour of this relationship was that simple two-species predator-prey models are inherently unstable. Of course, this argument says nothing about the stability of models with
382
L.-F. Bersier
many interacting species, but obviously assumes that they should be inherently more stable. It is this assumption that has been challenged by the seminal work of May (81,26,82). He found that, on the basis of mathematical models, randomly assembled systems will remain stable if the product of average interaction strength α and the square root of S·C remains smaller than one, with S the number of species in a food-web, C the connectance, that is the total number of trophic links L divided by the number of possible links (S2), and α the average interaction strength, a measure of the magnitude of the effect of the abundance of one species on the abundance of another. This is known as "May-Wigner stability criterion" ( α SC < 1 ) (83). In other words, assuming a constant α, foodwebs will become unstable with an increase of either species diversity, number of interactions, or both. This discovery was a real bombshell. It is important to recognize the assumptions of May's analysis: the system is at equilibrium, and only local stability is assessed by the analysis; the interactions between species are random, i.e. the network is a random graph: any species can interact with any other, with probability set by the value of connectance C; the interaction strength between species is set by a random number drawn from a normal distribution with mean 0 and standard deviation 1. May himself was certainly the person most aware of the ecological limitations of his criterion. Few results in ecology have elicited so many critiques, and fostered such a wealth of studies. Empirical ecologists have complained about the irrelevance of May's analysis for real systems, and Polis (84) hoped that descriptions of real food-webs would pound the final nail into the "coffin of May's paradox". Theoreticians have challenged the criterion itself, and Cohen and Newman (85) showed that the criterion was not as general as originally thought (but see refs. 86). Whatever the critiques, it is indisputable that the huge merit of May's study was to force ecologists to think deeply about the complexity-stability question, and this was the starting point of a fruitful body of research in different disciplines. One can recognize at least four interdependent research directions in which ecologists have ventured to answer this question. First, May's analysis is purely theoretical, and ecologists have conducted since many experiments to test the relationship between diversity and stability. A major initiator of this kind of research agenda is
A History of the Study of Ecological Networks
383
David Tilman, who started in the early 1980s a long-term field experiment with plants aimed at answering if diversity does affect ecological functioning in term of biomass production (87-90). Europeans researchers set a similar kind of experiments in eight countries (91), and experiments were also conducted in laboratory controlled settings (92). The results raised controversy in their interpretation (93,94), a major difficulty being to disentangle the effect of diversity per se and the statistical effect of selecting by chance a plant with a high productivity, such a probability being of course higher with higher number of species. This again fostered theoretical efforts to reconcile the opposing views (95,96,97). The main result can be summarized as follows: within an ecosystem, plant diversity is positively correlated with stability at the community level (higher yield), but not always with stability at the population level (98,25). This positive relationship is due to the complementarity effect of diversity (i.e. to a better utilization of all resources through finer resource partitioning and/or positive interactions), and not to the selection effect (i.e. higher chance of having a strong competitor in the plant mixture). However, it must be noted firstly that the kind of stability envisaged in such experiments is quite different from the stability envisaged by Elton, MacArthur or May, and secondly that these experiments concerned mostly only one trophic level – plants – and made no explicit reference to the structure of interactions between the species. In this respect, the connection to ecological networks is very loose. It is also worth noting that the relationship between diversity and ecosystem functioning is not as straightforward when higher trophic levels are taken into account (99,100). The experiment of Fagan (101) is an intriguing study linking stability and the structure of trophic interactions in natural food-webs. Fagan explored the effect of omnivory on stability in natural patches dominated by two species of plants on Mt St-Helens. He manipulated the level of disturbance through aphicide application, and of omnivory by adding or removing wolf spiders (Pardosa sp.) and damselbugs (Nabis sp.), the former are omnivorous, the latter not (Fig. 8). According to May's (81) findings, omnivorous species should destabilize the system since they increase the level of connectance. Other theoretical works supported this idea (102-104). But Fagan found the opposite: increased levels of
384
L.-F. Bersier
omnivory tended to stabilize the dynamics of the community. Other experiments in microcosms supported this result (105,106). In all, we see that experimental challenges of May's findings tend to demonstrate that complexity increases stability in communities. A second direction of research following May's contribution is the exploration of the influence of interaction strength for stability. The notion of interaction strength is deeply entwined with classical LotkaVolterra models of species interactions, where the effect of species i on species j is modelled as a simple law of mass action, i.e. the effect is proportional to the product of both abundances times interaction strength αji (Fig. 9). Such a modelling approach is unsound for ecological systems as it leads to nonsensical situations, for example a single predator j in presence of a huge number N of prey i should consume alone αji·N prey, without being satiated. This has generated a large body of mostly theoretical researches on so-called functional responses, i.e. the number of prey eaten per predator, and to a much lesser-degree, on numerical responses, i.e. the number of predators produced per prey consumed, with sometimes heated debates (107-112). May's approach accommodates any form of response, since local stability analyses are evaluated close enough to the stable point that a linear response can be assumed. However, what appears crucial in May's approach for the negative relationship between stability and diversity is the assumption that interaction strengths between species comprised in a community do follow a normal distribution, with mean 0 and standard deviation 1. In fact, it has been found that the distribution of αs plays a major role for the stability. The first study pointing out this possibility is Yodzis' analysis of the dynamics of systems where the structure and the strengths of interactions were derived from the observation of real ecological systems (113). Yodzis found that, compared to random systems, such biologically possible ones were much more stable. However, it was not clear why. Answers awaited the study of McCann et al. (1998), Berlow (1999), and Neutel et al. (2002) (114-116). McCann et al. (1998) used nonlinear models of simple systems where interactions strengths between species were allowed to vary (114). It was found that weak interactions
A History of the Study of Ecological Networks
385
Figure 8. Feeding relationships among arthropod species in the Mount Saint Helens blowdown zone (Fagan 1997) (101). Pardosa is a generalist predator and Nabis a specialist one. Manipulations of their abundances and application of a disturbance to the system revealed that the generalist species had a stabilizing effect on community dynamics. (Reprinded from Fagan, 1997 (101), with permission of the University of Chicago Press).
Figure 9. A food-web graph depicting the flows of biomass (a), and the corresponding graph with interaction strengths (b). In the case of a trophic interaction between a consumer j and a prey i, αij is negative and αji positive. Note that this graph possesses two loops of size three (α31- α23- α12 and α13- α21- α32), configurations explored in Neutel et al. (2002)(116).
386
L.-F. Bersier
have a stabilizing effect because they act to dampen oscillations between consumers and prey, thereby allowing higher densities and consequently diminishing the probability of extinction. Berlow (1999) (115) additionally found that weak interactions are proportionately extremely variable in strength, and suggested that this feature could enhance spatial variability which is known to promote the persistence of species (117121). More recently, Neutel et al. (2002) (116) concentrated on loops in food-webs (see Fig. 9), which are known to have a destabilizing effect. They found that long loops contain in fact many weak links, which stabilizes the dynamics of the system. All these studies show that the presence of many weak links stabilizes food-web dynamics and generates a positive relationship between complexity and stability. Importantly, the presence of many weak links is consistent with observations of real systems (122). Robert Paine (122-125) has initiated the functional description of food-webs. He argued that a purely topological description of the trophic interactions between species can only bring little information on the functioning of a system, and that quantitative information on biomass flows is as well not satisfactory (Fig. 10). The key problem is that both such descriptions provide no information on interaction strengths. It is consequently not possible to link the structure of a system with its dynamical behaviour. Paine advocated instead the use of experiments to estimate interaction strengths between species by a suite of removal experiments (122) This approach has the advantage of the adequacy of measurements with system functioning, but is hardly feasible in species rich systems (126-128). In any instance, this discussion highlights the difficulty of documenting interactions between species in a way that ties theory and empirical studies. Third, May's (1972) study is based on the assumption that communities lie at equilibrium, that is, that the abundances of all species remain constant – a point in the phase space (81). Local stability analyses evaluate what happens when abundances are displaced close to this equilibrium point; the dynamics is locally stable when the trajectory goes back to the equilibrium point, and unstable if it diverges. This assumption reminds the early ideas of a balance of nature, with the abundances of species adjusting to each other to attain a precisely fixed
A History of the Study of Ecological Networks
387
state. However, ecologists have been more and more suspicious with this early belief about community dynamics, and became aware that the perspective of non-equilibrium dynamics was ecologically sensible. Species abundances show fluctuations which are driven by biotic and
Figure 10. Three ways of looking at interactions in a food-web (Paine 1980) (122). (a) A purely topological web indicating the trophic interactions; (b) the energy flows provide quantitative information on the importance of flows, but not necessarily on the functional organisation of the community; (c) the functional web (or interaction web) is based on controlled manipulations measuring the effect of the removal of a consumer on the abundance of prey. It can reveal that weak interactions in term of biomass flow may have strong dynamical effects. (From data in Paine 1980) (122).
388
L.-F. Bersier
abiotic interactions. Studies relaxing the equilibrium assumption have flourished (e.g. refs. 129,130,114,131), and appropriate measures of stability had to be used (103). In general, it was found that population fluctuations can strongly promote the coexistence of species (25). In this respect, an intriguing study links non-equilibrium dynamics and network structure in model systems (132). Michalski and Arditi found that the structure of a food-web was typically much simpler in terms of connectance under equilibrium assumptions than under non-equilibrium dynamics (Fig. 11). The way dynamics is modelled does not only affects species diversity, but also the structure of connections between species. It must be noted that this quest for the consequences of non-equilibrium dynamics has largely been restricted to theoretical studies. Few analyses of the dynamics of real communities have been undertaken (e.g. 133135). They confirmed the view that communities do not lie at equilibrium. Fourth, a strong assumption in May's analysis is that connections between species are random. This can be perceived as a null hypothesis for the architecture of real food-webs. Ecologists have searched for regularities in the topological structure of food-webs, found that real networks are not random graphs, and developed models to explain these patterns. This theme is intimately linked to network structure and will be expanded in some length in the forthcoming section.
Figure 11. Interactions in a food-web in an equilibrium and a non-equilibrium context (Michalski and Arditi 1995) (132). (a) The potential interactions allowed by the dynamical model studied; (b) effective food-web structure realized at the equilibrium; (c) possible food-web structure under non-equilibrium dynamics. (Reprinted with permission from Michalski and Arditi, 1995, (132) Proc. R. Soc. B 259, 217-222).
A History of the Study of Ecological Networks
389
Figure 12. Example of two food-web graphs and the corresponding niche-overlap, resource, and intervality graphs. The food-web graphs depict the flows of biomass between prey (1 to 4) and consumers (A to D). The niche overlap graph is constructed by linking consumers that share at least one prey in common. The resource graph is constructed by linking prey that share at least on consumer in common; predators can be superimposed on this graph, giving rise to a picture of the niche of the whole community (consumer A is a point above prey 1; consumers B and C are the dashed and dotted lines, respectively; consumer D is the grey triangle - above - or the grey square - below). The interval graph is built by representing consumers as segments that overlap if they share at least a prey. The above food-web graph produces 1) a non-rigid (also called non chordal, or non triangulated) niche overlap graph: consumers form a circuit of four points without a chord shortening the circuit, as in the below niche-overlap graph. The above resource graph contains a topological hole between prey 1, 2, and 3. The above food-web graph is non-interval: it is not possible to arrange the consumers as segments along a single dimension (it is not possible to place D in this example); two dimensions are needed. Analyses of early collections of food-webs revealed a strong excess of rigid nicheoverlap graphs, of resource graphs without holes, and of interval food-webs (as in the below food-web)
5. The Topological Structure of Food Webs The pioneer in the search for topological regularities in food-webs is Joel Cohen. He was the first researcher to assemble a collection of food-webs – from the literature – and to explore the regularities found in such data. He was also the first to confront these observations to theoretical models of food-web structure (136,137). Cohen explored how the niche structure of communities could be deduced from information on food-web structure. For this purpose, he developed so-called niche-overlap graphs (Figs. 12,13), and applied graph theoretical tools to analyze them. The niche-overlap graph is built from information of the food-web graph to
390
L.-F. Bersier
obtain a representation of the structure of exploitative competition (that is competition for resources, in the present case trophic sources) within the community. Simply, an edge is drawn between species that consume at least one species in common, and are thus potentially competing for that prey. This new graph is by itself very informative, but Cohen went one step further by exploring their intervality (Fig. 12). The discovery was intriguing: most niche overlap graphs of the dataset were found to be interval. The biological consequence is that it should be possible to arrange all species along a single niche dimension. The idea that trophic resources within a community could be mapped on a single dimension was certainly a surprise at that time, contrasting with the view of a multidimensional niche (138,139). Cohen offered some possible explanations of this pattern, but the prevailing one was offered by Lawton and Warren (1988) (140) : this single dimension could simply reflect the body size of the prey.
Figure 13. Food-web graph (centre) of the Narraganset Bay and their derived nicheoverlap and resource graphs (data from ref.143). The niche overlap graph provides a representation of the competitive interactions for shared resources between the consumers. The resource graph can serve as a representation of the communal niche (see Fig. 12), but also provides information on so-called apparent competition (see Chapter 6 on indirect effects). Note that niche overlap end resource graphs are undirected. Species are : 1. flagellates, diatoms ; 2. detritus ; 3. macroalgaes, eelgrass ; 4. arcatia, other copepods ; 5. sponges ; 6. benthic macrofauna and infauna ; 7. clams ; 8. ctenophores ; 9. meroplankton, fish larvae ; 10. Pacific menhaden ; 11. bivalves ; 12. crabs, lobsters ; 13. butterfish ; 14. striped bass ; 15. bluefish ; 16. mackerel ; 17. other demersal species ; 18. starfish ; 19. flounder ; 20. man.
A History of the Study of Ecological Networks
391
If predators consume a set of prey that are restricted to a range of prey sizes, and that they consume all prey within this range, then a single niche dimension is sufficient. This is a nice example of the discovery of a non-intuitive constraint on food-web structure thanks to network analysis. It paved the path to further studies, notably on static and dynamical models of food-webs (141-143). The research on food-web graphs was pushed one step further by Sugihara. First, he developed a different way of representing niche space from food-web information. He noted that Cohen's (137) niche-overlap graph was lacking the multidimensional portrait suggested by the Hutchinsonian niche, and proposed to use instead so-called resource graphs (Figs. 12,13). The resource graph is an inside-out version of the niche-overlap graph: prey that share at least one predator in common are linked by an undirected edge. The interest of such a graph is that it is then possible to superimpose the consumers as simplexes (Fig. 12). This gives rise to a solid "tinker-toy" model of the niche of the whole community where prey are vertices and consumers polyhedral structures. These polyhedra have dimensions n-1, with n the number of prey categories, which preserves the multidimensional flavour of the niche. Sugihara pursued by exploring regularities in resource graphs from a collection of real food-webs (data from Briand 1983) (144). He concentrated on the presence or absence of topological holes in resource graphs, and on the rigidity (also called chordal property, or triangulation) of niche-overlap graphs (see Fig. 12). Rigidity is a property of graphs that is closely related to intervality – all interval graphs are rigid, but the contrary is not true – while there is no necessary connection between rigidity and absence of holes. Sugihara found that most analyzed communities were rigid and lacked topological holes, in contrast to randomly constructed matrices. In other words, the niche space of communities is densely packed. This mere result led Sugihara to postulate a simple assembly rule for food-webs: new species added to a community must compete within a single guild (that is, a group of consumers sharing similar prey), and is prohibited to bridge two or more guilds. This simply means that it is not possible for an incoming species to join two guilds say of insectivores and of nectarivores. If this rule is respected, rigidity and lack of topological holes are granted. This rule
392
L.-F. Bersier
can broadly reflect an invasion process acting at an ecological time scale, or a speciation process acting at evolutionary time scale. It gives rise to hierarchically structured communities, which can be represented by dendrograms representing phylogeny or niche overlaps, where more and more similar species are grouped together in a tree-like manner. Interestingly, this line of reasoning was successfully applied to models of species abundances in taxocenes (145,9). Both works of Cohen and Sugihara are very compelling examples of how network analyses can shed new light on non-trivial ecological processes (146). Food-web graphs themselves – and not derived niche-overlap or resource graphs - have also elicited a wealth of studies to uncover regularities in their structure. Again, Joel Cohen is a pioneer in this undertaking (136,30,147). Food-web graphs are complex objects, and descriptors were devised to extract ecologically meaningful information (for a list of descriptors, see e.g. Yodzis 1989 (10), or Bersier et al. 2002 (148). As seen previously, individual food-webs appear complex and variable. However, first collections of large numbers of webs have yielded simple and intriguing patterns. These collections are those of Cohen (1978) (137), Briand (1983) (144), Sugihara et al. (1989) (149), and Cohen et al. (1990) (143). The most important of these patterns are the following (140,180,104): 1) food chains are typically short and consist of five or fewer trophic levels (103,149,143); 2) connectance declines as species number increases in a way that the product of connectance and species number is roughly constant (149,103,150,30,151) – this is equivalent to a constant link density (L divided by S); 3) the fraction of top species, (i.e. having no predator), of intermediate species (having prey and predator), and of basal species (having no prey), and the fraction of links between top and intermediate, top and basal, intermediate and intermediate, intermediate and basal species are scale invariant: they stay roughly constant across a variety of webs spanning a wide range in the number of species they contain (147,149,143,152). These regularities were coined ‘scaling laws’. However, critics of food-web theory argued that such generalisations were artefacts due to the poor quality of the data sets (e.g. 150,124,153,154). Much of these data on food-webs were not gathered by the original investigators with the intention of producing realistic
A History of the Study of Ecological Networks
393
food-webs, but were most often provided as accessory information on the global structure of the studied systems. The lack of quality and uniformity in these early food-web collections is striking. Typically, "species" in higher trophic levels are resolved at the species level (e.g. fishes), while those in lower levels are crudely categorised (e.g. zooplankton, see Fig. 13). Following these criticisms, food-webs designed specifically to represent the full complexity of ecological systems were assembled (154-167,14). Indeed, these high-quality data challenged the validity of the previously recognized generalizations. The least robust property appeared to be the link density. It was soon proposed that a power law may be a more accurate description for this property (168,143,24), and all recently compiled collections of foodwebs do not uphold the scale invariance for this property (169,156,170,152,72,160). This led Martinez (1992) (72) to hypothesize that the directed connectance (L/S2) was the scale-invariant measure of food-web structure, and not the link density. Analyses of single highly resolved webs (154,153,157,158) revealed values for the food-web properties that were not consistent with those reported from earlier studies: e.g. the link density was much higher, the fraction of top species much lower, and the various measures of chain length were found to be much larger. Most analyses of collections of food-webs produced results that diverged from the ‘scaling laws’: the properties varied with scale (156,160). This led Martinez and Lawton (1995) (171) to put forward that the scale dependent hypothesis was more successfully predictive and precise than the old paradigm of scale invariance (see also next Chapter). Apart from this line of evidence based on the scaling properties of foodwebs, the recognition that food-webs were more complex than previously thought arose from increasing knowledge of feeding biology, and from recognition of the importance of additional trophic pathways (detrital channel, allochthonous inputs, life-history omnivory, or cannibalism) compared to those traditionally depicted by the trophic level ideal (156,153,85,38,172). This complexity does not fit into the classical view of the trophic level concept, which led Polis and Strong (1996) (173) to question the usefulness of this concept, and to propose a new model that accounts for the full complexity of natural systems.
394
L.-F. Bersier
Independently of this debate about the scaling behaviour of food-web properties, the importance of temporal resolution in the construction of food-webs was pointed out (174,155,175,176,159,161). It was generally noted that food-webs sampled over specific periods of time produced values for the properties that were very different (especially for the link density) from those obtained with cumulated versions of the same webs. Since all species and links observed over the complete study period are never present on any particular sampling date, time-specific webs provide greater realism. For example, Schoenly et al. (1996)(177) studied the impact of insecticide in rice-plantation food-webs: cumulative versions of sprayed and unsprayed webs revealed no between-treatment differences, while time-specific webs before, during, and after spraying revealed classic examples of secondary pest resurgence and early-season losses of natural enemies over the spray interval. Another important factor affecting our perception of the structure of food-webs is the degree of sampling effort used to document the webs. Goldwasser and Roughgarden (1997)(178), and Martinez et al. (1999) (166) found that most food-web properties were highly sensitive to this sampling effect. Bersier et al. (1999) (80) went one step further by showing that inherently scale-dependent systems could appear scaleinvariant when sampled with a low intensity. This result reconciled the opposite view on food-web structure, since it gave an explanation for the scale-invariance observed in early collections of food-webs, which were sampled with a low level of details. More importantly, this result revealed a basic problem with high-quality food-webs: such food-webs incorporate trophic links with huge differences in their importance in term of biomass flow. High quality data typically include some very strong links, and a lot of links of intermediate and low importance. Yet, food-web ecologists most often disregarded this variability and treated all links as equal (Fig. 14). This calls for a quantitative rather than a qualitative approach of food-web structure. Quantitative descriptors of weighted networks have been devised since (148, 179), and preliminary results show that food-web structure is indeed scale-dependent, with most descriptors having nonlinear relationships with species richness (179).
A History of the Study of Ecological Networks
395
Recently, new methodologies to analyze the topology of food-webs have yielded extremely interesting results (see Strogatz 2001 (181) for a review). They are partly borrowed from analyses of the robustness of communication networks like the Internet (e.g. ref. 182), which have a scale-free structure – that is, most nodes have few links and few nodes have a large number of links.
Figure 14. (a) Qualitative and (b) quantitative representation of the food web of Chesapeake Bay mesohaline ecosystem (see also Fig. 5). The width of the links is proportional to the amount of biomass flow. It exemplifies the problem of analysing the topology of food-webs without taking link magnitude into account.
This pattern is revealed by the degree distribution of a network, which plots the cumulative frequency distribution of nodes with a given number of links (number of nodes with 1, 2, 3... links). A power law distribution demonstrates a scale-free structure (see Chapter 1). Melian and Bascompte (2002) (183) have compared different networks (protein, random and food-web networks) and analyzed yet another aspect of their structure: they concentrated on average connectivity of neighbours in a network. To achieve this, first all nodes with one link are considered, and the average number of links of their single neighbour is computed; then, this process is repeated for nodes with 2, 3, 4... links, and the mean
396
L.-F. Bersier
number of links of the neighbours is plotted on a log-log scale, with the average connectivity as ordinate and number of links as abscissa. It revealed that food-webs have a very different structure than random and protein networks. In protein networks, nodes with many links tend to be connected to nodes with very few links; this is not the case for food-webs where the reverse happens. This gives rise to groups of species that form highly connected subwebs (Fig. 15), a pattern explored in Melian and Bascompte (2004) (184). In another intriguing study, Williams et al. (2002) (185) have studied the minimum number of links necessary to go from any one species to any other in a food-web. This is a very important feature to understand how a perturbation in a network (e.g. the extinction of a species) may propagate to the rest of the network. They found that food-webs possess the so-called "small-world" property, that is, the
Figure 15. A representation of the food-web of Ythan estuary (data from Hall and Raffaelli 1991) (184) where dense subsets of trophically interacting species are emphasized (from Melián and Bascompte 2004) (184). Typically, food-webs are organized around few subwebs of highly interacting species. (Reprinted from Melián and Bascompte, 2004, with permission of the Ecological Society of America.)
A History of the Study of Ecological Networks
397
characteristic path length – the mean distance between all nodes – is typically very short (between 1.4 and 2.7 for the dataset studied). These are examples of important patterns revealed by applying network analysis tools. It shows that food-webs have highly cohesive structure, with typically one or few large and highly trophically interconnected subwebs, and with the vast majority of species not more distant than three links apart of each other. The quest for patterns in food-web structure continues to attract ecologists and theoreticians, with explorations on universal scaling in networks (186a,186b,187,188). This quest for regularities in food-web structure is a basic step in the process of developing models of food-web structure. If indeed such general patterns are valid, they represent the fundamental patterns in real ecosystems that must be incorporated into models, as basic constraints and as standards for testing hypothesized processes underlying food-web structure. Models have been formulated to answer this question (141,189-193)(see Chapter 12). The so-called niche-hierarchy model of Sugihara (194,189) has been discussed above. It is important to be aware that it focuses on niche-overlap graphs, but not directly on food-web graphs. So, while it constrains the possible configurations of food-webs, it does not actually provide a recipe to build a food-web matrix. Cohen and Newman (1985) (141) were the first to suggest a model that reproduces the topology of trophic interactions in a community, the cascade model. Two parameters are needed, the number of species and the number of trophic interactions – a shared characteristic of this kind of static models of food-web structure. The cascade model is stochastic and based on a very simple rule: all species are ranked according to a single hierarchy, and species can only consume prey of lower rank, with a probability similar for all species and equal to twice the connectance (i.e. 2·L/S2). This gives rise to matrices that are triangulated, where cannibalistic as well as longer loops (e.g. A eats B and B eats A) are forbidden (Fig. 16b). As said above, the single niche dimension in this model was interpreted to represent a body size hierarchy: consumers must usually be larger than their prey to be able to consume them (140). This very simple constraint appears to have non trivial consequences for the topology of food-webs. Interestingly, the idea of a single hierarchy came from the
398
L.-F. Bersier
observation that many real food-webs from early datasets were interval. However, Cohen and Palka (1990) (195) later discovered that the cascade model produces an excess of non-triangulated webs compared to observed food-webs, at least for webs with more than 16 species. They concluded that most reported webs with a small number of species were incomplete representation of real communities, and that consequently more than one dimension was necessary to describe trophic niches of communities. A later study explored quantitatively the level to which food-webs departed from intervality by counting the number of chordless circuits with four consumers (see Fig. 12) in niche-overlap graphs (196). It was found that the cascade model generates a large surplus of such circuits compared to the niche-overlap graph of Ythan estuary (197). Williams and Martinez (2000) (190) confirmed that the cascade model poorly reproduces the structure of highly resolved food-webs. However, the importance of body size for food-web structure has been confirmed (167), and fostered the development of a new analytical representation of food-web structure where abundance, body size, and trophic information are combined (198,199). The partial failure of the cascade model to reproduce real patterns led Williams and Martinez (2000) (190) to formulate the niche model (Fig. 16c), which is based on the assumption of a single trophic niche dimension where consumers eat all prey within a range. This produces contiguous diets for all species; that is, it assumes that it is possible to arrange all prey species so that no gap is present in any of the diet of the consumers (see Fig. 16c, and note that all columns have continuous range of ones). Note that this model generates nothing but interval food-webs. It is intriguing since it is able to reproduce closely many patterns seen in food-web structure, and represents a major improvement over the cascade model. The niche model however suffers from a serious drawback: the assumption of continuous diets is never observed in recent highly resolved food-webs (191). That is, it is not possible to order the prey species and remove all gaps in the diets of consumers. This is another hint that more than one dimension are needed to represent the trophic structure of real communities. Cattin et al. (2004) (191) used an evolutionary approach to generate food-web matrices: they postulated that diets of species could be understood as resulting from phylogenetic constraints and adaptation (see also ref. 200). A consumer’s
A History of the Study of Ecological Networks
a
b
c
399
d
Figure 16. (a) Food-web matrix of Bridge Brook Lake (data from Havens 1992) (152), and one realization of stochastic models of food-web structure with similar numbers of species and links: (b) the cascade model (152), (c) the niche model (190), and (d) the nested-hierarchy model (191). Columns are species in their role of consumers, and rows are species in their role of prey; a "I" indicates a trophic interaction.
diet is constrained by its phylogenetic origin: taxonomically related species share similar ancestral morphological features that influence the kind of species they can prey on. For example, all warblers of the Phylloscopus genus possess a beak suited to prey on insects. This connection between phylogenetic origin and diet was substantiated with statistical analyses relating matrices of trophic and taxonomic similarity. However, phylogenetic constraints are not sufficient to explain trophic structure since species have to adapt to varying environments in order to survive, diverging from close relatives in their behaviour, and possibly innovating by using new food sources. Cattin et al. (2004) (191) proposed simple rules that incorporated these evolutionary processes, and found that food-webs generated in this way were very close to observed ones, with the advantage of correctly accounting for the level of chordless circuits found in real communities when compared to the niche model. The prime difference between former models and the nestedhierarchy model is that the process of generating food-webs is sequential in the latter: consumers are added one by one in the community, and their diet depends on existing ones. This feature is meant to represent a process where new species are not free to consume any kind of prey if they are taxonomically related to already present species. Other models of food-web structure based on evolutionary dynamics have been devised (192,193), and their ability to closely reproduce observed trophic
400
L.-F. Bersier
structures is impressive. This success of evolutionary models witnesses a change in ecological thinking about communities. Earlier studies have mainly focused on ecological processes like competition and niche theory, predation, or trophodynamics, to explain community patterns (e.g. refs. 201-206). Community ecologists have become more and more aware of the importance of historical effects, and notably on "deep history" – in other words, phylogeny – on community structure (e.g. refs. 4,207-212). Finally, another fundamental question is to understand the effect of food-web topology on community dynamics. This research is very important to understand community responses to extinctions of species that may be driven by climate changes. Interestingly, it has been tackled by purely static analyses of network topology, as well as by more traditional dynamical models. The former approach explores the possibility of secondary extinctions, that is of possible cascading effects after the removal of species (213-215). In this vein, Alessina and Bodini (2004) (216) have developed the use of so-called dominator trees to visualize species that are bottlenecks for the flow of biomass within the community. This issue on food-web dynamics is tackled in more details in Chapter 12. 6. Indirect Effects A very important aspect in the study of ecological networks is the concept of indirect effects, that is, effects between two species that are not directly interacting, and which are mediated by other interacting species in the network. Such indirect effects can have profound influence on community dynamics. This aspect is typically overlooked in many studies on populations and, especially for conservation issues, the omission of important third-party players in the system studied can lead to inappropriate management recommendations (217-219). Other examples come from the evolutionary biology of plant-herbivore interactions, which are typically studied without regard to other interacting species. When considered in a community context, the outcome of such interactions yield results that often are not accounted for by classical theories (220-221). Though indirect effects can in principle
A History of the Study of Ecological Networks
401
occur between any species in a food-web (222), they are most often considered between adjacent species, which allows a classification of basic forms: trophic cascade, apparent competition, keystone predation, and consumptive competition (Fig. 17). The latter is the most straightforward and lies at the basis of niche-overlap graphs (Fig. 12, 13). The term trophic cascade was coined by Robert Paine (1980) (122), but the idea of top-down effects propagating through food-chains was already explicit in Camerano (1880) (17), and was at the core of theories of community organization by Hairston et al. (1960) (223) and Fretwell (1977) (224). A fascinating study by Carpenter and co-workers (225,226) showed how the fourth trophic level can affect the first one in a lake ecosystem: large fishes are able to suppress many small fishes, in turn unleashing herbivores and eventually decreasing strongly the biomass of plants. A removal of the large predatory fishes produces a massive bloom of primary producers. Trophic cascades have been under deep scrutiny by ecologists, with questions about their importance in aquatic versus terrestrial systems (227-230), about the processes triggering their occurrence (231-234), or about their dynamics (235,176,236). Indirect mutualism is another form of indirect effect that was discussed above (Fig. 6). Keystone predation is yet another kind of indirect effect: by preying upon a superior competitor, a consumer can enhance the abundance of an inferior competitor. The existence of configurations leading to apparent competition was first described by Holt (1977) (237): the increase in abundance of a prey may increase the abundance of its predator; if this predator is shared by another prey, which does not compete with the first one, then the abundance of this prey may decrease, giving rise to the appearance of competition between both preys (Holt and Lawton 1994) (238). It is worth noting that Sugihara's (1982) (194) resource graphs (Fig. 13) provides a representation of possible apparent competition between prey in a community. By giving a complete picture of biomass flows, food-webs by essence incorporate all informations needed to appreciate the importance of indirect effects in a community. However, it took a rather long time to ecologists to appreciate the dynamical importance of such effects – for example, indirect effects are not treated in the community ecology textbook of Putman (239), while a whole Chapter is devoted to this subject in Morin (3).
402
L.-F. Bersier
Figure 17. Typical forms of indirect interactions. Solid lines are direct interactions and dotted lines are the resulting indirect interaction.
In the context of network analyses, the various kinds of indirect effects have been considered as modules or building blocks of food-webs (240) – with the addition of configurations like intraguild predation, a case where an omnivorous predator and its prey both share the same resource (241,242). For example, Bascompte et al. (2005) (243) have extracted from a large and highly resolved food-web all modules of two types: simple food-chains and intraguild predation. They analyzed how interaction strength was distributed within these modules. In a simple food-chain module with three species and two links, this entails looking at the co-occurrence of strong interactions. They found that two cooccurring strong links were rare and that, when present, they were "shortcut" by a strong omnivory link more often than expected by chance – that is, they formed an intraguild predation module. In all, such patterns have a strong stabilizing effect on community dynamics. One of the most successful research programs on indirect effects is the work of Charles Godfray and co-workers on apparent competition in ecological networks. They studied mostly herbivore-parasitoid systems (parasitoids have a mode of life intermediate between predation and parasitism: they typically lay their egg inside a host and the larvae will develop by feeding within the live body; most parasitoids are wasps). The beauty of this system is that it is possible to accurately estimate the magnitude of the interactions by collecting hosts and counting the number of parasitoid larvae; it is also possible to take into account secondary parasitoids, which lay their eggs inside the parasitoid larvae or
A History of the Study of Ecological Networks
403
mummy when they are still in their herbivore host. In this way, it is possible to document very precisely the trophic interactions within such systems (244-250). Once quantitative food-webs are available, it is possible to produce quantitative resource graphs (called in this case parasitoid overlap graphs) where herbivores are linked if they share a parasitoid. The interest of obtaining such weighted graphs is that it is then possible to isolate pairs of herbivores that are likely to be affected by strong apparent competition. Such pairs were subjected to experimental manipulation and the significance of indirect effects was assessed in this way (251,247,252). 7. Networking with Non Trophic Interactions As said in the introduction, networks of species linked by different kind of interactions are quite uncommon compared to the wealth of food-web studies, and up to now, I have considered almost exclusively networks of species linked by trophic interactions. From food-webs, it is possible to generate graphs of species linked by consumptive competition. Such niche-overlap graphs have been discussed above. When considering competition without information from food-webs, ecologists have typically assessed the magnitude of interactions indirectly through measures of body size, morphology, and micro-habitat use (e.g. refs. 253,254), with the assumption that they are sufficient to account for most mechanisms leading to competition (e.g. consumption of shared resources, or preemption of space). From such information, it is possible to compute a squared quantitative matrix of niche similarity. Pairs of species overlapping extensively in their niche use are expected to be strong competitors. There are to my knowledge few studies where such competition matrices were used to generate networks. The preferred analytical tool is cluster analysis, which provides a hierarchical structure of community organization. An exception is Sugihara's (1982) (194) analysis of a dataset of 11 bird communities from Cody (1974) (201). He built the niche overlap graphs of such communities at various thresholds of interaction magnitude, and found that the absence of chordless cycles (see Fig. 12) was a robust feature of communities whatever the level of competition used to construct such graphs.
404
L.-F. Bersier
Mutualistic interactions between plants and their pollinators have typically been thought to be quite specialized. However, recent studies have showed that it is often not the case: many plants have numerous pollinators and many pollinators visit different plant species. This discovery comes from the analysis of mutualistic networks of plantspollinators and plants-seed dispersers (255-257). It is important in many respects. First, it shows that, when viewed at a community level, coevolution between pollinators and plants is not "multi-channel" like, between tightly co-adapted pairs, but diffuse over the whole network. Second, this knowledge is crucial when assessing the possible dispersion of pollens of genetically modified plants, and is also very important for conservation issues (258). Finally, it is very important for community dynamics. Bascompte et al. (2003) (259) and Jordano et al. (2003) (260) have analyzed precisely the structure of plant-animal networks in a large dataset comprising more than 50 networks. The former study found that such networks are organized in a highly nested way: specialist species interact only with subsets of species that are interacting with many generalists. This pattern generates highly asymmetrical structures, with the community organized in a cohesive manner around a central core of interacting species. The latter study explored scaling relationships in this dataset. They found that, contrary to other studied networks, the degree distribution (the cumulative frequency distribution of numbers of species with a given number of links) was not scale-free, that is, it did not decay as a power law. They hypothesized that the construction of such networks followed a process where morphological and phenological constraints restricted the number of possible links between species, thus generating networks with a strong core of interacting species and a wealth of satellite species attached loosely to this core, a pattern that may confer a high dynamical robustness to these systems. Finally, to my knowledge very few studies at the community level have incorporated different kinds of interactions in a similar network. One such work is the recent contribution of Melian, Bascompte and Jordano (in Melián 2005) (12), who tackled the analysis of a network comprising plants, herbivores, pollinators, and seed-dispersers (a total of 394 species). They analyzed degree distributions and subwebs (see above), and found interestingly that plant species that were involved in
A History of the Study of Ecological Networks
405
the core of interactions with pollinators were at the same time exposed heavily to herbivores. Looking only at mutualistic interactions may give the impression that this core of interacting species does form a barrier against the propagation of disturbances in this system. Incorporating herbivores evidences the fact that these species are also more prone to be destabilized by herbivores. In all, both effects may cancel out. This shows that one must be careful when drawing conclusions about community dynamics from studies of only one kind of interactions. This highlights the need of a multi-interactive perspective of ecological networks. 8. Future Avenues of Research I hope that the present chapter has given to the reader a palette of possibilities to study ecological networks. I believe that apprehending natural communities as ecological networks has a strong potential to become a paradigmatic approach in ecology. Firstly because analytical tools are becoming more and more available, but more importantly because ecological networks provide a framework to link various schools in ecology and evolution that have largely progressed independently. For example, evolutionary theories about plant-herbivores or host-parasites can be scaled-up, and energetic based theories of ecosystem structure can be scaled-down at the same network level. Such an undertaking is paved with theoretical and empirical challenges. To be successful, I believe that developments are particularly needed in the following domains. First, it is desirable to tackle ecological networks in a multi-interactive manner, that is, to integrate all kinds of interactions. Trophic interactions are of course paramount to understand the dynamics of communities, but other kind of links are as crucial, some of which are overlooked in ecological networks – for example facilitation (261), ecosystem engineering (262264), without forgetting ecological stoichiometry – the balance of different chemicals in ecological interactions (265). Models and theories of community structure and functioning need to prove useful in their predictions, for example when effects of climatic forcing or of other perturbations are considered. I believe that this goal can be fully achieved only by integrating all kinds of interactions in the same
406
L.-F. Bersier
framework. Second, interactions should be not only reported as present of absent, but also quantified in appropriate units, and abundance of species should also be estimated. Weighted networks provide much more sensible information on community structure, and suffer less from effects due to different levels of sampling effort exerted in documenting communities (266). Third, such an undertaking will be successful only with high-quality datasets, which requires a strong involvement of field ecologists and taxonomists. Here, a difficult methodological aspect is the quantitative documentation of interactions. Trophic interactions are notably difficult to quantify from observations, and molecular methods may be helpful in this task (e.g. ref. 267). Finally, communities must be studied at various temporal and spatial scales (268-270). The recent concepts of metacommunities (271) and meta-ecosystems (5) emphasizes dispersal and allochthonous inputs as key processes for community structure. They are totally relevant in this respect, which views communities as networks of networks. In all, these issues are a plea for a multidisciplinary approach to the ecology of natural communities. This should prevent theoreticians to forget that nodes in a network are not just abstract objects but species, and population ecologists that species interact not only with their prey or parasites, but within a larger constellation of interacting species. Ecological networks make connections between species in a community; they should also foster ecologists and evolutionists of diverse disciplines, as well as mathematicians and physicists, to make connections in a fruitful network. Acknowledgements I am grateful to all colleagues and friends who helped me to sharpen my ideas about community ecology, to all field ecologists involved in data collection presented here, and to Russell Naisbit for revising the text. My apologies to all whose contributions could not be tackled in this Chapter. This research was supported by a Swiss NSF grant (3100A0-113843).
A History of the Study of Ecological Networks
407
References 1. 2.
3. 4. 5. 6. 7. 8. 9.
10. 11. 12. 13.
14. 15. 16. 17.
18.
Haeckel, E. (1866) Generelle Morphologie der Organismen, 2 Vols. Georg Reimer Verlag, Berlin. Fauth, J.E., Bernardo, J., Camara, M., Reserarits, W.J.J., Van Buskirk, J. and McCollum, S.A. (1996) Simplifying the jargon of community ecology: a conceptual approach. American Naturalist. 147, 282-286. Morin, P. (1999) Community Ecology. Blackwell Science. Drake, J.A. (1990) Communities as assembled structures: do rules govern pattern? Trends in Ecology and Evolution. 5, 159-164. Loreau, M. and Mouquet, N. (1999) Immigration and the maintenance of local species diversity. American Naturalist. 154, 427-440. Root, R. (1967) The niche exploitation pattern of the blue-gray gnatcatcher. Ecological Monographs. 37, 317-350. Hubbell, S.P. (2001) The Unified Neutral Theory of Biodiversity and Biogeography. Princeton University Press, Princeton, Oxford. Volkov, I., Banavar, J.R., Hubbell, S.P. and Maritan, A. (2003) Neutral theory and relative species abundance in ecology. Nature. 424, 1035-1037. Sugihara, G., Bersier, L.F., Southwood, T.R.E., Pimm, S.L. and May, R.M. (2003) Predicted correspondence between species abundances and dendrograms of niche similarities. Proceedings of the National Academy of Sciences of the United States of America. 100, 5246-5251. Yodzis, P. (1989) Introduction to Theoretical Ecology. Harper and Row, NewYork. Ricklefs, R.E. and Miller, G.L. (1999) Ecology. W.H. Freeman and Company, New York. Melián, C.J. (2005) On the Structure and Dynamics of Ecological Networks. PhD thesis, Universidad de Alcalá, Spain. Darwin, C. (1859) On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. 1st edition John Murray, London. Hagen, J. (1992). An entangled bank: The origins of ecosystems ecology. Rutgers Univ. Press, New Brunswick. Golley, F.B. (1993) A history of the ecosystem concept in ecology: More than the sum of the parts. Yale Univ. Press, New Haven. Kingsland, S.E. (1995) Modeling nature : episodes in the history of population ecology. University of Chicago Press, Chicago. Camerano, L. (1880) Dell'equilibrio dei viventi merce la reciproca distruzione. Atti della Reale Accademia delle Scienze di Torino, 15, 393-414. (For an English translation by Claudia M. Jacobi: Levin, S.A., ed. (1994) Frontiers in Mathematical Biology. Springer Verlag, Berlin. 360-380. Darwin, C. (1839) Journal of Researches into the Geology and Natural History of the Various Countries Visited by H.M.S. Beagle, Under the Command of Captain Fitzroy, R.N. from 1832 to 1836. Henry Colburn, London. Reprinted: Culture et Civilisation, Brussels, 1969.
408 19.
20. 21.
22. 23. 24. 25. 26. 27.
28. 29. 30.
31. 32. 33. 34.
35. 36. 37. 38.
L.-F. Bersier Cohen, J.E. (1994) Lorenzo Camerano's contribution to early food web theory. In Frontiers in Mathematical Biology (Levin, S.A.), 351-359. Springer Verlag, Berlin. Möbius, K. (1877) Die Auster und die Austernwirtschaft. Wiegand, Hempel and Parey, Berlin. Forbes, E. (1843) Report on the Mollusca and Radiata of the Aegean sea, and on their distribution, considered as bearing on geology. Report of the 13th meeting of the British Association for the Advancement of Science held at Cork. 130-93. Ehrlich, P.R. and Birch, L.C. (1967) The "balance of nature" and "population control". American Naturalist. 101, 97-107. Slobodkin, L.B., Smith, F.E. and Hairston, N.G. (1967) Regulation in terrestrial ecosystems, and the implied balance of nature. American Naturalist. 101, 109-124. Pimm, S.L. (1991) The Balance of Nature? Ecological Issues in the Conservation of Species and Communities. The University of Chicago Press, Chicago. McCann, K.S. (2000) The diversity-stability debate. Nature, 405, 228-233. May, R.M. (1974) Stability and complexity of model ecosystems. 2nd ed. Princeton University Press, Princeton. Estes, J.A., Tinker, M.T., Williams, T.M., and Doak, D.F. (1998) Killer whale predation on sea otters linking oceanic and nearshore ecosystems. Science. 282, 473-476. Brett, M.T. and Goldman, C.R. (1997) Consumer versus resource control in freshwater pelagic food webs. Science. 275, 384. Carpenter, S.R., Kitchell, J.F. and Hodgson, J.F. (1985) Cascading trophic interactions and lake productivity. BioScience. 35, 634-639. Cohen, J.E. and Briand, F. (1984) Trophic links of community food web. Proceedings of the National Academy of Sciences of the United States of America. 81, 4105-4109. Scheffer, M. (1999) Searching explanations of nature in the mirror world of math. Conservation Ecology. 3 art. 11. Allee, W.C., Emerson, A.E., Park, O., Park, T., and Schmidt, K.P. (1949) Principles of animal ecology. Saunders, Philadelphia. Oksanen, L. (1991) A century of community ecology: how much progress? Trends in Ecology and Evolution. 6, 294-296. Pierce, W.D., Cushman, R.A. and Hood, C.E. (1912) The insect enemies of the cotton boll weevil. U.S. Department of Agriculture Bureau of Entomology. Bulletin 100, 99. Shelford, V.E. (1913) Animal Communities in Temperate America as Illustrated in the Chicago Region. The Geographic Society of Chicago. Bulletin 5. Summerhayes, V.S. and Elton, C.S. (1923) Contributions to the ecology of Spitsbergen and Bear Island. Journal of Ecology. 11, 214-286. Elton, C.S. (1927) Animal ecology. Sidgwick and Jackson, London. Polis, G.A. and Hurd, S.D. (1996) Linking marine and terrestrial food webs: Allochthonous input from the ocean supports high secondary productivity on small islands and coastal land communities. American Naturalist. 147, 396-423.
A History of the Study of Ecological Networks 39.
40. 41. 42. 43. 44. 45. 46.
47. 48. 49. 50.
51. 52. 53. 54. 55. 56.
57. 58. 59.
409
Huxel, G.R., McCann, K. and Polis, G.A. (2002) Effects of partitioning allochthonous and autochthonous resources on food web stability. Ecological Research. 17, 419-432. Stapp, P. and Polis, G.A. (2003) Marine resources subsidize insular rodent populations in the Gulf of California, Mexico. Oecologia. 134, 496-504. Cuddington, K. (2001) The “Balance of Nature” metaphor and equilibrium in population ecology. Biology and Philosophy. 16, 463-479. Clements, F.E. (1916) Plant succession, an analysis of the development of vegetation. Carnegie Institution of Washington. 242, 1-512. Gleason, H.A. (1926) The individualistic concept of the plant association. Bulletin of the Torrey Botanical Club. 53, 7-26. Elton, C.S. (1930) Animal Ecology and Evolution. Oxford University Press, NewYork. Elton, C.S. (1958) The Ecology of Invasion by Animals and Plants. Methuen, London. Richards, O.W. (1926) Studies on the ecology of English heaths. III. Animal communities of the felling and burn successions at Oxshott Heath, Surrey. Journal of Ecology. 14, 244-281. Bird, R.D. (1930) Biotic communities of the aspen parkland of central Canada. Ecology. 11, 356-442. Lindeman, R.L. (1942) The trophic-dynamic aspect of ecology. Ecology. 23, 399417. Baird, D. and Ulanowicz, R.E. (1989) The seasonal dynamics of the Chesapeake Bay ecosystem. Ecological Monographs. 59, 329-364. Ulanowicz, R.E. (1995) Ecosystem trophic foundations: Lindeman exonerata. In Complex Ecology: The Part-Whole Relation in Ecosystems (Patten, B.C. and Jørgensen, S.E.). 549-560. Prentice Hall PTR, Engelwood Cliffs, NJ. Odum, E.P. (1953) Fundamentals of Ecology. Saunders, Philadelphia. Odum, E.P. (1969) The strategy of ecosystem development. Science. 164, 262270. Odum, H.T. (1988) Self-organization, transformity, and information. Science. 242, 1132-1139. Patten, B.C. (1995) Network integration of ecological extremal principles: exergy, emergy, power, ascendancy, and indirect effects. Ecological Modelling. 79, 75-84. Hau, J.L. and Bakshi B.R. (2004) Promise and problems of emergy analysis. Ecological Modelling. 178, 215-225. Brown, M.T. and Ulgiati, S. (2004) Energy quality, emergy, and transformity: H.T. Odum's contributions to quantifying and understanding systems. Ecological Modelling. 178, 201-213. Ulanowicz, R.E. (1997) Ecology, the Ascendent Perspective. Complexity in Ecological Systems Series. Columbia University Press, New York. Wynne-Edwards, V.C. (1962) Animal Dispersion in Relation to Social Behaviour. Hafner Publishing, New York. Wilson, D.S. (2001) Evolutionary biology: Struggling to escape exclusively individual selection. Quarterly Review of Biology. 76, 199-205.
410 60. 61. 62. 63. 64. 65.
66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76.
77. 78. 79. 80.
L.-F. Bersier Lewontin, R.C. (1970) The units of selection. Annual Review of Ecology and Systematics. 1, 1-18. Maynard Smith, J. (1976) Group selection. Quarterly Review of Biology. 51, 277283. Wilson, D.S. (1997) Introduction: Multilevel selection theory comes of age. American Naturalist. 150, S1-S4. Keller, L. (ed.) (1999) Levels of Selection in Evolution. Princeton University Press, Princeton, New Jersey. Swenson, W., Wilson, D.S. and Elias, R. (2000) Artificial ecosystem selection. Proceedings of the National Academy of Sciences of the U.S.A. 97, 9110-9114. Odling-Smee, F.J., Laland, K.N. and Feldman M.W. (2003) Niche Construction: The Neglected Process in Evolution. Princeton University Press, Princeton, New Jersey. Patten, B.C. (ed.) (1975) System Analysis and Simulation in Ecology. Academic Press, New York. Fath, B.D. and Patten, B.C. (1999) Review of the foundations of network environ analysis. Ecosystems. 2, 167-179. Patten B.C. and Jørgensen, S.E. (eds) (1995) Complex Ecology: The Part-Whole Relation in Ecosystems. Prentice Hall PTR, Engelwood Cliffs, NJ. Ulanowicz, R.E. (1986) Growth and Development: Ecosystems Phenomenology. Springer Verlag, Ulanowicz, R.E. and Wolff, W.F. (1991) Ecosystem flow networks: Loaded dice? Mathematical Biosciences. 103, 45-68. Warren, P.H. (1994) Making connections in food webs. Trends in Ecology and Evolution. 9,136–141. Martinez, N.D. (1992) Constant connectance in community food webs. American Naturalist. 139, 1208–1218. Ulanowicz, R.E. (2002) The balance between adaptability and adaptation. BioSystems. 64, 13-22. Paine, R.T. (1966) Food web complexity and species diversity. American Naturalist. 100, 65-75. MacArthur, R.H. (1955) Fluctuations of animal populations, and a measure of community stability. Ecology. 36, 533-536. Begon, M., Harper, J.L. and Townsend, C.R. (1996) Ecology. Individuals, Populations and Communities. 3rd edition. Blackwell Science, Malden, MA, U.S.A. Magurran, A.E. (2004) Measuring Biological Diversity. Blackwell Science, Malden, MA. Niering, W.A. (1963) Terrestrial ecology of Kapingamarangi Atoll, Caroline Islands. Ecological Monographs. 33, 131-160. Clarke, T.A., Flechsig, A.O. and Grigg, R.W. (1967) Ecological studies during Project Sealab II. Science. 157, 1381-1389. Bersier, L.F., Dixon, P. and Sugihara, G. (1999) Scale-invariant or scale-dependent behavior of the link density property in food webs: A matter of sampling effort? American Naturalist. 153, 676-682.
A History of the Study of Ecological Networks 81. 82. 83. 84. 85. 86.
87. 88. 89. 90. 91.
92.
93.
94.
95. 96.
97.
411
May, R.M. (1972) Will a large complex system be stable? Nature. 238, 413–414. May, R.M. (2001) Stability and complexity of model ecosystems. 2nd ed with a new introduction. Princeton University Press, Princeton. Hastings, H. (1982) The May-Wigner stability theorem. Journal of Theoretical Biology. 97, 155-166. Polis, G.A. (1994) Food webs, trophic cascades and community structure. Australian Journal of Ecology. 19, 121-136. Cohen, J.E. and Newman, C.M. (1984) The stability of large random matrices and their products. Annals of Probability. 12, 283-310. Sinha, S. and Sinha, S. (2005) Evidence of universality for the May-Wigner stability theorem for random networks with local dynamics. Physical Review E. 71, 020902 1-4. Tilman, D. and Downing, J.A. (1994) Biodiversity and stability in grasslands. Nature. 367, 363–365. Tilman, D. (1996) Biodiversity: population versus ecosystem stability. Ecology. 77, 350–363. Tilman, D. (1999) The ecological consequences of changes in biodiversity: A search for general principles. Ecology. 80, 1455-1474. Tilman, D., Wedin, D. and Knops, J. (1996) Productivity and sustainability influenced by biodiversity in grassland ecosystems. Nature. 379, 718–720. Hector, A., Schmid, B., Beierkuhnlein, C., Caldeira, M.C., Diemer, M., Dimitrakopoulos, P.G., Finn, J.A., Freitas, H., Giller, P.S., Good, J., Harris, R., Högberg, P., Huss-Danell, K., Joshi, J., Jumpponen, A., Körner, C., Leadley, P.W., Loreau, M., Minns, A., Mulder, C.P.H., O'Donovan, G., Otway, S.J., Pereira, J.S., Prinz, A., Read, D.J., Scherer-Lorenzen, M., Schulze, E.D., Siamantziouras, A.S.D., Spehn, E.M., Terry, A.C., Troumbis, A.Y., Woodward, F.I., Yachi, S. and Lawton J. H. (1999) Plant diversity and productivity experiments in european grasslands. Science. 286, 1123-1127. Naeem, S., Thompson, L.J., Lawlor, S.P., Lawton, J.H., and Woodfin, R.M. (1994) Declining biodiversity can alter the performance of ecosystems. Nature. 368, 734737. Doak, D.F., Bigger, D., Harding, E.K., Marvier, M.A., O'Malley, R.E. and Thomson, D. (1998) The statistical inevitability of stability-diversity relationships in community ecology. American Naturalist. 151, 264–276. Tilman, D., Lehman, C.L. and Bristow, C.E. (1998) Diversity-stability relationships: statistical inevitability or ecological consequence. American Naturalist. 151, 277–282. Loreau, M. and Hector, A. (2001) Partitioning selection and complementarity in biodiversity experiments. Nature. 412, 72-76. Loreau, M., Naeem, S., Inchausti, P., Bengtsson, J., Grime, J.P., Hector, A., Hooper, D.U., Huston, M.A., Raffaelli, D., Schmid, B., Tilman, D. and Wardle, D.A. (2001) Biodiversity and ecosystem functioning: Current knowledge and future challenges. Science. 294, 804-808. Loreau, M., Naeem, S. and Inchausti, P. (eds) (2002) Biodiversity and Ecosystem Functioning Synthesis and Perspectives. Oxford University Press, Oxford, UK.
412 98. 99. 100.
101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112.
113. 114. 115. 116. 117. 118. 119.
L.-F. Bersier McGrady-Steed, J. and Morin, P.J. (2000) Biodiversity, density compensation, and the dynamics of populations and functional groups. Ecology, 81, 361–373. Paine, R.T. (2002) Trophic control of production in a rocky intertidal community. Science. 296, 736-739. Thébault, E. and Loreau, M. (2003) Food-web constraints on biodiversity– ecosystem functioning relationships. Proceedings of the National Academy of Sciences of the U.S.A. 100, 14949-14954. Fagan, W. F. (1997) Omnivory as a stabilizing feature of natural communities. American Naturalist. 150, 554-567. Pimm, S.L. and Lawton, J.H. (1978) On feeding on more than one trophic level. Nature.275, 542–544. Pimm, S.L. (1982) Food webs. Chapman and Hall, New York. Pimm, S.L., Lawton, J.H. and Cohen, J.E. (1991) Food web patterns and their consequences. Nature. 350, 669–674. Holyoak, M. and Sachdev, S. (1999) Omnivory and the stability of simple food webs. Oecologia. 117, 413–419. Morin, P. (1999b) Productivity, intraguild predation, and population dynamics in experimental food webs. Ecology. 80, 752–760. Holling, C.S. (1966) The functional response of invertebrate predators to prey density. Memoirs of the Entomological Society of Canada. 48, 1-60. Arditi, R. and Ginzburg, L.R. (1989) Coupling in predator-prey dynamics: ratio dependence. Journal of Theoretical Biology. 139, 311-326. Berryman, A.A. (1992) The origins and evolution of predator-prey theory. Ecology. 73, 1530-1535. Abrams, P.A. (1994) The fallacies of "ratio-dependent" predation. Ecology, 75, 1842-1850. Abrams, P.A. and Ginzburg, L.R. (2000) The nature of predation: prey dependent, ratio dependent or neither? Trends in Ecology and Evolution. 15, 337-341. Schenk, D., Bersier, L.F. and Bacher, S. (2005) An experimental test of the nature of predation: neither prey- nor ratio-dependent. Journal of Animal Ecology. 74, 8691. Yodzis, P. (1981) The stability of real ecosystems. Nature 289, 674–676. McCann, K.S, Hastings, A. and Huxel, G.R. (1998) Weak trophic interactions and the balance of nature. Nature. 395, 794–798. Berlow, E. (1999) Strong effects of weak interactions in ecological communities. Nature. 398, 330–334. Neutel, A.M., Heesterbeek, J.A.P. and de Ruiter, P.C. (2002) Stability in real food webs: Weak links in long loops. Science. 296, 1120-1123. Huffaker, C.B. (1958) Experimental studies on predation: dispersion factors and predator-prey oscillations. Hilgardia. 27, 343-383. Hochberg, M.E. and Lawton, J.H. (1990) Spatial heterogeneities in parasitism and population dynamics. Oikos. 59, 9-14. Holt, R.D. and Hassell, M.P. (1993) Environmental heterogeneity and the stability of host parasitoid interactions. Journal of Animal Ecology. 62, 89-100.
A History of the Study of Ecological Networks
413
120. Keitt, T.H. (1997) Stability and complexity on a lattice: Coexistence of species in an individual-based food web model. Ecological Modelling. 102, 243-258. 121. Petchey, O.L., Gonzalez, A. and Wilson, H.B. (1997) Effects on population persistence: the interaction between environmental noise colour, intraspecific competition and space. Proceedings of the Royal Society of London Series BBiological Sciences. 264, 1841-1847. 122. Paine, R.T. (1992) Food-web analysis through field measurement of per capita interaction strength. Nature. 355, 73-75. 123. Paine, R.T. (1980) Food webs: linkage, interaction strength and community infrastructure. Journal of Animal Ecology. 49, 667-685. 124. Paine, R.T. (1982) Intertidal food webs: Does connectance describe their essence. In Current Trends in Food Web Theory – Report on a Food Web Workshop (DeAngelis, D.L., Post, W.M. and Sugihara, G.), 11-15. Oak Ridge National Laboratory, Oak Ridge, Tennessee. 125. Paine, R.T. (1988) Food web: road maps of interactions or grist for theoretical development? Ecology. 69, 1648-1654. 126. Raffaelli, D.G. and Hall, S.J. (1996) Assessing the relative importance of trophic links in food webs. In Food Webs: Integration of Patterns and Dynamics (Polis, G.A. and Winemiller, K.O.), 185-191. Chapman and Hall, New York. 127. Wootton, J.T. (1997) Estimates and tests of per capita interaction strength: Diet, abundance, and impact of intertidally foraging birds. Ecological Monographs. 67, 45-64. 128. Laska, M.S. and Wootton, J.T. (1998) Theoretical concepts and empirical approaches to measuring interaction strength. Ecology. 79, 461-476. 129. DeAngelis, D. and Waterhouse, J.C. (1987) Equilibrium and nonequilibrium concepts in ecological models. Ecological Monographs. 57, 1–21. 130. Law, R., and Morton, D. (1996) Permanence and the assembly of ecological communities. Ecology. 77, 762–775. 131. Huisman, J. and Weissing, F.J. (1999) Biodiversity of plankton by species oscillations and chaos. Nature. 402, 407–410. 132. Michalski, J. and Arditi, R. (1995) Food web structure at equilibrium and far from it: Is it the same? Proceedings of the Royal Society of London B Biological Sciences. 259, 217-222. 133. Keitt, T.H. and Marquet, P.A. (1996) The introduced Hawaiian avifauna reconsidered: Evidence for self-organized criticality? Journal of Theoretical Biology. 182, 161-167. 134. Bengtsson, J., Baillie, S.R. and Lawton, J.H. (1997) Community variability increases with time. Oikos. 78, 249-256. 135. Keitt, T.H. and Stanley, H.E. (1998) Dynamics of North American breeding bird populations. Nature. 393, 257-260. 136. Cohen, J.E. (1977) Food webs and the dimensionality of trophic niche space. Proceedings of the National Academy of Sciences of U.S.A. 74, 4533-4536. 137. Cohen, J.E. (1978) Food Webs and Niche Space. Princeton University Press, Princeton, NJ.
414
L.-F. Bersier
138. Hutchinson, G. (1957) Concluding remarks. Cold Spring Harbor Symposia on Quantitative Biology. 22, 415-427. 139. Leibold, M.A. (1995) The niche concept revisited: Mechanistic models and community context. Ecology. 76, 1371-1382. 140. Lawton, J.H. and Warren, P.H. (1988) Static and dynamic explanations for patterns in food webs. Trends in Ecology and Evolution. 3, 242-245. 141. Cohen, J.E. and Newman, C.M. (1985) A stochastic theory of community food webs I. Models and aggregated data. Proceedings of the Royal Society of London B Biological Sciences. 224, 421-448. 142. Cohen, J.E. and Newman, C.M. (1988) Dynamic basis of food web organization. Ecology. 69, 1655-1664. 143. Cohen, J.E., Briand, F. and Newman, C.N. (1990) Community Food Webs: Data and Theory. Springer-Verlag, Berlin. 144. Briand, F. (1983) Environmental Control of Food Web Structure. Ecology. 64, 253-263. 145. Sugihara, G. (1980) Minimal community structure: an explanation of species abundance patterns. American Naturalist. 116, 770-787. 146. Pimm, S.L. (1988) The geometry of niches. Community Ecology. A workshop held at Davis, CA, April 1986 (Hastings, A.), 92-111. Springer-Verlag, Berlin. 147. Briand, F. and Cohen, J.E. (1984) Community food webs have scale-invariant structure. Nature. 307, 264-267. 148. Bersier, L.F., Banašek-Richter, C. and Cattin, M.F. (2002) Quantitative descriptors of food-web matrices. Ecology. 83, 2394-2407. 149. Sugihara, G., Schoenly, K. and Trombla, A. (1989) Scale invariance in food web properties. Science. 245, 48-52. 149a. Rejmánek, M. and Stary, P. (1979) Connectance in real biotic communities and critical values for stability of model ecosystems. Nature, 280, 311-313. 150. May, R.M. (1983) The structure of food webs. Nature (London). 301, 566-568. 151. Bersier, L.F. and Sugihara, G. (1997) Scaling regions for food web properties. Proceedings of the National Academy of Sciences of the United States of America. 94, 1247-1251. 152. Havens, K.E. (1992) Scale and Structure in Natural Food Webs. Science (Washington DC). 257, 1107-1109. 153. Polis, G.A. (1991) Complex trophic interactions in deserts: an empirical critique of food-web theory. American Naturalist. 138, 123-155. 154. Martinez, N.D. (1991) Artifacts or attributes? Effects of resolution on the Little Rock Lake food web. Ecological Monographs. 61, 367-392. 155. Warren, P.H. (1989) Spatial and temporal variation in the structure of a freshwater food web. Oikos. 55, 299-311. 156. Winemiller, K.O. (1990) Spatial and temporal variation in tropical fish trophic networks. Ecological Monographs. 60, 331-367. 157. Hall, S.J. and Raffaelli, D. (1991) Food-web patterns: lessons from a species-rich web. Journal of Animal Ecology. 60, 823-842. 157a. Hall, S.J. and Raffaelli, D.G. (1993) Food webs: theory and reality. Advances in Ecological Research. 24, 187-239.
A History of the Study of Ecological Networks
415
158. Goldwasser, L. and Roughgarden, J. (1993) Construction and analysis of a large Caribbean food web. Ecology. 74, 1216-1233. 159. Closs, G.P. and Lake, P.S. (1994) Spatial and temporal variation in the structure of an intermittent-stream food web. Ecological Monographs. 64, 1-21. 160. Deb, D. (1995) Scale-dependence of food web structures: Tropical ponds a paradigm. Oikos. 72. 245-262. 161. Tavares-Cromar, A.F. and Williams, D.D. (1996) The importance of temporal resolution in food web analysis: Evidence for a detritus-based stream . Ecological Monographs. 66, 91-113. 162. Reagan, D.P. and Waide, R.B. (1996) The food web of a tropical rain forest. The University of Chicago Press, Chicago, Illinois, U.S.A. 163. Opitz, S. (1996) Trophic interactions in Caribbean coral reefs. International Centre for Living AquaticResource Management (now WorldFish Center), Penang, Malaysia. 164. Yodzis, P. (1998) Local trophodynamics and the interaction of marine mammals and fisheries in the Benguela ecosystem. Journal of Animal Ecology. 67, 635-658. 165. Almunia, J., Basterretxea, G., Aristegui, J. and Ulanowicz, R.E. (1999) Benthicpelagic switching in a coastal subtropical lagoon. Estuarine Coastal and Shelf Science. 49, 363-384. 166. Martinez, N.D., Hawkins, B.A., Dawah, H.A. and Feifarek, B.P. (1999) Effects of sampling effort on characterization of food-web structure. Ecology. 80, 10441055. 167. Memmott, J., Martinez, N.D. and Cohen, J.E. (2000) Predators, parasitoids and pathogens: Species richness, trophic generality and body sizes in a natural food web. Journal of Animal Ecology. 69, 1-15. 168. Schoener, T.W. (1989) Food webs from the small to the large. Ecology. 70, 15591589. 169. Winemiller, K.O. (1989) Must connectance decrease with species richness. American Naturalist. 134, 960-968. 170. Warren, P.H. (1990) Variation in food-web structure: the determinants of connectance. American Naturalist. 136, 689-700. 171. Martinez, N.D. and Lawton, J.H. (1995) Scale and food-web structure: from local to global. Oikos. 73, 148-154. 172. Deb, D. (1997) Trophic uncertainty vs parsimony in food web research. Oikos. 78, 191-194. 173. Polis, G.A. and Strong, D.R. (1996) Food web complexity and community dynamics. American Naturalist. 147, 813-846. 174. Kitching, R.L. (1987) Spatial and temporal variation in food webs in water-filled treeholes. Oikos. 48, 280-288. 175. Lockwood, J.A., Christiansen, T.A. and Legg, D.E. (1990) Arthropod preypredator ratios in a sagebrush habitat: Methodological and ecological implications. Ecology. 71, 996-1005. 176. Schoenly, K. and Cohen, J.E. (1991) Temporal variation in food web structure: 16 empirical cases. Ecological Monographs. 61, 267-298.
416
L.-F. Bersier
177. Schoenly, K.G., Cohen, J.E., Heong, K.L., Arida, G.S., Barrion, A.T. and Litsinger, J.A. (1996) Quantifying the impact of insecticides on food web structure of rice-arthropod populations in a Philippine farmer's irrigated field: A case study. In Food Webs: Integration of Patterns and Dynamics (Polis, G. A. and Winemiller, K.O.). 343-351. Chapman and Hall, New York. 178. Goldwasser, L. and Roughgarden, J. (1997) Sampling effects and the estimation of food-web properties. Ecology, 78, 41-54. 179. Banašek-Richter, C. (2004) Quantitative descriptors and their perspectives for food-web ecology. PhD thesis, Neuchâtel University, Switzerland. 180. Lawton, J.H. (1989) Food webs. In Ecological Concepts (Cherrett, J. M.), 43-78. Blackwell Scientific, Oxford. 181. Strogatz, S.H. (2001) Exploring complex networks. Nature. 410, 268-276. 182. Albert, R., Jeong, H. and Barabási, A.L. (2000) Error and attack tolerance of complex networks. Nature. 406, 378-382. 183. Melián, C.J. and Bascompte, J. (2002) Complex networks: two ways to be robust? Ecology Letters. 5, 705-708. 184. Melián, C.J. and Bascompte, J. (2004) Food web cohesion. Ecology. 85, 352-358. 185. Williams, R.J., Berlow, E.L., Dunne, J.A., Barabasi, A.L. and Martinez, N.D. (2002) Two degrees of separation in complex food webs. Proceedings of the National Academy of Sciences of the United States of America. 99, 12913-12916. 186. Garlaschelli, D., Caldarelli, G. and Pietronero, L. (2003) Universal scaling relations in food webs. Nature. 423, 165-168. 186a. Garlaschelli, D., Caldarelli, G. and Pietronero, L. (2005) Food-web topology: Universal scaling in food-web structure? (reply). Nature. 435, E4. 187. Brose, U., Ostling, A., Harrison, K. and Martinez, N.D. (2004) Unified spatial scaling of species and their trophic interactions. Nature. 428, 167-171. 188. Camacho, J. and Arenas, A. (2005) Universal scaling in food-web structure? Nature. 435, E3-E4. 189. Sugihara, G. (1983) Holes in niche space: a derived assembly rule and its relation to intervality. In Current Trends in Food Web Theory (DeAngelis, D. L., Post, W., and Sugihara, G.), 25-35. Report 5983, Oak Ridge National Laboratory, Oak Ridge, TN. 190. Williams, R.J. and Martinez, N.D. (2000) Simple rules yield complex food webs. Nature. 404, 180-183. 191. Cattin, M.F., Bersier, L.F., Banašek-Richter, C., Baltensperger, R. and Gabriel, J.P. (2004) Phylogenetic constraints and adaptation explain food-web structure. Nature. 427, 835-839. 192. Loeuille, N. and Loreau, M. (2005) Evolutionary emergence of size-structured food webs. Proceedings of the National Academy of Sciences of the United States of America. 102, 5761-5766. 193. Rossberg, A.G., Matsuda, H., Amemiya, T. and Itoh, K. (2005) An explanatory model for food-web structure and evolution. Ecological Complexity. 2, 312-321. 194. Sugihara, G. (1982) Niche hierarchy: structure assembly and organization in natural communities. Princeton University, Princeton, U.S.A.
A History of the Study of Ecological Networks
417
195. Cohen, J.E. and Palka, Z.J. (1990) A stochastic theory of community food webs. V. Intervality and triangulation in the trophic-niche overlap graph. American Naturalist. 135, 435-463. 196. Huxham, M., Beaney, S. and Raffaelli, D. (1996) Do parasites reduce the chances of triangulation in a real food web? Oikos. 76, 284-300. 197. Huxham, M., Raffaelli, D. and Pike, A. (1995) Parasites and food web patterns. Journal of Animal Ecology. 64, 168-176. 198. Cohen, J.E., Jonsson, T. and Carpenter, S.R. (2003) Ecological community description using the food web, species abundance, and body size. Proceedings of the National Academy of Sciences of the United States of America. 100, 17811786. 199. Reuman, D.C. and Cohen, J.E. (2004) Trophic links’ length and slope in the Tuesday Lake food web with species’ body mass and numerical abundance. Journal of Animal Ecology. 73, 852-866. 200. Cousins, S.H. (1985) The trophic continuum in marine ecosystems: Structure and equations for a predictive model. Canadian Journal of Fisheries and Aquatic Sciences. 213, 76-93. 201. Cody, M.L. (1974) Competition and the Structure of Bird Communities. Princeton University Press, Princeton. 202. Diamond, J.M. (1975) Assembly of species communities. In Ecology and Evolution of Communities (Cody, M. L. and Diamond, J. M.), 342-444. Belknap/Harvard University Press, Cambridge, MA. 203. Connor, E.F. and Simberloff, D. (1979) The assembly of species communities: Chance or competition. Ecology. 60, 1132-1140. 204. Schoener, T.W. (1983) Field experiments on interspecific competition. American Naturalist. 122, 240-285. 205. Connell, J.H. (1983) On the prevalence and relative importance of interspecific competition: Evidence from field experiments. American Naturalist. 122, 661-696. 206. Diamond, J.M. and Case, T.J. (1986) Community Ecology. Harper and Row Publ., New York. 207. Price, P.W. (1994) Phylogenetic constraints, adaptive syndromes, and emergent properties: from individuals to population dynamics. Researches on Population Ecology. 36, 3-14. 208. Price, P.W. (2003) Macroevolutionary theory on macroecological patterns. Cambridge University Press, Cambridge, UK. 209. Losos, J.B. (1996) Phylogenetic perspectives on community ecology. Ecology. 77, 1344-1354. 210. Robinson, W.D., Brawn, J.D. and Robinson, S.K. (2000) Forest bird community structure in central Panama: Influence of spatial scale and biogeography. Ecological Monographs. 70, 209-235. 211. Vanni, M.J., Flecker, A.S., Hood, J.M. and Headworth, J.L. (2002) Stoichiometry of nutrient recycling by vertebrates in a tropical stream: linking species identity and ecosystem processes. Ecology Letters. 5, 285-293.
418
L.-F. Bersier
212. Vitt, L.J. and Pianka, E.R. (2005) Deep history impacts present-day ecology and biodiversity. Proceedings of the National Academy of Sciences of the United States of America. 102, 7877-7881. 213. Borrvall, C., Ebenman, B. and Jonsson, T. (2000) Biodiversity lessens the risk of cascading extinction in model food webs. Ecology Letters, 3, 131-136. 214. Solé, R.V. and Montoya, J.M. (2001) Complexity and fragility in ecological networks. Proceedings of the Royal Society of London Series B-Biological Sciences. 268, 2039-2045. 215. Dunne, J.A., Williams, R.J. and Martinez, N.D. (2002) Network structure and biodiversity loss in food webs: robustness increases with connectance. Ecology Letters. 5, 558-567. 216. Allesina, S. and Bodini, A. (2004) Who dominates whom in the ecosystem? Energy flow bottlenecks and cascading extinctions. Journal of Theoretical Biology. 230, 351-358. 217. Courchamp, F., Langlais, M. and Sugihara, G. (1999a) Cats protecting birds: modelling the mesopredator release effect. Journal of Animal Ecology. 68, 282292. 218. Courchamp, F., Langlais, M. and Sugihara, G. (1999b) Control of rabbits to protect island birds from cat predation. Biological Conservation, 89, 219-225. 219. Courchamp, F., Langlais, M. and Sugihara, G. (2000) Rabbits killing birds: modelling the hyperpredation process. Journal of Animal Ecology. 69, 154-164. 220. Strauss, S.Y. (1991) Direct, indirect, and cumulative effects of three native herbivores on a shared host plant. Ecology. 72, 543-558 221. Strauss, S.Y. and Irwin, R.E. (2004) Ecological and evolutionary consequences of multispecies plant-animal interactions. Annual Review of Ecology Evolution and Systematics. 35, 435-466. 222. Ulanowicz, R.E. and Puccia, C.J. (1990) Mixed trophic impacts in ecosystems. Coenoses. 5, 7-16. 223. Hairston, N.G., Smith, F.E. and Slobodkin, L.B. (1960) Community structure, population control and competition. American Naturalist. 94, 421-425. 224. Fretwell, S.D. (1977) The regulation of plant communities by the food chains exploiting them. Perspectives in Biology and Medicine. 20, 169-185. 225. Carpenter, S.R., Kitchell, J.F., Hodgson, J.R., Cochran, P.A., Elser, J.J., Elser, M.M., Lodge, D.M., Kretchmer, D., He, X. and von Ende, C.N. (1987) Regulation of lake primary productivity by food web structure. Ecology. 68, 1863-1876. 226. Carpenter, S.R. and Kitchell, J.F. (1988) Consumer control of lake productivity. BioScience. 38, 764-769. 227. Strong, D.R. (1992) Are trophic cascades all wet? Differentiation and donorcontrol in speciose ecosystems. Ecology. 73, 747-754. 228. Pace, M.L., Cole, J.J., Carpenter, S.R. and Kitchell, J.F. (1999) Trophic cascades revealed in diverse ecosystems. Trends in Ecology and Evolution. 14, 483-488. 229. Chase, J.M. (2000) Are there real differences among aquatic and terrestrial food webs? Trends in Ecology and Evolution. 15, 408-412.
A History of the Study of Ecological Networks
419
230. Schmitz, O.J., Hamback, P.A. and Beckerman, A.P. (2000) Trophic cascades in terrestrial systems: A review of the effects of carnivore removals on plants. American Naturalist. 155, 141-153. 231. Schmitz, O.J. (1994) Resource edibility and trophic exploitation in an old-field food web. Proceedings of the National Academy of Sciences of the United States. 91, 5364-5367. 232. Schmitz, O.J., Beckerman, A.P. and O'Brien, K.M. (1997) Behaviorally mediated trophic cascades: Effects of predation risk on food web interactions. Ecology (Washington D C). 78, 1388-1399. 233. MacKay, N.A. and Elser, J.J. (1998) Factors potentially preventing trophic cascades: Food quality, invertebrate predation, and their interaction. Limnology and Oceanography. 43, 339-347. 234. Bell, T. (2002) The ecological consequences of unpalatable prey: phytoplankton response to nutrient and predator additions. Oikos, 99, 59-68. 235. Morin, P.J. and Lawler, S.P. (1995) Food web architecture and population dynamics: Theory and empirical evidence. Annual Review of Ecology and Systematics. 26, 505-529. 236. McCann, K.S., Hastings, A. and Strong, D.R. (1998) Trophic cascades and trophic trickles in pelagic food webs. Proceedings of the Royal Society of London Series B-Biological Sciences. 265, 205-209. 237. Holt, R.D. (1977) Predation, apparent competition, and the structure of prey communities . Theoretical Population Biology. 12, 197-229. 238. Holt, R.D. and Lawton, J.H. (1994) The ecological consequences of shared natural enemies. Annual Review of Ecology and Systematics. 25, 495-520. 239. Putman, R.J. (1994) Community Ecology. Chapman and Hall, London. 240. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002) Network motifs: simple building blocks of complex networks. Science. 298, 824-827. 241. Polis, G.A., Myers, C.A. and Holt, R. (1989) The ecology and evolution of intraguild predation: potential competitors that eat each other. Annual Review of Ecology and Systematics. 20, 297-330. 242. Holt, R.D. (1996) Community modules. In Multitrophic Interactions in Terrestrial Communities (Gange, A.C. and Brown, V.K.). 333-350. Blackwell Science, Oxford, UK. 243. Bascompte, J., Melián, C.J. and Sala, E. (2005) Interaction strength combinations and the overfishing of a marine food web. Proceedings of the National Academy of Sciences of the United States of America. 102, 5443-5447. 244. Memmott, J., Godfray, H.C.J. and Gauld, I.D. (1994) The structure of a tropical host-parasitoid community. Journal of Animal Ecology. 63, 521-540. 245. Memmott, J. and Godfray, H.C.J. (1994) The use and construction of parasitoid webs. In Parasitoid community ecology (Hawkins, B. A. and Sheehan, W.). 300318. Oxford University Press, Oxford. 246. Wilson, H.B., Hassell, M.P. and Godfray, H.C.J. (1996) Host-parasitoid food webs: Dynamics, persistence, and invasion. American Naturalist. 148, 787-806.
420
L.-F. Bersier
247. Müller, C.B., Adriaanse, I.C.T., Belshaw, R. and Godfray, H.C.J. (1999) The structure of an aphid-parasitoid community. Journal of Animal Ecology. 68, 346370. 248. Rott, A.S. and Godfray, H.C.J. (2000) The structure of a leafminer-parasitoid community. Journal of Animal Ecology. 69, 274-289. 249. Valladares, G.R., Salvo, A. and Godfray, H.C.J. (2001) Quantitative food webs of dipteran leafminers and their parasitoids in Argentina. Ecological Research. 16, 925-939. 250. Lewis, O.T., Memmott, J., Lasalle, J., Lyal, C.H.C., Whitefoord, C. and Godfray, H.C.J. (2002) Structure of a diverse tropical forest insect-parasitoid community. Journal of Animal Ecology. 71, 855-873. 251. Müller, C.B. and Godfray, H.C.J. (1997) Apparent competition between two aphid species. Journal of Animal Ecology. 66, 57-64. 252. Morris, R.J., Lewis, O.T. and Godfray, H.C.J. (2004) Experimental evidence of apparent competition in a tropical forest food web. Nature. 428, 310-313. 253. Inger, R.F. and Colwell, R.K. (1977) Organization of contiguous communities of amphibians and reptiles in Thailand. Ecological Monographs. 47, 229-253. 254. Hofer, U., Bersier, L.F. and Borcard, D. (2004) Relating niche and spatial overlap at the community level. Oikos. 106, 366-376. 255. Jordano, P. (1987) Patterns of mutualistic interactions in pollination and seed dispersal: Connectance, dependence asymmetries, and coevolution. American Naturalist. 129, 657-677. 256. Jordano, P. (1995) Angiosperm fleshy fruits and seed dispersers: A comparativeanalysis of adaptation and constraints in plant-animal interactions. American Naturalist. 145, 163-191. 257. Memmot, J. (1999) The structure of a plant-pollinator food web. Ecology Letters. 2, 276-280. 258. Kearns, C.A., Inouye, D.W. and Waser, N.M. (1998) Endangered mutualisms: The conservation of plant-pollinator interactions. Annual Review of Ecology and Systematics. 29, 83-112. 259. Bascompte, J., Jordano, P., Melián, C.J. and Olesen, J.M. (2003) The nested assembly of plant-animal mutualistic networks. Proceedings of the National Academy of Sciences of the United States of America. 100, 9383-9387. 260. Jordano, P., Bascompte, J. and Olesen, J.M. (2003) Invariant properties in coevolutionary networks of plant-animal interactions. Ecology Letters. 6, 69-81. 261. Tiunov, A.V. and Scheu, S. (2005) Facilitative interactions rather than resource partitioning drive diversity-functioning relationships in laboratory fungal communities. Ecology Letters. 8, 618-625. 262. Jones, C.G., Lawton, J.H. and Shachak, M. (1997) Positive and negative effects of organisms as physical ecosystem engineers. Ecology. 78, 1946-1957. 263. Wilby, A., Shachak, M. and Boeken, B. (2001) Integration of ecosystem engineering and trophic effects of herbivores. Oikos. 92, 436-444. 264. Berkenbusch, K. and Rowden, A.A. (2003) Ecosystem engineering - moving away from 'just-so' stories. New Zealand Journal of Ecology. 27, 67-73.
A History of the Study of Ecological Networks
421
265. Elser, J.J., Sterner, R.W., Gorokhova, E., Fagan, W.F., Markow, T.A., Cotner, J.B., Harrison, J.F., Hobbie, S.E., Odell, G.M. and Weider, L.J. (2000) Biological stoichiometry from genes to ecosystems. Ecology Letters. 3, 540-550. 266. Banašek-Richter, C., Cattin, M.F. and Bersier, L.F. (2004) Sampling effects and the robustness of quantitative and qualitative food-web descriptors. Journal of Theoretical Biology. 226, 23-32. 267. Sheppard, S.K. and Harwood, J.D. (2005) Advances in molecular ecology: tracking trophic links through predator–prey food-webs. Functional Ecology. 19, 751-762. 268. Holt, R.D. (2002) Food webs in space: On the interplay of dynamic instability and spatial processes. Ecological Research. 17, 261-273. 269. McCann, K.S., Rasmussen, J.B. and Umbanhowar, J. (2005) The dynamics of spatially coupled food webs. Ecology Letters. 8, 513-523. 270. Tscharntke, T., Klein, A.M., Kruess, A., Steffan-Dewenter I. and Thies C. (2005) Landscape perspectives on agricultural intensification and biodiversity – ecosystem service management. Ecology Letters. 8, 857-874. 271. Leibold, M.A., Holyoak, M., Mouquet, N., Amarasekare, P., Chase, J.M., Hoopes, M.F., Holt, R.D., Shurin, J.B., Law, R., Tilman, D., Loreau, M. and Gonzalez, A. (2004) The metacommunity concept: a framework for multi-scale community ecology. Ecology Letters. 7, 601-613.
This page intentionally left blank
CHAPTER 12 DYNAMIC NETWORK MODELS OF ECOLOGICAL DIVERSITY, COMPLEXITY, AND NONLINEAR PERSISTENCE
Richard J. Williams*,† and Neo D. Martinez† *
Microsoft Research Ltd, 7 J J Thomson Ave, Cambridge CB3 0FB, UK
†
Pacific Ecoinformatics and Computational Ecology Lab, 1604 McGee Ave, Berkeley, CA 94703, USA
[email protected],
[email protected]
1. Introduction Explorations of ecological networks have led a long line of scientists to debate the influence of diversity (number of nodes) in terms of species richness and complexity in terms of the number and structure of interactions. This research on how vast numbers of interacting species manage to coexist in nature reveals a deep disparity between the ubiquity of complex ecosystems in nature and their mathematical improbability in theory. Here, we show how integrating models of food-web structure and nonlinear bioenergetic dynamics bridges this disparity and helps elucidate the mechanics of ecological complexity. Structural constraints of these networks including the trophic hierarchy, contiguity, and looping formalized by the “niche model” are shown to greatly increase persistence in complex model ecosystems. Behavioral nonlinearities including interference between consumers and reduced consumption of rare resources formalized by predator interference and new “type II.2” functional responses further increase the diversity of dynamically persisting species. Integrating these empirically observed regularities 423
424
R. J. Williams and N. D. Martinez
yields remarkably comprehensive, extensible, and ecologically realistic models that revise the role of omnivory and emphasize the importance of network structure, short food chains, and behavioral ecology. These models also provide a powerful framework for adding non-trophic effects and developing field experiments and hypotheses. Such work illuminates opportunities and pitfalls for other areas of network science as developed in other chapters of this book, especially for areas without the 50-year research history that ecological networks enjoy. Highlighted opportunities include the integration of complex network structure and nonlinear dynamics and highlighted pitfalls include the over dependence on one or a few network properties to describe network structure. One of the most important and least settled questions in ecology concerns the roles of network diversity and complexity in the dynamics and functioning of ecosystems (1). Scientists still have difficulty explaining why diversity in terms of vast numbers of species and complexity in terms of species’ myriad interactions are so ubiquitous in ecological systems (1,2). Early attempts found that diversity and complexity increased ecosystem stability because the addition of consumers can prevent their prey from competitively excluding other prey (3) and more feeding links among more species generally reduced the risk of species’ dependence on few resources (4). Later, influential models of ecosystems demonstrated that diversity and complexity may actually destabilize ecosystems, either through increasing the chance of positive feedback loops (5,6) or through additional omnivorous interactions increasing the time needed for perturbed species to return to equilibrium (7). While much early work emphasized equilibrium-based modeling and comparative empiricism applied to large ecosystems, later research placed more emphasis on nonlinear modeling and experimental empiricism focused on small modules within ecosystems. These recent models and experiments find increases in complexity, such as the addition of weak and omnivorous interactions (8-10), and increases in diversity, such as increases in numbers of species and functional groups (11,12) to be stabilizing. Other work has confirmed the theoretically destabilizing effects of large amounts of complexity in experimental communities (13). These findings leave much disparity between the
Dynamic Network Models
425
improbability of diverse complex ecosystems in theory and their pervasiveness in nature. In particular, it is unclear whether the stabilizing effects of omnivory, (8) weak links, (9,14) and diversity in small modules (11) and single trophic levels (12) also apply to large networks with several trophic levels. Here, we address these issues by examining the persistence of species in nonlinear dynamical models of large complex ecological networks that replace limiting and unrealistic modeling assumptions e.g. food webs are random networks and populations are at equilibrium (6,15), with more mechanistic and empirically consistent biological diversity and complexity (16,9). Currently, few analyses of such models examine the nonlinear dynamics of more than ten species. Here, we present results from integrated network models of ecosystem structure and dynamics with up to fifty species and systematically analyze the larger parameter space that such higher dimensionality creates. The structural “niche model” component successfully predicts at least 14 structural network properties of the largest and most complex food webs in the primary literature (17-21). The dynamical bioenergetic model component successfully simulates persistent and non-persistent stable, cyclic, and chaotic dynamics (22) that are often found in nature (23). We refer to the number of species as the diversity of the network, while linkage density is referred to as the networks’ complexity. Function refers to processes associated with species’ interactions including rates of consumption and preferences for different prey. We explore the interplay of structure and nonlinear dynamics by systematically varying diversity, complexity, and function in order to “elucidate the devious strategies which make for stability in enduring natural systems.” (6) Our exploration expands on previously proposed strategies and shows how recently discovered structural and functional properties of ecological networks appear to promote stability and persistence in large complex ecosystems. 2. Models Our bioenergetic network models are constructed in two steps. The first step specifies the structure of a food web network using one of three
426
R. J. Williams and N. D. Martinez
different stochastic models. The second step uses a nonlinear bioenergetic model to compute the dynamics of the network. 2.1. Structural Models and Food-Web Topology All three models of network structure require the number of species in the system (S) and the density of trophic links (L) in terms of directed connectance (C = L/S2) as input parameters, but vary in the degree to which they constrain network organization. In the random model (24.25), any possible link among S species occurs with the same probability equal to C of the empirical web. This creates webs as free as possible from biological structuring while maintaining the fundamental observed network properties of S and C. The modified (17) cascade model (24) creates a hierarchical structure by assigning each species a random value drawn uniformly from the interval [0,1] and giving each species a probability p = 2CS/(S-1) of consuming only species with values less than its own. The niche model (17) similarly assigns each species a randomly drawn “niche value.” The species are constrained to consume all species within one beta-distributed range of values whose mean = C and whose uniformly and randomly chosen center is less than the consumer’s niche value. Some niche model networks contain energetically unsustainable closed loops such as pairs of mutual predators that are preyed on but have no prey themselves. These networks are most common in small systems with low connectance and are eliminated form further consideration here. When describing food webs, several conventions are employed. Top species have links to resources but not to consumers. Intermediate species have links to both resources and consumers. Basal species such as plants have links to consumers but not to resource species. Trophic levels quantify the path length between consumers and plant derived energy with plants being assigned trophic level 1. Herbivores are at trophic level 2 while carnivores of herbivores are at trophic level 3 and so on. Omnivores feed from more than one trophic level and typically have noninteger trophic levels (26). Herbivores only eat basal species. To remove the confounding variability of the number of basal species, omnivory and herbivory is calculated as the fraction of consumers that
Dynamic Network Models
427
are omnivores and herbivores respectively. Similarly, to better measure the trophic height of the consumers independent of the fraction of basal species, mean trophic level is the mean of all consumer species’ trophic levels. Among a variety of definitions of trophic level, we use a modification of previous trophic level definitions (27,28) that weights each consumer’s prey equally (26). A species’ connectivity is its total number of links (both incoming and outgoing) divided by the mean connectivity (2L/S) of the network. 2.2. Bioenergetic Model of Nonlinear Food-Web Dynamics The dynamic model closely follows previous work (16,29,8,9) but is generalized to n species and arbitrary functional responses. Extending the earlier notation (16) to n-species systems, the variation of Bi, the biomass of species i, over time t, is given by n K K K Bi '(t ) = Gi ( B ) − xi Bi (t ) + ∑ ( xi yijα ij Fij ( B ) Bi (t ) − x j y jiα ji Fji ( B ) B j (t ) / e ji ) j =1
(1)
The first term Gi (B) = ri Bi (t) (1 - Bi (t) / Ki) is the gross primary production rate of species i where ri is the intrinsic growth rate that is non-zero only for basal species, and Ki is the carrying capacity; the second term is metabolic loss where xi is the mass-specific metabolic rate; the third and fourth terms are gains from resources and losses to consumers respectively, where yij is the maximum rate at which species i assimilates species j per unit metabolic rate of species i; αij is the relative preference of species i for species j compared to the other prey of species i. αij is normalized so that the sum of αij (0 ≤ αij ≤ 1) across all j is 1 for consumer species and 0 for basal species; Fij (B), a non-dimensional functional response that may depend on resource and consumer species’ biomasses, gives the fraction of the maximum ingestion rate of predator species i consuming prey species j; eij is the conversion efficiency with which the biomass of species j lost due to consumption by species i is converted into the biomass of species i. Dividing the last term by eij converts the biomass assimilated by consumer j into biomass lost by
428
R. J. Williams and N. D. Martinez
resource i. Non-zero αij’s are assigned according to the topology specified by the structural models. The many parameters in these equations have been estimated from empirical measurements (16) and there are wide ranges of biologically plausible values. While a wide variety of functional responses have been proposed in the literature (30-34), our model uses two different families of functional responses (FH and FBD, Fig. 1) that have both mechanistic and empirical justifications (34). The first FH, (22) is based on a parameterized form (35,36) of Holling’s (37,30) type II and III responses and generalizes earlier multiple species type II responses (9,10). FH of predator i consuming prey j is
K FHij ( B) =
1+ q ij
B j (t ) n
1+ qij
∑α ik Bk (t ) k =1
1+ qij
,
(2)
+ B0 ji
where B0ji is the half saturation density of species j when consumed by species i and qij controls the form of FH. The functional response decelerates and accelerates feeding on relatively rare and abundant resources as q increases and decreases, respectively (Fig. 1). The range 0 < qij ≤ 1 generalizes FH so that it can smoothly vary from standard type II responses (qij = 0) used in many earlier studies (16,29,8,9,38,10) to the standard type III response (qij = 1) (25,36,16) that stabilizes two-species systems. The FBD response models predator interference (34) by extending earlier models (32,33) to consumers of multiple species. FBD of predator i consuming prey j is K FBDij ( B ) =
B j (t ) n
∑ α ik Bk (t ) + (1 + cij Bi (t )) B0 ji
(3)
k =1
Similar to FH (3), FBD includes a control parameter cij ≥ 0 that quantifies the intensity of predator interference. When cij = 0, FBD is the standard type II response with no predator interference, and empirical studies suggest c ≈ 1 (34).
Dynamic Network Models 0.8
429
q=1
0.7
q = 0.25
0.6 0.5
F
q = 0.1 c=1
0.4 c=q=0 0.3 0.2 0.1 0
0
B0
2 B0
Biomass of prey Figure 1. Illustration of the effects of control parameters q and c on fractions of maximal consumption rates (F) according to the FH and FBD functional responses in equations 2 and 3, respectively. Note that FBD also depends on the density of consumers that pushes the half saturation density (B0) of the dotted c = 1 line left or right as the consumer density decreases or increases, respectively.
Predator interference and type III responses are known to stabilize small food web modules (33,39,40,16) but have not previously been used to study the dynamics of relatively species-rich systems. In addition, small deviations from the type II response such as our “type II.2 response” (q = 0.2) have only recently been introduced and applied to food-web models with 10 or fewer species (21). We simplify the dynamical model through our choice of parameter values. First, we choose a single value for each of the parameters Ki=1, ri=1, xi=0.5, yij = 6, eij=1, and B0ij=0.5 for each set of a model’s iterations. Simulations that draw these parameters from normal distributions with specified mean and standard deviation (eij>1 not allowed) gave similar results to fixed parameter simulations (results not shown). Second, even though functional responses could be different for each link in the network (21), we specify a single value of qij or cij, so each link within a network is of the same type.
430
R. J. Williams and N. D. Martinez
Unless stated otherwise, we assume that predator species have equal preference (αij) for all their prey. If ni is the number of prey that species i consumes, αij = 1/ni for each species j in the diet of species i. We also systematically vary the αij of omnivores to examine the effects of skewing diets to higher or lower trophic level prey. The range of αij is defined by a preference skewness k = αimax/αimin, where αimax and αimin are the preferences for the prey items of species i with the maximum and minimum trophic level (TLmax and TLmin) respectively. For each prey species j of species i, we define bij = 1 + (k − 1)(TL j − TLmin ) /
(TLmax − TLmin ) , where TLj is the trophic level of prey item j.
The
preference of species i for prey item j is then α ij = bij / ∑ bil , where the l
sum is across all prey items of species i. When k = 1, all prey preferences of an omnivore are equal; when k < 1, low trophic level prey are preferred and when k > 1, high trophic level prey are preferred. Each simulation begins by building an initial random, cascade, or niche model web of a certain size (S0) and connectance (C0). The integrated structure-dynamic model then computes which species persist with positive biomass greater than an extinction threshold of 10-15 after 4000 time steps. Following any extinctions, a “persistent web” with SP species and connectance CP remains. As the structural models are stochastic, this procedure is repeated a large number of times so that statistical properties of the integrated structure-dynamic model is ascertained. Both the functional response control parameters and a predator’s preferences among prey are varied to study effects of foodweb dynamics on persistence and food-web structure. For each model iteration, we define absolute persistence PA = SP and relative persistence as PR = SP/S0. Overall persistence P is the mean value of PR across a set of iterations. Topological properties of the persistent webs were compared to different versions of niche webs. Here, we focus on the distribution of trophic levels and connectivity among species by examining the fractions of top, intermediate, basal, omnivorous, and herbivorous species, mean trophic level, and the standard deviation of the connectivity of each species.
Dynamic Network Models
431
3. Topology and Dynamics
We analyzed the behavior of our dynamic network models with respect to the combined variation of several key parameters. The models’ high dimensionality prevents full examination of all the combinations of parameter values that were analyzed. Instead, we present a sequence of results that describes the effects of varying a few parameters and then fix these parameters and analyze effects of varying other parameters. Fixing the parameters at different values quantitatively changes the results. Therefore, we focus on overall behaviors that resist qualitative changes due to alternative choices. Perhaps most importantly, varying topology and the functional response control parameters profoundly affect persistence. Fig. 2a shows the effect of varying q and c on 30-species webs with an intermediate level (17) of C0 = 0.15 for food webs with initial topologies built using the random, cascade and niche models. All other input parameters are constant across all trials of the stochastic models unless otherwise indicated. Most or all species go extinct in every trial of random webs and q and c have little if any effect on their relative persistence (P < 0.05). The structural constraints provided by the cascade model and especially the niche model increase P by more than an order of magnitude. In addition to this enormous effect of network structure, a large change in persistence occurs when q is increased from 0 to 0.1. In this range of q, cascade-web P increases 32% from 0.34 to 0.44 and niche-web P increases 44% from 0.43 to 0.62. Compared to cascade webs, niche webs are 27% to 50% more robust for any fixed q from 0 to 0.3 and more strongly increase in persistence for q > 0. Fig. 2b shows that predator interference causes a similar change in the persistence of 30-species webs when c varies across a biologically reasonable (34) range. The effect of c on persistence is similar to the effects of q but, unlike q’s asymptotic effects, increasing c continually increases persistence across the whole range of values examined. Due to the similar effects of q and c, we present further results only for intermediately robust responses with q = 0.2 or c = 1.0, a choice that highlights the effects of altering other model parameters in a representative manner.
R. J. Williams and N. D. Martinez
432
S = 30, C = 0.15, Functional response = FH 0
0.7
0
Persistence
0.6 0.5 0.4 0.3
Niche Model Cascade Model Random Model
0.2 0.1 0
0
0.05
0.1
0.15
0.2
0.25
0.3
q (a) S 0 = 30, C0 = 0.15, Functional response = FBD
0.7
Persistence
0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.4
0.8
1.2
1.6
2
c (b) Figure 2. Mean overall persistence (P) of model food webs vs. functional response control parameter for networks built using the random (), cascade (□) and niche (○) models. In (a) q controls the parameterized Holling (type II to “type II.3”) functional response (Eq. 2); in (b) c controls the Beddington-DeAngelis (BD) predator interference functional response (Eq. 3). All networks initially have S0 = 30 and C0 = 0.15. Values shown are averages of 500 trials.
Dynamic Network Models
433
Relative persistence (PR = PA / S0) of niche-model webs decreases linearly both with increasing initial network size (S0) and with increasing initial connectance (C0) (Fig. 3) as shown by linear regressions of PR as a function of the product S0C0, the network’s initial value of L/S. For the type II.2 response (q=0.2) with constant C0 = 0.15, PR = 0.87 – 0.05 S0 C0 (R2=0.48, n=2500); with constant S0 = 30, PR = 0.93 – 0.06 S0 C0 (R2=0.23, n=3500). Despite the negative effect of S0 on PR, absolute persistence (PA) increases with S0 from roughly 11 when S0 = 15 to approximately 25 when S0 = 50. We compared variation in CP with SP among persistent webs that were initially constructed with the niche model to two other sets of model webs (Fig. 4). These sets were created by starting with a set of niche webs using fixed parameters S0 = 30 and C0 = 0.15 and then randomly deleting species (41,42) to create networks with the same S as the persistent webs. Two deletion algorithms were used. One deletes species entirely at random and the other randomly deletes only non-basal species (42). C of niche webs increases with the number of entirely random deletions but varies little when basal species are protected. Despite the strong negative effects of C0 on P, CP of the most robust webs (SP > 21, PR > 0.7) is typically greater than the C of niche webs subjected to random deletions (Fig. 4). This suggests that structurally peculiar subsets of niche webs with relatively high C yield remarkably persistent networks (42). Both S and C affect many topological properties of empirical and niche-model webs (17,18,20,43). We examined how dynamic extinctions affect network topology by controlling for these effects and comparing the persistent webs with two sets of 1000 niche webs (Fig. 5a-e). One set had the initial values of S0 = 30 and C0 = 0.15 as inputs and non-basal species were randomly deleted until S = Sp. This compares persistent webs of a certain size to similarly sized niche webs subjected to randomized extinctions that leave C relatively unchanged (C ≈ C0 ≈ CP, Fig. 4). The second set was created using the values S = SP and C = CP as inputs into the niche model, allowing comparison between persistent webs of a certain size and similarly sized niche webs not subject to extinctions.
R. J. Williams and N. D. Martinez
434
0.8 Type III functional response, q = 0.2 BD functional response, c = 1
Persistence
0.75 0.7 0.65 0.6 0.55 0.5 0.45
15
20
25
30
35
S
40
45
50
0
(a) 0.9 Type III functional response, q = 0.2 BD functional response, c = 1
Persistence
0.8 0.7 0.6 0.5 0.4 0.3
0.05
0.1
0.15
0.2
C
0.25
0.3
0
(b) Figure 3. Mean overall persistence (P) of model food webs vs. (a) initial network size S0 for niche model networks with C0 = 0.15, and (b) initial network connectance C0 for niche model networks with S0 = 30. The dynamical model uses (○) the parameterized Holling type II.2 functional response where q = 0.2 (Eq. 2) and (□) the BeddingtonDeAngelis (BD) predator interference functional response with c = 1 (Eq. 3). Values shown are averages of 500 trials. The regression lines are (a) Type II.2: P = 0.874 – 0.00770S0, r 2 = 0.996; BD: P = 0.799 – 0.00682S0, r 2 = 0.992 and (b) Type II.2: P = 0.927 – 1.923C0, r 2 = 0.998; BD: P = 0.862 – 1.799C0, r 2 = 0.997.
Dynamic Network Models
435
0.2 Dynamical model Random deletions Random deletion of consumers
0.18
C
0.16
0.14
0.12
0.1 10
15
20
25
30
S
Figure 4. Mean connectance C of model food webs versus dynamically persistent network size S (×), with error bars showing plus and minus two standard errors of the estimated mean. The points without error bars show the mean connectance of 1000 niche model networks that have species deleted at random (○) or have consumer species deleted at random (□). All initial networks are built using the niche model with S0 = 30, C0 = 0.15, and the dynamical model uses our Holling “type II.2” functional response where q = 0.2 (Eq. 2).
Compared to either set of niche webs, persistent webs consistently have higher fraction of basal species and consumers with lower mean trophic levels, especially in larger most persistent webs (SP > 25, Fig. 5ab). Both of these properties vary with SP in the same direction but less strongly as the properties vary with S in niche webs. The fractions of consumer species that are omnivores or herbivores are higher in the persistent webs than in the niche webs (Fig. 5c-d). This helps explain persistent webs’ lower mean trophic levels. The differences in herbivore and basal species richness tend to lose their statistical significance as webs get smaller, while the differences in mean trophic level also get smaller but remain significant. The fraction of omnivorous consumers was often slightly (5-10%) though not significantly higher in the highly robust persistent webs (SP > 25), whereas there was a slight deficit of omnivores in less robust persistent networks (S P < 15). The standard
R. J. Williams and N. D. Martinez 0.45
4 3.8
Fraction of basal species B
Mean Trophic Level of Consumers
436
3.6 3.4 3.2 3 2.8 2.6 2.4
10
15
20 S
25
0.4 0.35 0.3 0.25 0.2 0.15
30
10
15
0.6 0.5 0.4 0.3 0.2 0.1
10
15
20 S
20 S
25
30
25
30
(b) Omnivory as fraction of consumers
Herbivory as fraction of consumers
(a)
25
30
(c) S
0.9 0.8 0.7 0.6 0.5 0.4
10
15
20 S
(d) (d) S
SD of node connectivity
0.6
Dynamical model Niche model Random deletion of consumers
0.55
0.5
0.45
10
15
20 S
25
30
(e)
Figure 5. Mean and variation of model food-web properties versus persistent network size S (×). Error bars show plus and minus two standard errors of the estimated mean. Points without error bars show the mean property value in 1000 niche model networks with the same size and connectance as the dynamical model networks (○) and in 1000 niche model networks with the same initial size and connectance as the dynamically constrained networks that then had consumer species deleted at random (□). Properties shown are (a) fraction of basal species, (b) trophic level, (c) fraction of consumers that are omnivores, (d) fraction of consumers that are herbivores, and (e) standard deviation of node connectivity. Initial networks are built using the niche model with S0 = 30, C0 = 0.15, and the dynamical model uses our Holling “type II.2” functional response with q = 0.2 (Eq. 2).
Dynamic Network Models
437
deviations of node connectivity were similar between persistent and niche webs but random deletions increased these deviations above those in persistent webs (Fig. 5e). This similarity also applies to the standard deviation of the number of incoming and outgoing links taken separately, properties previous termed the generality and vulnerability, respectively (17). We examined omnivory more finely by altering the skewness of omnivores’ preference for prey at different trophic levels. Such skewness has profound effects on overall persistence, P (Fig. 6), similar to the effects of varying the functional-response parameter q. Niche webs are most persistent (P ≈ 0.42 when q = 0 and P ≈ 0.64 when q = 0.2) when omnivores prefer lower trophic-level resources but avoid near exclusive consumption of the lowest trophic-level resources (0.2 < skewness < 0.8). Persistence drastically falls to as low as P ≈ 0.25 when q = 0 and P ≈ 0.34 when q = 0.2 as omnivores more strongly prefer upper trophiclevel resources (skewness = 10). 0.7
0.6
Persistence Persistance
0.5
0.4
0.3
0.2 q=0 q=0.2
0.1
0 0.1
1
10
Skewness (k) (k) Skewness
Figure 6. Mean (n=500) overall persistence P of model food webs vs. skewness k of the prey preference of omnivores. When k = 1, all prey preferences of an omnivore are equal; when k < 1, low trophic level prey are preferred and when k > 1, high trophic level prey are preferred (see methods). All networks initially have S0 = 30, C0 = 0.15, and the dynamical model uses parameterized Holling type II (q = 0) and II.2 (q = 0.2) functional responses (Eq. 2).
438
R. J. Williams and N. D. Martinez
3.1. Effects of Structure on Dynamics
Our results generally illuminate how the structure of ecological networks may influence their function by examining the effects of diversity and complexity on in silico ecosystem dynamics. One early and remarkably durable theory based on linear stability analyses of random networks (6) proposed that S and C have hyperbolically negative effects on stability. Qualitatively similar effects occur in our nonlinear analyses of more ecologically realistic networks, but the effects are linear rather than hyperbolic (6), perhaps due to the differences between linear stability and nonlinear persistence. C affects persistence much more strongly than does S. This is illustrated by the regressions in which variance in C explains over twice as much variance of PR as does variance in S. This greater importance of C than S to persistence had been previously noted but the negative effects of C observed here are opposite the previously noted positive effects (42,10,2). Analyzing the effects of deleting species or otherwise challenging persistent webs to study their robustness may clarify this discrepancy. Beyond the classic effects of S and C on dynamics, our study illustrates the overriding importance of the overall arrangement of links among species. Random webs have almost no persistence while the hierarchal ordering of the cascade model vastly increases persistence. The contiguous niches and looping (43) in the niche model appears confer even more persistence on food-web networks. The hierarchical ordering of the cascade and niche models is easily interpreted as a mechanistic formalization of energy flowing from plants to upper trophic levels. Models that ignore such distinctions between plants and animals by making all species capable of growing without consuming other species (2) fail to detect the significance of nonrandom and hierarchical network structure (45). Niche space as formalized by the niche model is much less easily interpreted and deserves more study to understand which evolutionary, ecological, and mathematical factors underlie the improved the model’s improved empirical fit (17) and persistence (Fig. 2).
Dynamic Network Models
439
These effects of network structure on dynamics closely mirror the degree to which model networks mimic the structure of ecological networks (17,18,20,46,47,21). The random model mimics very few properties of networks and dynamically sustains very few species. The cascade model mimics several natural network properties such as the fractions of top, intermediate and basal species and dynamically sustains ~50% or less of the original species within cascade webs. The niche model mimics over a dozen network properties and typically sustains between half and two thirds of the original species within niche webs. This suggests that a structural model that even more accurately mimics natural webs will dynamically sustain even larger fractions of species. This also suggests that scientists should be somewhat skeptical of models that mimic very few network properties, especially if these properties that can be reproduced by vast number of highly disparate models (48). Instead, models such as the niche model that matches a broad and well populated range of network properties may provide much more robust frameworks for integrating and exploring network structure and dynamics of the natural systems of interest. 3.2. Effects of Dynamics on Structure
Our work illuminates how the functioning of ecological networks influences their structure by examining the effects of nonlinear dynamics on the topology of complex food webs. Within network science, such analyses and influences may be only generalizable to networks such as food webs and pollination webs (45) whose nodes critically depend on interactions for their continued existence. Within ecology, our results show for the first time that the stabilizing effects of both predator interference and respective decelerated and accelerated feeding on rare and abundant resources found in small modules of two species also apply to much larger networks with 30 or more species. This enables large complex food webs to sustain many more species than networks governed by standard type II responses. This remarkable persistence greatly increases the potential to theoretically and computationally add
440
R. J. Williams and N. D. Martinez
other ecological processes such as facilitation, age-structured populations, migration, and environmental stochasticity to models of large ecological networks, which should further facilitate exploration of their effects on ecological structure and dynamics. We also show that small and perhaps empirically undetectable changes in functional responses foster greatly increased persistence in model ecosystems (21). This suggests that tiny amounts of prey switching behavior of consumers (37,2) or refuge seeking behavior of resources (36,49) has large effects on the structure and dynamics of complex ecological networks. This suggestion complements recent empirical findings (33,49), suggesting these functions as some of nature’s more prevalent and important stabilizing strategies. More effects of network function on network structure are seen in comparisons between persistent webs and webs generated by structural models free from explicit dynamics. Persistent webs typically have C similar to that in niche webs whose consumers are randomly deleted but lower than that in niche webs subjected to random deletions of any species. More strikingly, persistent webs have higher fractions of basal species and consumers with lower mean trophic levels than do niche webs. This is consistent with the niche model’s systematic overestimation of empirically observed food-chain lengths (17) assuming that empirical webs have more persistent topologies than do niche webs. While the SD of node connectivity shows few differences between niche webs subjected to dynamic loss of species and random loss of consumers, more detailed investigation of degree distributions (20) could illuminate differences hidden by our relatively coarse analysis. Given the niche model’s overestimation of the mean trophic level of consumers in large persistent webs by almost a whole level (Fig. 5a) and its underestimation of the fraction herbivores by ~0.07 (Fig. 5c), we tested the niche model against these properties of the seven empirical webs originally compared to the niche model (17). Table 1 shows that the niche model consistently overestimates mean trophic level by 0.2-2.4 levels and consistently underestimates the fraction of herbivores by
Dynamic Network Models
441
0.01-0.32. Apparently, dynamics alters these properties of niche webs to become even more similar to empirically observed properties. The empirically observed fraction of basal species is well explained by the niche model (17), so the higher fraction of basal species observed in the dynamically constrained networks (Fig. 5b) appears to conflict with empirical findings. This discrepancy may be due to highly aggregated and poorly described basal species in the empirical data. For example, basal species in the St. Martin island food web (50) are categories of plant material such as seeds, leaves, etc. Many basal taxa in the Bridge Brook Lake (51) food web are trophically identical, suggesting that the trophic links are poorly resolved (52). Therefore, the fraction of basal species in the observed trophic-species networks and the niche model’s fit to these fractions could be methodological artifacts of taxonomic and trophic resolution. The importance of basal species to persistence emphasizes the need for high quality data resolved evenly at all trophic levels (53). Alternatively, artifacts of the dynamical model Table 1. Errors of niche model predictions of the fraction of herbivores (Herbivory) and mean trophic level (TL) of consumers in empirical food webs. S is the number of trophic species. C is directed connectance. Error is measured both as the difference between the model’s mean property and the empirically observed property (in parentheses) and in more rigorously comparable terms of the number of model standard deviations that the empirically observed property differs from the model’s mean (17).
Food Web Name
S
C
Herbivory
TL of consumers
St. Martin Island
42
0.12
-2.7 (-0.15)
1.4 (0.79)
Bridge Brook L.
25
0.17
-3.9 (-0.19)
1.5 (1.23)
Coachella Valley
29
0.31
-1.3 (-0.04)
0.6 (1.24)
Chesapeake Bay
31
0.072
-0.2 (-0.01)
0.6 (0.21)
Skipwith Pond
25
0.32
-7.8 (-0.29)
0.1 (2.39)
Ythan Estuary
78
0.061
-4.1 (-0.20)
1.6 (0.60)
Little Rock L.
92
0.12
-12.7 (-0.32)
2.5 (1.52)
Mean
-4.62 (-0.17)
1.17 (1.14)
Std error
1.65 (0.04)
0.30 (0.27)
442
R. J. Williams and N. D. Martinez
might cause the discrepancy (45). Our models assume that basal species do not compete for shared resources. Adding competition among basal species might lower the fraction of basal species in the persistent webs. 3.3. Omnivory
One of the more confusing interdependencies between food-web structure and dynamics concerns the issue of omnivory. There is a close positive and confounding relationship between omnivory and C in earlier studies, (8,10) since increasing C typically makes consumers more omnivorous and increasing omnivory typically increases C. We help clarify this issue by controlling for the strong effects of C on persistence (Fig. 4) and showing that the prevalence of omnivorous consumers in persistent webs is usually similar to that in niche webs (Fig. 5d), which is typically much less than in cascade webs (17). If structural omnivory had an unusually strong positive effect on persistence, one would expect higher omnivory in the most persistent niche webs and more persistence in cascade webs. This is not generally supported by our results. Contemporary modeling studies also tend to confound increasing omnivory with lowering consumers’ trophic levels by increasing omnivory in a narrow fashion. That is, omnivory is typically created by adding short paths that enable carnivores to consume primary production (8,10). Adding this type of omnivory lowers the consumer’s trophic level. Omnivory that increases a consumer’s trophic level, for example, by adding a carnivorous links to an herbivore’s diet, is typically avoided. Omnivores that prefer higher trophic level prey strongly decrease persistence compared to omnivores lacking such preference, while variable preference for low levels has much less effect (Fig. 6). These findings, combined with consumers’ relatively low trophic levels and high prevalence of basal species and herbivores in the most persistent niche webs, suggest that shortening food chains and reducing trophic levels account for the stabilizing effects previously attributed to omnivory. In contrast, omnivory strongly decreases persistence in food webs when omnivores engage in the empirically unusual (19) destabilizing behavior of preferring prey at higher trophic levels.
Dynamic Network Models
443
4. Conclusion
Our analyses address several historically perplexing aspects of the remarkable complexity and persistence of natural ecosystems and show how more empirically prevalent aspects of ecological interactions (17,33,25,49) may confer persistence on large complex ecosystems. Both food-web structure as characterized by the relatively successful the niche model and food-web function as characterized by decelerated consumption of rare resources (49), predator interference (33), and omnivores’ preferences for lower trophic-level prey (25) greatly increase the diversity and complexity that persists in ecological networks. While all models are simplifications of nature, formal inclusion of these frequently observed regularities indicate that our ecological models may be the most biologically informed and empirically well-corroborated in terms of their detailed diversity, complexity, structure, function, and dynamics. Some of the increased persistence resulting from including these factors appears to have been mistakenly attributed to unqualified omnivory. The strong effects of predator interference and decelerated and accelerated feeding on relatively rare and abundant resources, respectively, suggests that other behaviors that reduce consumption of rare resources, e.g. prey switching (37,2), will also stabilize large complex networks. In contrast, responses that increase consumption of rarer and higher trophic level resources, e.g. economic exploitation of relatively rare carnivorous fishes (55), can be expected to decrease persistence. Perhaps a larger lesson to be taught by these studies is that several solutions to the devilishly difficult problem of understanding the structure and nonlinear function of complex networks have been found in the subtle details of these networks. That is, the fine structure of the particular location of the links may matter much more than particular distributions of degrees among nodes or species among trophic levels that can be simulated by a wide variety of network models (21). Similarly, the particular trajectory by which functional responses reach their maximum consumption rates may matter more than the presence or magnitude of these rates’ asymptotic maxima. While interdisciplinary
R. J. Williams and N. D. Martinez
444
network theory has much to offer many scientific disciplines (56,55), useful application of such theory may critically depend on understanding the devil of disciplinary details essential to the structure and function of real-world complex networks. Within the discipline of ecology, the persistent models described here provide new tools to explore non-trophic processes including invasions, extinctions, experimental manipulations, environmental variability, and spatial processes. Such processes could be simulated by manipulating our model’s parameters e.g. stochastically varying basal species’ carrying capacity, and adding different functions e.g. densitydependent migration. More study of these models as well as empirical and especially experimental tests of their findings could significantly refine our results. Such combined studies could do much to bring about exciting new insights regarding the trophic and non-trophic interactions in the large complex networks that sustain the stunning, yet tragically diminishing, levels of diversity in nature. Acknowledgements
Jennifer A. Dunne, Ulrich Brose, and Jessica Green are greatly appreciated for comments on the manuscript. NSF provided support for RJW and NDM. NDM especially thanks the National Center for Ecological Analysis and Synthesis and the NSF funded IGERT Program in Nonlinear Systems and hospitality of the Telluride House, both at Cornell University, for support. Correspondence and Requests for materials should be addressed to: E-mail:
[email protected], Telephone: 510-295-7624, Fax: 970-349-7481 References 1. 2. 3. 4.
McCann, K. (2000). The diversity-stability debate. Nature. 405, 228-233. Kondoh, M. (2003). Foraging adaptation and the relationship between food-web complexity and stability. Science. 299, 1388-1391. Paine, R.T. (1966). Food web complexity and species diversity. American Naturalist. 100, 65-75. MacArthur, R.H. (1955). Fluctuation of animal populations and a measure of community stability. Ecology. 36, 533-536.
Dynamic Network Models 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18. 19. 20.
21.
22. 23.
24. 25.
445
Gardner, M.R., and Ashby, W.R. (1970). Connectance of large dynamic (cybernetic) systems: critical values for stability. Nature 228, 784. May, R.M. (1973). Stability and Complexity in Model Ecosystems. Princeton Univ Press, Princeton. Pimm, S.L. and Lawton, J.H. (1978). On feeding on more than one trophic level. Nature. 275, 542-544. McCann, K. and Hastings, A. (1997). Re-evaluating the omnivory-stability relationship in food webs. Proc R Soc Lond. B. 264, 1249-1254. McCann, K., Hastings, A. and Huxel, G.R. (1998). Weak trophic interactions and the balance of nature. Nature. 395, 794-798. Fussman, G.F. and Heber, G. (2002). Food web complexity and chaotic population dynamics. Ecology Letters. 5, 394-401. Naeem, S., Thompson, L.J., Lawler, S.P., Lawton, J.H. and Woodfin, R.M. (1994). Declining biodiversity can affect the functioning of ecosystems. Nature. 368. Tilman, D., Reich, P.B., Knops, J., Wedin, D., Mielke, T. and Lehman, C. (2001). Diversity and productivity in a long-term grassland experiment. Science. 294, 843845. Fox, J.W. and McGrady-Steed, J. (2002). Stability and complexity in microcosm communities. Journal of Animal Ecology. 71, 749-756. Berlow, E.L. (1999). Strong effects of weak interactions in ecological communities. Nature. 398, 330-334. Yodzis, P. (2000). Diffuse effects in food webs. Ecology. 81, 261-266. Yodzis, P. and Innes, S. (1992). Body-size and consumer-resource dynamics. American Naturalist. 139, 1151-1173 Williams, R.J. and Martinez, N.D. (2000). Simple rules yield complex food webs. Nature. 404, 180-183. Camacho, J., Guimera, R. and Amaral, L.A. (2002). Robust patterns in food web structure. Phys Rev Let. 88, 228102 Williams, R.J. and Martinez, N.D. (2002). Trophic levels in complex food webs: theory and data. Santa Fe Institute Working Paper. 02-10-056. Dunne, J.A., Williams, R.J. and Martinez, N.D. (2002b). Food-web structure and network theory: the role of connectance and size. Proc Nat Acad Sci. 99, 1291712922. Stouffer, D.B., Camacho, J., Guimera, R., Ng, C.A. and Amaral, L.A. (2005). Quantitative patterns in the structure of model and empirical food webs. Ecology. 86, 1301-1311. Williams, R.J. and Martinez, N.D. (2004b). Stabilization of chaotic and nonpermanent food web dynamics. European Physics Journal B. Kendall, B.E., Prendergast, J. and Bjornstad, O.N. (1998). The macroecology of population dynamics: taxonomic and biogeographic patterns in population cycles. Ecology Letters. 1, 160-164. Cohen, J.E., Briand, F. and Newman, C.M. (1990). Community food webs: data and theory. Springer-Verlag, Berlin. Solow, A.R. and Beet, A.R. (1998). On lumping species in food webs. Ecology. 79, 2013-2018.
446 26. 27. 28.
29. 30. 31. 32. 33. 34.
35. 36. 37. 38. 39. 40. 41. 42.
43.
44. 45.
R. J. Williams and N. D. Martinez Williams, R.J. and Martinez, N.D. (2004a). Limits to trophic levels and omnivory in complex food webs: theory and data. American Naturalist. 163, 458-468. Levine, S. (1980). Several measures of trophic structure applicable to complex food webs. Journal of Theoretical Biology. 83, 195-207. Adams, S.M., Kimmel, L.B. and Plokey, R G. (1983). Sources of organic matter for reservoir fish production: A tropic-dynamics analysis.Canadian Journal of Fisheries and Aquatic Science. 40, 1480-1495. McCann, K. and Yodzis, P. (1995). Biological conditions for chaos in a threespecies food chain. Ecology. 75, 561-564. Holling, C.S. (1959b). Some characteristics of simple types of predation and parasitism. Can. Entom. 91, 385-399. Hassell, M.P. and Varley, G.C. (1969). New inductive population model for insect parasites and its bearing on biological control. Nature. 223, 1133-1136. Beddington, J.R. (1975). Mutual interference between parasites or predators and its effects on searching efficiency. Journal of Animal Ecology. 51, 331-340. DeAngelis, D.L., Goldstein, R.A. and O'Neill, R.V. (1975). A model for trophic interaction. Ecology. 56, 881-892. Skalski, G.T. and Gilliam, J.F. (2001). Functional responses with predator interference: viable alternatives to the Holling type II model. Ecology. 82, 30833092. Real, L.A. (1977). The kinetics of functional response. American Naturalist. 111, 289-300. Real, L.A. (1978). Ecological determinants of functional response. Ecology. 60, 481-485. Holling, C.S. (1959a). The components of predation as revealed by a study of small-mammal predation of the European pine sawfly. Can. Entom. 91, 293-320. Post, D.M., Conners, M.E. and Goldberg, D.S. (2000). Prey preference by a top predator and the stability of linked food chains. Ecology. 81, 8-14. Murdoch, W.W. and Oaten, A. (1975). Predation and population stability. Adv. Ecol. Res. 9, 1-131. Hassell, M.P. (1978). The dynamics of arthropod predator-prey systems. Princeton University Press, Princeton. Solé, R.V. and Montoya, J.M. (2001). Complexity and fragility in ecological networks. Proc Roy Soc B. 268, 2039-2045. Dunne, J.A., Williams, R.J. and Martinez, N.D. (2002a). Network structure and biodiversity loss in food webs: Robustness increases with connectance. Ecology Letters. 5, 558-567. Williams, R.J., Berlow, E.L., Dunne, J.A., Barabási, A.L. and Martinez, N.D. (2002). Two Degrees of Separation in Complex Food Webs. Proc Nat Acad Sci. 99, 12913-12916. Neutel, A.M., Heesterbeek, J.A.P. and de Reuter, P.C. (2002). Stability in real food webs: weak links in long loops. Science. 296, 1120-1123. Bascompte, J.P., Jordano, P., Melian, C.J. and Olesen, J.M. (2003). The nested assembly of plant-animal mutualistic networks. Proceedings of the National Academy of Sciences of the United States of America. 100, 9383-9387.
Dynamic Network Models 46. 47. 48. 49.
50. 51. 52.
53.
54.
55. 56.
447
Sarnelle, O. (2003). Nonlinear effects of and aquatic consumer: causes and consequences. American Naturalist. 161, 478-496. Dunne, J.A., Williams, R.J. and Martinez, N.D. (2004). Network structure and robustness of marine food webs. Marine Ecology Progress Series 273, 291-302. Goldwasser, L. and Roughgarden, J. (1993). Construction of a large Caribbean food web. Ecology. 74, 1216-1233. Havens, K. (1992). Scale and structure in natural food webs. Science. 257, 11071109. Goldwasser, L. and Roughgarden, J. (1993). Construction and Analysis of a Large Caribbean Food Web. Ecology. 74, 1216-1233. Martinez, N.D., Hawkins, B.A., Dawah, H.A. and Feifarek, B. (1999). Effects of sampling effort on characterization of food-web structure. Ecology. 80, 1044-1055. Cohen, J.E., Beaver, R.A., Cousins, S.H., DeAngelis, D.L., Goldwasser, L., Heong, K.L., Holt, R.D., Kohn, A.J., Lawton, J.H., Martinez, N., O'Malley, R., Page, L.M., Patten, B.C., Pimm, S.L. Polis, G.A., Rejmánek, M., Schoener, T.W., Schoenly, K., Sprules, W.G., Teal, J.M., Ulanowicz, R.E., Warren, P.H., Wilbur, H.M. and Yodzis, P. (1993). Improving Food Webs. Ecology. 74, 252-258. Brose, U., Williams, R.J. and Martinez, N.D. (2003). Comment on "Foraging adaptation and the relationship between food-web complexity and stability". Science. 301, 918b-918c. Pauly, D., Christensen, V., Guënette, S., Pitcher, T.J., Sumaila, U.R., Walters, C.J., Watson, R. and Zeller, D. (2002). Toward sustainability in world fisheries. Nature. 418, 689-695. Albert, R., and Barabasi, A.L. (2002). Statistical mechanics of complex networks. Reviews of Modern Physics. 74, 47-97. Strogatz, S.H. (2001). Exploring complex networks. Nature. 410, 268-276.
This page intentionally left blank
CHAPTER 13 INFECTION TRANSMISSION THROUGH NETWORKS
James S. Koopman Dept. of Epidemiology, University of Michigan, USA
[email protected]
1. Introduction The epidemiologic analysis of systems that cause infection in human populations inform the spending of billions of dollars and affect the lives of millions. This is the case for deciding on how to stem the ravages of a new pandemic flu strain, how to treat water to prevent transmission of diverse infectious agents such as Cryptosporidia, how to focus HIV treatment to stop transmission, and how to respond to emerging infections like SARS. Network models are making new and important contributions to such analyses. Network model abstractions, however, may not always be appropriate. Infection transmission systems are complex and diverse. Infectious agents usually spread from one individual, species, or environment to another via multiple modes. For example many respiratory pathogens can spread both through the air over long distances, via droplets from sneezing over short distances, and via touching mucosa then surfaces that are touched by others who touch their mucosa. Each mode alone would generate different transmission dynamics. Combinations of modes might occur in patterns that make one or another mode the key to control. For example even though droplet spread may be more frequent than airborne transmission, airborne transmission might play a key role in the network and thus offer a better target for control. The effect of interventions on 449
450
J. S. Koopman
different modes of transmission varies by the (a) paths that infection transmission can traverse, (b) natural histories of infection and immunity, (c) survivability of infectious agents outside a host, (d) environmental factors such as temperature, humidity, acidity, oxidative potential, etc., (e) the dose of infectious agent required to initiate an infection, (f) occurrence of other comorbidities affecting host immunocompetence, (g) complex evolutionary patterns that generate diverse strains with only partial cross-reactivity, and many other factors. For more than a century, the major traditions for making simplifying model assumptions were based on differential equation analyses that ignored the influence of enduring contact networks. Those traditions arose to accommodate the analytical tools available rather than as the result of a careful consideration of what simplifying assumptions might do to the inferences made from analyzing a model. Those traditions provided insights and helped organize thinking about infection control. But, they failed to generate a progressive science of transmission systems characterized by increasingly robust and data-driven model development. Infection transmission science is still a data-poor discipline with many isolated methods and theories. Perhaps network analyses could play an integrative role that helps epidemiologists validate their methods and theories. One reason for the failure of differential equation models to launch a transmission system science is that they intrinsically ignore a number of network phenomena that are centrally importannt to infection transmission. Newman lists some of these (1,2) including (a) the small world effect, (b) transitivity, (c) degree distributions, (d) network resilience, (e) mixing patterns, (f) degree correlations, (h) community structure, (i) network navigation, (j) component size distributions, and (k) distributions of betweenness centrality (see Chapters 1 and 2). Powerful new tools for analyzing infection transmission networks have been developed. Beginning with the specification of probability generating functions (pgfs) for network degree distributions, a series of analytic methods have made it possible to solve for the probabilities of epidemics and the expected sizes of epidemics under increasingly complex conditions without burdensome simulations that make it difficult to assess the implications of model variations (2-12).
Infection Transmission through Networks
451
Additionally, efficient simulation and network analysis algorithms, and greater computational power improve our ability to analyze networks currently intractable by theory alone. Nevertheless, network analyses using these new tools employ simplifying assumptions that are as extreme as the assumption of mass action in differential equation models. If the effects of these radically unrealistic assumptions in network models are ignored by network scientists, then network model analyses are likely to leave transmission system science in the state of very slow growth generated by mass action models. One reason to hope that network analyses could lead to rapid growth in transmission system science is their potential to relate to new sources of data, including data on contact patterns, data on environmental agent identification, and data on patterns of nucleotide sequence variation in infected individuals. For that promise to be realized, epidemiologists have to get involved with network scientists in the work of analyzing infection transmission. Otherwise the needed data will not materialize and the relevant questions that can advance both the theory and practice of infection control will not be addressed. To involve epidemiologists more fully, network scientists will do well to address the identification and control of risk factors, an organizing principle in epidemiologic analysis. Accordingly, I urge network scientists to pursue these three directions in the analysis of infection transmission: •
•
•
Establishing a method for comparing inferences derived from network models with those derived from other model forms. This will improve the quality and efficiency of infection control decisions and motivate better data collection. Create a theoretical framework to integrate nucleotide sequence data from clinical and environmental isolates into transmission system models. Those data hold a large and untapped potential to increase the accuracy of network model specification. Consider the role of the distributions of risk factors within a network in determining its epidemic potential. Developing this lexicon of risk factors in network science will further align the
452
J. S. Koopman
interests of the epidemiologic and network sciences. Only when network theorists get into this dominant mode of epidemiologic thinking are epidemiologists likely to begin using and developing their work. After further introducing these three issues, I will review broad issues in network analysis of infection transmission systems. Rather than thoroughly reviewing all recent advances, I seek to provide a vision of tasks for network scientists that I think will contribute most to infection control. An area I do not cover where network analysis is particularly important is the evolution of infectious diseases (13). 2. Inference Robustness Assessment The robustness of infection control decisions or causal inferences is assessed by relaxing the simplifying assumptions of a model and evaluating whether that leads to a change in the inference. Note that we are talking here about assessing the robustness of inferences, not of models. We are not talking about the robustness of statistical inference. Statistical inferences are relevant only from samples to statistical target populations and do not depend upon causal inferences. The inferences we are concerned with are about the general applicability of a theory or about the causal consequences of control actions. The simplifying assumptions of a model may be relaxed by changing model elements while keeping the form of the model constant. For example a network model may have a process that generates a network structure. Some assumption in that process could be relaxed by elaborating the process so it is more realistic. Certain assumptions are intrinsic to model forms. The intrinsic assumptions of network models and of compartmental models using mass action contact formulations are both important to relax. One can relax the assumptions of network models by transiting to mass action formulated compartmental models and vice versa. Another process of inference robustness assessment is to transit from one model form making one set of simplifying assumptions to another model form making a different set of model assumptions (14-16). One
Infection Transmission through Networks
453
ideal for such transitions is to formulate transitions between models in such a way that both model forms should theoretically generate identical output. This can be done for example in the case of the mass action network model transition by formulating dynamic network models (16). Formulating models of different types such that both should theoretically behave the same has two advantages (14-16). First, it helps validate model code by insuring that two quite different model forms generate the same output when they theoretically should do so. Second, it allows for relaxation of assumptions within a single model form rather than across two model forms where all the different assumptions may not be identified. Inference robustness assessment validates a model for a specific use. Validation of any model for all purposes is impossible. Demonstrating that a model generates observed patterns does not validate any model use. Conversely, a model can be validated for a specific use even if many aspects of the patterns it generates differ from observed patterns. A model based inference is not invalidated just because some aspects of the model differ from those of the real world. It is only invalidated if realistically relaxing assumptions changes inferences. Thus inference robustness assessment entails demonstrating that the ways a model differs from the real world do not affect model based inferences. Given the multitude of ways that simplifying model assumptions might be realistically relaxed, the task of inference robustness assessment can never come to a definitive end. It is practically impossible to look at every combination of recognized ways to relax simplifying model assumptions in order to assess inference robustness. Even if all recognized ways to relax model assumptions could be examined, insightful scientists would discover new assumptions and ways to relax them. The best we can hope for from inference robustness assessment is a process that leads to consensus. Inference robustness assessment is not a formula for finding the truth. It is a social process that can bring out threats to finding the truth that individuals pursuing the truth on their own may not perceive.
454
J. S. Koopman
2.1. Transitions within Mass Action Models from Continuous Population Deterministic to Discrete Individual Stochastic Models We deal first with this deterministic to stochastic transition within mass action models even though it does not involve network models because the issues addressed in this transition are part of transitions from ordinary differential equation models of infection transmission to network models. But these issues are better clarified without adding the network issues. Mass action models can be formulated either as deterministic continuous population models using differential equations or stochastic discrete individual models. The continuous population deterministic models assume that the size of each compartment in the model is effectively infinite. The stochastic compartmental models relax that assumption but in so doing they must add a stochastic process to the model and assumptions about that stochastic process. Usually the stochastic model is the more realistic one. It makes less radical and more realistic assumptions than the deterministic model. It may have more parameters because it adds realism to the model, but that does not mean it makes more assumptions – just that it makes more realistic assumptions. In an illustrative case, an inference about the relative utility of contact tracing with quarantine programs and mass immunization was made using deterministic mass action models (17,18). These models had nice characteristics such as the inclusion of the human resources needed to carry out these programs in the model. They were analyzed in a manner that demonstrated the robustness of the inference regarding the relative utility of these two programs to a variety of assumptions about the nature of the transmission system. But this inference, in turn, was shown not to be robust to the deterministic model assumptions of continuous populations (19-21). Various reasons for the lack of robustness were postulated (20) with local die out of infection that occurs in discrete individual stochastic models but not in continuous population deterministic models (22) being the main one. Stochastic effects like those just discussed for smallpox are not always important. An illustration of how the natural history of infection and immunity may determine whether stochastic phenomena must be
Infection Transmission through Networks
455
captured is found in Rohani et al. (23). They find that the dynamics of whooping cough are strongly determined by stochastic phenomena but measles is not. The transition from deterministic mass action models formulated as ordinary differential equations (ODEs) to stochastic mass action models creates a dramatic change in system equilibrium states for infection transmission models (24). Whereas the deterministic ODE models may have mathematically derivable equilibrium relationships, stochastic models always have a chance of going to the state of no one being infected and thus reach an equilibrium where no one is infected. There is, however, a pseudo-equilibrium for stochastic infection transmission models that given large population sizes approximates the equilibrium of the deterministic models. It is the deviations from this equality that are informative. Given limited population sizes, when endemic infection levels bounce up by chance, the number of recoveries driving the infection levels down always increases because these are proportional to the number of infected individuals. In contrast when infection levels bounce down due to chance, the force driving them up (infected cases that are the source of transmissions) always decreases (25,26). This drives equilibrium levels down. Likewise as population structure increases by the specification of further mixing groups (22), the chance that infection will die out of a population subgroup increases and this further lowers the equilibrium level of infection in the stochastic mass action model. Network models with individuals as nodes can be viewed in this regard as structured mass action models where every linkage is a two person mixing group. Thus when transiting from deterministic mass action ODE models to network models, the stochastic phenomena just discussed will be especially important. But there are even more important aspects of this transition which we consider next. 2.2. Stochastic Compartmental Mass Action — Network Transitions Consider a population with individuals always in motion who establish transient linkages that can transmit infection. Linkages may consist of joint presence in a building, the use of a common bathroom,
456
J. S. Koopman
conversations, hugs, preparing and sharing food, etc. Some of these transient relationships recur regularly over many years such as those in families. Others recur only over months, such as those in academic institutions. Some recur only a couple times, such as those involved in business deals. And some never recur. The temporal pattern of recurrence may be quite different for different settings. Most social encounters are temporally structured so that they occur at regular intervals or at least after minimal separation times. Now contrast the mass action and network abstractions of such a population. One network model abstraction has individuals as nodes with directed or undirected arcs that are fixed by some definition of what transient relationships are sufficiently strong and long enough or will come into play at crucial moments so as to constitute an arc. It assumes individuals do not move. The contact process generating transmission events across an arc is almost always assumed to be Poisson with exponentially distributed event intervals. These assumptions about fixed relationships and no movement and Poisson processes for events may be true for models of some physical systems, but are not remotely true for any infection transmission system. The mass action model, in contrast, assumes all interactions are instantaneous with instantaneously thorough mixing that results from extremely rapid movement after each contact and accumulate numbers of contacts in a Poisson process (27). These assumptions may also be true for models of some physical systems, but again, are not remotely true for any infection transmission system. Both model forms making such radically unrealistic assumptions may be useful. But they might lead to different inferences. The question then is which assumptions are affecting the robustness of inferences. Are the network model inferences or the mass action model inferences more robust? Network and mass action models make almost opposite unrealistic assumptions about both movement of individuals and duration of contact. Both make assumptions about the timing of contacts that are unrealistic and have been shown recently to have important influences on transmission system dynamics and inferences about the effectiveness of control measures (28). Assumptions about movement and duration of contact are both more easily relaxed in compartmental models. This can
Infection Transmission through Networks
457
be done by defining contact groups and movement of individuals between contact groups. But assumptions about micro-network structure are more easily relaxed in network models. To see what we mean by this, consider a transition between model forms. A network model equivalent to the mass action model could be formulated with linkages between every pair of individuals and a rate of transmission corresponding to the effective contact rate between any pair of individuals in the mass action model. The network model can then relax the mass action assumptions of instantaneous contact and instantaneously thorough mixing by eliminating a fraction of arcs and correspondingly increasing transmission probabilities to keep constant the overall rate of contact events. As the nodal degree is decreased and the transmission probabilities across edges are increased, the opportunities for infection transmission are decreased. Fewer contacts of infected individuals are with susceptible individuals because of the increased chance that a potential contact event is with the source case for transmission. The transitions from mass action to the equivalent network model and then the relaxation of mass action assumptions in the network model form could and should be used to assess the robustness of any inferences in the mass action model to realistic violation of the instantaneous contact assumption. Alternatively the transition from network to mass action models could be formulated by having a rate of formation and break up for pairs making the network model a dynamic network model. If the formation of pairs results from a mass action process, then as the breakup rates are increased, the network model approximates a mass action model (16,29). Even in the completely identical random network and mass action formulations, the network model has patterns of arcs in small sets of individuals that the mass action model does not have. We refer to such patterns involving triads, tetrads, etc. as micro-network structure. This micro-network structure consists of patterns of relationships between two contacts of an individual, or between second or higher generations of contacts of an individual. For compartmental models to handle such micro-network structures, they would have to define mixing groups for every small group conformation (14,15,30). That would so quickly explode model structure that it would make the model useless. The
458
J. S. Koopman
micro-network structures in network models can be elaborated in ways that are impossible to elaborate within the context of the mass action model. Micro-network structures may derive from basic social processes and thus be scientifically generalizable across populations. Network models can relax the random mixing assumptions of the mass action model by assuming different micro-network patterns or by formulating forces that lead to different patterns. In one such exercise, micro-network structure was changed by adding clustering of contacts (31). This led to notable effects on the probability of extinction in stochastic models and the fraction of the population left susceptible at the end of an epidemic. In another such exercise, effects on the equilibrium level of infection for infections without permanent immunity were found to dramatically change across the progressive change of formation and breakup rates (16). Such a transition was further explored in limited pair formation models with similar inferences being made about the effects of changing pair formation and breakup rates in relationship to infection recovery rates (32). The behavior and formulation in terms of model parameters of key epidemiological constants such as the basic reproduction number, endemic infection levels, and critical vaccination levels is significantly different for mass action and network models even when both microand macro-network structure is random. This has been shown using probability generating function formulations of networks (8) by analyzing pair models (33), and by analyzing dynamic network models (16). Any aspect of contact structure that can be readily defined within the context of compartmental models by specifying risk groups and mixing groups we define as macro-network structure. Macro-network patterns can be defined in either the compartmental or network model context by formulating rates of contact between different classes of individuals. Each real world situation modeled is likely to have a unique macronetwork structure with many hypotheses about the nature and effects of that structure that deserve exploration. That means macro-network structure is not as generalizable as micro-network structure.
Infection Transmission through Networks
459
Although both mass action and network models can relax macronetwork assumptions, they do not do so in an identical fashion. Network models assign limited numbers of ongoing linkages between individuals at different sites in order to create the macro-network structure while mass action models create homogeneous contact processes at a site. That is to say, network models incorporate micro-network structures into their definition of macro-network structures while mass action models do not. Model analyses of highly structured continuous populations using mass action assumptions are recognizing the need to assess the robustness of their inferences by considering network models (34). Network model analyses need to do the reverse. Fixed network models assume no one is changing their spatial or social relationships to other individuals. There is no motion on these dimensions. Mass action models assume everyone is in so much motion that mixing is instantaneously thorough. Currently, mass action models can more flexibly relax their macro-network motion assumptions than network models can. Movement assumptions can be relaxed in the compartmental model form by defining compartments of individuals in different contexts and moving individuals between contexts (35-37). Movement is not explicit in network models. Assumptions about movement effects on symmetries of contact, however, can be addressed in network models (5). Moreover work on integrating movement into network model analysis is proceeding (38). A common compartmental model of structured populations called structured mixing (30) makes assumptions that obscure movement effects between different sites. That formulation takes a statistical mechanics approach to movement so that it assumes all individuals are simultaneously at all sites at all times with specified probabilities. Thus one does not explicitly formulate movement. Metapopulation models (35-37,39) and mechanistic movement models (40) formulate movement more explicitly. Mechanistic models are more detailed and keep track of everyone at any site by their population of origin while metapopulation models keep less track of the history of where individuals have been (40). Besides greater flexibility for relaxing motion assumptions, there are other reasons to transit from network models to mass action models. One
460
J. S. Koopman
is that analysis of the effects of relaxing macro-network assumptions may be computationally easier. Another is that when models are used in data analyses, the number of parameters to be estimated from the data must be reduced and that might be easier using mass action model forms. Later in Section 2.5 on mathematical analysis of network models we will suggest that advances in network analysis might make network models more flexible for both these situations. We have talked about robustness assessment so far as a process of elaborating models to relax simplifying assumptions and noted that network and mass action models have different advantages and disadvantages for this purpose. Inference robustness assessment, however, should also encompass the process of making inferences about model form or model parameters from data analysis. Different assumptions can be made in analytical models. But to undertake any data analysis, it may be necessary to reduce model parameters and perhaps even model structure. Thus a major reason for model transition in inference robustness assessment is to assist in the estimation of model parameters from data by imposing unrealistic simplifying assumptions. Network and mass action models do this in different ways and thus may offer robustness assessment for statistical inferences as well as for scientific inferences. Since statistical inferences are often used in making causal or control action inferences, inference robustness assessments should most often incorporate assessing the robustness of statistical inferences. 2.3. Dynamic Network Models Because both network and mass action models are far from reality, a robustness assessment might employ models where linkages between individuals form and break up. The simplest such models have only isolated pairs of individuals with no individual in two pairs at once. The relationship between pair models with these characteristics and mass action compartmental models has been explored in several studies (42). More general dynamic network models where pairing is a mass action process have enabled further exploration of the relationships between network and mass action models (16,29).
Infection Transmission through Networks
461
The equilibrium relationships of dynamic pairing models that do not allow for concurrency can be derived by examining expected equilibrium status at pair formation and pair breakup (16). But equilibrium and other analyses for more general dynamic network models with continually varying nodal degree is a challenge that network scientists should consider taking on because that would provide a key tool for inference robustness assessments. 2.4. ODE Network Models of Correlation in Infection Status The transition from mass action deterministic models to network models may take another direction where pairings are added as separate compartments within an ODE model. An early influential model of this type was formulated by Dietz and Hadeler (43). Chick et al. explored the relationship between these deterministic models and stochastic dynamic network models (16). A generalization of the Dietz and Hadeler model stays in the deterministic ODE model framework but can examine additional model structure while still switching from instantaneous contact with instantaneous thorough mixing to fixed linkages (33,35,44-47). These are deterministic ODE models of infection spread in networks with either random structure or specifications on the level of increased linkage within small groups like triads (48). They have continuous populations rather than individuals but the continuous population is formulated in terms of pairs, triplets, or higher order structures. In these models, all relationships between individuals are enduring in nature. Assumptions about how infection status is distributed in higher order structures beyond pairs (triplets, tetrads, etc.), however, usually imply that these higher order structures are continually changing. Thus these network models differ significantly from fixed network models. Correlation models as presented by Rand (33) formulate the infection and immunity status of each egocentric linkage made by each individual. A random graph can be assumed. Triad closure probabilities can be specified (47). And even higher order structures may be formulated (48). Consider first the model formulation where the infection status of each egocentric linkage of each individual is formulated. These are called pair
462
J. S. Koopman
models. The basic approach is to construct models at the levels of individuals and dyads by first formulating models at the level of individuals, dyads, and triads and then formulating the triads in terms of individual infection status and dyad status. Many different assumptions can be made in the process of modeling triads in terms of individuals and dyads. Even with the random graph formulation, the differences in the mathematical relationships defining the basic reproduction number R0 in the mass action and correlation models are considerable (33). R0 represents the endemic or epidemic threshold parameter and under certain restrictive assumptions of random contact represents the number of secondary cases that an infected individual will generate over the entire course of their infection (49). There are also important differences in mathematical relationships for the endemic level of infection and for critical vaccination levels needed to control transmission. There are also differences in the dynamics of infection as correlation models broaden out the shape of an epidemic (31). Pair correlation models might be a good a first step in robustness assessments for inferences made using mass action assumptions. They relax those assumptions in the direction of network model assumptions but can preserve both random mixing assumptions and a degree of tractability for numerical analysis. After a first transition from mass action models to pair correlation models on a random graph, a second step would be to stay within the random graph context but pursue different models of correlation in higher order structures. To accomplish this, one can move to modeling quadruplets in terms of individuals, pairs and triplets, modeling quintuplets in terms of all the lower level structures, etc. One approach to such modeling has been “momentclosure” methods (33,44,47). But other approaches sometimes work better (46). As a second step in inference robustness assessment, one can proceed to add micro-network structure by specifying the probabilities of closed triads and of higher level structures (47). As a third step in relaxing model assumptions for an inference robustness assessment, different types of linkages from the same individual can be specified and the frequency of one type of linkage can be modeled as conditional on
Infection Transmission through Networks
463
another (48). Using this approach for sexually transmitted infections (STIs), it has been clearly demonstrated that constraining formulations to the pair level is inadequate to capture important aspects of infection transmission dynamics (45). Since sexual linkages can be characterized fairly well into short and longer duration classes, and since there is a fair amount of data on these classes of linkages, this approach to adding complexity to micro-network structure is logical for STIs. As one moves from random graph correlation models to higher level structure models and as one moves from modeling correlations at the dyad level to correlations in these higher level structures, model complexity explodes and the logical structure becomes challenging, at least for a medical doctor epidemiologist. As one adds further population structure that puts individuals into different risk groups and further distinctions by type of interaction between individuals, again complexity explodes. Adding macro-network structure creates even further challenges but it is possible to formulate pair correlation models with metapopulation structure (48). But that state of affairs also holds for the “pgf” approach to network model formulation and analysis (8) which we deal with next. 2.5. Mathematical Analysis of Network Models with other Structures One aspect of network models that makes them particularly useful within the context of inference robustness assessment is the availability of percolation theory for their analysis and the success of applying percolation theory in models formulated using probability generating functions (pgf) of degree distributions and group structure or mixing site structure (8). As mentioned earlier, discrete individual mass action models with realistic population structure are mostly analyzed using simulations. The process of relaxing single assumptions and analyzing the effects of such relaxations using purely simulations is exceedingly laborious to the point that it will be undertaken only superficially. The potential for percolation analyses to provide solutions for complex stochastic models could significantly facilitate many inference robustness assessments. These solutions may be closed form solutions or solutions that only require simple numeric approximation to the solution
464
J. S. Koopman
of transcendental equations (4,8,9). Such solutions will be most readily available for simple model structures. As mentioned earlier, relaxing assumptions within simple models can provide some practical level of assurance about the robustness of inferences to these same assumptions in more complex models. Of course, relaxing assumptions two at a time may reveal lack of robustness in inferences that are not revealed by relaxing assumptions one at a time. But the conceptual clarity of how relaxing assumptions can demonstrate the lack of inference robustness that comes from simple models may be preferable within the social process of science. Many different lattice structures, structuring of local and distant contacts, degree distributions, and linkage patterns between degree distributions can be formulated. These include small world networks (50-55), scale-free network (56-58) different formulations of clustering (11,59,60), and other diverse specific structures (2,10,60-69). This proliferation of analyzable model structures is most encouraging for the development of robustness assessment methodologies. This is the case even when specific formulations are not very realistic. If two different formulations cover the extremes of reality and an inference holds across both of them, robustness to the characteristics on which they and presumably more realistic models differ has been demonstrated. A nice illustration of this principle was used in assessing inferences about SARS. Ancel Meyers et al. used Poisson and power law models to cover the extremes of more realistic models (5). Additionally it is easy to see how small world network formulations relate to realistic contact patterns and how assumptions in lattice models can be relaxed by adding distant linkages. Likewise it is easy to see how adding local structure to random graph models can realistically relax random graph assumptions to the point where small world formulations are achieved. A particularly promising model formulation using pgfs in solvable models formulates the network as a bipartite structure between humans sites where transmission takes place (4). Such models facilitate both adding realistic structure and the process of finding parameter values that fit observations. In a model of Mycoplasma pneumoniae transmission in a psychiatric hospital, Ancel Meyers, and Newman et al. made an inference that even though caregivers were at less risk than patients,
Infection Transmission through Networks
465
restricting the number of wards on which they worked and protecting them from infection within the wards where they worked was more important to infection control that reducing patient to patient transmission. The robustness of this inference, however, to realistic relaxation of the simplifying assumptions in the model is in further need of assessment. The particular model form used makes the assumption that the number of infected patients on a ward does not affect the risk of a caregiver getting infected when they work on a ward. Realistic relaxation of this assumption seems likely to increase the utility of controlling patient to patient transmission while decreasing the utility of protecting caregivers from infection. While simulations relaxed other model assumptions such as the form of probability distributions, they did not relax this most important assumption. But then as we have said, all robustness assessments are in a state of incompletion and the social process of submitting analyses to scrutiny by other scientists will always reveal further possible robustness assessments. The fact that a clear formulation of network structure allowed for individuals other than the authors to assess assumptions whose realistic violation might be a threat to inference robustness makes this approach promising even though the particular inferences in this paper may not hold up. 2.6. Transiting from Undirected to Directed Graphs Most compartmental models of infection transmission assume that transmission is symmetric. This corresponds to arcs representing the potential for transmission between individuals being undirected. There are some strong differences in transmission potential in different directions between individuals with different characteristics. For example, sexual transmission often has higher transmission probabilities from males to females than from females to males because males deposit infectious agents into sites where females retain them. Another example is that transmission of nosocomial agents may be higher from doctor to patient than the other way around due to susceptibility differences. Not only is average transmissibility different, but the distribution of transmissibility may be different. Contagiousness, for example, might be
466
J. S. Koopman
more highly clustered than susceptibility. We will say more on that in Section 5.2 on modes of transmission later. In compartmental models, these asymmetries are usually handled by susceptibility and contagiousness parameters being assigned to individuals with different characteristics. It would seem that the same tactic would work for network models. But some asymmetries are total. If a transmission media is contaminated by one person and then picked up by another, timing and movement issues are likely to make at least some transmissions unidirectional. To my knowledge issues regarding mixed directionality of transmission have not been addressed in the compartmental model context. It would seem possible to do so by specifying a site compartment making bipartite graphs like those just discussed in the network models of Meyers et al. (4). But directionality is not as natural an issue in the compartmental model context as it is in the network model context. In individual level models, where transmission occurs via contamination of media and movement of individuals is explicitly modeled across time, directionality of transmission is an intrinsic part of the model (70). Pourbohloul et al. have shown that percolation analysis of directed graphs illustrates significantly different behavior than similar analysis of undirected graphs (3). They showed that in network models with mixed undirected and directed arcs the probability of large epidemics and the sizes of those epidemics are not identical as is the case for undirected graphs. A particular source of directionality addressed in these network models arises when patients go to make contact with health care workers when they become ill but the reverse does not happen. But directionality arising from different frequencies or timing of contamination of media events and uptake of contamination from media seem to be a much more ubiquitous source of asymmetry. Network models may provide the most feasible and flexible way to assess the robustness of inferences to assumptions about symmetry of contact. If the fraction of arcs that are directed can be varied by group, then the consequences of directed media contamination can be assessed and inference robustness limits demonstrated without having to model the media explicitly.
Infection Transmission through Networks
467
2.7. Models Involving Contact Processes That Generate Networks Another way to pursue needed realism when inferences are not robust to the extreme assumptions of network or mass action models is to formulate models of contact process mechanisms. These may involve either modeling the media through which transmission takes place or specifying locations where contact might occur. Such models play an important role in bioterrorist and emerging infection control models that we will discuss in this section. Three traditions pursuing more realistic complexity in contact processes within infection transmission system models are “agent based” models, transportation based generation of contact models that handle movement in a metapopulation manner, and structured mixing models taking the statistical mechanics approach to movement. These three approaches are represented by the first three infection transmission modeling groups supported by NIH under the MIDAS (Modeling Infectious Disease Agent Systems) program (71). This is the major effort supported by the US government to understand how to control the population transmission dynamics of emerging infections and bioterrorist dissemination of infectious agents. The initial focus of these research groups has been on smallpox and pandemic influenza. A group linking Johns Hopkins University and the Brookings Institute has taken what they call an “agent-based” approach (72) which moves individuals around on realistically structured grids and defines contact by grid proximity. A group at Emory uses the structured mixing with the statistical mechanics approach discussed earlier (73,74). These are the same sort of models used to analyze smallpox that we mentioned earlier (21). A group now at Virginia Tech and formerly at Los Alamos National Labs builds simulations on models of the transportation network and generates contact by having infected individuals contaminate environments and susceptible individuals pick up contamination from environments (70,75). They call their model structure “EPISIMS”. All three of these models could be used by network analysts to relax simplifying assumptions in their network models and assess the robustness of inferences to realistic violation of their network
468
J. S. Koopman
assumptions. Conversely, all three of these model forms are very complex and do not readily allow for the generation of generalizable knowledge. Network analyses can relate to these model forms and improve their capacity to generate generalizable knowledge in two different ways. First, network models can be used within inference robustness assessment strategies that whittle down their complexity while both maintaining their accuracy by preserving key assumptions. Second, the output of these models includes highly dynamic and complex networks that should be described efficiently to better understand what is determining the behavior of these models. All three approaches could potentially use media contamination mechanisms to effect contact and transmission but only the models of the Virginia Tech/LANL group do so. This group is pursuring methods to directly formulate static networks for analysis. They first model stochastic processes that incorporate various forces leading to contact. They do so by modeling the media of transmission or by modeling contact processes at sites where contact is made. Modeling the media is especially applicable to agents that survive for some time in the environment. That includes a great many agents of interest for enteric transmission like noroviruses and rotaviruses as well as agents like smallpox. Even influenza is probably mostly spread through agents that have survived for at least hours in their transit between hosts. Each of the three MIDAS groups view and analyze the networks that are generated by their mechanistic simulations. The Virginia Tech and Emory groups view the resulting networks as bipartite graphs where humans link sites. The networks generated are subject to analyses regarding their structure. The purpose of analyzing networks generated by models is of course not to make inferences about the mechanisms that generated the network. These are completely known. Rather the purpose is to describe the networks along many dimensions and relate the behavior of the system to these descriptions. For example observed distributions of reproduction numbers for individuals in different parts of the network or with different risk factors can be described and related to epidemic size or epidemic risk. Analysis of networks generated by specific mechanistic processes and specific rules for defining network links may be one of the most
Infection Transmission through Networks
469
important uses for network analyses. It could and has generated insights not possible without the network analysis (70). Complex contact process models make it difficult to determine what is really happening within a model. Network definition as the result of model processes and analysis of resulting network patterns can help perceive and analyze the patterns that such models generate. Analyzing the rules used to define linkages between individuals from the output of complex mechanistic models and seeing how these relate to processes involved in the population dissemination of infection could be especially helpful in designing questionnaires that capture the most useful definitions of linkages. The questionnaires may ask respondents whether specific conditions meeting such definitions exist. These may involve time spent at places and behaviors engaged in. Of course having information on the details of contact processes might be better. But, as discussed in Section 5.4, it is exceedingly difficult to get detailed data to describe networks. Thus simplification of questionnaires in ways that still capture the essential elements needed for robust inferences from network model analysis is important. Output from complex dynamic network simulations has been used to generate fixed networks from the dynamic network output for sexually transmitted infections (76). In this work, fixed network linkages that are directional are defined by timing of dynamic linkages that are non-directional. This approach could be used for questionnaire simplification. When transmission is not direct but via surfaces, water, food, or any means that can take an infectious agent across time and space from one individual to another, the need to model the media itself may arise. For example, when water carries infectious agents between individuals, it has been judged important to incorporate the water into the transmission models (77-81). Network graphs then become bipartite with humanmedia arcs but not human-human or media-media. Models that incorporate mechanistic details about how potentially infection transmitting linkages between individuals get generated and broken, or models that specify the paths that infectious agents take through different media to connect individuals may also be used directly in a robustness assessment strategy. They of course relax the simplifying assumptions of models without these mechanistic details.
470
J. S. Koopman
2.8. Statistical Analysis of Network Structure Two types of statistical analysis are relevant to inference robustness assessment. The first involves statistical analysis of network conformation parameters from data on who is connected to whom. We consider two ways such analyses can be used in inference robustness assessments. The second involves statistical estimation of infection transmission system model parameters by fitting network conformations to observed patterns of infection in a population. Both types of statistical inference might be integral to causal or control action inferences. When one is making causal or control action inferences about particular situations, one must consider a range of possible network conformations that might represent the particular situation one is modeling. Statistical analysis of network data on who is connected to whom might assist with this task. This type of data is obtained in various ways as outlined later in the Section 5.4. All of the different methodologies for gathering this data are problematic. But any available information of this type deserves to be integrated into inference robustness assessments. One way to use the output of analyses of contact patterns from contact histories is to enter the full range of network parameters consistent with observations into an inference robustness assessment. For example, the effect of age contact structures on inferences about what age groups are the principle transmitters of respiratory infections has been evaluated using this approach (82). As another example, consider inferences about whether a contact tracing strategy will be effective. Network model analysis might show that this depends on how much clustering of contacts there is in a population. The degree of clustering might then be assessed statistically from contact histories. Another way to perform inference robustness assessments using statistical models of networks is to use statistical models that make different assumptions about the underlying shape of the data. For example, clustering might be assessed using different statistical models that make different assumptions about the forces leading to clustering. Then the estimates from these different models might be used in a robustness assessment.
Infection Transmission through Networks
471
Measures of transitivity and connectivity are especially relevant to epidemiology. One productive approach to statistical estimation of relevant transitivity and connectivity parameters builds upon exponential random graph models (83). Generalizable descriptive statistics of network characteristics like connectivity may be useful for inference robustness assessment in a couple ways. First, they may generate hypotheses that lead to inferences whose robustness deserves assessment. Second, such descriptions might help understand how contact processes lead to effects on infection transmission systems through effects on network conformation. These do not fall into the main inference robustness assessment activity of relaxing simplifying assumptions. But they help rationalize when and how to undertake that activity. With regard to estimating transmission system parameters by fitting network models, MCMC Markov Chain Monte Carlo (MCMC) methods have demonstrated both theoretical and practical utility (84-89). A particularly promising development is the integration of network structure modification steps directly into MCMC estimation algorithms (84). The approach taken is to impute missing information in the form of a graph that describes potential infectious contacts between individuals. The graph may initially be random but then connections can be switched in iterative steps one at a time until some stability in posterior distributions is achieved. This was done for estimating transmission probabilities within and between groups, for estimating transmission thresholds, and estimating the number of network connections within and between groups. These later entities represent the imputed missing data derived from the posterior distributions of an MCMC process that adds and subtracts network connections in iterative steps. While it might seem that such an imputation would be quite demanding, the limited range of possible networks that are consistent with the data gives this approach practical utility. MCMC output is in a particularly useful form for inference robustness assessments. When comparing two control action choices, the full range of posterior distributions consistent with the data can be sampled and the frequency with which one choice or the other is best can be observed.
472
J. S. Koopman
A different sort of interaction between inference robustness assessment and statistical analysis is the use of model transitions to simplify the statistical models of transmission systems. Models with too much realistic detail may not be appropriate for statistical analyses either because they have too many parameters that need estimation or because the extent of realistic detail in the model makes calculations too burdensome for practical purposes. Inference robustness assessment might first determine what type of detail in a model is needed for robust inferences. That might entail a model that is too complex and detailed for practical use in a parameter estimation procedure. Then a variety of model reduction approaches could be used to generate statistical models that are more practical for parameter estimation. Such reduction approaches might impose diverse simplifying assumptions that are known to be incorrect. For example complex models with detailed movement parameters for different classes of individuals might be reduced to either mass action or network models. If the use of estimation models making different model assumptions makes no difference for an inference, robustness has been demonstrated. 2.9. Designing Network Models with Robustness Assessment in Mind What can network scientists working on transmission system analyses do to facilitate integration of their models into an inference robustness assessment framework? First, they should clearly separate macro-network structure issues from micro-network structure issues. Micro-network structures that are generated by contact processes should be the focus of network analysis. The social processes and spatial relationships determining macronetwork structures are most likely unique in each situation one might model. Thus one will want to make hypotheses about and explore the consequences of differences in macro-network structure. As just discussed, that is currently easier in a deterministic compartmental model context. Therefore network models that facilitate transition to compartmental models that capture macro-network structure is preferred. Compartmental model macro-network structure is generated by distinct
Infection Transmission through Networks
473
rates of contact between different risk groups. Therefore the network model should generate its macro-network structure in the same way. Second, network modelers should identify and state clearly all model assumptions. The failure to recognize the effects of model assumptions is the major cause of faulty inferences. The failure to recognize model assumptions is the main reason for failing to see their effects. Once model assumptions are made clear and the consequences of their realistic violation are considered, then inference robustness assessments follow naturally. 3. Nucleotide Sequence Traces through Contact Networks Nucleotide sequencing is cheap, fast, widely available, and becoming more so. How to best use it is yet to be defined. Epidemiologists currently use it to make inferences about whether transmission could have occurred between two individuals and/or determine which set of individuals were infected from a common source (90). That provides useful information about transmission systems but for the same reasons we will discuss in the Section 5.4 regarding collection of histories to analyze networks of contact, we will never be able to get an overall view of the shape of the network from such individual level deductions. We need population inference processes. Molecular information about infectious agents isolated from individuals can be used in two ways to assist population inferences about networks. It can be used to classify individuals or to establish genetic distances. Individuals can be classified as having agents that are mainly circulating in one population or another. This has been helpful for Tuberculosis (91,92), sexually transmitted infections (93,94) and HIV. HIV strains characteristic of intravenous drug users, homosexuals, and/or heterosexuals can often be identified. But the strains characteristic of one group can be found in some members of other groups. Such distributions of genetic types by risk group can be fit by models to estimate model parameters. Restriction Fragment Length Polymorphism (RFLP) analysis of microbial genomes of Tb organisms from individuals of different ages was used to classify organisms into small groups of identical strains.
J. S. Koopman
474
Clusters of identical Tb strains were analyzed with regard to age and gender structure in relationship to the overall population of infected individuals to infer what age groups and genders are infecting what other age groups and gender with tuberculosis (91,92). For STIs, strains of Chlamydia were classified from individuals that were linked by extensive but quite incomplete contact tracing (93). This helped fill in some gaps where linkage information was incomplete and to check for the consistency and inconsistency of histories not just in individual linkages but with regard to the pattern of linkages. Similar work has been undertaken for gonorrhea (94). Genetic distance data indicates how many transmissions there might be along the transmission tree that led from one individual down two separate chains of infection to two individuals whose distance is being measured. Genetic distances can be used to analyze the shape of genetic trees and therefore of transmission trees. The transmission trees generated by models can then be fit to the patterns of genetic trees for the agents. A task network scientists should undertake is to extract the information on transmission trees in a way that helps inferences about the conformation and behavior of infection transmission systems. It seems there should be information on both micro- and macro-network structure in nucleotide sequences.
A
B
C
D
Figure 1. Consider the traces that could have been provided by genetic analysis of four infectious agents from individuals A through D. Infection has percolated to these individuals through network nodes consisting of other individuals in whom the agent first replicated and then transmitted. Genetic analysis will reveal that agents from C and D are closely related to each other and equidistant from the agent of B and only distantly related to the agent of A.
Infection Transmission through Networks
475
The genetic tree corresponding to the transmission tree can be inferred using phylogenetic analysis methods. The tree will be inferred with error. But the extent of random error can be quantified such that the range of shapes to that tree can be specified. Thus a record is kept in the sequences of infectious agents regarding what elements their paths have in common and where they diverge. We will point out in Section 5.4 how difficult it is to piece together chains of contact from questionnaire. The infectious organisms, however, have been going around keeping a record of where they have been. Just as the human genome has demonstrated the migratory patterns of humans out of Africa better than any history, pathogen genomes are structured by their transmission histories and they can document that better than any set of contact histories can. Network analysis should take on the challenge of extracting that information in ways that allows inferences about the contact network shapes that could have generated the transmission tree shapes. There are a number of epidemiological questions that might be pursued from analysis of nucleotide sequence data. These include: Which groups are involved in sustaining transmission? Which are only being peripherally infected? Which are amplifying and disseminating transmission? There are similar questions relevant to risk factors: In what categories of individuals will changes in risk factors generate the broadest public benefit in the total population? Risk factor changes might involve behavior changes, sanitation intervention, hygiene improvement, vaccinations or infection treatment. Conversely we can ask in what categories of individuals will interventions have little value because they will become infected by some means despite the intervention and still carry on the chain of transmission. One key to extracting such information is linking sequence data to macro-network data describing contact rates between risk groups. To date, nucleotide sequence data has been available mainly from convenience samples not linked to other network data. Thus the assumption of random mixing has been employed and inferences sought have been restricted to past patterns of infected population sizes (95). If network scientists find ways to extract the information on networks from their nucleotide sequence traces, then epidemiologists will design studies
476
J. S. Koopman
to collect the data and insist that bioinformatics data bases are linked to network data. 4. Risk Factors for Transmissibility Controlling the spread of infection is a goal of public health. Epidemiology is the basic science of public health relevant to this task. But epidemiology in the last half of the 20th century focused on individual risk factors and, at least in the United States, defined itself by the use of methodologies that assume there was no interaction between individuals and thus assume away system behavior driven by networks of interaction between individuals. The dominant methods of epidemiological analyses assess risk factor effects using models that assume the outcome of one individual is independent of outcomes in other individuals. That is to say, they assume away infection transmission along with any network connections between individuals. This methodology was driven by chronic, non-infectious diseases. But it is also applied to infectious diseases and their control. It has had the nefarious effect of causing epidemiologists to ignore the most important risk factors – those that affect contagiousness and those that affect network conformation. Epidemiologists seek causes by comparing the experience of individuals with and without a risk factor. This can only detect risk factors that increase an individual’s susceptibility. Study designs that compare the infection experience of individuals exposed to source case infections with different risk factors affecting contagiousness are rare and expensive. Infection transmission system modelers seem to have almost the opposite problem. The main scientists modeling infection transmission as a causal system phenomenon at the population level have not been trained as epidemiologists. They are mathematicians and mathematical biologists. But these scientists have largely ignored the importance of risk factor identification and elimination in the control of infection transmission. Most of the work by mathematical biologists on analyzing infection transmission systems in the 20th century used methodology that made it hard to analyze data to detect contagiousness or transmission enhancing environmental risk factors. They did not model individuals on
Infection Transmission through Networks
477
whom risk factor data could be assessed in a manner familiar to epidemiologists. The standard approaches to infection transmission system modeling are illustrated in a couple of very helpful texts (41,49). Noteworthy, however, these texts do not deal with the analysis of transmissibility related risk factors. Using discreet individual transmission models for epidemiological data analysis could potentially provide methods to determine the roles of risk factors in transmission systems, including risk factors that increase contagiousness. That is a highly worthy goal. Let us consider why. Part of the importance of risk factors that increase contagiousness is that there is more variability in contagiousness and more potential for control of risk factors that generate such variability. The greater variability of contagiousness derives from both behavioral and biological factors. The greater potential for control involves more modifiable behaviors, the potential of treatment to decrease contagiousness but not susceptibility, the potential for decontamination efforts to stop transmission, and correctable hygiene and sanitation deficiencies that affect contagiousness. Further increasing the importance of risk factors that increase transmissibility is the fact that they have greater system effects than risk factors that increase susceptibility by the same amount (96). That is because the most susceptible individuals are the first to get infected and the consumption of these susceptible individuals then slows transmission. There is no comparable consumption of the most contagious individuals. Identifying risk factors that affect transmissibility is tightly related to identifying modes of transmission. Classification of transmission modes can be made in diverse ways. Transmission modes might include the following and many more categories: •
• •
Airborne transmission where infectious organisms stay suspended in air through the formation of droplet nuclei that can travel considerable distances Droplet transmission where organisms settle quickly out of the air and contact must usually be within 1.5 meters to be effective Fecal-oral transmission involving any route, direct or indirect, from feces to mouth
J. S. Koopman
478
• • • • • •
Skin to skin transmission through direct contact Sexual transmission Surface or fomite mediated indirect transmission from either droplet or skin sources Water borne transmission Food borne transmission Vector borne transmission involving intermediate, often arthropod, hosts.
Most infectious agents are transmitted via multiple modes. Epidemiology has developed many methods to determine which modes of transmission exist but none to determine what roles the different modes of transmission play in the transmission system. Although we have long known that various common cold viruses are transmitted directly via skin to skin contact, and via air and surfaces, we have little intuition as to what effects interrupting just one mode of transmission or the other would have on population levels of infection. Likewise, although we know that HIV can be transmitted via either oral and anal sex, the population effects of reducing these modes of transmission is highly controversial with few helpful analyses. The traditional epidemiology approach allows for specifying modes of transmission by determining how contacts of different kinds are associated with infection risks. For example the association of anal sex and intravenous drug injection with HIV infection has been observed with enough consistency and control of confounding factors to convince epidemiologists that HIV can be transmitted via each of these modes. But because the association methodologies used to ignore network structures, the role of different modes of transmission in a transmission system cannot be determined by those methods. Network structure can cause the reduction of contacts via one mode to have markedly different effects from the reduction of the same number of contacts via another mode. Consider an example where one group is sustaining transmission via mode A of transmission and disseminating it to another group via that mode that transmits mainly via mode B. If chains of transmission via mode B die out quickly, then eliminating a specified number of transmissions via Mode A could cut off extended
Infection Transmission through Networks
479
chains of transmission while eliminating the same number of transmission via mode B would have much smaller indirect effects. This may be the case for waterborne vs. direct transmission of many enteric pathogens, (77-79,81) anal vs. oral sex as well as transmission during early vs. late infection for HIV (97,98), and possibly airborne vs. direct transmission of respiratory infection. Traditional epidemiology cannot address these issues. A systems analysis approach is needed for this. Network models might have multiple links between individuals corresponding to different modes of transmission in individual network models. In bipartite graph models where sites connect individuals and vice versa, different sites might have transmission weighted by the modes of transmission involved. The first approach might facilitate analysis of individual risk factor effects while the second might facilitate analysis of environmental risk factors. 5. Overview of Infection Transmission Network Models Let us now consider the elements of network models in light of the three priorities for network analysis of infectious diseases. We first consider infection processes within hosts and then transmission processes between hosts before considering the standard issues covered by most Chapters in this book including choice of interacting partners in a network model, the choice of interaction events, methodological issues, computational issues, network macrotopology, network microtopology, spatial issues, dynamics, and control. 5.1. Infection Processes and Infectious Agent Characteristics A major division of infection processes that affects model structure is whether infections are microparasitic or macroparasitic. In microparasitic infections the infectious agent replicates in the host such that infectious load and transmissibility of an infected person is assumed to be independent of the total exposure dose and exposure doses are not cumulative over time. Microparasitic infections also have the characteristic that re-exposure to more infectious agent after infection begins is irrelevant because such exposures are always far less than the
480
J. S. Koopman
infectious load an individual already has. Most viral and bacterial infections and many unicellular parasitic infections are microparasitic. In macroparasitic infections the infectious agents acquired from the environment do not replicate in the host but rather each acquired agent constitutes a new source of infectious agents that can be transmitted from one host to another or indeed back to the original host, usually via some environmental contamination or vector. In most macroparasitic infections acquired immunity does not significantly inhibit the acquisition of additional infectious agents by the already infected host. Most infections with helminthes such as intestinal worms are macroparasitic. Most network models are of microparasitic infections. Within these, the temporal patterns of infection and of acquired immunity dictate different model assumptions. The simplest standard model assumes that upon transmission an individual goes from a uniformly susceptible state (S) to a uniform infected and infectious state (I) and thereafter to a completely immune state (R). This SIR model form is useful when seeking initial insights into system behavior for some infections. Its realistic violation, however, can lead to marked changes in system behavior. Almost always there is an incubation period (E) during which an exposed individual has acquired an infectious agent but has not yet become contagious. Thus an initial realistic relaxation of assumptions is to transit to an SEIR model form. More importantly, no infections induce complete and everlasting immunity. Thus elaboration of the R state is often called for in robustness assessment. Some infections, such as measles, have seemed to induce nearly lifetime immunity but elimination of infection in the population has revealed that enduring immunity requires continual re-exposure to the agent to boost immune responses. All infections induce some acquired immunity. For a few, however, the immunity seems too negligible to be included in a model. Thus a simple SIS model form might be used in seeking insights about system behavior. Often the lack of immunity is due to the fact that infectious agents are so variable that immunity stimulated against one particular strain is not helpful in protecting against the vast majority of agents that are not distinguished by the sophisticated methods needed to detect the multitude of immune stimulation and immune response variations that are possible. This is especially true for infectious agents like gonorrhea.
Infection Transmission through Networks
481
In this and many other common bacterial infections the evolution of different antigenic variations that affect immunity is so rapid and involves such intricate complexity that it makes little sense to distinguish different immune types. In many viral infections such as polio, and some bacterial infections such as those caused by Streptococcus pneumonia, the immune variants fall into a number of fairly cohesive categories with little cross-reactive immunity between them so that it becomes worthwhile to keep track of each individual variant in an infection transmission system model. The model assumption that a single unvarying infectious agent is involved is almost never true. It is the variations in the infectious agents that make the genomic sequencing discussed earlier a potentially useful approach to working out infection transmission networks. But in a few unusual cases the variations do not affect immunity very much. One of those cases is hepatitis A. Worldwide there is no meaningful difference in the immunity stimulated by highly variable hepatitis A viruses. Thus hepatitis A viruses that have only 60% homology in terms of nucleotide sequences can stimulate immunity to each other almost as well as each agent can stimulate immunity to itself. More commonly there is some cross reactive immunity to different variants of the infectious agent but this cross reactive immunity is not complete. It is difficult in most model forms to handle this situation completely. Model complexity explodes as different cross reacting strains are added to the model. One approach to handling these realistic effects is to capture the effects of continuing agent changes by modeling waning immunity across multiple immune levels (99). Compartmental models cannot handle agent and immune process diversity well. Even given modest agent and risk behavior diversity, the number of compartments can reach astronomical levels. Discrete individual simulations, including network model simulations and individual event history mass action simulations, are needed. As agent and immune response diversity increases, it seems likely that micronetwork structures would gain in importance. This speculation has not been examined to my knowledge. The increase importance would derive from the fact that as each variant moves through a population, it might do so on very sparse networks of immunity holes left by other variants. This
482
J. S. Koopman
could be especially important for bacterial infections and modeling of such complex relationships may be necessary to devise an effective strategy for controlling agents like those that cause otitis and sinusitis (99). 5.2. Modes of Transmission The ways that infectious agents get out of a host to begin their transit to another host are highly diverse and highly dependent upon both biological and behavioral characteristics of the host. Even within respiratory infections there is great diversity. Different mucous membranes may be affected. Skin can provide an exit route directly or by touching mucous membranes. Agents may be aerosolized at lower, middle, or upper levels of the respiratory tract with consequences on the size and survivability of resulting aerosols. The routes agents can travel to another host are also diverse. The effectiveness of different routes is not only a function of the agent, but also of environmental factors like humidity, sunlight levels, characteristics of the inanimate objects that might carry an infectious agent, and above all behaviors of those involved in transmission. Again just considering the respiratory agents, there will be diverse mixtures of the role of skin, surface, droplet, or aerosol mediated transmission. The extent of mixture may vary not only by agent but by environment. The routes an agent may take to enter a host and the effects of host factors on which routes might work are again highly diverse. Likewise infectious agents are highly diverse with regard to how well they survive in the environment as they transit from one person to another, how much they can multiply outside of the human host, and the number of organisms in an exposure dose that is required before transmission becomes likely. Given this complexity, modes of transmission, as discussed earlier, can be conceptualized in many different ways. Issues of how modes of transmission affect transmission dynamics have been glossed over in both epidemiological investigations and models of transmission. Differences between airborne and droplet spread or in the survivability of an agent in the environment are usually ignored. I believe these are important issues to address. When a new infectious
Infection Transmission through Networks
483
disease like SARS emerges, the actions that will work to control it will depend on the mixture of modes of transmission an agent has and the mixture of environments where different modes can act. To focus control efforts, the mode of transmission must be known. When deciding on what actions designed to reduce contact to control transmission, the relative importance of transmission in crowd settings vs. intimate gatherings will vary markedly by mode of transmission. From a modeling standpoint, a first decision in addressing modes of transmission is whether to incorporate the media into the model or not. The models of the Virginia Tech group originally incorporated media. Individuals go to a site and contaminate media or pick up contamination from media at the site. In those models, great complexity was added to the parts of the model that specify to which sites individuals go. Little specificity is added to media contamination to correspond to specific differences in airborne, droplet, skin, or surface spread. This seems like a serious imbalance that needs to be addressed by proper robustness assessments. It seems likely that important inferences might not be robust to tacit model assumptions made about the mixture of transmission modes in models that do not specify these. These deserve field investigation so that modelers can specify them in their models. But even without field data, inferences might be assessed as to their robustness to different spacetime patterns of contact between people that are involved in infection transmission. The appropriateness of decisions on what details to include or leave out will depend on the inferences that one pursues. If one were focusing on how to extrapolate information from transmission in a hospital to transmission at community sites, one would most likely need more details with regard to the media. If one were focusing on how resources should be directed to populations with different geographic and social structures, perhaps the movement of individuals would be more important to insure inference robustness. A second modeling decision depending upon modes of transmission is whether to make contact symmetric or asymmetric. There are two aspects to symmetry. The first is symmetry as to who is contagious and who is susceptible. The second is how timing of contact or timing of
484
J. S. Koopman
movement of individuals affects who can transmit to whom. The first type of asymmetry is generated both by biological and behavioral factors. For organisms transmitted fecal-orally, there are a few individuals who will contaminate media like swimming water with feces and many individuals who will take up the media orally. Thus asymmetric transmission should be considered. The same may be true for airborne transmission as only individuals with respiratory tract conformations that will aerosolize infectious agent might put many agents into the air while anyone breathing can pick the agent out of the air. This first type of asymmetry can be handled by specifying contagiousness and susceptibility differently for individuals with different characteristics. This can be done within the context of models that do or do not incorporate media and that employ structured mixing or metapopulation movement mechanisms for determining population patterns of contact. The second type of asymmetry arises because one person comes in contact with media before another person. The latter person cannot transmit to the first but the first can transmit to the latter. A judgment needs to be made as to whether such asymmetries will balance out in a manner that does not require their modeling. For airborne transmission, that seems likely. As the survival of the organism increases and the dilution rate of the media decreases, asymmetries may become more important. Also if the social structure of contacts generates a particular order of contact, such as might happen in needle sharing, then directional asymmetry will be more important. Since enteric organisms like rotavirus and norovirus survive a long time in the environment, robustness of any inference to symmetry assumptions would deserve greater consideration than for organisms like Shigella sonnei that will die out quickly. Models that incorporate media and use metapopulation movement formulations have a built in mechanism for generating directional asymmetries in contact. Models without media or that use the statistical mechanics assumptions of structured mixing cannot formulate directional timing of contact. Network models that begin with undirected arcs and that can be readily modified to add patterns of asymmetry that might be consistent
Infection Transmission through Networks
485
with different hypothesized modes of transmission might be the easiest way to assess whether temporal asymmetries in contact processes could be important to consider in inference robustness assessments. Incorporating media and movement to handle asymmetries is a much more complicated task. 5.3. Interacting Partners in Infection Transmission Network Models It has already been explained how the choice of whether or not media is to be modeled determines whether or not the interacting partners will be pairs of individuals or individuals and transmission media. In order to fit into network theories and percolation analysis algorithms, network models with discrete individuals as the interacting nodes and undirected arcs are often chosen. In the network model discussed earlier of Mycoplasma pneumoniae transmission in a psychiatric hospital, bipartite graphs of contact sites (wards) and individuals were constructed but rather extreme assumptions about the mode of transmission were made by not specifying any mode of transmission (4). In order to enhance analytical power, the assumption was made that transmission to a site to an individual was independent of how many infected individuals there were at a site. That might apply to agents that are aerosolized at a very high level and that require very low doses to cause infection. It does not correspond to droplet transmission which is one of the principal modes among several modes via which Mycoplasma pneumoniae can be transmitted. Making assumptions that enhance analytic power is always justified when the goal is merely to gain insights into system dynamics or behavior. But even general insights can be wrong or misused in pursuing modeling goals. Thus it is worthwhile considering what could lead to lack of robustness for a percolation analysis or other network analysis. When infection transmission is via skin to skin contact, via air, via droplets or via contamination of inanimate objects that can carry infection from one person to another, then undirected network models may be appropriate. Even then, if the organism survives for any time outside of the hosts modeled as nodes, then inferences based on network analysis of undirected arcs between hosts might lack robustness.
486
J. S. Koopman
Likewise, if behaviors involving contamination or picking up of contamination are socially scripted such that one class of person performs a contaminating act and another performs an act that picks up contamination, then inferences from analyses employing undirected arcs may not be robust. Whether direct transmission or transmission via a media is assumed, the connections between individual hosts are defined by the existence of an ongoing potential for the mode of transmission involved to carry infection from one person to the other. Abstracting such connections into permanent connections between hosts is an extreme simplification that is never closely approximated in the real world. It is this abstraction, however, that brings the tools of network analysis to bear on infection transmission. The appropriate definition for a fixed arc that relates such a model to the real world is problematic but we leave that for discussion in the “methodology” Section that follows. 5.4. Methodology The book “Network Epidemiology” (100) puts field studies of network patterns into three levels. These all seek to describe micro-network conformations using data obtained in interviews about contacts. 5.4.1. Micro-network Interview Data The first level of study seeks merely to describe local networks around randomly selected subjects by asking them about their contacts. An example where graph theoretic considerations have helped derive useful population measures from such data deals with concurrency of sexual partnerships (101). Given consistency with mixing assumptions, it was shown that inferences of epidemiological importance about population network patterns can be made from individual level data (101). Specifically the extent that concurrent links are linked together can be inferred. The second level of study seeks to describe partial networks by following out the contacts of specific classes of individuals. The typical methodology used is to have subjects name their contacts and then go to
Infection Transmission through Networks
487
these contacts and follow out their contacts. The only infections for which this is attempted to any extent that might be useful for constructing a network are sexually transmitted infections (STIs). Various strategies to trace out distant links and try to recompose them into a picture of the entire population have been developed for STIs. Tracing chains uses a strategy that keeps the sample size approximately constant as one goes various generations out. Snowball sampling follows out all contacts so the sample snowballs in size. Only a handful of studies have pursued either approach and none have achieved very complete descriptions of contact networks using this approach. It is rare that more than half of contacts can be traced in such an effort. For the very high risk populations where construction of the network is most important, 10% tracing represents a good effort. Thus there are many chances for biases and distortions to arise in networks of direct contact constructed from contact tracing data. An outstanding study of this type examined networks of contacts in Colorado Springs, Colorado that could have spread HIV infection (102). Eventually 3% of the Colorado Springs population had entries in this network. Many useful observations about the characteristics of the network were made, including the observation of a surprisingly close relationship between social distance and geographic distance in the key populations that could spread HIV infection (103). Another useful study of this type was conducted in Manitoba (93). The third study level accomplishes complete network descriptions by gathering data on every individual in a population in a manner that allows specification of other individuals to whom each individual is connected. Only a handful of such studies in general populations of any size have been conducted. One of these was conducted in the Nang Rong area of Northeast Thailand. The data in Nang Rong accrued in various steps without being designed at the start specifically for that purpose. The data permits some description of how the network evolved over time (104). This network data was used in a general heuristic manner by the Emory group within the MIDAS program to assess the chances for controlling an emergent H5N1 epidemic where this agent breaks out of its numerous avian hosts in Thailand to begin human to human transmission (74). The idea is that it might be useful to stock anti-
488
J. S. Koopman
influenza drugs and use them intensely so such an epidemic would not start a pandemic that might bring high levels of mortality and social disruption. One study in a subpopulation of school age youth was able to describe networks at the school level with relative completeness (105). This was the adolescent health study directed by the Carolina Population Center of the University of North Carolina. Since the dominant social ties of middle and high school students are to other students, this provides a nearly complete description of ties within one age group but not outside it. The study enquired about social and romantic partners in a questionnaire administered to everyone on one day and then pursued more details in subsequent studies of selected subjects. Useful insights relevant to adolescent health and the structure of networks have been made from this study data but there have been no formal transmission model analyses performed. The challenge for network scientists is how to use limited and potentially biased data of this sort to make robust inferences. As statistical methodology improves for integrating such data into infection transmission system analyses, we can expect epidemiologists to pursue the collection of more such data. The trick will be to get data and methods development into a positive feedback loop where improving methodology encourages more support for data collection which in turn will justify more methodological development. Emphasis in methodological development for using the easiest to collect data, namely egocentric data, will do the most to stir new data collection. Analysis of such data within an inference robustness framework that assesses the effects of assumptions about further details in micro-network structure and assumptions about macro-network structure might then justify the collection of level two types of data for the micro-network structure issues and contact matrix data for the macro-network structure issues. A fundamental problem with interview data methodologies is the definition of contact that is relevant to infection transmission. For sexually transmitted infections interview data may be attainable using acceptable definitions of what constitutes contact and people may be able to remember contact events with some reasonable accuracy. But for other infections this is highly questionable. Currently, for example, the
Infection Transmission through Networks
489
MIDAS projects are building influenza transmission system models without addressing the issue of how different types of contact and different environmental conditions differentially affect airborne transmission which may occur over considerable time and distance from droplet spread infection which will have more restricted time and space dimensions. Some contact pattern studies have used data on who has spoken to whom under the assumption that if individuals have spoken, they could have transmitted infection (82). But given that most respiratory infections are spread by both airborne and droplet spread mechanisms and the nature of contact for these two modes is quite different, it is not clear how good this definition might be. No matter how far off transmission networks are from conversation networks, the use of conversation networks is superior to the use of assumptions about contact patterns that have little empirical support. In this regard, interview obtained contact pattern assessments using data on who has spoken to whom were found superior in transmission system analyses for making inferences regarding the effects of age specific contact patterns on the dissemination of droplet spread infections like mumps (82). The methodology used assessed the improvement in prediction of infection patterns from using the interview based data compared simpler assumptions about mixing. Clearly this fits within the inference robustness assessment methodology we advocated earlier. If we fail to develop methodologies that distinguish which modes of transmission are acting under different conditions and what role these play in a transmission system, we are less likely to control any new emerging infection. Ideally for an emerging infection, one would like to determine the modes of infection that are likely to be acting in the general community from data that can be gathered in the hospital setting. In order to do that, we will need definitions that distinguish contacts with differential risk for airborne or droplet spread transmission in the community. We will then need a combination of environmental and epidemiological data in the hospital setting to characterize the infectious agents as to their relative propensities to transmit via these modes. This can all be done. It is an issue of how much effort and resources are dedicated to the task. Perhaps this can all be done more efficiently by integrating nucleotide sequence data more fully into the task.
490
J. S. Koopman
5.4.2. Macro-network Interview Data In addition to these methodologies that focus on completing micronetwork structures, various approaches can be used that seek only to describe macro-network structures. One such approach is to have individuals identify characteristics of their partners that they can readily ascertain and that the investigator can use to classify the subject as well. This has been called the matrix construction approach (100). Another approach is to have subjects identify the sites they visit where contacts could be generated and then to make assumptions about the encounter and linkage processes that leads two people at a site to make a contact and then to engage in behaviors that might entail transmission. This has not been widely attempted but anecdotal experiences indicate that it might be useful. It is reasonable to think it might be subject to fewer biases than the partial network description approaches. It will be subject to the validity of assumptions about contact process at a site, but the robustness of inferences to various violations of assumptions can be readily assessed. 5.4.3. Environmental Contamination Data Data on the level of environmental contamination can help construct and analyze models either by specifying the potential for different modes of transmission to act under different conditions or as data the model can fit. Media sampled could be air or surfaces or vacuum cleaning filters. Environmental scientists and epidemiologists have only used such data for the first purpose. For that purpose it is important that organisms that are identified in the environment be viable organisms that could start an infection in a new host. Methodology for determining viability is expensive and consequently few studies have been conducted of environmental contamination levels except in special media like food and water where safety for consumption must be insured. On the other hand, it is cheaper to identify genome segments without regard to viability. If models have compartments for both viable and nonviable infectious agents, they can be fit to such data.
Infection Transmission through Networks
491
5.4.4. Nucleotide Sequence Data Nucleotide sequence data for some infections is being extensively gathered for reasons not related for transmission analysis. Such data might be useful for model analyses. For example most HIV infected individuals get their reverse transcriptase and polymerase genes sequenced in order to assess the potential for resistance to emerge to each of the different antiviral agents that might be used. Such sequences are probably not as valuable for epidemiological analyses as envelope gene sequences. But they could still have great value. For the most part, however, sequence data relevant to transmission analysis is unlikely to be gathered until modelers demonstrate that making robust control decisions depends on it. The first step in that direction is to develop methods to use information on transmission tree patterns in transmission system analyses. A key issue for using nucleotide sequence information in transmission system analyses is to link the sequences to data on where the person experiencing the sequenced infection is located in the system. Such data linkage is unlikely to be available without special studies. The most likely data to be available will characterize individuals by their risk factors. If models classify individuals by these risk factors, then patterns of genetic distances between individuals with different risk factors can be compared with patterns of transmission distances generated by models. 5.5. Computational Modeling Network model simulations of infection transmission have been constructed in “agent based” simulation languages like Swarm (106), Ascape (107), and Repast (108). The Hopkins-Brookings Institute group in MIDAS uses Ascape. These are all higher level languages for general construction of agent based models available at modest cost or free. Epidemiologists, however, generally do not have the programming skills needed to use these languages. Commercial modeling packages like AnyLogic (109) have proven to be valuable for model construction by researchers who are not sophisticated programmers. Various other high level simulation programs have been developed and made publicly
492
J. S. Koopman
available for specific infections. Probably more than 20 different general programs for discrete individual simulation of infection transmission that were intended for public use and intended to be adaptable to particular infection problems have been developed. In general, however, each question is unique enough so that it is hard to use these general programs. To date, the most productive researchers have developed code that is narrowly applicable to the specific issues they are investigating. Good software might open up this field so that more epidemiologists get involved. But our effort to construct a general dynamic network simulation for sexually transmitted infections called GERMS (29) proved to be too complex to maintain and adapt readily to the multitude of unique questions that need to be addressed. We sought to simulate realistic processes of encounter and linkage formation and breakup for sexual relationships. Each time a relationship is formed, it changes the encounter and linkage environment of others. Therefore, we formulated all events as independent Markov processes and simulated one event at a time using mass action formulations in structured mixing settings to determine the sum of all encounter rates in a population. This process proved hard to tune so that specific network patterns were produced. An ideal model construction and analysis environment that this naïve epidemiologist would love to have is one that allows for compartmental model formulation of macro-network structure using structured mixing formulations. Analysis of such models using both numerical solutions of deterministic differential equations and stochastic simulations of discrete individual probability formulations should be possible. Then within the mixing sites the conversion of mass action formulations to network formulations with different micro-network structures should be facilitated. This is simpler than the dynamic network environment of GERMS (29) in that the mass action environment for encounter processes does not change with every event. The ability to transit between deterministic and stochastic mass action and network formulations would facilitate inference robustness assessment and allow one to make computational tradeoffs between memory and calculations in addressing specific problems.
Infection Transmission through Networks
493
5.6. Macro Topology of the Network Most broad aspects of macro-network structure of any infection transmission system within either urban or rural areas remain unknown because of the difficulties pointed out in the methodology Section. There are a few advances, however, in this regard. Studies of the space-time patterns of common infections from surveillance data over many years have demonstrated very clearly that population contact patterns specified by urban size and degrees of connection between different urban regions are clear determinants of infection patterns (110-112). There has long been an interest regarding macro-network contact structures by age group and the patterns of childhood infections. Most contact matrices employed, however, have been arbitrary and unrealistic. Recently, however, interview data on contact patterns by age has been used in fitting infection transmission system models to observed patterns of mumps and of pandemic influenza (82). The conclusion was reached that the 13-19 age group plays an important role in spreading this sort of infection and should be a focus of control efforts for any emerging infection spread via droplets. The robustness of this inference was not fully assessed and there are reasons to think it may be wrong. This assessment of the effect of macro-network structure should clearly be reexamined using network models with different micro-network assumptions. But as this sort of assessment is done for an increasing number of infections, patterns that modelers should use in any robustness assessment for an inference will be increasingly narrowed. Transportation system data has been touted as a key guide for control of emerging infections (70,75). Analyses of transportation determined patterns of infection have assumed uniform susceptibility to infection. In an emerging infection that a population has never experienced before, this might be a reasonable assumption. But in most cases, immunity is a key issue to deal with in the context of population contact structure. One problem is how to separate out the influence of actual structure of contact rates between different population segments and the influence of immunity from previous infection having circulated through the same population structures and transmission systems in the past. Several investigators have tried to use travel data to predict patterns of spread for
494
J. S. Koopman
influenza through countries or the world (113-117). In these studies model parameters were fit to observed annual epidemic patterns of influenza. But cumulative immunity from one annual epidemic to another is clearly acting and this immunity has not been taken into account. In fact the spatial-temporal pattern of influenza emergence in any year are due to the speed in different areas with which the epidemic reaches the fairly high infection levels that have to be achieved before an influenza epidemic is perceived. That speed is a function of immunity levels. The pattern of spatial spread has little to do with the spatial temporal patterns of when epidemics are perceived. We have shown in unpublished analyses that this failure greatly distorts conclusions about the flow of infection and its control. One vision for eventually determining the macro-network structure of a population is to combine various sorts of information. These would include the type of data and modeling structure used by the Virginia Tech group, inferences about the macro-network structure made from the study of transmission of numerous different infectious agents and their nucleotide sequence patterns in the same population, sequential serological studies in selected individuals whose contacts within the transmission system are documented, and biological studies of the potential for transmission via various modes of the various agents studied. It is my belief that every health officer should some day be able to determine what institutions and social events bring people together to form the key network connections that could sustain and amplify circulation of various infectious agents in their population. It should be possible on the basis of such an analysis for a health officer to make decisions about changing contact structure, where to focus case detection and isolation efforts, where to focus chemoprophylaxis of infectious agents, where to concentrate tracing and quarantine efforts, and where to insure maximal vaccination coverage. When pandemic flu hits again, it seems quite reasonable that with such knowledge, even if control of transmission is lost, total infection rates and mortality could be reduced by as much as 80% if proper actions are taken. Currently we are a long way from that. But the future is coming fast.
Infection Transmission through Networks
495
5.7. Micro Topology of the Network A correlation model analysis of childhood infections demonstrates that the micro-network topology of infection transmission systems brings infected individuals in contact more often with immune and other infected individuals than would be expected by chance (47). Analysis of the contact patterns generated by simulation based on transportation pattern influence on movement to mixing sites results in a similar conclusion (70). This significantly slows transmission dynamics, broadens out epidemic peaks, and reduces variance in infection levels (31). Expected patterns for sexual contact, airborne transmission, droplet spread, and transmission via inanimate objects or vectors should be different in this regard. In a purely heterosexual contact network there will be no triplets and only a modest increase in quadruplets where partners switch back and forth. In an airborne infection of respiratory infections at local mixing sites sharing the same airspace, all triplets involving people at the site will be closed and higher level structures like quadruplets or quintuplets will be saturated as well. On the other hand, as the population size in airborne contact in each unit goes up and the per person transmission rate goes down, the more the system will behave as if there is random mixing (31). But for droplet spread of respiratory infections, more direct contact will limit the formation of saturated small contact units except in households. There has been considerable work on following infectious agents through air, on surfaces, on hands, etc (118). But there is little field data and few tracer studies of particles in the environment that bear on issues related to how many people will be contaminated by someone in different environments. In summary, our knowledge of the microtopology of infection transmission systems is minimal. 5.8. Spatial Aspects Space is important to infection transmission at three different scales. On a global or large region scale, the space covered by air travel is important in the spread of emerging and other infections (116). Within localities,
496
J. S. Koopman
distance patterns to residences or contact sites play an important role (103). At transmission sites, spatial issues arise regarding the distance that an infectious organism leaving a host might travel to find another host (118). The distinction between the space-time dimensions of airborne, droplet, and surface transmission is an important issue that, unfortunately has not been adequately addressed. The lack of field information here may justify theoretical work on network effects that ignores this issue. But rather than ignoring this issue, it might help for theorists to incorporate it into inference robustness assessments in order to motivate appropriate data collection in this area. 5.9. Dynamics and Control Much of the theoretical work on infection transmission systems over the past 20 years has been demonstrating how contact patterns affect infection transmission dynamics. Comments relevant to this have been dispersed throughout this document and are especially found in the Section on micro-network topology. Likewise, comments on infection control are dispersed throughout this document. The Section on inference robustness assessment has this as a major focus. 6. Conclusion and Perspective In summary, theory for a science of infection transmission system analysis is flourishing, but data is languishing. One of the most exciting advances in the development of analytic tools for this science has been the incorporation of population structure into network analyses using percolation theory. But as long as such analyses must assume a fixed network, the robustness of inferences made will be suspect and will need to be assessed by realistically relaxing the assumption of fixed relationships. Addressing the data limitations may first require advancing analysis theory in ways that facilitate inference robustness assessments. When such assessments clarify what data are needed to make robust inferences, then the collection of such data is more likely to be supported. Particularly valuable data that relates more specifically to network
Infection Transmission through Networks
497
models than to any other model forms includes nucleotide sequence data from infected individuals along with additional data on their position in the transmission system. Robustness assessments that consider how much more robust infection control inferences could be if they could be based on such data need to be performed to motivate the collection of such data. But first network scientists need to develop new methodology to incorporate such data into transmission system analyses. Recent work using pgfs in the construction and analysis of network models that permits analyses using minimal computations should give network model analysis a central role within the pantheon of model forms that might be used in an inference robustness assessment. The key to using network models in this way is to specify their assumptions and to specify the assumptions in other models that they can relax. A model does not have to specifically incorporate structures involved in transmission in order to relax assumptions about such structures. For example, different social relationships may imply different micro- and macro-network structures. Models do not need to incorporate these social processes to assess the validity of inferences with regard to realistic violation of assumptions about the effects of social structures on contact patterns. They just have to show that across a range of patterns that different social structures might be generating, the infection control inference or causal inference holds. The inferences assessed in inference robustness assessments need to be chosen carefully in order to advance support for this area of science. Network modelers need to work with epidemiologists to define control issues that they can address. As much as possible, modelers should couch the inferences they seek in terms of risk factor effects. That will facilitate use of model analysis based inferences by epidemiologists and better point the way to new data collection. Infection transmission system behavior and the effects of control measures are sensitive to a seemingly endless number of real world complexities. Every infectious agent is likely to circulate differently within the social structures that create potentially infection transmitting contacts between individuals. This is true even when the range of agents considered is limited to those causing respiratory infections because all respiratory infections are transmitted by a variety of different modes and
J. S. Koopman
498
the particular mixture of those modes is likely to make crucial differences relevant to infection control decisions. But as the science of infection transmission system analysis advances, we can expect a time to come when transmission in each health jurisdiction is routinely conducted in a manner relevant to all known infectious problems and in regard to the potential emergence of new problems. As health occupies an increasing segment of economies, such routine analyses will make multi-billion dollar differences in the performance of those economies and improve and prolong the lives of almost everyone. References 1. 2. 3.
4.
5.
6. 7. 8. 9. 10. 11. 12. 13.
Newman, M.E. (2003). The Structure and Function of Complex Networks. SIAM Review. 45, 167-256. Newman, M.E. (2003). Mixing patterns in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 67, 026126. Pourbohloul, B., Meyers, L.A., Skowronski, D.M., Krajden, M., Patrick, D. M. and Brunham, R.C. (2005). Modeling Control Strategies of Respiratory Pathogens. Emerging Infectious Diseases. 11, 1249-1256. Meyers, L.A., Newman, M.E.J., Martin, M. and Schrag, S. (2003). Applying network theory to epidemics: Control measures for Mycoplasma pneumoniae outbreaks. Emerging Infectious Diseases. 9, 204-210. Meyers, L.A., Pourbohloul, B., Newman, M.E., Skowronski, D.M. and Brunham, R.C. (2005). Network theory and SARS: predicting outbreak diversity. J Theor Biol. 232, 71-81. Newman, M.E. and Ziff, R.M. (2000). Efficient Monte Carlo algorithm and highprecision results for percolation. Phys Rev Lett. 85, 4104-7. Newman, M.E. and Ziff, R.M. (2001). Fast Monte Carlo algorithm for site or bond percolation. Phys Rev E Stat Nonlin Soft Matter Phys. 64, 016706. Newman, M.E.J. (2002). Spread of epidemic disease on networks. Physical Review. E 66. Newman, M.E.J., Jensen, I. and Ziff, R.M. (2002). Percolation and epidemics in a two-dimensional small world. Physical Review. E 65. Newman, M.E. (2002). Assortative mixing in networks. Phys Rev Lett. 89, 208701. Newman, M.E. (2003). Properties of highly clustered networks. Phys Rev E Stat Nonlin Soft Matter Phys. 68, 026121. Newman, M.E. (2004). Analysis of weighted networks. Phys Rev E Stat Nonlin Soft Matter Phys. 70, 056131. Read, J.M. and Keeling, M.J. (2003). Disease evolution on networks: the role of contact structure. Proceedings of the Royal Society of London - Series B: Biological Sciences. 270, 699-708.
Infection Transmission through Networks 14. 15.
16.
17. 18. 19.
20. 21. 22.
23.
24.
25. 26. 27.
28.
29.
30.
499
Koopman, J.S. (2004). Modeling Infection Transmission. Annual Reviews of Public Health. 25, 303-326. Koopman, J.S., Jacquez, G. and Chick, S.E. (2001). New data and tools for integrating discrete and continuous population modeling strategies. Annals of the New York Academy of Sciences. 954, 268-94. Chick, S.E., Adams, A.L. and Koopman, J.S. (2000). Analysis and simulation of a stochastic, discrete-individual model of STD transmission with partnership concurrency. Mathematical Biosciences. 166, 45-68. Kaplan, E.H., Craft, D.L. and Wein, L.M. (2003). Analyzing bioterror response logistics: the case of smallpox. Mathematical Biosciences. 185, 33-72. Kaplan, E.H., Craft, D.L. and Wein, L.M. (2002). Emergency response to a smallpox attack: the case for mass vaccination. 6, 10935-40. Bozzette, S.A., Boer, R., Bhatnagar, V., Brower, J.L., Keeler, E.B., Morton, S.C. and Stoto, M.A. (2003). A model for a smallpox-vaccination policy. New England Journal of Medicine. 348, 416-25. Koopman, J.S. (2003). Controlling Smallpox. Science 298, 1342-1344. Halloran, M.E., Longini, I.M., Jr., Nizam, A. and Yang, Y. (2003). Containing bioterrorist smallpox. Science. 298, 1428-32. Koopman, J.S., Chick, S.E., Simon, C.P., Riolo, C.S. and Jacquez, G. (2002). Stochastic effects on endemic infection levels of disseminating versus local contacts. Mathematical Biosciences. 180, 49-71. Rohani, P., Keeling, M.J. and Grenfell, B. (2002). The Interplay between Determinism and Stochasticity in Childhood Diseases. The American Naturalist. 5, 469-480. Jacquez, J.A. and Simon, C.P. (1993). The stochastic SI model with recruitment and deaths. I. Comparison with the closed SIS model. Mathematical Biosciences. 117, 77-125. Riggs, T.W. and Koopman, J S. (2004). A stochastic model of vaccine trials for endemic infections using group randomization. Epidemiol Infect. 132, 927-38. Riggs, T.W. and Koopman, J.S. (2005). Maximizing statistical power in group randomized vaccine trials. Epidemiology and Infection Available online. Koopman, J.S. (2005). Mass Action and System Analysis of Infection Transmission. In Ecological Pardigms Lost: Routes to Theory Changes (Cuddington, K. and Beisner, B.E., eds.). Academic Press. Vazquez, A. and Barabasi, A.L. (2005). The impact of non-Poisson contact processes on virus spreading. DIMACS Computational and Mathematical Epidemiology Seminar Series. Koopman, J.S., Chick, S.E., Riolo, C.S., Adams, A.L., Wilson, M.L. and Becker, M.P. (2000). Modeling contact networks and infection transmission in geographic and social space using GERMS. Sexually Transmitted Diseases. 27, 617-26. Jacquez, J.A., Simon, C.P. and Koopman, J.S. (1989). Structured Mixing: Heterogeneous Mixing by the Definition of Activity Group. In Mathematical and Statistical Approaches to AIDS Epidemiology (Castillo-Chavez, C., ed.). 83,. 316349. Springer-Verlag, Heidelberg.
500 31. 32.
33. 34. 35.
36. 37. 38. 39. 40. 41.
42. 43. 44.
45. 46.
47.
48.
J. S. Koopman Keeling, M. (2005). The implications of network structure for epidemic dynamics. Theor Popul Biol. 67, 1-8. Lloyd-Smith, J.O., Getz, W.M. and Westerhoff, H.V. (2004). Frequencydependent incidence in models of sexually transmitted diseases: portrayal of pairbased transmission and effects of illness on contact behaviour. Proc Biol Sci. 271, 625-34. Rand, D.A. (1999). Correlation Equations and Pair Approximations for Spatial Ecologies. CWI Quarterly. 12, 329-368. Grassly, N.C., Fraser, C. and Garnett, G.P. (2005). Host immunity and synchronized epidemics of syphilis across the United States. Nature. 433, 417-21. Bolker, B. and Grenfell, B. (1995). Space, persistence and dynamics of measles epidemics. Philosophical Transactions of the Royal Society of London - Series B: Biological Sciences. 348, 309-20. Rohani, P., May, R.M. and Hassell, M.P. (1996). Metapopulations and equilibrium stability: the effects of spatial structure. J Theor Biol. 181, 97-109. Keeling, M.J. and Gilligan, C.A. (2000). Metapopulation dynamics of bubonic plague. Nature. 407, 903-6. Miramontes, O. and Luque, B. (2002). Dynamical small-world behavior in an epidemical model of mobile individuals. Physica D. 168, 379-385. Grenfell, B. and Harwood, J. (1997). (Meta)population dynamics of infectious diseases. TREE. 12, 395-9. Keeling, M.J. and Rohani, P. (2002). Estimating spatial coupling in epidemiological systems: a mechanistic approach. Ecology Letters. 5, 20-29. Diekmann, O. and Heesterbeek, J.A.P. (2000). Mathematical Epidemiology of Infectious Diseases: Model Building, Analysis and Interpretation. Mathematical and Computational Biology (Levin, S., Ed.), Wiley, Chichester. Heesterbeek, J.A. and Metz, J.A. (1993). The saturating contact rate in marriage and epidemic models. Journal of Mathematical Biology. 31, 529-539. Dietz, K. and Hadeler, K.P. (1988). Epidemiological models for sexually transmitted diseases. Journal of Mathematical Biology 26, 1-25. Bauch, C. and Rand, D.A. (2000). A moment closure model for sexually transmitted disease transmission through a concurrent partnership network. Proc Biol Sci. 267, 2019-27. Bauch, C.T. (2002). A versatile ODE approximation to a network model for the spread of sexually transmitted diseases. J Math Biol. 45, 375-95. Filipe, J.A. and Maule, M.M. (2003). Analytical methods for predicting the behaviour of population models with general spatial interactions. Mathematical Biosciences. 183, 15-35. Keeling, M.J., Rand, D.A. and Morris, A.J. (1997). Correlation models for childhood epidemics. Proceedings of the Royal Society of London - Series B: Biological Sciences. 264, 1149-56. Keeling, M.J. (2005). Extensions to Mass Action Mixing chapter 6, 107-55. In Ecological Paradigms Lost: Routes to Theory Changes (Cuddington, K. and Beisner, B. E., eds.). Academic Press.
Infection Transmission through Networks 49. 50. 51. 52.
53. 54. 55.
56. 57. 58.
59. 60. 61. 62. 63. 64.
65. 66. 67.
501
Anderson, R.M. and May, R.M. (1991). Infectious Diseases of Humans: Dynamics and Control, Oxford University Press. Moore, C. and Newman, M.E.J. (2000). Epidemics and percolation in small-world networks. Physical Review E. 61, 5678-5682. Kuperman, M. and Abramson, G. (2001). Small world effect in an epidemiological model. Physical Review Letters. 86, 2909-12. Newman, M.E. and Watts, D.J. (1999). Scaling and percolation in the small-world network model. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 60, 7332-42. Newman, M.E., Moore, C. and Watts, D.J. (2000). Mean-field solution of the small-world network model. Phys Rev Lett. 84, 3201-4. Watts, D.J. (1999). Small Worlds: the Dynamics of Networks between Order and Randomness, Princeton University Press, Princeton. Small, M., Shi, P. and Tse, C.K. (2004). Plausible models for propagation of the SARS virus. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences Special Section on Nonlinear Theory and Its Applications. 2379-2386. Grabowski, A. and Kosinski, R.A. (2004). Epidemic spreading in a hierarchical social network. Phys Rev E Stat Nonlin Soft Matter Phys. 70, 031908. Liljeros, F., Edling, C.R., Amaral, L.A., Stanley, H.E. and Aberg, Y. (2001). The web of human sexual contacts. Nature. 411, 907-8. Borguna, M., Pastor-Satorras, R. and Vespignani, A. (2002). Absence of epidemic threshold in scale-free networks with connectivity correlations. cond- mat.8, 0208163. Newman, M.E. (2001). Clustering and preferential attachment in growing networks. Phys Rev E Stat Nonlin Soft Matter Phys. 64, 025102. Szendroi, B. and Csanyi, G. (2004). Polynomial epidemics and clustering in contact networks. Proc R Soc Lond B Biol Sci. 271, S364-6. Borguna, M. and Pastor-Satorras, R. (2002). Epidemic spreading in correlated complex networks. Phys Rev E Stat Nonlin Soft Matter Phys. 66, 047104. Girvan, M. and Newman, M.E. (2002). Community structure in social and biological networks. Proc Natl Acad Sci. U.S.A. 99, 7821-6. Jin, E.M., Girvan, M. and Newman, M.E. (2001). Structure of growing social networks. Phys Rev E Stat Nonlin Soft Matter Phys. 64, 046132. Olinky, R. and Stone, L. (2004). Unexpected epidemic thresholds in heterogeneous networks: the role of disease transmission. Phys Rev E Stat Nonlin Soft Matter Phys. 70, 030902. Pastor-Satorras, R. and Vespignani, A. (2001). Epidemic dynamics and endemic states in complex networks. Phys Rev E Stat Nonlin Soft Matter Phys. 63, 066117. Pourbohloul, B. and Brunham, R.C. (2004). Network models and transmission of sexually transmitted diseases. Sex Transm Dis. 31, 388-90. Sander, L.M., Warren, C.P., Sokoloff, I.M., Simon, C.P. and Koopman, J.S. (2002). Percolation on heterogeneous networks as a model for epidemics. Mathematical Biosciences. 180, 293-305.
502 68. 69. 70.
71. 72.
73.
74.
75. 76.
77.
78.
79.
80.
81.
82.
83.
J. S. Koopman Eguiluz, V.M. and Klemm, K. (2002). Epidemic Threshold in Structured ScaleFree Networks. Physical Review Letters. 89, 108701. Dodds, P.S. and Watts, D.J. (2004). Universal Behavior in a Generalized Model of Contagion. Physical Review Letters. 92, 218701. Eubank, S., Guclu, H., Kumar, V.S., Marathe, M.V., Srinivasan, A., Toroczkai, Z. and Wang, N. (2004). Modelling disease outbreaks in realistic urban social networks. Nature. 429, 180-4. NIH. (2003). Pilot Projects for models of infectious disease agent study. (MIDAS), Vol. 2003. NIH. Epstein, J.M., Cummings, D.A.T., Chakravarty, S., Singa, R. M. and Burke, D.S. (2002). Toward a Containment Strategy for Smallpox Bioterror: An IndividualBased Computational Approach. In Center on Social and Economic Dynamics, 24. Longini Jr., I.M., Halloran, M.E., Nizam, A. and Yang, Y. (2004). Containing pandemic influenza with antiviral agents. American Journal of Epidemiology. 159, 623-33. Longini Jr, I.M., Nizam, A., Xu, S., Ungchusak, K., Hanshaoworakul, W., Cummings, D.A.T. and Halloran, E.M. (2005). Containing Pandemic Influenza at the Source. Science. 1115717. Barrett, C.L., Eubank, S. and Smith, J.P. (2005). If smallpox strikes Portland. Scientific American. 292, 42-49. Riolo, C.S., Koopman, J.S. and Chick, S.E. (2001). Methods and measures for the description of epidemiologic contact networks. Journal of Urban Health. 78, 44657. Eisenberg, J.N., Brookhart, M.A., Rice, G., Brown, M. and Colford, J.M., Jr. (2002). Disease transmission models for public health decision making: analysis of epidemic and endemic conditions caused by waterborne pathogens. Environ Health Perspect. 110, 783-90. Eisenberg, J.N., Soller, J.A., Scott, J., Eisenberg, D.M. and Colford, J.M., Jr. (2004). A dynamic model to assess microbial health risks associated with beneficial uses of biosolids. Risk Anal. 24, 221-36. Chick, S.E., Soorapanth, S. and Koopman, J.S. (2002). Waterborne Microbial Infections: Inferring Transmission Parameters That Influence Water Treatment Decisions. INSEAD Working Papers. 1-33. Chick, S.E., Koopman, J.S. and Soorapanth, S. (2003). Inferring Infection Transmission Parameters That Influence Water Treatment Decisions. Management Science. 49, 920-935. Chick, S.E., Soorapanth, S. and Koopman, J.S. (2004). Microbial Risk Assessment for Drinking Water. In Handbook of Operations Research/Management Science Applications in Health Care (Sainfort, F., Brandeau, M. and Pierskalla, W., eds.). Kluwer. Wallinga, J., Teunis, P. and Kretszchmar, M. (2005).Using data on social contacts to estimate age specific transmission parameters for respiratory spread infections agents. American Journal of Epidemiology. 164, 936-44. Snijders, T.A.B., Pattison, P.E., Robins, G.L. and Handcock, M.S. (2006). Sociological methology. 36, 99-153.
Infection Transmission through Networks 84.
85. 86.
87.
88. 89. 90.
91.
92.
93.
94.
95.
96. 97.
98.
503
Demiris, N. and O'Neill, P.D. (2005). Bayesian inference for stochastic multitype epidemics in structured populations via random graphs. Journal of the Royal Statistical Society, Series B. 67, 731-746. Becker, N.G., Britton, T. and O'Neill, P.D. (2003). Estimating vaccine effects on transmission of infection from household outbreak data. Biometrics. 59, 467-475. O'Neill, P.D. and Roberts, G.O. (1999). Bayesian inference for partially observed stochastic epidemics. Journal of the Royal Statistical Society, Series A. 162, 121129. O'Neill, P.D. (2002). A tutorial introduction to Bayesian inference for stochastic epidemic models using Markov chain Monte Carlo methods. Mathematical Biosciences. 180, 103-114. O'Neill, P.D. and Marks, P.J. (2005). Bayesian model choice and infection route modelling in an outbreak of Norovirus. Statistics in Medicine. 24, 2011-24 Chu, H. and Halloran, M.E. (2004). Estimating vaccine efficacy using auxiliary outcome data and a small validation sample. Stat Med. 23, 2697-711. Tenover, F.C., Arbeit, R.D. and Goering, R.V. (1997). How to select and interpret molecular strain typing methods for epidemiological studies of bacterial infections: a review for healthcare epidemiologists. Molecular Typing Working Group of the Society for Healthcare Epidemiology of America. Infection Control and Hospital Epidemiology. 18, 426-39. Borgdorff, M.W., Nagelkerke, N.J., van Soolingen, D. and Broekmans, J.F. (1999). Transmission of tuberculosis between people of different ages in The Netherlands: an analysis using DNA fingerprinting. Int J Tuberc Lung Dis 3, 2026. Borgdorff, M.W., Nagelkerke, N.J., de Haas, P.E. and van Soolingen, D. (2001). Transmission of Mycobacterium tuberculosis depending on the age and sex of source cases. Am J Epidemiol. 154, 934-43. Wylie, J.L., Cabral, T. and Jolly, A.M. (2005). Identification of networks of sexually transmitted infection: a molecular, geographic, and social network analysis. J Infect Dis. 191, 899-906. Ghani, A.C., Ison, C.A., Ward, H., Garnett, G.P., Bell, G., Kinghorn, G.R., Weber, J. and Day, S. (1996). Sexual partner networks in the transmission of sexually transmitted diseases. An analysis of gonorrhea cases in Sheffield, UK. Sexually Transmitted Diseases. 23, 498-503. Pybus, O.G., Rambaut, A. and Harvey, P.H. (2000). An integrated framework for the inference of viral population history from reconstructed genealogies. Genetics. 155, 1429-37. Koopman, J.S., Simon, C.P. and Riolo, C.S. (2005). When to Control Endemic Infections by Focusing on High-Risk Groups. Epidemiology. 16, 621-7. Jacquez, J.A., Koopman, J.S., Simon, C.P. and Longini, I.M., Jr. (1994). Role of the primary infection in epidemics of HIV infection in gay cohorts. Journal of Acquired Immune Deficiency Syndromes. 7, 1169-84. Koopman, J.S., Jacquez, J.A., Welch, G.W., Simon, C.P., Foxman, B., Pollock, S.M., Barth-Jones, D., Adams, A.L. and Lange, K. (1997). The role of early HIV
504
99.
100. 101. 102.
103. 104.
105.
106. 107.
108. 109. 110. 111. 112. 113. 114.
J. S. Koopman infection in the spread of HIV through populations. Journal of Acquired Immune Deficiency Syndromes and Human Retrovirology. 14, 249-58. Koopman, J.S., Lin, X., Chick, S.E. and Gilsdorf, J. (2004). Transmission Model Analysis of Nontypeable Haemophilus influenzae Immunity. In Handbook of Operations Research / Management Science Applications in Health Care (Sainfort, F., Brandeau, M. and Pierskalla, W., eds.). Kluwer. Morris, M. (2004). Network Epidemiology: A Handbook for Survey Design and Data Collection, Oxford University Press, Oxford. Kretzschmar, M. and Morris, M. (1996). Measures of concurrency in networks and the spread of infectious disease. Mathematical Biosciences. 133, 165-95. Potterat, J.J., Phillips-Plummer, L., Muth, S.Q., Rothenberg, R.B., Woodhouse, D.E., Maldonado-Long, T.S., Zimmerman, H.P. and Muth, J.B. (2002). Risk network structure in the early epidemic phase of HIV transmission in Colorado Springs. Sex Transm Infect. 78, 59-63. Rothenberg, R., Muth, S.Q., Malone, S., Potterat, J.J. and Woodhouse, D.E. (2005). Social and geographic distance in HIV risk. Sex Transm Dis. 32, 506-12. Rindfuss, R.R., Jampaklay, J., Entwisle, B., Sowangdee, Y., Faust, K. and Prasartkul, P. (2004). The Collection and Analysis of Social Network Data in Nang Rong, Thailand. In Network Epidemiology: A Handbook for Survey Design and Data Collection. 175-200. Oxford University Press, Oxford. Bearman, P.S., Moody, J., Stovel, K. and Thalji, L. (2004). Social and Sexual Networks: The National Longitudinal Study of Adolescent Health. In Network Epidemiology: A Handbook for Survey Design and Data Collection (Morris, M., ed.), 201-224. Oxford University Press, Oxford. Terna, P. (1998). Simulation Tools for Social Scientists: Building Agent Based Models with SWARM. Journal of Artificial Societies and Social Simulation 1. Parker, M.T. (2001). What is Ascape and Why Should You Care? Journal of Artificial Societies and Social Simulation. Available from
4. Collier, N. (2003). Repast: An extensible framework for agent simulation. Available from http://repast.sourceforge.net/projects.html. Technologies, X. (2005). AnyLogic 4.0 User Manual. Available from http://www.xjtek.com/products/anylogic/40/. Grenfell, B.T., Bjornstad, O.N. and Kappey, J. (2001). Travelling waves and spatial hierarchies in measles epidemics. Nature. 414, 716-723. Rohani, P., Earn, D.J. and Grenfell, B.T. (1999). Opposite patterns of synchrony in sympatric disease metapopulations. Science. 286, 968-71. Rohani, P., Earn, D.J. and Grenfell, B.T. (2000). Impact of immunisation on pertussis transmission in England and Wales. Lancet. 355, 285-6. Longini, I.M., Jr., Fine, P.E. and Thacker, S.B. (1986). Predicting the global spread of new infectious agents. American Journal of Epidemiology. 123, 383-91. Grais, R.F., Ellis, J.H. and Glass, G.E. (2003). Assessing the impact of airline travel on the geographic spread of pandemic influenza. European Journal of Epidemiology. 18, 1065-1072.
Infection Transmission through Networks
505
115. Grais, R.F., Ellis, J.H., Kress, A. and Glass, G.E. (2004). Modeling the spread of annual influenza epidemics in the U.S.: The potential role of air travel. Health Care Management Science. 7, 127-134. 116. Hufnagel, L., Brockmann, D. and Geisel, T. (2004). Forecast and control of epidemics in a globalized world. Proc Natl Acad Sci U.S.A. 101, 15124-9. 117. Hyman, J.M. and LaForce, T. (2003). Modeling the spread of influenza among cities. In Bioterrorism: Mathematical Modeling Applications in Homeland Security (Banks, H. T. and Castillo-Chavez, C., eds.). Society for Industrial and Applied Mathematics. 118. Haas, C.N., Rose, J.B. and Gerba, C.P. (1999). Quantitative Microbial Risk Assessment, John Wiley, NY.
This page intentionally left blank
INDEX
adaptive immunity, 343, 351 agent-based, 467 amplification, 33, 34, 37, 112, 119, 313 apoptosis, 351, 356 Arabidopsis thaliana, 90 average connectivity, 109, 110, 182, 183, 276, 395 B cell, 76, 342, 345–349, 353–355 Bacillus subtilis, 124, 273 bacteriophage lambda, 103, 104 Barabasi–Albert, 9, 153, 187 basal species, 392, 426, 427, 433, 435, 436, 439–442, 444 basic blocks, 37 Bayesian network, 44, 58, 60, 61, 63, 64, 69, 77, 102, 210, 246 Bayesian score, 71 best-fit, 14, 67, 208, 248 betweenness centrality, 2, 450 biocoenosis, 366, 370 bioenergetic model, 425– 427 bioinformatics, 96, 97, 164, 176, 216, 232, 476 biological networks, 2, 4, 8, 9, 16, 22, 23, 28, 44, 47, 52, 101, 114, 145, 148, 152, 213, 273, 274, 276, 280–282, 284 biomass, 14, 176, 182, 189–191, 235, 236, 366, 375, 376, 378, 383, 385, 386, 389, 394, 395, 400, 401, 427, 430 bipartite graph, 168, 169, 209, 466, 468, 479, 485 Boolean modeling, 99, 100 Boolean network, 42, 55, 56, 58, 67, 109, 208, 211, 213, 238, 278 Bose-Einstein condensation, 11, 12 building blocks, 23, 228, 325, 402 Caenorhabditis elegans, 3, 4, 139, 140, 150, 152, 263, 268 cancer, 29, 76, 93, 121 carnivore, 371, 426, 442 cascade, 87, 106, 117, 119, 133, 153, 233, 300, 358, 397, 401, 426, 430–432, 438, 439, 442 catalytic, 88, 98, 101, 168 causal dependency, 44 chemokine, 355
507
508
Index
chromatin immunoprecipitation, 93, 206 circuit feedback, 87, 110, 116, 117 integrated, 24 negative, 110, 115–118 positive, 110, 116, 117, 119 clustering coefficient, 3, 4, 8, 9, 11, 22, 177, 182, 186, 214, 272, 274, 276, 280, 322 co-evolution, 268 co-immunoprecipitation, 135, 137, 140 co-occurrence, 121, 138, 402 coexpression network, 202, 218, 241, 245, 247, 267 combinatorial, 23, 43, 67, 113, 259, 347 combinatorial explosion, 23, 36, 181 combinatorial transcription logic, 112 community structure, 120, 121, 375, 400, 405, 450 compartmentalization, 22, 101 complementary DNA, 89 complexity, 36, 37, 43, 48, 50, 52, 62, 64, 67–69, 77, 112, 147, 164, 174, 186, 199, 203, 232, 257–259, 266, 273, 283, 291, 293, 294, 311, 315, 322, 325, 327, 328, 350, 353, 368, 375, 380–382, 384, 386, 393, 423–425, 438, 443, 463, 467, 468, 481–483 component network, 200, 201, 203, 205, 207, 209, 210, 215, 217, 222, 224, 226, 233– 235, 240, 249, 250 composite network, 200, 203–205, 207, 209, 214, 221–223, 225, 226, 228, 232–234 computation, 44, 72, 181, 184, 294, 312, 314 computational, 36, 46, 50, 97, 99, 105, 112, 134, 138, 139, 152, 155, 199, 200, 203, 205, 206, 209, 213, 239, 244, 245, 248–250, 267, 291, 297, 304, 310, 313, 315, 323, 324, 327, 358, 361, 451, 479, 492 computational modeling, 97, 141, 175, 200, 208, 308, 491 connectance, 379, 382, 383, 388, 392, 393, 397, 426, 430, 433–436, 441 connectivity, 4, 11, 26, 33, 106, 108–110, 115, 144, 154, 165, 184, 186–189, 191–193, 224, 231, 272, 276, 277, 295, 296, 298, 311, 314–317, 319, 321, 322, 327, 396, 427, 430, 436, 437, 440, 471 contact group, 457 contagiousness, 465, 476, 477, 484 convergence, 50, 303, 313 conversion, 164–166, 177, 183, 184, 427, 492 correlation profile, 31–33 cortical networks, 315, 320, 322 cryptosporidia, 449 cytokines, 344–348, 350, 351, 353, 355, 356 cytoskeleton, 133, 356
Index
509
decoupled evolution, 268 degree distribution, 2, 3, 8–10, 16, 22, 32, 215, 221, 227, 318, 322, 395, 404, 440, 450, 463, 464 determinism, 104 deterministic models, 51, 54, 59, 105, 454, 455, 461 differential equations ordinary, 102, 103, 455 partial, 103, 104, 361 differentiation, 30, 116, 264, 345, 347, 353, 355, 356, 358–360 divergence, 153, 187, 263, 266, 268, 283, 305 diversity, 58, 104, 267, 282, 293, 298–300, 303, 304, 311, 327, 328, 341, 368, 380–382, 384, 388, 423–425, 438, 443, 444, 481, 482 DNA, 27, 28, 30, 45, 51, 58, 62, 63, 83–86, 89, 93–97, 104, 109, 111, 112, 114, 122–124, 133, 136–138, 141, 147, 148, 152, 164, 171, 172, 190, 220, 236, 249, 257, 261, 263– 266, 274, 282, 350 Drosophila melanogaster, 3, 139, 140, 150 drug targets, 153, 190 duplication, 10, 33, 35, 37, 153, 154, 259, 268, 270, 273, 275–277, 280, 283 dynamic models, 70, 427 dynamical models, 43, 50, 54, 63, 204, 234, 236, 369, 391, 400, 425 dynamics, 1, 26, 34, 36, 41, 50, 51, 54, 58, 83, 84, 92, 98, 106, 107, 109, 114–117, 120, 122, 134, 154, 155, 190, 191, 200, 201, 204, 208–211, 213, 214, 220, 226, 228, 229, 233–236, 238, 240, 241, 249, 250, 260, 273–275, 277–279, 282, 283, 292, 294, 296, 298, 299, 302, 303, 306, 307, 309, 310, 319, 320, 322, 325, 327, 328, 371, 374, 384– 386, 388, 399, 400, 402, 404, 405, 423–427, 429–431, 438–443, 449, 455, 456, 462, 463, 467, 479, 482, 485, 495, 496 ecological networks, 365, 366, 368, 369, 375, 380, 381, 383, 400, 402, 405, 406, 423, 425, 438, 439, 443 ecology, 2, 365, 370, 371, 373, 376, 378, 380, 382, 401, 405, 406, 424, 439, 444 ecosystems, 365, 370, 376, 378, 380, 397, 406, 423–425, 440, 443 edge, 32, 44, 61, 64, 72, 98, 107, 109, 115, 143, 184, 208, 210, 215, 217, 228–230, 276– 278, 280, 327, 390, 391 elementary mode, 178, 180, 181, 185, 189, 235 encoding, 47, 51, 54, 62, 64, 83, 88, 122, 123, 146, 240, 243, 247, 260 enzyme, 28, 86, 133, 163, 165, 167, 168, 172–176, 183, 188–190, 204, 209, 218, 235, 240, 241, 247 epidemiological, 458, 475–477, 482, 486, 489, 491 epidemiology, 471, 476, 478, 479, 486 Erdös–Rényi, 9, 22, 182 errors, 89, 176, 190, 435, 436, 441
510
Index
Escherichia coli, 6, 8, 13–16, 107, 109, 110, 118, 122–124, 139, 140, 183, 184, 186, 188, 189, 191, 192, 211, 226, 227, 235, 239, 241, 247, 248, 259, 263, 271, 282 eukaryote, 53 evolution, 21–23, 33, 34, 37, 104, 134, 138, 139, 153–155, 185, 187, 190, 193, 204, 214, 240, 257–262, 264, 265, 267–270, 272–275, 281–283, 365, 377, 405, 452, 481 evolvability, 23, 34, 37, 186, 263 extinction, 386, 396, 430, 458 extreme pathway, 179, 181, 189, 212 factor graph network, 63, 65, 66, 214 fan, 119, 230 feedforward loop, 65, 117, 118 fluctuations, 34, 67, 104, 323, 324, 387 flux, 13–16, 47, 164, 165, 177–179, 181, 192, 209, 211, 212, 235, 236, 238 flux balance analysis, 179, 180, 189, 191 fluxes, distribution of, 192 food web, 2, 3, 389, 395, 425, 426, 429, 431, 432, 434, 435, 437, 439, 441, 442 functional association, 22, 134, 138, 139, 202 fundamental pathway, 181 gene duplication, 34, 138, 153, 154, 187, 258, 262, 263, 268, 271, 274, 276–278, 283 gene expression, serial analysis of, 93 gene fusion, 138, 202, 226 gene neighborhood, conserved, 202, 226 gene transfer, horizontal, 138, 269 genetic distance, 473, 474, 491 genetic interaction, 86, 98, 107, 134, 137, 142, 144, 202, 203, 207, 220–222, 228, 232, 233, 239, 242–244, 249 genetic recombination, 187 genome sequence, 171, 172, 175, 176, 199 genomics, 2, 114, 154, 215, 216, 226, 251, 257, 260 genotype, 23, 212 geometry, 84, 122, 125 graph, 9, 27, 28, 32, 34, 46, 58, 64, 65, 71, 86, 88, 98, 102, 106, 107, 109, 110, 121, 168, 175, 177, 179, 182–184, 186–189, 192, 208–210, 213, 214, 216, 217, 223, 224, 249, 257, 269, 272, 274, 276, 317, 322, 367, 373, 382, 385, 389–391, 398, 461–464, 471, 486 directed, 102, 116, 118, 170, 184, 210, 269, 272, 317, 465, 466 undirected, 210, 367, 466 Helicobacter pylori, 13, 139, 140, 150 herbivore, 400, 402, 435
Index
511
heterogeneity, 21, 22, 122, 358 heterogeneous, 14, 21, 22, 28, 34, 106, 122, 125, 200, 201, 204, 205, 209, 210, 213–215, 223, 224, 226, 232–234, 238, 240, 243, 248, 249 hierarchical, 4, 5, 7, 8, 10, 16, 22, 26–28, 30, 35, 36, 76, 78, 186, 218, 271, 295, 326, 366, 403, 426, 438 hierarchy, 2, 4, 28, 29, 36, 117, 121, 347, 397, 423 high-throughput, 89, 93, 134, 136, 137, 139, 141, 204–207, 214, 239 homeostasis, 115, 116, 303, 304, 377 Homo sapiens, 139, 140, 150 homogeneity, 103 homogeneous, 14, 104, 107, 263, 315, 318, 321, 366, 459 hub, 29, 31, 32, 77, 145, 152, 272 hypergraph, 99 identifiability, 52, 62 immune network, 344, 347–350, 356, 357, 359 immune system, 341–348, 350, 351, 353, 355–359 infection, 344–347, 355, 356, 360, 449–452, 454–458, 461–463, 465, 467, 469–471, 474–476, 478–481, 483, 485–491, 493–497 infection control, 450–452, 465, 467, 496–498 inference, 42– 46, 49, 50, 52–55, 62, 64, 67, 71–74, 76–78, 100, 102, 105, 176, 293, 452, 454, 464, 470, 473, 484, 493 inference robustness, 452, 453, 460–466, 468, 470–473, 483, 485, 488, 489, 492, 496, 497 innate immunity, 342 integration, 10, 77, 201, 204, 207, 212, 222, 226, 237, 246, 265, 291, 293, 295, 296, 312, 313, 324, 328, 349, 356, 357, 424, 471, 472 interolog, 139, 150 invasion, 392 irreversible, 119, 153, 168, 170 isomorphic subgraph, 145 keystone predation, 381, 401 layered structure, 224, 225, 311 learnability, 49, 52 linear network, 56, 57 linear programming, 179–181, 192 link, 2, 9, 13, 16, 26, 27, 34, 58, 76, 97, 106, 114, 122, 144, 154, 165, 176, 202, 231–233, 243, 248, 277, 280, 293, 300, 315, 317, 322, 328, 348, 353, 369, 380, 381, 386, 392– 395, 402, 405, 426, 429, 468, 491 logical functions, 113, 209, 213
512
Index
logical modeling, generalized, 100 machine learning, 42–47, 49, 52, 66, 72–74, 77, 78, 280 macro-network, 458–460, 463, 472, 474, 475, 488, 490, 492–494, 497 macromolecular network, 114, 115 major transitions, 23 mass action, 384, 451–463, 467, 472, 481, 492 mass conservation, 166, 180, 192 mass spectrometry, 134–136, 140, 150, 172 mechanistic movement, 459 metabolic network, 3–6, 13, 16, 78, 144, 154, 164, 166, 169, 170, 174–179, 181, 184, 185, 187–193, 201, 206, 209–212, 218, 233, 235, 236, 240–242, 247, 248, 250, 274, 276 metabolic pathway, 42, 177, 179, 181, 183–185, 202, 204, 218, 226, 242, 247–249, 265, 271 metabolism, 13–16, 163–169, 171, 172, 174–176, 182–188, 190–193, 199, 203, 207, 209, 210, 212, 213, 218, 219, 233–236, 238, 239, 247, 248, 250 metabolite, 13–15, 168, 170, 174, 177, 182, 183, 185–190, 203, 204, 209, 233, 235, 238, 249, 274 external, 170, 179, 211, 213 internal, 170, 177, 178 metabolomics, 250 metapopulation, 459, 463, 467, 484 metazoan, 122, 134, 259 micro-network, 457–459, 462, 472, 481, 486, 488, 490, 492, 493, 495, 496 microarray, 15, 42, 50, 51, 89, 91, 92, 94, 95, 97, 134, 141, 172 microenvironment, 346, 350, 354 microorganism, 110, 191 microorganisms, 84, 111, 119, 190 migration, 343, 344, 349, 350, 355, 356, 366, 440, 444 modular architecture, 21, 22, 26, 37 modularity, 4, 10, 23, 25, 26, 31, 33–36, 84, 114, 115 module, 10, 22, 26, 30, 31, 36, 63, 65, 76, 111, 112, 114, 115, 120, 186, 225, 226, 232, 241, 242, 259, 265, 270, 271, 316, 353, 402 molecular complex detection, 121 morphological, 122, 267, 296, 303, 314, 399, 404 morphology, 295, 296, 298, 403 motif, 5, 6, 95, 97, 145, 146, 227–232, 242, 243, 266, 269, 271, 279, 312 multi-scale, 291 multicellular organism, 104, 111, 119, 137, 171, 173, 357 mutation, 137, 174, 187–190, 202, 266–268, 275, 276, 278, 280, 281, 349 mutualism, 367, 377, 379, 401
Index
513
network alignment, 148–150 network navigation, 450 network resilience, 450 neural network, 4, 57, 68, 291, 308, 309 niche, 378, 389–392, 397, 401, 403, 423, 425, 426, 430–443 node, 2–12, 22, 28, 33, 34, 61, 64, 106, 115, 116, 122, 143, 152, 153, 177, 182, 183, 217, 228–232, 241, 272, 276–280, 345, 346, 355, 360, 377, 436, 437, 440 noise, 22, 54, 55, 58, 59, 64, 67, 69–71, 77, 95, 104, 125, 147, 269, 320, 323, 324 nonlinear persistence, 438 normalization, 65, 92 oligonucleotide, 90, 91 omnivores, 426, 430, 435–437, 442 open reading frame, 84, 171 optimization, 13, 44, 45, 47, 66, 212, 219, 236 orthologous, 138, 150, 226, 242 orthology, 241, 242 oscillation, 118, 308, 352, 386 pattern formation, 152, 281, 361 peptide, 135, 136, 346 permutation test, 216, 217 petri net, 101, 168–170 phenotype, 23, 148, 152, 217, 244, 248, 267, 282 phylogenetic, 34, 138, 139, 171, 202, 226, 241, 245, 263, 265, 398, 475 phylogenetic co-occurrence, 202 phylogenetic profile, 138, 226, 241, 245 phylogeny, 392, 400 physical interaction, 137, 202, 220, 221, 233, 242, 243, 245 pleiotropy, 23 polymerase chain reaction, 92, 298 post-translational modification, 135, 172, 347 power law, 2, 11, 14–16, 22, 106, 108, 109, 182, 184, 186, 188, 193, 320, 327, 393, 395, 404, 464 predation, 367, 369, 381, 400–402 predator, 381, 384, 385, 391, 392, 401, 402, 423, 427–432, 434, 439, 443 prediction, 34, 43, 50, 63, 74, 75, 96, 118, 139, 204, 205, 220, 226, 236, 239, 240, 244– 248, 489 predictive, 75, 103, 125, 200, 201, 203, 205, 207, 211, 234, 240–242, 244, 245, 247, 250, 280, 393 prey, 136, 137, 367–369, 371–373, 376, 381, 384–386, 389–392, 397, 401, 402, 406, 424–428, 430, 437, 440, 442, 443
514
Index
probabilistic model, 58, 59, 75, 206, 324 probabilistic relational model, 63, 64 prokaryote, 53 proliferation, 30, 342, 345, 356, 357, 464 promoter, 146, 258, 261, 265–267, 271, 272, 350, 351 propagation, 32, 57, 69, 305, 320, 405 protein complex, 121, 137, 147, 149, 150, 206, 220, 231, 236, 248 protein interaction network, 3, 6, 15, 34, 78, 121, 134, 141, 144, 145, 147–150, 152, 153, 155, 202, 206, 220, 223, 224, 233, 236, 247, 269, 272, 276, 280 proteomics, 141, 172, 257, 260 random network, 4, 7, 9, 10, 110, 152, 185, 186, 214, 227, 277, 279, 317, 321, 322, 425, 438, 457 reductionist, 1, 291, 295 regulatory complex, 28 regulatory region, 45, 84–86, 95–97, 109, 111, 113, 260, 266–268, 271, 276, 278 relative persistence, 430, 431, 433 repertoire, 190, 298, 305, 306, 308 representation, 2, 7, 30, 43, 44, 47, 49, 54, 97, 115, 138, 142, 144, 156, 167–169, 207– 209, 232, 238, 239, 309, 367, 372, 390, 395, 398, 401 restriction fragment length polymorphism, 473 reverse engineering, 75, 205, 206 reverse transcription, 92, 298 reversible, 13, 153, 168, 170, 174, 183, 303 risk factor, 451, 468, 475–477, 479, 491, 497 RNA, 21, 41, 42, 62, 83, 85, 86, 88, 92, 113, 122, 133, 164, 172, 225, 259, 348 robustness, 21–23, 32, 37, 103, 152, 153, 174, 188–190, 211, 259, 278, 279, 395, 404, 438, 452–454, 456, 457, 459, 460, 462, 464–467, 469–472, 480, 483–485, 490, 493, 496, 497 Saccharomyces cerevisiae, 3, 6, 8, 14, 93, 95, 107, 109, 110, 121, 123, 139, 140, 144, 150, 152, 231, 239, 248, 268, 282 sample complexity, 52 sampling, 53, 54, 74, 394, 406, 487 scaffold graph, 31, 32 scale-free network, 7, 9, 10, 16, 76, 144, 152, 274, 464 self-similar, 186, 193 self-similarity, 186 sequence database, 135, 136, 147, 150 signal transduction, 114, 133, 149, 153, 233, 276 signaling, 106, 121, 133, 137, 141, 147, 156, 199, 202–204, 209, 233, 238, 248, 249 intercellular, 351
Index
515
intracellular, 347–349, 351, 356 simulation, 43, 50, 63, 115, 152, 233, 239, 273, 274, 296, 311, 430, 451, 491, 492, 495 small world, 22, 23, 28, 110, 144, 182, 184, 450, 464 social networks, 144 solenoidal, 123, 124 spatial, 94, 103, 104, 122, 123, 125, 225, 291, 295, 296, 312, 317, 320, 327, 359–361, 386, 406, 444, 459, 472, 479, 494–496 stability, 116, 269, 326, 370, 374, 377, 380–384, 386, 424, 425, 438, 471 state-space model, 59, 61, 71 static models, 43, 63, 397 stationarity, 54, 116 stationary, 51, 52, 67, 116, 122 statistical learning, 67–69 statistical validation, 45, 73, 74 steady-state dynamics, 235 stochastic, 34, 48, 51, 52, 58, 59, 62, 63, 67, 104, 125, 204, 209, 276, 277, 281, 309, 320, 324, 397, 426, 430, 431, 454, 455, 461, 468, 492 stochastic equations, 104 stochastic models, 58, 426, 431, 454, 455, 458, 463 stoichiometric, 13, 98, 101, 177, 181, 189, 210, 235, 271 stoichiometry matrix, 169, 170, 177, 179, 181, 186 Streptomyces coelicolor, 175 sub-graph, 3, 5, 6, 32, 231, 272, 274, 279, 280 subnetwork, 247, 258 substrate, 13, 86, 133, 165, 167–170, 173, 177, 183, 184, 186, 209, 239, 293, 304 synapse, 300, 302–304, 324 synaptic plasticity, 303, 304, 306 synthetic, 91, 137, 140, 184, 191, 202, 207, 220–222, 239, 242, 243, 245, 291, 293 systems biology, 41, 78, 126, 343 T cell, 342, 345–348, 351, 353, 355, 357, 358, 360 taxonomy, 116, 298 technological graphs, 37 temporal aggregation, 53 time series, 44, 50–53, 63, 67, 236, 239 tinkering, 23, 33, 37, 273 topological overlap, 25, 26, 28–30, 186 topology global, 106, 107, 111, 181 local, 107, 109–111 training, 43, 47, 50–53, 66–68, 70, 73, 75, 244 training sample, 47, 50, 52, 68
516
Index
transcription, 4, 5, 8, 45, 62, 83, 85, 88, 93, 96, 97, 113, 119, 122, 203, 204, 212, 238, 239, 248, 259, 350, 351, 356 transcription factor, 28–31, 33, 45, 56, 62, 68, 72, 76, 84–86, 94–97, 100, 102, 104, 108, 109, 111–113, 121–123, 133, 136, 137, 147, 201, 206, 210, 224, 228, 233, 257, 259– 267, 269–273, 275, 279, 282, 283, 347, 348, 350, 351, 359 transcriptomics, 77, 257, 260 transitivity, 450, 471 translation, 45, 62, 119, 173, 261 transmissibility, 465, 476, 477, 479 transmission, 295, 300, 302, 304, 325, 449–452, 454–457, 462–491, 493–497 transportation, 16, 467, 493, 495 tricarboxylic acid cycle, 154, 164 trophic interaction, 367–369, 373, 381, 383, 385, 386, 397, 403, 405, 444 trophic level, 366, 371, 373, 375, 380, 383, 392, 393, 401, 425, 426, 430, 435– 438, 440– 443 two-hybrid, 134–137, 140, 141, 144, 150, 206, 219 validation, 44, 45, 48, 72, 73, 75, 76, 78, 243–245, 453 vertex, 26, 102, 116–118, 120, 121, 214, 215, 280 visualization, 140–143, 147 wiring diagram, 125, 155, 156