RED BOX RULES ARE FOR PROOF STAGE ONLY. DELETE BEFORE FINAL PRINTING.
HANDBOOK OF STATISTICAL SYSTEMS BIOLOGY Editors
...
191 downloads
1825 Views
12MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
RED BOX RULES ARE FOR PROOF STAGE ONLY. DELETE BEFORE FINAL PRINTING.
HANDBOOK OF STATISTICAL SYSTEMS BIOLOGY Editors
EDITORS
Stumpf Balding Girolami
Michael P.H. Stumpf, Imperial College London, UK David J. Balding, Institute of Genetics, University College London, UK Mark Girolami, Department of Statistical Science, University College London, UK
This book: • Provides a comprehensive account of inference techniques in systems biology. • Introduces classical and Bayesian statistical methods for complex systems. • Explores networks and graphical modelling as well as a wide range of statistical models for dynamical systems. • Discusses various applications for statistical systems biology, such as gene regulation and signal transduction. • Features statistical data analysis on numerous technologies, including metabolic and transcriptomic technologies. • Presents an in-depth presentation of reverse engineering approaches. • Includes colour illustrations to explain key concepts. This handbook will be a key resource for researchers practising systems biology, and those requiring a comprehensive overview of this important field.
HANDBOOK OF STATISTICAL SYSTEMS BIOLOGY
Systems Biology is now entering a mature phase in which the key issues are characterising uncertainty and stochastic effects in mathematical models of biological systems. The area is moving towards a full statistical analysis and probabilistic reasoning over the inferences that can be made from mathematical models. This handbook presents a comprehensive guide to the discipline for practitioners and educators, in providing a full and detailed treatment of these important and emerging subjects. Leading experts in systems biology and statistics have come together to provide insight into the major ideas in the field, and in particular methods of specifying and fitting models, and estimating the unknown parameters.
Editors Michael P. H. Stumpf | David J. Balding | Mark Girolami
HANDBOOK OF STATISTICAL SYSTEMS BIOLOGY
Handbook of Statistical Systems Biology
Handbook of Statistical Systems Biology
Edited by
MICHAEL P. H. STUMPF Imperial College London, UK
DAVID J. BALDING Institute of Genetics, University College London, UK
MARK GIROLAMI Department of Statistical Science, University College London, UK
This edition first published 2011 © 2011 John Wiley & Sons, Ltd Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress Cataloging-in-Publication Data Stumpf, M. P. H. (Michael P. H.) Handbook of statistical systems biology / Michael P.H. Stumpf, David J. Balding, Mark Girolami. p. cm. Includes bibliographical references and index. ISBN 978-0-470-71086-9 (cloth) 1. Systems biology–Statistical methods–Handbooks, manuals, etc. 2. Biological systems–Mathematical models–Handbooks, manuals, etc. 3. Uncertainty–Mathematical models–Handbooks, manuals, etc. 4. Stochastic analysis–Mathematical models–Handbooks, manuals, etc. I. Balding, D. J. II. Girolami, Mark, 1963- III. Title. QH324.2.S79 2011 570.1’5195–dc23 2011018218
A catalogue record for this book is available from the British Library. Print ISBN: 978-0-470-71086-9 ePDF ISBN: 978-1-119-97061-3 obook ISBN: 978-1-119-97060-6 ePub ISBN: 978-1-119-95204-6 Mobi ISBN: 978-1-119-95205-3 Set in 10/12pt Times by Thomson Digital, Noida, India
Contents Preface
xvii
Contributors
xix
A
METHODOLOGICAL CHAPTERS
1
1
Two Challenges of Systems Biology
3
William S. Hlavacek
2
1.1 Introduction 1.2 Cell signaling systems 1.3 The challenge of many moving parts 1.4 The challenge of parts with parts 1.5 Closing remarks References
3 4 5 7 9 10
Introduction to Statistical Methods for Complex Systems
15
Tristan Mary-Huard and Stéphane Robin
3
2.1 2.2
Introduction Class comparison 2.2.1 Models for dependent data 2.2.2 Multiple testing 2.3 Class prediction 2.3.1 Building a classifier 2.3.2 Aggregation 2.3.3 Regularization 2.3.4 Performance assessment 2.4 Class discovery 2.4.1 Geometric methods 2.4.2 (Discrete) latent variable models 2.4.3 Inference References
15 16 16 19 22 22 25 27 29 31 31 32 33 36
Bayesian Inference and Computation
39
Christian P. Robert, Jean-Michel Marin and Judith Rousseau
3.1 3.2
Introduction The Bayesian argument 3.2.1 Bases 3.2.2 Bayesian analysis in action
39 39 39 41
vi
4
Contents
3.2.3 Prior distributions 3.2.4 Confidence intervals 3.3 Testing hypotheses 3.3.1 Decisions 3.3.2 The Bayes factor 3.3.3 Point null hypotheses 3.3.4 The ban on improper priors 3.3.5 The case of nuisance parameters 3.3.6 Bayesian multiple testing 3.4 Extensions 3.4.1 Prediction 3.4.2 Outliers 3.4.3 Model choice 3.5 Computational issues 3.5.1 Computational challenges 3.5.2 Monte Carlo methods 3.5.3 MCMC methods 3.5.4 Approximate Bayesian computation techniques Acknowledgements References
42 46 48 48 48 49 50 52 54 55 55 56 56 57 57 59 61 63 64 64
Data Integration: Towards Understanding Biological Complexity
66
David Gomez-Cabrero and Jesper Tegner
4.1
5
Storing knowledge: Experimental data, knowledge databases, ontologies and annotation 4.1.1 Data repositories 4.1.2 Knowledge Databases 4.1.3 Ontologies 4.1.4 Annotation 4.2 Data integration in biological studies 4.2.1 Integration of experimental data 4.2.2 Ontologies and experimental data 4.2.3 Networks and visualization software as integrative tools 4.3 Concluding remarks References
66 67 68 70 71 74 74 77 77 77 78
Control Engineering Approaches to Reverse Engineering Biomolecular Networks
83
Francesco Montefusco, Carlo Cosentino and Declan G. Bates
5.1
5.2
Dynamical models for network inference 5.1.1 Linear models 5.1.2 Nonlinear models Reconstruction methods based on linear models 5.2.1 Least squares 5.2.2 Methods based on least squares 5.2.3 Dealing with noise: CTLS 5.2.4 Convex optimization methods
83 84 85 89 89 90 91 97
Contents
6
vii
5.2.5 Sparsity pattern of the discrete-time model 5.2.6 Application examples 5.3 Reconstruction methods based on nonlinear models 5.3.1 Approaches based on polynomial and rational models 5.3.2 Approaches based on S-systems 5.3.3 A case-study References
100 101 104 105 107 109 111
Algebraic Statistics and Methods in Systems Biology
114
Carsten Wiuf
6.1 6.2 6.3 6.4
B 7
Introduction Overview of chapter Computational algebra Algebraic statistical models 6.4.1 Definitions 6.4.2 Further examples 6.5 Parameter inference 6.6 Model invariants 6.7 Log-linear models 6.8 Reverse engineering of networks 6.9 Concluding remarks References
114 115 116 118 118 119 122 124 126 129 130 130
TECHNOLOGY-BASED CHAPTERS
133
Transcriptomic Technologies and Statistical Data Analysis
135
Elizabeth Purdom and Sach Mukherjee
7.1 7.2
Biological background Technologies for genome-wide profiling of transcription 7.2.1 Microarray technology 7.2.2 mRNA expression estimates from microarrays 7.2.3 High throughput sequencing (HTS) 7.2.4 mRNA expression estimates from HTS 7.3 Evaluating the significance of individual genes 7.3.1 Common approaches for significance testing 7.3.2 Moderated statistics 7.3.3 Statistics for HTS 7.3.4 Multiple testing corrections 7.3.5 Filtering genes 7.4 Grouping genes to find biological patterns 7.4.1 Gene-set analysis 7.4.2 Dimensionality reduction 7.4.3 Clustering 7.5 Prediction of a biological response 7.5.1 Variable selection 7.5.2 Estimating the performance of a model References
135 136 136 137 137 139 140 140 141 142 143 145 147 147 149 150 153 153 156 157
viii
8
Contents
Statistical Data Analysis in Metabolomics
163
Timothy M. D. Ebbels and Maria De Iorio
9
8.1 8.2
Introduction Analytical technologies and data characteristics 8.2.1 Analytical technologies 8.2.2 Preprocessing 8.3 Statistical analysis 8.3.1 Unsupervised methods 8.3.2 Supervised methods 8.3.3 Metabolome-wide association studies 8.3.4 Metabolic correlation networks 8.3.5 Simulation of metabolic profile data 8.4 Conclusions Acknowledgements References
163 164 164 166 169 169 171 172 173 176 178 178 178
Imaging and Single-Cell Measurement Technologies
181
Yu-ichi Ozaki and Shinya Kuroda
10
9.1
Introduction 9.1.1 Intracellular signal transduction 9.1.2 Lysate-based assay and single-cell assay 9.1.3 Live cell and fixed cell 9.2 Measurement techniques 9.2.1 Western blot analysis 9.2.2 Immunocytochemistry 9.2.3 Flow cytometry 9.2.4 Fluorescent microscope 9.2.5 Live cell imaging 9.2.6 Fluorescent probes for live cell imaging 9.2.7 Image cytometry 9.2.8 Image processing 9.3 Analysis of signal cell measurement data 9.3.1 Time series (mean, variation, correlation, localization 9.3.2 Bayesian network modeling with single-cell data 9.3.3 Quantifying sources of cell-to-cell variation 9.4 Summary Acknowledgements References
181 181 182 182 182 183 183 184 185 187 188 190 191 194 194 197 197 197 199 199
Protein Interaction Networks and Their Statistical Analysis
200
Waqar Ali, Charlotte Deane and Gesine Reinert
10.1 10.2
Introduction Proteins and their interactions 10.2.1 Protein structure and function 10.2.2 Protein–protein interactions
200 201 201 202
Contents
10.2.3 Experimental techniques for interaction detection 10.2.4 Computationally predicted data-sets 10.2.5 Protein interaction databases 10.2.6 Error in PPI data 10.2.7 The interactome concept and protein interaction networks 10.3 Network analysis 10.3.1 Graphs 10.3.2 Network summary statistics 10.3.3 Network motifs 10.3.4 Models of random networks 10.3.5 Parameter estimation for network models 10.3.6 Approximate Bayesian Computation 10.3.7 Threshold behaviour in graphs 10.4 Comparison of protein interaction networks 10.4.1 Network comparison based on subgraph counts 10.4.2 Network alignment 10.4.3 Using functional annotation for network alignment 10.5 Evolution and the protein interaction network 10.5.1 How evolutionary models affect network alignment 10.6 Community detection in PPI networks 10.6.1 Community detection methods 10.6.2 Evaluation of results 10.7 Predicting function using PPI networks 10.8 Predicting interactions using PPI networks 10.8.1 Tendency to form triangles 10.8.2 Using triangles for predicting interactions 10.9 Current trends and future directions 10.9.1 Dynamics 10.9.2 Integration with other networks 10.9.3 Limitations of models, prediction and alignment methods 10.9.4 Biases, error and weighting 10.9.5 New experimental sources of PPI data References
ix
202 203 204 204 205 205 205 206 207 207 209 209 210 211 211 213 215 217 217 218 219 220 221 223 224 224 226 226 227 227 228 228 228
C NETWORKS AND GRAPHICAL MODELS
235
11
237
Introduction to Graphical Modelling Marco Scutari and Korbinian Strimmer
11.1 11.2
11.3 11.4
Graphical structures and random variables Learning graphical models 11.2.1 Structure learning 11.2.2 Parameter learning Inference on graphical models Application of graphical models in systems biology 11.4.1 Correlation networks 11.4.2 Covariance selection networks
237 241 242 246 246 247 247 248
x
12
Contents
11.4.3 Bayesian networks 11.4.4 Dynamic Bayesian networks 11.4.5 Other graphical models References
250 250 250 251
Recovering Genetic Network from Continuous Data with Dynamic Bayesian Networks
255
Gaëlle Lelandais and Sophie Lèbre
13
12.1
Introduction 12.1.1 Regulatory networks in biology 12.1.2 Objectives and challenges 12.2 Reverse engineering time-homogeneous DBNs 12.2.1 Genetic network modelling with DBNs 12.2.2 DBN for linear interactions and inference procedures 12.3 Go forward: how to recover the structure changes with time 12.3.1 ARTIVA network model 12.3.2 ARTIVA inference procedure and performance evaluation 12.4 Discussion and Conclusion References
255 255 256 256 256 259 261 262 263 267 268
Advanced Applications of Bayesian Networks in Systems Biology
270
Dirk Husmeier, Adriano V. Werhli and Marco Grzegorczyk
14
13.1
Introduction 13.1.1 Basic concepts 13.1.2 Dynamic Bayesian networks 13.1.3 Bayesian learning of Bayesian networks 13.2 Inclusion of biological prior knowledge 13.2.1 The ‘energy’ of a network 13.2.2 Prior distribution over network structures 13.2.3 MCMC sampling scheme 13.2.4 Practical implementation 13.2.5 Empirical evaluation on the Raf signalling pathway 13.3 Heterogeneous DBNs 13.3.1 Motivation: Inferring spurious feedback loops with DBNs 13.3.2 A nonlinear/nonhomogeneous DBN 13.3.3 MCMC inference 13.3.4 Simulation results 13.3.5 Results on Arabidopsis gene expression time series 13.4 Discussion Acknowledgements References
270 270 272 273 273 274 275 276 277 277 281 281 282 284 284 285 287 288 288
Random Graph Models and Their Application to Protein–Protein Interaction Networks
290
Desmond J. Higham and Nataˇsa Prˇzulj
14.1 14.2 14.3
Background and motivation What do we want from a PPI network? PPI network models
290 293 294
Contents
15
xi
14.3.1 Lock and key 14.3.2 Geometric networks 14.4 Range-dependent graphs 14.5 Summary References
294 297 301 305 306
Modelling Biological Networks via Tailored Random Graphs
309
Anthony C. C. Coolen, Franca Fraternali, Alessia Annibale, Luis Fernandes and Jens Kleinjung
15.1 15.2
Introduction Quantitative characterization of network topologies 15.2.1 Local network features and their statistics 15.2.2 Examples 15.3 Network families and random graphs 15.3.1 Network families, hypothesis testing and null models 15.3.2 Tailored random graph ensembles 15.4 Information-theoretic deliverables of tailored random graphs 15.4.1 Network complexity 15.4.2 Information-theoretic dissimilarity 15.5 Applications to PPINs 15.5.1 PPIN assortativity and wiring complexity 15.5.2 Mapping PPIN data biases 15.6 Numerical generation of tailored random graphs 15.6.1 Generating random graphs via Markov chains 15.6.2 Degree-constrained graph dynamics based on edge swaps 15.6.3 Numerical examples 15.7 Discussion References
D 16
DYNAMICAL SYSTEMS Nonlinear Dynamics: A Brief Introduction
309 310 310 311 312 312 313 315 315 316 317 320 320 323 323 324 325 325 327
331 333
Alessandro Moura and Celso Grebogi
17
16.1 Introduction 16.2 Sensitivity to initial conditions and the Lyapunov exponent 16.3 The natural measure 16.4 The Kolmogorov–Sinai entropy 16.5 Symbolic dynamics 16.6 Chaos in biology References
333 334 334 335 336 338 338
Qualitative Inference in Dynamical Systems
339
Fatihcan M. Atay and J¨urgen Jost
17.1 17.2
Introduction Basic solution types
339 343
xii
18
Contents
17.3 Qualitative behaviour 17.4 Stability and bifurcations 17.5 Ergodicity 17.6 Timescales 17.7 Time series analysis References
346 347 353 354 356 357
Stochastic Dynamical Systems
359
Darren J. Wilkinson
18.1 18.2
Introduction Origins of stochasticity 18.2.1 Low copy number 18.2.2 Other sources of noise and heterogeneity 18.3 Stochastic chemical kinetics 18.3.1 Reaction networks 18.3.2 Markov jump process 18.3.3 Diffusion approximation 18.3.4 Reaction rate equations 18.3.5 Modelling extrinsic noise 18.4 Inference for Markov process models 18.4.1 Likelihood-based inference 18.4.2 Partial observation and data augmentation 18.4.3 Data augmentation MCMC approaches 18.4.4 Likelihood-free approaches 18.4.5 Approximate Bayesian computation 18.4.6 Particle MCMC 18.4.7 Iterative filtering 18.4.8 Stochastic model emulation 18.4.9 Inference for stochastic differential equation models 18.5 Conclusions Acknowledgements References 19
359 359 359 360 360 360 360 364 365 365 366 366 367 368 369 370 371 371 371 372 372 373 373
Gaussian Process Inference for Differential Equation Models of Transcriptional Regulation 376 Neil Lawrence, Magnus Rattray, Antti Honkela and Michalis Titsias
19.1
Introduction 19.1.1 A simple systems biology model 19.2 Generalized linear model 19.2.1 Fitting basis function models 19.2.2 An infinite basis 19.2.3 Gaussian processes 19.2.4 Sampling approximations 19.3 Model based target ranking 19.4 Multiple tanscription factors 19.5 Conclusion References
376 377 379 380 383 385 387 387 391 393 394
Contents
20
Model Identification by Utilizing Likelihood-Based Methods
xiii
395
Andreas Raue and Jens Timmer
20.1
ODE models for reaction networks 20.1.1 Rate equations 20.2 Parameter estimation 20.2.1 Sensitivity equations 20.2.2 Testing hypothesis 20.2.3 Confidence intervals 20.3 Identifiability 20.3.1 Structural nonidentifiability 20.3.2 Practical nonidentifiability 20.3.3 Connection of identifiability and observability 20.4 The profile likelihood approach 20.4.1 Experimental design 20.4.2 Model reduction 20.4.3 Observability and confidence intervals of trajectories 20.4.4 Application 20.5 Summary Acknowledgements References
E 21
APPLICATION AREAS Inference of Signalling Pathway Models
396 397 398 399 400 401 403 403 404 405 405 406 407 407 408 413 414 415
417 419
Tina Toni, Juliane Liepe and Michael P. H. Stumpf
22
21.1 21.2 21.3
Introduction Overview of inference techniques Parameter inference and model selection for dynamical systems 21.3.1 Model selection 21.4 Approximate Bayesian computation 21.5 Application: Akt signalling pathway 21.5.1 Exploring different distance functions 21.5.2 Posteriors 21.5.3 Parameter sensitivity through marginal posterior distributions 21.5.4 Sensitivity analysis by principal component analysis (PCA) 21.6 Conclusion References
419 420 422 424 425 426 428 430 430 430 435 436
Modelling Transcription Factor Activity
440
Martino Barenco, Daniel Brewer, Robin Callard and Michael Hubank
22.1 22.2
Integrating an ODE with a differential operator Computation of the entries of the differential operator 22.2.1 Taking into account the nature of the biological system being modelled 22.2.2 Bounds choice for polynomial interpolation
441 443 443 445
xiv
23
Contents
22.3 Applications 22.4 Estimating intermediate points Acknowledgements References
447 449 450 450
Host–Pathogen Systems Biology
451
John W. Pinney
24
23.1 Introduction 23.2 Pathogen genomics 23.3 Metabolic models 23.4 Protein–protein interactions 23.5 Response to environment 23.6 Immune system interactions 23.7 Manipulation of other host systems 23.8 Evolution of the host–pathogen system 23.9 Towards systems medicine for infectious diseases 23.10 Concluding remarks Acknowledgements References
451 453 453 455 457 458 459 460 462 462 463 463
Bayesian Approaches for Mass Spectrometry-Based Metabolomics
467
Simon Rogers, Richard A. Scheltema, Michael Barrett and Rainer Breitling
25
24.1 Introduction 24.2 The challenge of metabolite identification 24.3 Bayesian analysis of metabolite mass spectra 24.4 Incorporating additional information 24.5 Probabilistic peak detection 24.6 Statistical inference 24.7 Software development for metabolomics 24.8 Conclusion References
467 468 469 471 472 473 474 475 475
Systems Biology of microRNAs
477
Doron Betel and Raya Khanin
25.1 25.2 25.3 25.4 25.5
25.6 25.7
Introduction Current approaches in microRNA Systems Biology Experimental findings and data that guide the developments of computational tools Approaches to microRNA target predictions Analysis of mRNA and microRNA expression data 25.5.1 Identifying microRNA activity from mRNA expression 25.5.2 Modeling combinatorial microRNA regulation from joint microRNA and mRNA expression data Network approach for studying microRNA-mediated regulation Kinetic modeling of microRNA regulation 25.7.1 A basic model of microRNA-mediated regulation
477 477 478 479 482 482 484 485 486 487
Contents
25.7.2 Estimating fold-changes of mRNA and proteins in microRNA transfection experiments 25.7.3 The influence of protein and mRNA stability on microRNA function 25.7.4 microRNA efficacy depends on target abundance 25.7.5 Reconstructing microRNA kinetics 25.8 Discussion References Index
xv
488 489 489 489 490 491 495
Preface
Systems biology is data-rich. Technological advances over the past 50 years have allowed us to probe, map and interfere with biological organisms in a number of ways. Most notable are perhaps the sequencing efforts that are continuing to catalogue the genetic diversity of life, including our own species. A whole host of other techniques from biochemistry, molecular, cell and structural biology have been used to study the function of the protein products and other biomolecules that are encoded by these sequences. So we live in a time with ready access to sophisticated techniques that allow us to study how biological systems – ranging from single molecules to whole organisms – function and work. But systems biology is also hypothesis-rich. By this we mean that there are an overwhelmingly large number of potential mechanisms that could explain many, if not all, biological systems. And each of these models has associated unknown parameters, most of which cannot be measured directly using experimental approaches. Systems biology is also one of the most fertile fields for modern statistics. The richness in both data and hypotheses pose serious challenges to classical statistical theory. Fisher, Neyman and Pearson and their successors typically dealt with problems where there are only a small number of hypotheses that are evaluated in light of adequate data derived from well designed experiments. The situation in systems biology could hardly be more different: in humans we have some 24 000 genes (and probably several hundred thousand protein products), but only a small number of measurements for each of these genes. A priori each of these genes (or worse, each combination of these genes) could be involved in any biological process or phenotype of interest; the number of hypotheses is vastly larger than the amount of available data. But the resulting so-called ‘large p small n problem’ and the multiple testing problem is only a small part of the problem. The lack of suitable models weighs much more heavily. Mechanistic models, framed in suitable mathematical language, allow us to summarize our knowledge about biological systems and make testable predictions, which probe our understanding. Iteration between modelling and experimental analysis will thus be required, and in the long run is believed to yield better understanding of biological systems in both health and disease. But where do these models come from? In a recent polemic, Sydney Brenner has put down the challenge, stating essentially that solving the so-called inverse problem in systems biology is doomed to fail. Given the central role that learning or inferring the structure and dynamics of biological systems has for systems biology this would amount to the long-term failure of the whole enterprise (also of synthetic biology, which cannot do without the mechanistic insights and models provided by systems biology). Rather than inferring models from data, Delbrück proposes to use maps, i.e. mathematical models that connect the different molecular entities inside cells, tissues, organs or whole organisms, to put the wealth of information collected by traditional reductionist molecular and cell biology research into context. Where these maps are coming from or how they are constructed is not clear, however. In the chapters in this handbook we hope to provide a more optimistic but also nuanced perspective on the inverse problem in systems biology. The different chapters in this handbook provide accessible accounts of basic statistical methodologies and their application in a systems biology setting. There is ample need for such a unified account. First of all, the field is progressing rapidly and technological advances have allowed researchers to gather data at a phenomenal rate. Not all data are good data, however, from a statistical or reverse-engineering perspective. The type of data collected and the manner in which they are collected can make or break any
xviii
Preface
statistical analysis. Thus some familiarity with statistical methodologies, but also the potential pitfalls will be essential for the design of better experiments and technologies. Secondly, once a model is given, mathematical analysis and exploration is relatively straightforward. But specifying the model and inferring its parameters are fraught with statistical challenges. The curse of dimensionality is encountered almost everywhere, data are noisy, incomplete and exhibit high levels of dependence and colinearity. Learning anything from such data is challenging. But the fact that biological systems change constantly with time and in response to environmental, physiological and developmental cues means that the window we have for observing a well specified system may be very small indeed. Thirdly, frequently we are dealing with mathematical models that are much more complex and challenging than those typically considered in statistics. Many nice properties of classical probability models are absent from the contingent, complex and complicated models considered in e.g. the context of metabolic networks or signal transduction networks. Moreover, a host of recent results has led us to re-evaluate our perspective on inference of parameters for dynamical systems. Being able to infer parameters is intimately related to properties of the dynamical system that include stability (of equilibrium solutions) and identifiability. These in turn, however, change with the parameters. In other words, the same dynamical systems may in effect be identifiable in some regions of parameter space but not others. To make any progress in this arena requires us to be aware of both statistics and dynamical systems theory. This brings us to the fourth point: systems biology is a highly interdisciplinary research area. None of the present practitioners has received any formal training in systems biology; instead they come from a diverse set of backgrounds ranging from mathematics, computer science, physics and the engineering sciences to biology and medicine. The different modelling and experimental approaches must be melded together in order to make progress. Since data take a central part in this dialogue, statistics must also play an essential role at this interface between traditional disciplines. This handbook aims to introduce researchers, practitioners and students to the statistical approaches that are making an impact on cutting edge systems biology research. It is born out of the editors’ belief that the inferential perspective is essential to the whole enterprise of systems biology, and that there is a lack of suitable resources for researchers in the field. We are therefore grateful to Wiley for providing us with the opportunity to develop this handbook. This would, of course, not have been possible without the cooperation of the many contributing authors. Producing comprehensive reviews and overviews over methodologies in such a rapidly moving field is challenging and may often be considered as a distraction from the work we would really like to be getting on with. We are therefore delighted and hugely grateful for the warm response that we have had from our contributors. Each chapter provides insights into some of the areas that we believe are essential for tackling inference problems in systems biology (and biomedical research more generally). Overlap between different chapters is unavoidable but rather than having resulted in redundancy or repetitiveness, these areas of overlap really serve to highlight the different perspectives and validity of alternative approaches. We are also hugely grateful to Kathryn Sharples, who helped to get this project off the ground and provided invaluable assistance in the early stages. We thank Richard Davies for his unwavering support, diligence and help in bringing the work on the handbook to fruition. He together with Heather Kay and Prachi SinhaSahay also delivered the most challenging aspect of the book by editing and proofing chapters and making a consistent whole out of the many individual contributions. Wiley and especially Kathryn, Richard, Heather and Prachi have been wonderful to work with and have accommodated the editors’ wishes and concerns with great patience and grace. M. Stumpf, D. Balding and M. Girolami
Contributors
Waqar Ali, Department of Statistics, University of Oxford, UK. Alessia Annibale, Department of Mathematics, King’s College London, The Strand, London, UK. Fatihcan M. Atay, Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany. Martino Barenco, Institute of Child Health, University College London, UK. Michael Barrett, Institute of Infection, Immunity and Inflammation, College of Medical, Veterinary and Life Sciences, University of Glasgow, UK. Declan G. Bates, College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK. Doron Betel, Institute of Computational Biomedicine, Weill Cornell Medical College, New York, USA. Rainer Breitling, Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, UK. Daniel Brewer, Institute of Cancer Research, Sutton, UK. David Gomez-Cabrero, Unit of Computational Medicine, Department of Medicine, Karolinska Institutet, Stockholm, Sweden. Robin Callard, Institute of Child Health, University College London, UK. Anthony C. C. Coolen, Department of Mathematics, King’s College London, The Strand, London, and Randall Division of Cell and Molecular Biophysics, King’s College London, New Hunt’s House, London, UK. Carlo Cosentino, School of Computer and Biomedical Engineering, Università degli Studi Magna Græcia di Catanzaro, Catanzaro, Italy. Charlotte Deane, Department of Statistics, University of Oxford, UK. Timothy M. D. Ebbels, Biomolecular Medicine, Department of Surgery and Cancer, Imperial College, London, UK. Luis Fernandes, Randall Division of Cell and Molecular Biophysics, King’s College London, New Hunt’s House, London, UK. Franca Fraternali, Randall Division of Cell and Molecular Biophysics, King’s College London, New Hunt’s House, London, UK. Celso Grebogi, Institute for Complex Systems and Mathematical Biology, School of Natural and Computing Sciences, King’s College, University of Aberdeen, UK. Marco Grzegorczyk, Department of Statistics, TU Dortmund University, Dortmund, Germany. Desmond J. Higham, Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK.
xx
Contributors
William S. Hlavacek, Theoretical Division, Los Alamos National Laboratory, Los Alamos, USA. Antti Honkela, Helsinki Institute for Information Technology, University of Helsinki, Finland. Tristan Mary-Huard, AgroParisTech and INRA, Paris, France. Michael Hubank, Institute of Child Health, University College London, UK. Dirk Husmeier, Biomathematics and Statistics Scotland (BioSS) JCMB, Edinburgh, UK. Maria De Iorio, Department of Epidemiology and Biostatistics, Imperial College, St Mary’s Campus, Norfolk Place, London, UK. Jürgen Jost, Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany, and Santa Fe Institute for the Sciences of Complexity, USA. Raya Khanin, Bioinformatics Core, Memorial Sloan-Kettering Cancer Center, New York, USA. Jens Kleinjung, Divison of Mathematical Biology, MRC National Institute for Medical Research, London, UK. Shinya Kuroda, Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, and CREST, Japan Science and Technology Agency, University of Tokyo, Japan. Neil Lawrence, Department of Computer Science and Sheffield Institute for Translational Neuroscience, University of Sheffield, UK. Gaëlle Lelandais, DSIMB, INSERM, University of Paris Diderot and INTS, Paris, France. Juliane Liepe, Division of Molecular Biosciences, Imperial College London, UK. Sophie Lèbre, LSIIT, University of Strasbourg, France. Jean-Michel Marin, Institut de Mathematiques et Modelisation de Montpellier, Université de Montpellier 2, France. Francesco Montefusco, College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK and School of Computer and Biomedical Engineering, Università degli Studi Magna Græcia di Catanzaro, Catanzaro, Italy. Alessandro Moura, Institute for Complex Systems and Mathematical Biology, School of Natural and Computing Sciences, King’s College, University of Aberdeen, UK. Sach Mukherjee, Department of Statistics, University of Warwick, Coventry, UK. Yu-ichi Ozaki, Laboratory for Cell Signaling Dynamics, RIKEN Quantitative Biology Center, Kobe, Japan. John W. Pinney, Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College London, UK. Nataša Prˇzulj, Department of Computing, Imperial College London, UK. Elizabeth Purdom, Department of Statistics, University of California at Berkeley, USA. Magnus Rattray, Department of Computer Science and Sheffield Institute for Translational Neuroscience, University of Sheffield, UK.
Contributors
xxi
Andreas Raue, Institute of Physics and Freiburg Institute for Advanced Studies (FRIAS) and Centre for Biological Systems Analysis (ZBSA), University of Freiburg, Germany. Gesine Reinert, Department of Statistics, University of Oxford, UK. Christian P. Robert, Université Paris-Dauphine, CEREMADE, France, and CREST and ENSAE, Malakoff, France. Stéphane Robin, AgroParisTech and INRA, Paris, France. Simon Rogers, School of Computing Science, University of Glasgow, UK. Judith Rousseau, Université Paris-Dauphine, CEREMADE, France, and CREST and ENSAE, Malakoff, France. Richard A. Scheltema, Proteomics and Signal Transduction, Max Planck Institute for Biochemistry, Martinsried, Germany. Marco Scutari, UCL Genetics Institute (UGI), University College London, UK. Korbinian Strimmer, Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, Germany. Michael P. H. Stumpf, Division of Molecular Biosciences, Imperial College London, UK. Jesper Tegner, Unit of Computational Medicine, Department of Medicine, Karolinska Institutet, Stockholm, Sweden. Jens Timmer, Institute of Physics and Freiburg Institute for Advanced Studies (FRIAS) and Centre for Biological Systems Analysis (ZBSA), University of Freiburg, Germany. Michalis Titsias, School of Computer Science, University of Manchester, UK. Tina Toni, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, USA. Adriano V. Werhli, Centro de Ciências Computacionais, Universidade Federal do Rio Grande (FURG), Rio Grande, RS, Brazil. Darren J. Wilkinson, School of Mathematics and Statistics, Newcastle University, Newcastle-upon-Tyne, UK. Carsten Wiuf, Bioinformatics Research Centre, Aarhus University, Denmark.
Part A METHODOLOGICAL CHAPTERS
1 Two Challenges of Systems Biology William S. Hlavacek Theoretical Division, Los Alamos National Laboratory, Los Alamos, USA
1.1
Introduction
Articulating the challenges faced by a field, or rather the problems that seem worthy of pursuit, can be a useful exercise. A famous example is Hilbert’s problems (Hilbert 1902), which profoundly influenced mathematics. This entire handbook can be viewed as an attempt to define challenges faced by systems biologists, especially challenges requiring a statistical approach, and to provide tools for addressing these challenges. This particular chapter provides a personal perspective. If the editors had invited someone else to write it, I am sure it would have materialized in a very different form, as there are many challenges worthy of pursuit in this field. Much has been written about systems biology, and excellent overviews of the field are available (for example, see Kitano 2002). For many years, I believe one of the challenges faced by systems biology has been community building. In recent years, thanks in part to events such as the International Conference on Systems Biology (Kitano 2001) and the q-bio Summer School and Conference (Edwards et al. 2007), a community of systems biology researchers has established itself. This community identifies with the term ‘systems biology’ and broadly agrees upon its loose definition and the general goals of the field. If one is looking for a definition of systems biology, the work presented at the meetings mentioned above serves the purpose well. A characteristic of systems biology is the systems approach, which is aided by modeling. In systems biology, there has been special interest in the study and modeling of cellular regulatory systems, such as genetic circuits (Alon 2006). This chapter will be focused on a discussion of challenges faced by modelers. Of course, modeling would be aided by technology development, the generation of new quantitative data and in other ways, but I will be concerned mainly with the practice of modeling. Moreover, I will focus on modeling of cell signaling systems, although I believe the discussion is relevant for modeling of other types of cellular regulatory systems. We need models of cellular regulatory systems to accelerate elucidation of the molecular mechanisms of cellular information processing, to understand design principles of cellular regulatory systems (i.e. the
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
4
Handbook of Statistical Systems Biology
relationships between system structures and functions), and ultimately, to engineer cells for useful purposes. Cell engineering encompasses a number of application areas, including manipulation of cellular metabolism for production of desirable metabolic products (Keasling 2010), such as biofuels; identification of drug targets for effective treatment or prevention of diseases (Klipp et al. 2010), such as cancer; and creation of cells with entirely new functions and properties through synthetic biology approaches (Khalil and Collins 2010). In each of these areas, predictive models can be used to guide the manipulation of cellular phenotypes through interventions at the molecular level (e.g. genetic modifications or pharmacological perturbations). In traditional engineering fields, predictive models play a central role, and I believe predictive modeling will be just as central to cell engineering efforts in the future. Another more immediate reason to seek models of cellular regulatory systems is that model-guided studies of the molecular mechanisms of cellular information processing can accelerate the pace at which these mechanisms and their design principles are elucidated (Wall et al. 2004; Alon 2006; Nov´ak and Tyson 2008; Rafelski and Marshall 2008; Mukherji and van Oudenaarden 2009). A model of a cellular regulatory system essentially represents a hypothesis about how the system operates and such a model can be used to interpret experimental observations, to design experiments, and generally to extend the reach of human intuition and reasoning. A model can be constructed in a systematic step-by-step fashion and the logical consequences of a model, no matter how complicated, can usually be determined. Although models are useful because they make predictions, models are also useful for other reasons, which are sometimes overlooked, as recently discussed by Lander (2010). It is often assumed, presumably because of the long tradition of modeling in science and engineering and the many successes of modeling (e.g. celestial modeling), that modeling of cell signaling systems should be fairly straightforward and that all the necessary tools (e.g. numerical methods and software implementations of these methods) should be more or less readily available. I do not believe that this is the case or even close to the case. I believe that predictive modeling of cell signaling systems will require the solution of unique and difficult problems and that there is much work to be done.
1.2
Cell signaling systems
Cell signaling (Hunter 2000; Scott and Pawson 2009) controls the phenotypes and fates of cells. By conveying information through a signal-initiated cascade of molecular interactions, a cell signaling system enables a cell to sense changes in its environment and respond to them. Cell signaling is mediated in large part by complex networks of interacting proteins, each of which is itself complex. A signaling protein may contain a catalytic subunit that is tightly regulated, such as a tyrosine kinase (Bradshaw 2010; Lemmon and Schlessinger 2010), as well as other functional components, including modular protein interaction domains [e.g. a Src homology 2 (SH2) domain] (Pawson and Nash 2003), short linear motifs (e.g. a proline-rich sequence that can be recognized by an SH3 domain) (Gould et al. 2010), and sites of post-translational modification (e.g. substrates of kinases and phosphatases) (Walsh et al. 2005). Many of the components of signaling proteins enable and regulate interactions with binding partners, and regulated assembly of protein complexes plays an important role in cell signaling by co-localizing enzymes and substrates, which affects both enzyme activity and specificity (Ptashne and Gann 2001). Although many of the molecular-level properties of specific cell signaling systems have been elucidated and characterized, we are only now beginning to investigate system-level properties, i.e. how the parts work together to process information (Aldridge et al. 2006; Kholodenko 2006; Kholodenko et al. 2010). A systems-level understanding of cell signaling is central to a basic understanding of cell biology. This type of understanding is also relevant for finding better treatments for a number of diseases involving dysregulation of cell signaling systems, such as cancer. The malignancy of cancer cells is driven and/or sustained in part by mutations that affect signaling proteins, many of which are the products of oncogenes and tumor
Two Challenges of Systems Biology 5
suppressor genes (Hanahan and Weinberg 2000). We can hope to reverse the effects of deleterious mutations by understanding the system-level properties of cell signaling systems (Kreeger and Lauffenburger 2010; Xu and Huang 2010). However, the complexity of cell signaling systems poses a formidable barrier to understanding and rational manipulation of system-level properties. There is a pressing need to find better ways to reason about cell signaling, so that we can relate the molecular interactions in cell signaling systems to the higher level phenomena that emerge from these interactions (Nurse 2008). Addressing this need now seems timely, given recent advances in proteomics that allow quantitative multiplex assays of the molecular states of cell signaling systems (Gstaiger and Aebersold 2009; Choudhary and Mann 2010; Kolch and Pitt 2010) and opportunities to take advantage of molecular therapeutics (Gonzalez-Angulo et al. 2010), i.e. drugs that specifically target signaling proteins, such as imatinib (Deininger et al. 2005; Hunter 2007). A fairly standard aid for reasoning about a cell signaling system is a network wiring diagram, which is used to provide a visual representation and summary of available knowledge about a system, in particular the molecules involved and their interactions. Wiring diagrams come in many forms, from simple cartoons to elaborate depictions that make use of formalized iconography (Kohn 1999; Kitano et al. 2005). A testament to the importance of wiring diagrams is the Systems Biology Graphical Notation (SBGN) project, an effort to develop standardized conventions for diagrammatic representation of biological knowledge (Le Nov`ere et al. 2009). Although diagrams are clearly useful, they lack predictive power. A more powerful aid for reasoning about a cell signaling system is a mathematical model (Kholodenko et al. 2010; Kreeger and Lauffenburger 2010; Xu and Huang 2010). Models of cell signaling systems are less commonly used and less accessible than network wiring diagrams, but modeling provides a means to cast available knowledge in a form that allows predictions to be made via computation. Once a model has been formulated, computational procedures can usually be applied to elucidate the logical consequences of the model. We can expect modeling to be useful for understanding cell signaling systems, especially given emerging capabilities in synthetic biology (Bashor et al. 2010; Gr¨unberg and Serrano 2010; Lim 2010), which present opportunities for model-guided cell engineering. However, it is becoming clear that traditional modeling approaches, such as the formalism of ordinary differential equations (ODEs), will not be entirely adequate. Modeling of cell signaling systems presents unique challenges for which new approaches are needed. Two of these challenges are discussed below.
1.3
The challenge of many moving parts
To address many interesting questions, we need large models (Hlavacek 2009). The reason is twofold. First, we need models of entire cell signaling systems and indeed models that encompass sets of connected cell signaling systems and other regulatory systems (e.g. metabolic circuits) to understand how the molecular states of cells affect their fates and phenotypes. The molecular state of a cell can be perturbed and potentially rationally manipulated. For example, pharmacological inhibition of a receptor tyrosine kinase might be used to impact cell-cycle control and reduce proliferation. However, the effect of a perturbation must generally propagate through a network of regulatory systems to bring about a change in phenotype, and these systems are rife with interconnections and feedback loops (Tyson et al. 2003; Brandman and Meyer 2008). The effects of some perturbations can be intuited. Others can be predicted with greater accuracy on the basis of simple or coarse-grained models or on the basis of mechanistically detailed but circumscribed models (Tyson and Nov´ak 2010). However, at some point, to obtain deep understanding and/or exquisite control, we will seek to develop models that are large in scope and heavy in detail. Such models may only promise minor improvements in understanding or controllability, but even minor improvements, given the potential benefits, could provide strong motivation to pursue ambitious modeling efforts. In the United States, the annual cost of treating cancer is estimated to now be greater than US $100 billion and rising (Mariotto et al. 2011). This cost could perhaps
6
Handbook of Statistical Systems Biology
be reduced and/or care could be improved with greater predictive capabilities with regard to the effects of perturbations on cell signaling. The second reason we need large models is that a cell signaling system typically comprises a large number of proteins and other signaling molecules (e.g. lipids), as well as a larger number of molecular interactions. For example, documentation of a cell signaling system in NetPath (netpath.org) includes, on average, information about >80 different molecules (counts from ‘Molecules Involved’ fields) and >150 molecular interactions (counts from ‘Physical Interactions’ and ‘Enzyme Catalysis’ fields) (Kandasamy et al. 2010). This problem can be dramatically exacerbated when one attempts to translate biological knowledge of a cell signaling system into mathematical form because the number of formal elements in a conventional, or traditional, model (e.g. variables or equations) is typically much larger than the number of molecules or interactions considered in the model. For example, a recent model encompassing only about 30 proteins is composed of nearly 500 ODEs (Chen et al. 2009). Models that are far more comprehensive than available models, such as models that couple receptor signaling to cell cycle control, can be developed, evaluated, and incrementally improved through an iterative process, including parameter estimation, experimental tests of model predictions, and model refinement. This iterative process is often discussed, but the process is difficult to implement. It will continue to be difficult to implement until certain types of resources and conventions are developed to aid modelers. In addition, methods are needed to make modeling more automatic, such as methods for data-driven model specification [for an example of inference of network structure in the context of cell signaling, see Ciaccio et al. (2010)], automated comparison of model predictions to known, formalized system properties (Batt et al. 2005; Calzone et al. 2006; Gong et al. 2010), and automated model-guided design of experiments (Bongard and Lipson 2007). The resources I have in mind would provide the type of information available at the Tyson Lab website (mpf.biol.vt.edu), where for example, one can find an impressively long list of budding yeast mutants and their cell cycle phenotypes. This collection of information, which would be time consuming to reproduce, serves to document the basis of the model of Chen et al. (2004) and also provides a set of known system properties, which can be used to validate any new model for the budding yeast cell cycle or to compare alternative models. There is a wealth of information in the primary literature about system properties, which are mostly qualitative in nature. This information could perhaps be better used by modelers to (tightly) constrain the values of parameters in models. The information available about a cell signaling system is often so abundant that only a fraction of the available information can be held in memory at a time. In fact, it may be impossible to even enter into memory all the available information. Available information must be systematized (ideally in the form of reusable and extensible models), reporting of new knowledge must be made more formal, and protocols and standardized conventions for accomplishing these tasks are needed. Translating available knowledge into formal elements of a model specification or into benchmarks for validating a model is currently a rate-limiting step in the modeling process, and it requires an unusual combination of skills. (It also requires a rare temperament.) With the adoption of conventions for reporting observed system properties, this burden could be lifted from modelers. A standard language for specifying dynamical system properties could perhaps be developed on the basis of a temporal logic (Calzone et al. 2006; Gong et al. 2010). An important element of such a language will be a capacity to represent system perturbations, such as mutations. Progress on formal representation of perturbations has been made by Danos et al. (2009). It should be noted that a common concern about the development of a large model is the difficulty of estimating the many parameters in such a model. Recently, Apgar et al. (2010) demonstrated, via an assumed ground truth model and synthetic data, that the parameters of a model can be identified via fitting to measurements from an unexpectedly small number of experiments if the experiments are well chosen and under the assumptions that model structure is correct and all variables of the model are observable in the form of time-series data. Despite these caveats, this study suggests cautious optimism. The parameters in large models may be more identifiable than conventional wisdom suggests, and new statistical methods could perhaps be
Two Challenges of Systems Biology 7
developed for better addressing the problem of parameter estimation in the the context of large models for cell signaling systems. One need in this area is the development of parameter estimation methods that respect thermodynamic constraints, which are easily enforced in the case of a small model but not so easily enforced in the case of a large model (Yang et al. 2006; Ederer and Gilles 2007, 2008; Vlad and Ross 2009; Jenkinson et al. 2010).
1.4
The challenge of parts with parts
The parts of cell signaling systems (e.g. proteins) themselves have parts (e.g. sites of post-translational modification). To faithfully represent the mechanisms of cell signaling, we need models that account for the site-specific details of molecular interactions (Hlavacek et al. 2006; Hlavacek and Faeder 2009; Mayer et al. 2009). A signaling protein is typically composed of multiple functional components or sites. Tracking information about these sites is critical for modeling the system-level dynamics of molecular interactions because protein interactions are context sensitive, i.e. they depend on site-specific details. For example, many protein– protein interactions are regulated by phosphorylation (Pawson and Kofler 2009). It is difficult to document and/or visualize the contextual dependencies of molecular interactions in a way that is both clear and precise. Online resources, such as NetPath (Kandasamy et al. 2010), Phospho.ELM (Dinkel et al. 2011) and Reactome (Matthews et al. 2009), provide (limited) information about the site-specific details of protein–protein interactions through combinations of illustrations, lists (e.g. a list of sites of phosphorylation), comments, and literature citations. It is also difficult to incorporate the site-specific details of molecular interactions into a conventional or traditional model (e.g. ODEs) (Hlavacek et al. 2003). Nevertheless, our mechanistic understanding of site-specific details is fairly advanced and such details can be monitored experimentally (Gstaiger and Aebersold 2009; Choudhary and Mann 2010; Kolch and Pitt 2010), which encourages a consolidation of this knowledge in the form of models. As mentioned above, ODEs are traditionally used to model the kinetics of (bio)chemical reaction systems, and many models of cell signaling systems take the form of a system of coupled ODEs (Kholodenko et al. 1999; Schoeberl et al. 2002). Another commonly used modeling approach involves using Gillespie’s method, or kinetic Monte Carlo (KMC), to execute a stochastic simulation of the kinetics of a given list of reactions (Gillespie 2007; Voter 2007). This approach is typically applied in preference to the ODE-based approach when one is concerned about stochastic fluctuations that arise from small population sizes (McAdams and Arkin 1999). Both approaches require a listing of the reactions that are possible and the chemical species that can be populated in a system. The approach of using ODEs to represent the kinetics of chemical reaction systems was first applied by Harcourt and Esson (1865), long before we understood how cells process information. Cell signaling systems have features that distinguish these systems from other reaction systems, such as many reaction systems of importance in the chemical processing industry. In a typical commercial reactor or test tube, there are large numbers of molecules that can populate a relatively small number of chemical species. In contrast, in a cell signaling system, because of the combinatorial potential of protein interactions, relatively small numbers of molecules populate a tiny fraction of a large, possibly astronomical or even super astronomical number of possible chemical species (Endy and Brent 2001; Bray 2003; Hlavacek et al. 2003, 2006). This feature of cell signaling systems has been called combinatorial complexity. Combinatorial complexity hinders application of traditional modeling approaches, because in these approaches, one must write an equation for each chemical species that could be populated in a system, or the equivalent. To overcome the problem of combinatorial complexity, the rule-based modeling approach has been developed (Hlavacek et al. 2006). In this approach, graphs are used to model proteins, and graph-rewriting rules are used to model protein interactions. Rules can be specified using a formal language, such as the BioNetGen language (BNGL) (Faeder et al. 2009) or Kappa (Feret et al. 2009; Harmer et al. 2010), which is closely
8
Handbook of Statistical Systems Biology
related to BNGL. The two basic ideas underlying the rule-based modeling approach are to treat molecules as the building blocks of chemical species and to model the molecular interactions that generate chemical species as rules rather than to specify a list of reactions connecting the chemical species that can be populated, which is equivalent to specifying a set of ODEs under the assumption of mass action kinetics. A rule can be viewed as implicitly defining a set of individual reactions, which involve a common transformation. A rule includes a specification of the necessary and sufficient conditions that a chemical species must satisfy to undergo a transformation defined by the rule and a rate law for the transformation. If the conditions are minimal, the number of individual reactions implied by the rule is maximal. For example, if ligand–receptor binding is independent of receptor signaling, then a rule can be specified that captures ligand–receptor interaction without considering the interaction of the receptor with cytosolic signaling proteins. However, there is no free lunch. The cost of the rule-based modeling approach is that the reactions implied by a rule are all assumed to be characterized by the same rate law. Thus, the approach relies on a coarse-grained description of reaction kinetics, because each individual reaction in a system may truly be characterized by a unique rate law. Although it is important to keep this limitation of rule-based modeling in mind, the reliance on coarse-grained reaction kinetics is not onerous, as the granularity of a rule can be adjusted as needed. At the finest level of granularity, a rule implies only a single reaction, i.e. a rule is equivalent to a reaction in the traditional sense. Thus, the rule-based modeling approach can be viewed as a generalization of conventional modeling approaches. Rule-based modeling provides a practical solution to the problem of combinatorial complexity. A number of methods for simulating rule-based models have been developed and these methods have been implemented in various software tools (Blinov et al. 2004; Lok and Brent 2005; Meier-Schellersheim et al. 2006; Moraru et al. 2008; Colvin et al. 2009, 2010; Lis et al. 2009; Mallavarapu et al. 2009; Andrews et al. 2010; Gruenert et al. 2010; Ollivier et al. 2010; Sneddon et al. 2010). A few recently developed tools are briefly discussed below to point to newly available modeling capabilities and to identify capability gaps. The tools discussed below all implement statistical methods. Once a rule-based model has been specified, it can be simulated in a number of different ways. For example, the set of ODEs for mass action kinetics corresponding to the list of reactions implied by a model can be derived from the rules of the model (Blinov et al. 2004; Faeder et al. 2005), or a smaller set of ODEs for the dynamics of collective variables can be derived from the rules (Borisov et al. 2008; Conzelmann and Gilles 2008; Koschorreck and Gilles 2008; Feret et al. 2009; Harmer et al. 2010). However, statistical methods are particularly well suited for simulation of a rule-based model, because tracking individual molecules in a stochastic simulation can be easier than tracking the population levels of chemical species in a deterministic simulation (Morton-Firth and Bray 1998; Danos et al. 2007; Yang et al. 2008; Colvin et al. 2009, 2010). The stochastic simulation compiler (SSC) of Lis et al. (2009) implements a ‘next subvolume’ method for simulation of spatial rule-based models, in which subvolumes of a reaction volume are considered to be well mixed but transport between subvolumes is included to capture the effects of diffusion. Within a subvolume, KMC is used to simulate reaction kinetics, i.e. Gillespie’s method is applied. SSC uses a constructive solid geometry approach, in which Boolean operators are used to combine primitive shapes/objects, to define surfaces and objects. A limitation of SSC is its reliance on the computationally expensive procedure of network generation, as SSC implements a ‘generate-first’ approach to simulation (Blinov et al. 2005). In this approach, a rule set is expanded into the corresponding reaction network (or equivalently, the list of reactions implied by the rules) before a simulation is performed. Network generation is expensive and impractical for many rule-based models. It is also unnecessary, as various ‘network-free’ approaches are available (Danos et al. 2007; Yang et al. 2008; Colvin et al. 2009, 2010). In these approaches, rules are used directly to advance a stochastic simulation. In fact, network-free simulation can be viewed as an extension of Gillespie’s method in which a list of reactions is replaced with a list of rules. A number of software tools implement network-free simulation methods, including DYNSTOC (Colvin et al. 2009), RuleMonkey (Colvin et al. 2010) and NFsim (Sneddon et al. 2010), which are all compliant with BNGL. None of these tools have a spatial modeling
Two Challenges of Systems Biology 9
capability, i.e. the methods implemented in DYNSTOC, RuleMonkey and NFsim all rely on the assumption of a well-mixed reaction compartment. Network-free simulation tools compliant with Kappa with similar capabilities and limitations, such as KaSim, are available at kappalanguage.org/tools. Another software tool for simulation of spatial rule-based models is Smoldyn (Andrews et al. 2010), which performs particle-based reaction-diffusion calculations. These types of calculations are well suited for dealing with complex geometrical constraints, such as a reconstructed 3D surface (Mazel et al. 2009). Smoldyn implements an ‘on-the-fly’ approach to simulation (Faeder et al. 2005; Lok and Brent 2005), which limits network generation by expanding a list of reactions as chemical species become populated during the course of a simulation. However, on-the-fly simulation still relies on network generation, and because of this, on-the-fly simulation is impractical for many rule-based models (Hlavacek et al. 2006; Yang et al. 2008). Interestingly, the network-free simulation approaches implemented in DYNSTOC, RuleMonkey, NFsim and KaSim are particle-based methods, and it should be straightforward to extend any of these methods to include molecule positions (spatial coordinates) and propagation of molecules in space, on or off a lattice. Powerful tools are available for specifying model geometry and visualizing results in 3D (Czech et al. 2009), and such tools could perhaps be coupled to rule-based simulation capabilities. Finally, I wish to call attention to the simulation capability provided by the SRSim package (Gruenert et al. 2010), which couples BNGL (Faeder et al. 2009) to LAMMPS (Plimpton 1995), a package for performing molecular dynamics (MD) calculations. The MD simulation capability provided by SRSim allows molecules and their binding sites to be modeled as patchy particles and molecular interactions to be modeled in terms of force fields. This capability could be used to study the effects of orientation constraints on molecular interactions, molecular size (i.e. volume exclusion), and both translational and rotational diffusion. A wealth of structural information is available about the modular domains of signaling proteins, other molecules involved in signaling, and biomolecular complexes (Rose et al. 2011). Structures or models for protein signaling complexes could perhaps be built and used to identify potential steric clashes and other factors that potentially impact protein–protein interactions involved in signaling. Relating such structural information to function could perhaps be aided by system-level modeling of the dynamics of protein–protein interactions. A model for a signaling complex is static, whereas assembly of a signaling complex during the course of cell signaling is dynamic. Tools such as SRSim could provide a bridge for connecting structural data available about signaling proteins to system-level dynamics of protein–protein interactions, i.e. to protein function. The study of Monine et al. (2010) provides another example, but no general-purpose software, of how structural constraints on molecular interactions can be considered within the framework of rule-based modeling.
1.5
Closing remarks
Interest in modeling of cellular regulatory systems has increased dramatically in recent years, especially interest in stochastic modeling. In 2010, the classic paper of Gillespie (1977) was cited over 300 times, which is more than the number of times the paper was cited in the 20 years before Gillespie’s method was applied by McAdams and Arkin (1997) to study gene regulation. Although modeling of cellular regulatory systems is more commonplace now than ever, the models being considered are usually small models, which also tend to incorporate little of the known mechanistic details (e.g. the site-specific details of protein–protein interactions). Much can be learned from small models, but cellular regulatory systems are large, and there are interesting questions that can be addressed with large and mechanistically detailed models. I am reminded of the saying attributed to Einstein, ‘everything should be as simple as it can be, but not simpler.’ It will not be easy to develop useful large models of cell signaling systems for a variety of reasons. One reason is that there is no biological expert available to help the ambitious modeler, who may be a physicist or mathematician by training, with formalization of the known biology. I could be wrong, but I do not
10 Handbook of Statistical Systems Biology
believe there is a person alive today who has what can be considered a complete command of the known facts about any well-studied cell signaling system. One could be discouraged by this situation or motivated by it, as the systematization of knowledge provided by a model could help us make better use of available knowledge. Another difficulty for the ambitious modeler is the lack of support that can be expected from colleagues through the peer review process. Once an ODE-based model, for example, crosses some threshold of complexity, a modeler can expect little proofreading of equations, questioning of model assumptions, or helpful suggestions on how to improve parameter estimates. This situation points to a need to develop methods for better communicating the content of large models. Finally, one should keep in mind that models should be formulated to address specific questions, not for the sake of modeling. Finding the right questions to ask, the ones that can be answered with the help of large models, is the crux of the matter. Whether, or rather when, these questions can be found will determine when large modeling efforts become commonplace in the future.
References Aldridge BB, Burke JM, Lauffenburger DA and Sorger PK 2006 Physicochemical modelling of cell signalling pathways. Nat. Cell Biol. 8, 1195–1203. Alon U 2006 An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and Hall/CRC, Boca Raton, FL. Andrews SS, Addy NJ, Brent R and Arkin AP 2010 Detailed simulations of cell biology with Smoldyn 2.1. PLoS Comput. Biol. 6, e1000705. Apgar JF, Witmer DK, White FM and Tidor B 2010 Sloppy models, parameter uncertainty, and the role of experimental design. Mol. Biosyst. 6, 1890–1900. Bashor CJ, Horwitz AA, Peisajovich SG and Lim WA 2010 Rewiring cells: synthetic biology as a tool to interrogate the organizational principles of living systems. Annu. Rev. Biophys. 39, 515–537. Batt G, Ropers D, de Jong H, Geiselmann J, Mateescu R, Page M and Schneider D 2005 Validation of qualitative models of genetic regulatory networks by model checking: analysis of the nutritional stress response in Escherichia coli. Bioinformatics 21, i19–i28. Blinov ML, Faeder JR, Goldstein B and Hlavacek WS 2004 BioNetGen: software for rule-based modeling of signal transduction based on the interactions of molecular domains. Bioinformatics 20, 3289–3291. Blinov ML, Faeder JR, Yang J, Goldstein B and Hlavacek WS 2005 ‘On-the-fly’ or ‘generate-first’ modeling? Nat. Biotechnol. 23, 1344–1345. Bongard J and Lipson H 2007 Automated reverse engineering of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USA 104, 9943–9948. Borisov NM, Chistopolsky AS, Faeder JR and Kholodenko BN 2008 Domain-oriented reduction of rule-based network models. IET Syst. Biol. 2, 342–351. Bradshaw JM 2010 The Src, Syk, and Tec family kinases: distinct types of molecular switches. Cell. Signal. 22, 1175–1184. Brandman O and Meyer T 2008 Feedback loops shape cellular signals in space and time. Science 322, 390–395. Bray D 2003 Molecular prodigality. Science 299, 1189–1190. Calzone L, Fages F and Soliman S 2006 BIOCHAM: an environment for modeling biological systems and formalizing experimental knowledge. Bioinformatics 22, 1805–1807. Chen KC, Calzone L, Csikasz-Nagy A, Cross FR, Novak B and Tyson JJ 2004 Integrative analysis of cell cycle control in budding yeast. Mol. Biol. Cell 15, 3841–3862. Chen WW, Schoeberl B, Jasper PJ, Niepel M, Nielsen UB, Lauffenburger DA and Sorger PK 2009 Input-output behavior of ErbB signaling pathways as revealed by a mass action model trained against dynamic data. Mol. Syst. Biol. 5, 239. Choudhary C and Mann M 2010 Decoding signalling networks by mass spectrometry-based proteomics. Nat. Rev. Mol. Cell Biol. 11, 427–439. Ciaccio MF, Wagner JP, Chuu CP, Lauffenburger DA, and Jones RB 2010 Systems analysis of EGF receptor signaling dynamics with microwestern arrays. Nat. Methods 7, 148–155.
Two Challenges of Systems Biology 11 Conzelmann H and Gilles ED 2008 Dynamic pathway modeling of signal transduction networks: a domain-oriented approach. Methods Mol. Biol. 484, 559–578. Colvin J, Monine MI, Faeder JR, Hlavacek WS, Von Hoff DD and Posner RG 2009 Simulation of large-scale rule-based models. Bioinformatics 25, 910–917. Colvin J, Monine MI, Gutenkunst RN, Hlavacek WS, Von Hoff DD and Posner RG 2010 RuleMonkey: software for stochastic simulation of rule-based models. BMC Bioinformatics 11, 404. Czech J, Dittrich M and Stiles JR 2009 Rapid creation, Monte Carlo simulation, and visualization of realistic 3D cell models. Methods Mol. Biol. 500, 237–287. Danos V, Feret J, Fontana W and Krivine J 2007 Scalable simulation of cellular signaling networks. Lect. Notes Comput. Sci. 4807, 139–157. Danos V, Feret J, Fontana W, Harmer R and Krivine J 2009 Rule-based modelling and model perturbation. Lect. Notes Comput. Sci. 5750, 116–137. Deininger M, Buchdunger E and Druker BJ 2005 The development of imatinib as a therapeutic agent for chronic myeloid leukemia. Blood 105, 2640–2653. Dinkel H, Chica C, Via A, Gould CM, Jensen LJ, Gibson TJ and Diella F 2011 Phopho.ELM: a database of phosphorylation sites—update 2011. Nucleic Acids Res. 39, D261–D267. Ederer M and Gilles ED 2007 Thermodynamically feasible kinetic models of reaction networks. Biophys. J. 92, 1846– 1857. Ederer M and Gilles ED 2008 Thermodynamic constraints in kinetic modeling: thermodynamic-kinetic modeling in comparison to other approaches. Eng. Life Sci. 8, 467–476. Edwards JS, Faeder JR, Hlavacek WS, Jiang Y, Nemenman I and Wall ME 2007 q-bio 2007: a watershed moment in modern biology. Mol. Syst. Biol. 3, 148. Endy D and Brent R 2001 Modelling cellular behaviour. Nature 409, 391–395. Faeder JR, Blinov ML, Goldstein B and Hlavacek WS 2005 Rule-based modeling of biochemical networks. Complexity 10, 22–41. Faeder JR, Blinov ML and Hlavacek WS 2009 Rule-based modeling of biochemical systems with BioNetGen. Methods Mol. Biol. 500, 113–167. Feret J, Danos V, Krivine J, Harmer R and Fontana W 2009 Internal coarse-graining of molecular systems. Proc. Natl. Acad. Sci. USA 106, 6453–6458. Gillespie DT 1977 Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. Gillespie DT 2007 Stochastic simulation of chemical kinetics. Annu. Rev. Phys. Chem. 58, 35–55. Gonzalez-Angulo AM, Hennessy BT and Mill GB 2010 Future of personalized medicine in oncology: a systems biology approach. J. Clin. Oncol. 28, 2777–2783. Gong H, Zuliani P, Komuravelli A, Faeder JR and Clarke EM 2010 Analysis and verification of the HMGB1 signaling pathway. BMC Bioinformatics 11, S10. Gould CM, Diella F, Via A, Puntervoll P, Gem¨und C, Chabanis-Davidson S, Michael S, Sayadi A, Bryne JC, Chica C, Seiler M, Davey NE, Haslam N, Weatheritt RJ, Budd A, Hughes T, Pas J, Rychlewski L, Trav´e G, Aasland R, Helmer-Citterich M, Linding R and Gibson TJ 2010 ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 38, D167–D180. Gruenert G, Ibrahim B, Lenser T, Lohel M, Hinze T and Dittrich P 2010 Rule-based spatial modeling with diffusing, geometrically constrained molecules. BMC Bioinformatics 11, 307. G¨unberg R and Serrano L 2010 Strategies for protein synthetic biology. Nucleic Acids Res. 38, 2663–2675. Gstaiger M and Aebersold R 2009 Applying mass spectrometry-based proteomics to genetics, genomics and network biology. Nat. Rev. Genet. 10, 617–627. Hanahan D and Weinberg RA 2000 The hallmarks of cancer. Cell 100, 57–70. Harcourt AV and Esson W 1865 On the laws of connexion between the conditions of a chemical change and its amount. Proc. R. Soc. 14, 470–474. Harmer R, Danos V, Feret J, Krivine J and Fontana W 2010 Intrinsic information carriers in combinatorial dynamical systems. Chaos 20, 037108. Hilbert D 1902 Mathematical problems. Bull. Am. Math. Soc. 8, 437–479. Hlavacek WS 2009 How to deal with large models? Mol. Syst. Biol. 5, 240.
12 Handbook of Statistical Systems Biology Hlavacek WS and Faeder JR 2009 The complexity of cell signaling and the need for a new mechanics. Sci. Signal. 2, pe46. Hlavacek WS, Faeder JR, Blinov ML, Perelson AS and Goldstein B 2003 The complexity of complexes in signal transduction. Biotechnol. Bioeng. 84, 783–794. Hlavacek WS, Faeder JR, Blinov ML, Posner RG, Hucka M and Fontana W 2006 Rules for modeling signal-transduction systems. Sci. STKE 2006, re6. Hunter T 2000 Signaling—2000 and beyond. Cell 100, 113–127. Hunter T 2007 Treatment for chronic myelogenous leukemia: the long road to imatinib. J. Clin. Invest. 117, 2036–2043. Jenkinson G, Zhong X and Goutsias J 2010 Thermodynamically consistent Bayesian analysis of closed biochemical reaction systems. BMC Bioinformatics 11, 547. Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Sameer Kumar GS, Venugopal AK, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Kanth Gollapudi S, Gopal Tattikota S, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob HKC, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra YL, Abdul Rahiman B, Keshava Prasad TS, Lin JX, Houtman JCD, Desiderio S, Renauld JC, Constantinescu SN, Ohara O, Hirano T, Kubo M, Singh S, Khatri P, Draghici S, Bader GD, Sander C, Leonard WJ and Pandey A 2010 NetPath: a public resource of curated signal transduction pathways. Genome Biol. 11, R3. Keasling JD 2010 Manufacturing molecules through metabolic engineering. Science 330, 1355–1358. Khalil AS and Collins JJ 2010 Synthetic biology: applications come of age. Nat. Rev. Genet. 11, 367–379. Kholodenko BN, Demin OV, Moehren G and Hoek JB 1999 Quantification of short term signaling by the epidermal growth factor receptor. J. Biol. Chem. 274, 30169–30181. Kholodenko BN 2006 Cell-signalling dynamics in time and space. Nat. Rev. Mol. Cell Biol. 7, 165–176. Kholodenko BN, Hancock JF and Kolch W 2010 Signalling ballet in space and time. Nat. Rev. Mol. Cell Biol. 11, 414–426. Kitano H (ed.) 2001 Foundations of Systems Biology. MIT Press, Cambridge, MA. Kitano H 2002 Systems biology: a brief overview. Science 295, 1662–1664. Kitano H, Funahashi A, Matsuoka Y and Oda K 2005 Using process diagrams for the graphical representation of biological networks. Nat. Biotechnol. 23, 961–966. Klipp E, Wade RC and Kummer U 2010 Biochemical network-based drug-target prediction. Curr. Opin. Biotechnol. 21, 511–516. Kohn KW 1999 Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol. Biol. Cell 10, 2703–2734. Kolch W and Pitt A 2010 Functional proteomics to dissect tyrosine kinase signalling pathways in cancer. Nat. Rev. Cancer 10, 618–629. Koschorreck M and Gilles ED 2008 ALC: automated reduction of rule-based models. BMC Syst. Biol. 2, 91. Kreeger PK and Lauffenburger DA 2010 Cancer systems biology: a network modeling perspective. Carcinogenesis 31, 2–8. Lander AD 2010 The edges of understanding. BMC Biol. 8, 40. Lemmon MA and Schlessinger J 2010 Cell signaling by receptor tyrosine kinases. Cell 141, 1117–1134. Le Nov`ere N, Hucka M, Mi H, Moodie S, Schreiber F, Sorokin A, Demir E, Wegner K, Aladjem MI, Wimalaratne SM, Bergman FT, Gauges R, Ghazal P, Kawaji H, Li L, Matsuoka Y, Villeger A, Boyd SE, Calzone L, Courtot M, Dogrusoz U, Freeman TC, Funahashi A, Ghosh S, Jouraku A, Kim S, Kolpakov F, Luna A, Sahle S, Schmidt E, Watterson S, Wu G, Goryanin I, Kell DB, Sander C, Sauro H, Snoep JL, Kohn K and Kitano H 2009 The systems biology graphical notation. Nat. Biotechnol. 27, 735–741. Lim WA 2010 Designing customized cell signalling circuits. Nat. Rev. Mol. Cell Biol. 11, 393–403. Lis M, Artyomov MN, Devadas S and Chakraborty AK 2009 Efficient stochastic simulation of reaction-diffusion processes via direct compilation. Bioinformatics 25, 2289–2291. Lok L and Brent R 2005 Automatic generation of cellular reaction networks with Moleculizer 1.0. Nat. Biotechnol. 23, 131–136. Mallavarapu A, Thomson M, Ullian B and Gunawardena J 2009 Programming with models: modularity and abstraction provide powerful capabilities for systems biology. J. R. Soc. Interface 6, 257–270.
Two Challenges of Systems Biology 13 Mariotto AB, Robin Yabroff K, Shao Y, Feuer EJ and Brown ML 2011 Projections of the cost of cancer care in the United States: 2010–2020. J. Natl. Cancer Inst. 103, 117–128. Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Stein L and D’Eustachio P 2009 Reactome knowledge base of human biological pathways and processes. Nucleic Acids Res. 37, D619–D622. Mayer BJ, Blinov ML and Loew LM 2009 Molecular machines or pleiomorphic ensembles: signaling complexes revisited. J. Biol. 8, 81. Mazel T, Raymond R, Raymond-Stintz M, Jett S and Wilson BS 2009 Stochastic modeling of calcium in 3D geometry. Biophys. J. 96, 1691–1706. McAdams HH and Arkin A 1997 Stochastic mechanisms in gene expression. Proc. Natl. Acad. Sci. USA 94, 814–819. McAdams HH and Arkin A 1999 It’s a noisy business! Genetic regulation at the nanomolar scale. Trends Genet. 15, 65–69. Meier-Schellersheim M, Xu K, Angermann B, Kunkel EJ, Jin T and Germain RN 2006 Key role of local regulation in chemosensing revealed by a new molecular interaction-based modeling method. PLoS Comput. Biol. 2, e82. Monine MI, Posner RG, Savage PB, Faeder JR and Hlavacek WS 2010 Modeling multivalent ligand-receptor interactions with steric constraints on configurations of cell-surface receptor aggregates. Biophys. J. 96, 48–56. Moraru II, Schaff JC, Slepchenko BM, Blinov ML, Morgan F, Lakshminarayana A, Gao F, Li Y and Loew LM 2008 Virtual Cell modelling and simulation software environment. IET Syst. Biol. 2, 352–362. Morton-Firth CJ and Bray D 1998 Predicting temporal fluctuations in an intracellular signalling pathway. J. Theor. Biol. 192, 117–128. Mukherji S and van Oudenaarden A 2009 Synthetic biology: understanding biological design from synthetic circuits. Nat. Rev. Genet. 10, 859–871. Nov´ak B and Tyson JJ 2008 Design principles of biochemical oscillators. Nat. Rev. Mol. Cell Biol. 9, 981–991. Nurse P 2008 Life, logic and information. Nature 454, 424–426. Ollivier JF, Shahrezaei V and Swain PS 2010 Scalable rule-based modelling of allosteric proteins and biochemical networks. PLoS Comput. Biol. 6, e1000975. Pawson T and Kofler M 2009 Kinome signaling through regulated protein–protein interactions in normal and cancer cells. Curr. Opin. Cell Biol. 21, 147–153. Pawson T and Nash P 2003 Assembly of cell regulatory systems through protein interaction domains. Science 300, 445–452. Plimpton SJ 1995 Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 117, 1–19. Ptashne M and Gann A 2001 Genes & Signals. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Rafelski SM and Marshall WF 2008 Building the cell: design principles of cellular architecture. Nat. Rev. Mol. Cell Biol. 9, 593–602. Rose PW, Beren B, Bi C, Bluhm WF, Dimitropoulos D, Goodsell DS, Prli´c A, Quesada M, Quinn GB, Westbrook JD, Young J, Yukich B, Zardecki C, Berman HM and Bourne PE 2011 The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res. 39, D392–D401. Schoeberl B, Eichler-Jonsson C, Gilles ED and M¨uller G 2002 Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors. Nat. Biotechnol. 20, 370–375. Scott JD and Pawson T 2009 Cell signaling in space and time: where proteins come together and when they’re apart. Science 326, 1220–1224. Sneddon MW, Faeder JR and Emonet T 2010 Efficient modeling, simulation and coarse-graining of biological complexity with NFsim. Nat. Methods doi:10.1038/nmeth.1546. Tyson JJ and Nov´ak B 2010 Functional motifs in biochemical reaction networks. Annu. Rev. Phys. Chem. 61, 219–240. Tyson JJ, Chen KC and Novak B 2003 Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell. Curr. Opin. Cell Biol. 15, 221–231. Vlad MO and Ross J 2009 Thermodynamically based constraints for rate coefficients of large biochemical networks. WIREs Syst. Biol. Med. 1, 348–358. Voter AF 2007 Introduction to the kinetic Monte Carlo method. In Radiation Effects in Solids (eds Sickafus KE, Kotomin EA, and Uberuaga BP), pp. 1–23. Springer, Dordrecht. Wall ME, Hlavacek WS and Savageau MA 2004 Design of gene circuits: lessons from bacteria. Nat. Rev. Genet. 5, 34–42.
14 Handbook of Statistical Systems Biology Walsh CT, Garneau-Tsodikova S and Gatto Jr GJ 2005 Protein posttranslational modifications: the chemistry of proteome diversifications. Angew. Chem. Int. Ed. Engl. 44, 7342–7372. Xu AM and Huang PH 2010 Receptor tyrosine kinase coactivation networks in cancer. Cancer Res. 70, 3857–3860. Yang J, Bruno WJ, Hlavacek WS and Pearson JE 2006 On imposing detailed balance in complex reaction mechanisms. Biophys. J. 91, 1136–1141. Yang J, Monine MI, Faeder JR and Hlavacek WS 2008 Kinetic Monte Carlo method for rule-based modeling of biochemical networks. Phys. Rev. E 78, 031910.
2 Introduction to Statistical Methods for Complex Systems Tristan Mary-Huard and St´ephane Robin AgroParisTech and INRA, Paris, France
2.1
Introduction
The aim of the present chapter is to introduce and illustrate some concepts of statistical inference useful in systems biology. Here we limit ourselves to the classical, so-called ‘frequentist’ statistical inference where parameters are fixed quantities that need to be estimated. The Bayesian approach will be presented in Chapter 3. Modelling and inference techniques are illustrated in three recurrent problems in systems biology: Class comparison aims at assessing the effect of some treatment or experimental condition on some biological response. This requires proper statistical modelling to account for the experimental design, various covariates or stratifications or dependence between the measurements. As systems biology often deals with high-throughput technologies, it also raises multiple testing issues. Class prediction refers to learning techniques that aim at building a rule to predict the status (e.g. well or ill) of an individual, based on a set of biological descriptors. An exhaustive list of classification algorithms is out of reach, but general techniques such as regularization or aggregation are of prime interest in systems biology where the number of variables often exceeds the number of observations by far. Evaluating the performances of a classifier also requires relevant tools. Class discovery aims at uncovering some structure in a set of observations. These techniques include distancebased or model-based clustering methods and allow to determine distinct groups of individuals in the absence of a prior classification. However, the underlying structure may have more complex forms, each raising specific issues in terms of inference. This chapter focuses on generic statistical concepts and methods, that can be applied no matter which technology is used for the data acquisition. In practice, applications to any biological problem will necessitate Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
16 Handbook of Statistical Systems Biology
both a relevant strategy for the data collection, and a careful tuning of the methods to obtain meaningful results. These two steps of data collection (or experimental design conception) and adaptation of the generic methods require taking into account the nature of the data. Therefore, they are dependent on the data acquisition technology, and will be discussed in Part B of this Handbook. In this chapter, the data are assumed to arise from a static process. The analysis of a dynamic biological system would require more sophisticated methods, such as partial differential equations or network modelling. These topics are not discussed here as they will be reviewed in depth in Parts C and D. Lastly, a basic knowledge in statistics is assumed, covering topics including point estimation (in particular maximum likelihood estimation), hypothesis testing, and a background in regression and linear models.
2.2
Class comparison
We consider here the general problem of assessing the effect of some treatment, experimental condition or covariate on some response. We first address the problem of modelling the data resulting from the experiments, focusing on how to account for the dependency between the observations. We then turn to the problem of multiple testing, which is recurrent in high-throughput data analyses. 2.2.1
Models for dependent data
Many biological experiments aim at observing the effects of a given treatment (or combination of treatments) on a given response. ‘Treatment’ is used here in a very broad sense, including controlled experimental conditions, uncontrolled covariates, time, population structure, etc. In the following n will stand for the total number of experiments. Linear (Gaussian) models (Searle 1971; Dobson 1990) provide a general framework to describe the influence of a set of controlled conditions and/or uncontrolled covariates, summarized in a n × p-dimensional matrix X, on the observed response gathered in a n-dimensional vector Y as Y = Xθ + E
(2.1)
where θ is the p-dimensional vector containing all parameters. In the most classical setting, the response is supposed to be Gaussian, and the dependency structure between the observations is then fully specified by the (co-)variance matrix = V(Y) = V(E) which contains the variance of each observation on the diagonal, and the covariances between pairs of observations elsewhere. In the most simple setting, the responses are supposed to be independent with same variance σ 2 , that is = σ 2 I. 2.2.1.1
Writing the right (mixed) model
In more complex experiments, the assumption that observations are independent does not hold and the structure of needs to be adapted. Because it contains n(n + 1)/2 parameters, the shape of has to be strongly constrained to allow good inference. We first present here some typical experimental settings, and the associated dependency structures. Variance Components Consider the study of the combined effects of the genotype (indexed by i) and of the cell type (k) on some gene expression. Several individuals (j) from each genotype are included and cells from each type are harvested in each of them. In such a setting the expected response is E(Yijk ) = μik , which is often decomposed into a genotype effect, a cell type effect and an interaction as μik = μ + αi + βk + (αβ)ik .
Statistical Methods for Complex Systems
17
The most popular way to account for the dependency between measures obtained on the same individual is to add a random term Uij associated with each individual. The complete model can then be written as Yijk = μik + Uij + Eijk
(2.2)
where all Uij and Eijk are independent centred Gaussian variables with variance σU2 and σE2 , respectively. The variance of one observation is then σ 2 = σU2 + σE2 , where σU2 is the ‘biological’ variance and σE2 is the ‘technical’ one (Kerr and Churchill 2001). The random effect induces a uniform correlation between observations from the same individual since: Cov(Yijk , Yi j k ) = σU2
if
(ij) = (i j ),
k= / k
(2.3)
and 0 if (ij) = / (i j ). The matrix form of this model is a generalization of (2.1): Y = Xθ + ZU + E
(2.4)
where Z describes the individual structure: each row corresponds to one measurement and each column to one individual and Z contains a 1 at the intersection if the measurement has been made on the individual, and a 0 otherwise. The denomination ‘mixed’ of ‘linear mixed models’ comes from the simultaneous presence of fixed (θ) and random (U) effects. It corresponds to the simplest form of so-called ‘variance components’ models. The variance matrix corresponding to (2.3) is = σU2 ZZ + σE2 I. Application of such a model to gene expression data can be found in Wolfinger et al. (2001) or Tempelman (2008). Repeated Measurements One considers a similar design where, in place of cell types, we compare successive harvesting times (indexed by t) within each individual. The uniform correlation within each individual given in (2.3) may then seem inappropriate, for it does not account for the delay between times of observation. A common dependency form is then the so-called ‘autoregressive’, which states that
Cov(Yijt , Yi j t ) = σ 2 ρ−|t−t |
if
(ij) = (i j )
and 0 otherwise. This is to assume that the correlation decreases (at an exponential rate) with the time delay. Such a variance structure cannot be put in a simple matrix form similar to (2.4). Note that Equation (2.1) is still valid, but with nondiagonal variance matrix = V(E). Spatial Dependency It is also desirable to account for spatial dependency when observations have some spatial localization. Suppose one wants to compare treatments (indexed by i), and that replicates (j) have respective localizations ij . A typical variance structure (Cressie 1993) is V(Yij ) = σ 2 = σS2 + σE2 , Cov(Yij , Yi j ) = σS2 ρ−ij −i j where σE2 accounts for the measurement error variability and ρ controls the speed at which the dependency decreases with distance. The dependency structures described above can of course be combined. Also note that this list is far from exhaustive. The limitations often come from the software at hand or the specific computing developments that can be made. A large catalogue of such structures can be found in software such as SAS (2002–03) or R (www.r-project.org).
18 Handbook of Statistical Systems Biology
2.2.1.2
Inference
Some problems related to the inference of mixed linear models are still unresolved. We only provide here an introduction to the most popular approaches and emphasize some practical issues that can be faced when using them. Estimation Mixed model inference requires to estimate both θ and . We start with the estimation of , which reduces to the estimation of a few variance parameters such as σ 2 , σU2 , σE2 , σS2 , ρ in the examples given above. Moment estimates can be obtained (Searle 1971; Demindenko 2004), typically for variance component models. Such estimates are often based on sums of squares, that are squared distances between Y and its projection on various linear spaces, such as span(X), span(Z) or span(X, Z). The expectation of these sums of squares can often be related to the different variance parameters and the estimation then reduces to solving a set of linear equations. The maximum likelihood (ML) estimator is defined as ML = arg max log P(Y; θ, )
and can be used for all models. Unfortunately, ML variance estimates are known to be biased in many (almost all) situations, because both θ and have to be estimated at the same time. The most popular way to circumvent this problem consists of changing to a model where θ is known (Verbeke and Molenberghs 2000). Defining = TY which satisfies some matrix T such that TX = 0, we may define the Gaussian vector Y =0 E(Y)
= TT . V(Y)
and
The most natural choice for T is the projector on the linear space orthogonal to span(X). The so-called ‘restricted’ maximum likelihood (ReML) of is then defined as ). ReML = arg max log P(Y;
Note that ReML estimates are ML estimates and therefore inherit all their properties (consistency, asymptotic normality). Also note that ReML provides less biased variance estimates than ML. The ReML estimates can be shown to be unbiased only for some specific cases such as orthogonal designs where they are equivalent to moment estimates. The estimates of θ (and the variance of this estimator) are the same for ML and generalized least squares (Searle 1971): −1 X)−1 X −1 Y, θ = (X
−1 X)−1 . V( θ) = (X
described above. However, These estimates are often calculated with a simple plug-in of one of the estimates as it relies on the shape of specified in the model, such an estimator of V( θ) is not robust to a misspecification of the dependency structure. Alternative estimators such as the ‘sandwich’ estimator can be defined, that are both consistent and robust (Diggle et al. 2002). Tests As many experiments aim to assess the significance of a given effect, we are often interested in testing the hypothesis H0 stating that the corresponding contrast between the elements of θ is null. The global procedure is similar to that of model (2.1): writing the contrast under study as a linear combination of the parameters c θ, we get the usual test statistic: c V( (2.5) θ T = c θ)c ≈ Tν . H0
Statistical Methods for Complex Systems
19
A central practical issue is the determination of the (approximate) degree of freedom of the Student’s t distribution as it depends on both the experimental design and the contrast itself. As an example, in model (2.2), we first consider the average difference between genotype i and i : μi• − μi • that is estimated by Yi•• − Yi •• (the notation ‘•’ means that the variables are averaged over the index it replaces) (Searle 1971). We then consider the difference between two cell types k and k is μ•k − μ•k and its estimate Y••k − Y••k . If the design is completely balanced with I genotypes, J individuals per genotype and K cell types within each individual, the respective variances of these estimates are i • ) = (2σU2 /J) + (2σE2 /JK), V( μi• − μ •k ) = 2σE2 /IJ. V( μ•i − μ We see here that the biological variance σU2 does not affect the contrast on the cell types (that are harvested within each individual) whereas it is predominant for the genotype contrast, which refers to differences between the individuals. The distribution of the test statistics is Student in both cases, but with different degrees of freedom: IJ − I for the genotype and IJK − IK for the cell types. The intuition is that the number of observations is the number of individuals IJ for the genotype, whereas it is the total number of measures IJK for the cell type. Hence, in such a design the power will be (much) greater for distinguishing cell types than genotypes. For more complex dependency structures such as repeated measurements or spatial data, the distribution of (2.5) is only approximated and the degrees of freedom ν need to be approximated (Fai and Cornelius 1996; Kenward and Roger 1997; Khuri et al. 1998). 2.2.1.3
Generalized Linear Mixed Models
Observations cannot always be modelled with a Gaussian distribution. The linear model can however easily be generalized to most usual distributions such as binomial, Poisson, Gamma, etc., giving rise to the ‘generalized’ linear model (Dobson 1990; Demindenko 2004; Zuur et al. 2009). A key ingredient is the introduction of a ‘link function’ g between the expectation of the responses and the linear model as g(EY) = Xθ. For non Gaussian models, a canonical parametrization of the correlation between the observations does not always exist. A specific modelling of the dependency is then required. As an example, in the Poisson case, the variance components model (2.4) will typically be rewritten as g[E(Y|U)] = Xθ + ZU where g is the log function and the coordinates of U are independent Gamma distributed variables. However the Gaussian modelling can be re-used in all generalized linear models, stating that g[E(Y|V)] = Xθ + V, where V is a centred Gaussian vector with variance matrix similar to those seen above in the Gaussian context. 2.2.2
Multiple testing
Models such as those described in the preceding section allow to assess or reject the existence of some effect on a given response. They typically allow to assess whether a given gene is differentially expressed between several conditions in microarray experiments, accounting for possible covariates and dependency between the replicates. Because of the large number of genes spotted on each microarray, we are then faced with a multiple testing since we get one test statistic Ti similar to (2.5) for each of the genes. This example has become common place (Dudoit et al. 2003), but similar issues are encountered in SNP studies, QTL detection and mass spectrometry, that is in all statistical analyses dealing with high-throughput technologies.
20 Handbook of Statistical Systems Biology
2.2.2.1
The basic problem
One Test Setting We first consider hypothesis Hi that states that the expression level of gene i is not affected by the treatment under study. To assess Hi , a test statistic Ti is calculated from the observations such that a large value of Ti will lead to reject Hi (gene i is then said to be ‘differentially expressed’). More precisely, denoting the cumulative distribution function (c.d.f.) of Ti if Hi holds, we calculate a p-value defined as Pi = 1 − (Ti ) and reject Hi if Pi is below a pre-specified level t. Pi measures the significance of the test and t is the level of the test, that is the risk (probability) to reject Hi whereas it is true. When testing one hypothesis at a time, t is generally set to 1% or 5%. Multiple Testing Now consider m null hypotheses H1 , . . . , Hm . Suppose that among these m hypotheses, m0 actually hold. For a given level t, denote FP(t) the number of false positives, that is the number of tests i for which Hi actually holds, whereas Pi falls below t. The expected number of false positives is E[FP(t)] = m0 t. This number can exceed several hundreds when m0 reaches 104 or 106 , as in microarray or SNP analyses. The primary goal of a multiple testing procedure (MTP) is to keep this number of false positives small, by tuning the threshold t. Distribution of the p-Value Most MTPs apply to the set of p-values P1 , . . . , Pm and the control of FP(t) relies essentially on their distribution. If Hi is true, we have Ti ∼
⇒
Pi ∼ U[0;1] ,
which means that, when Hi holds, Pi is uniformly distributed over the interval [0; 1]. As we shall see, this property is one of the key ingredients of all MTPs. It is therefore highly recommended to make a graphical check of it by simply plotting the histogram of the Pi s: it should show a peak close to 0 (corresponding to truly differentially expressed genes associated with small Pi s) and a uniform repartition along the rest of the interval [0, 1] [see Figure 2.1(a)]. 2.2.2.2
Global risk
Because FP(t) is random, an MTP aims at controlling some characteristic of its distribution. Such a characteristic can be viewed as a global risk since it considers all the tests at the same time. Our goal is to maintain it at some pre-specified value α. We assume that m0 is known; if not, it can be either estimated or replaced by a surrogate (see later). Family-Wise Error Rate (FWER)
The most drastic approach is to keep FP(t) close to 0, that is to make FWER = Pr{FP > 0}
small. The goal is then to find the right level t to guarantee a targeted FWER α. When all tests are assumed to be independent, FP(t) has a binomial B(m0 , t) distribution so FWER = 1 − (1 − t)m0 = α
⇐⇒
t = 1 − (1 − α)1/m0 .
This MTP is known as the Sidak procedure. When independence is not assumed, FWER can be upper bounded via the Bonferroni inequality which leads to FWER ≤ m0 t = α
⇐⇒
t = α/m0 .
For a small α, we have 1 − (1 − α)1/m0 ≈ 1 − (1 − α/m0 ) = α/m0 , so the two MTPs lead to similar thresholds t.
Statistical Methods for Complex Systems
21
False Discovery Rate (FDR) Because it basically aims at maintaining FP at 0, FWER control often leads to too stringent thresholds, especially when m0 is large. Benjamini and Hochberg (1995) suggest to rather control the FDR, that is the expected proportion of false positives among all positives: FDR(t) = E[FP(t)/P(t)] where P(t) denotes the number of Pi s smaller than t [the precise definition of the FDR is slightly different to account for the case where P(t) = 0.]. These authors propose the following MTP: denoting P(1) ≤ · · · ≤ P(i) ≤ · · · ≤ P(n) the ordered p-values, define i∗ = max{i : m0 P(i) /i ≤ α} and reject all hypotheses Hi such that Pi ≤ P(i∗ ) , then E[FDR(P(i∗ ) )] ≤ α. The intuition behind this is that if t is chosen as t = P(i) then P(t) = i and E[FP(t)] = m0 P(i) , so E[FP(t)/P(t)|P(i) ] = m0 P(i) /i. 2.2.2.3
Some extensions
Estimation of π0 Most MTPs involve the number of true hypotheses m0 , which is unknown, in general. A conservative strategy consists of replacing m0 with m, but this leads to a (sometimes dramatic) loss of power. A reliable estimation of m0 , or equivalently π0 = m0 /m, strongly improves the MTP. Thanks to the large number of tested hypotheses, the proportion π0 of true ones can be accurately estimated and several estimates have been proposed (Langaas et al. 2005; Blanchard and Roquain 2009; Celisse and Robin 2010; and references therein). The basic idea traces back to Storey (2002) who suggests to consider the number N(λ) of p-values exceeding a given threshold λ, above which only true null hypotheses should be found. π0 can then be estimated by 0 = N(λ)/(1 − λ). π 0 can then be plugged into the MTP. The properties of the resulting procedure can be assessed via a careful m study of the behaviour of the empirical c.d.f. of the p-value and associated false positive and negative empirical processes (Genovese and Wasserman 2002, 2004). The choice of λ can also be made in an adaptive way (Celisse and Robin 2010). Dependency The independence hypothesis stated up to now does not hold in most real situations: gene expressions are typically correlated and the test statistics associated with each gene are therefore not independent. A strong departure from this hypothesis may have strong consequences (Kim and van de Wiel 2008). When permutation tests are used, the dependency structure can be embedded in the permutation algorithm so no specific adjustment is required (Westfall and Young 1993). In a more general context, the MTP can be adapted to still control FDR under dependency. This field is active and we only provide here a few references: Sun and Cai (2008) consider a Markovian dependency, Friguet et al. (2009) propose a general approach to model the dependency and Blanchard and Roquain (2008) consider a more general context and define adaptive MTPs. Local FDR FDR only informs us about the proportion of false positives within a set of rejected tests. Some information about the probability for each of them to be a false positive is clearly desirable in practice. This is the aim of the local FDR (FDR) defined by Efron et al. (2001) who rephrase the multiple testing problem as a mixture model (see Section 2.4). By construction, p-values associated with true null hypotheses have a uniform distribution over [0, 1], so the distribution of all p-values displayed in Figure 2.1(a) can be written as g(p) = π0 + (1 − π0 )f (p)
22 Handbook of Statistical Systems Biology
Figure 2.1 (a) Typical distribution of the p-values in a multiple testing setting. (b) Receiver operating characteristic (ROC) curves for four different classifiers A, B, C and D
where f is some density concentrated toward 0. This modelling is consistent with most estimates of π0 presented above. The local FDR of hypothesis Hi is then defined as in (2.19): FDRi = π0 g(Pi ). Various estimates of f and π0 have been proposed in this context. Efron et al. (2001) considered a empirical Bayes approach, Allison et al. (2002) proposed a parametric mixture of Beta distribution, while McLachlan et al. (2006) applied a Gaussian mixture to probit-tranformed p-values. This last transformation turns out to be very efficient in practice, as it zooms into the region where true and false hypotheses are mixed, making the mixture more identifiable. More flexible modellings have also been proposed, including a semi-parametric estimate of f (Robin et al. 2007), or estimation under constraint of convexity (Strimmer 2008). R packages are associated with most of these procedures, the ‘Mutoss’ R-project aims at gathering them (r-forge.rproject.org/projects/mutoss/).
2.3
Class Prediction
2.3.1 2.3.1.1
Building a classifier Classification problem
In classification, the goal is to predict the class or label Y of an observation, according to some information X collected on this observation. Here are two examples of classification problems in computational biology: • Cancer diagnosis. Based on information X that consists of gene activity measured via microarray technology on a given tissue sample collected from an individual, one wants to determine the disease status Y of the individual that can be 0 if the individual is healthy, 1 if a moderate form of cancer is diagnosed or 2 if the cancer is invasive (Golub et al. 1999). • Protein–protein interaction. The problem is to predict whether a protein interacts (i.e. binds) with a protein of interest. In this problem, each candidate protein can be described by its amino-acid sequence and/or its surfacic geometrical elements that constitute the available information X, and the label Y is +1 if the candidate protein interacts, −1 otherwise (Hue et al. 2010).
Statistical Methods for Complex Systems
23
In these examples information X is of various forms, while Y is a qualitative variable with usually few possible values. In the following we only consider binary classification problems where Y = −1 or +1. In order to predict Y from X, one has to build a classifier, i.e. a function f : X → {−1, +1}, whose predictions are as accurate as possible. The prediction performance of a classifier can be quantified by its classification error rate L(f ) = Pr{f (X) = / Y} Theoretically, the problem is simple: define the Bayes classifier as +1 if p+1 (x) > 0.5 +1 if p+1 (x) > p−1 (x) = fB (x) = −1 otherwise, −1 otherwise,
(2.6)
(2.7)
where p (x) = Pr{Y = |X = x} is the posterior probability for point x to belong to class (to have label ). This classifier is known to be optimal, in the sense that it minimizes the classification error rate. However in practice this classifier is not available, because the joint distribution P(X, Y ) is usually unknown. In fact, the only information at our disposal to build a classifier is a training set, i.e. a sample of n independent observations (X1 , Y1 ), . . . , (Xn , Yn ) for which both information and labels are known. Still, the Bayes classifier gives important clues about how to build a classifier: the decision to classify an observation as +1 is based on the posterior probabilities p+1 (x) and p−1 (x). One then needs to either estimate these probabilities, or at least to identify the set of points x such that p+1 (x) > p−1 (x). Many classification algorithms have been proposed to build a classifier, three of the most popular ones are presented next. 2.3.1.2
Some classification algorithms
For the sake of simplicity, we will assume here that X is a collection of p quantitative descriptors X1 , . . . , Xp , such as the expression measurements of genes in a given tissue. Logistic Regression x, we have:
The logistic regression algorithm is based on the following hypothesis: for any point log
p+1 (x) p−1 (x)
= α0 +
p
αj xj = g(x).
(2.8)
j=1
We can infer the unknown parameters αj , j = 0, . . . , p via ML estimation, and then estimate the ratio of posterior probabilities by ⎞ ⎛ p
(x) p +1 = j xj ⎠ . 0 + α = exp ⎝α R p−1 (x) j=1
The logistic regression classifier is then fLReg (x) =
+1
>1 if R
−1
otherwise.
A consequence of hypothesis (2.8) is that the boundary between the two classes, i.e. the points such that p−1 (x) = p+1 (x), is linear. Therefore the logistic regression classifier results in a hyperplane that splits space X in two. Note that the model can be extended to quadratic or higher polynomial decompositions. One difficulty is then to make a relevant choice of the polynomial order. A low order may result in a too simplistic regression function and a poor fit to the training data. A high order will result in a very good fit to the traning data, but will poorly generalize to new observations. This last phenomenon is often referred to as ‘over-fitting’.
24 Handbook of Statistical Systems Biology
Nearest Neighbour (NN) Algorithm The posterior probabilities p−1 (x0 ) and p+1 (x0 ) of any new observation x0 can be locally estimated by looking at points that are close to x0 in the training set. This is the principle of the kNN algorithm (Fix and Hodges 1951, 1952): first, for a given observation x0 to classify, find X(1) , . . . , X(k) the k points closest to x0 in the training set, then classify x0 according to a majority vote decision rule among these k neighbours. We have: −1 (x0 ) +1 (x0 ) > p +1 if p fkNN (x) = −1 otherwise. 1
I{Y(i) = }. k k
(x0 ) = where p
i=1
A variant of this simple algorithm is the weighted kNN algorithm (WkNN), where each neighbour has a x specific weight w(i)0 in the majority vote decision rule, according to its proximity to x0 (the closer to x0 , the higher the weight). When elaborating a kNN classifier, the difficulty lies in the choice of the distance measure between points and the number of neighbours k to consider. Classification Tree A classification tree is a classifier that predicts the label of an observation by verifying a sequence of conditions, of the form ‘is measurement Xj higher than threshold sj for this observation ?’. Since the prediction is the result of the sequential checking process, such classifiers can be graphically represented as binary trees, where each node corresponds to a condition, except for the terminal nodes that correspond to the predictions. Assuming that the size of the tree (its number of nonterminal nodes) is K, then there are K + 1 profiles of answers leading to one of the K + 1 terminal nodes Nd1 , . . . , NdK+1 . Assuming without loss of generality that nodes Nd1 , . . . , Ndk and Ndk+1 , . . . , NdK+1 are associated to labels −1 and +1, respectively, we have: +1 if x0 ∈ Nd , ∈ {1, . . . , k} ftree (x) = −1 otherwise. When elaborating a tree classifier, the difficulty lies in the choice of the number K of conditions to check, their ordering, the variable Xj and threshold sj that should be used at each condition. Many algorithms have been proposed to build tree classifiers, CART (Breiman et al. 1984) and C4.5 (Quinlan 1993) being the most popular ones. Compared with the kNN algorithm, a major specificity of tree classifiers lies in their embedded variable selection process: at most K of the p variables will be used for the classification of any observation. This may be of importance when the number of variables is large, with most of them being irrelevant for the prediction purpose and only a small subset being informative. These examples illustrate the great diversity of existing classification algorithms, many more being detailed in Hastie et al. (2001). One may wonder how these algorithms would perform in high dimensional settings where the number of descriptors p in X is much larger than the number of observations (often referred to as the ‘p n’ paradigm), a classical situation in computational biology. It appears that all will deteriorate in performance as the dimension increases, but for different reasons. Due to their internal variable selection process, classification trees may be unstable: small changes in the training set will yield different classifiers with different sets of selected variables, and large discrepancies in prediction. Nearest neighbours will be much more stable since no variable selection is performed, but the pertinent information may be diluted if most variables are uninformative, leading to poor prediction performance. Lastly, logistic regression may be inefficient too, since it requires estimating one parameter per variable, each estimation being vitiated by noise. We therefore need efficient strategies to deal with high dimensional data. Two of them, aggregation and regularization, are presented in Sections 2.3.2 and 2.3.3, respectively.
Statistical Methods for Complex Systems
2.3.2
25
Aggregation
Aggregation may be motivated by different purposes: to ease the interpretation of the classification rule, or to improve the predictions of a given classification algorithm. When interpretation is at stake, aggregation is usually performed at the variable level. If the goal is to improve the prediction performance, one will rather aggregate classifiers. We briefly review variable aggregation and then focus on the two classical strategies to aggregate classifiers, bagging and boosting. 2.3.2.1
Variable aggregation
High dimensional data are usually characterized by a high level of redundancy: different variables or sets of variables may share the same information about the label. When classification is based on a subset of the p variables at hand, as in a classification tree, different (and sometimes nonoverlapping) subsets may be used to achieve the same prediction performance. The choice between the equivalent subsets will arbitrarily depend on the training sample, and small changes of the training sample may lead to drastic changes in subset selection, hence to noninterpretable classifiers (Michiels et al. 2005). To deal with the redundancy problem, several authors proposed to first aggregate the variables that share the same information into clusters, and then to build a classifier based on these clusters. Several strategies have been proposed (Hastie et al. 2000; Dettling and Buhlmann 2002; Mary-Huard and Robin 2009), that differ from each other on mainly two points: • •
supervised/unsupervised: the aggregation method can take into account the label or not; summarizing: once the clustering of redundant variables is done, the classifier can be built on the complete clusters, on some of the variables of each cluster, or on a synthesized variable that sums up the cluster information (for instance a linear combination of the cluster variables).
Variable aggregation should be distinguished from variable compression, where one tries to summarize the information shared by the whole set of variables. While variable compression often improves the classification error rate, it usually does not lead to an interpretable classifier. 2.3.2.2
Classifier aggregation
Bagging Bagging can be understood as an application of the classical bootstrap (Efron and Tibshirani 1993) to classification algorithms. In the classical setting where a parameter θ has to be estimated from a sample, bootstrap consists of re-sampling the sample at hand B times, obtaining an estimation of θ from each sample and then averaging the B estimates. This averaging process results in a bootstrap estimator whose variance is lower than the initial estimator, i.e. the one obtained by estimating θ from the initial sample. The same idea can be exploited in the classification context to improve the performance of an unstable classification algorithm A. Given a training sample, B bootstrap samples are drawn. A classifier fAb is obtained by applying algorithm A to each bootstrap sample, and the bagging classifier is then defined as
bag fA (x0 )
−1 (x0 ) +1 (x0 ) > p +1 if p = −1 otherwise,
(x0 ) = where p
B 1
I{fAb (x0 ) = }. B b=1
(2.9)
26 Handbook of Statistical Systems Biology
While bagging can be applied to any classification algorithm, one may wonder in which circumstances it is fruitful to use it. This can be theoretically answered by looking at the regression case. Assume for this section that the variable Y is quantitative, and that an algorithm A is available to predict Y from X. Let μA (x0 ) = EP [fA (x0 )] denote the average prediction at a fixed point x0 (with unknown label Y0 ) over all the possible classifiers obtained by applying A to a sample drawn from the joint distribution P of (X, Y ). We have (Breiman 1996): EP
2
(Y0 − fA (x)
= EP [(Y0 − μA (x0 )]2 + VP fA (x0 ) ,
where VP fA (x) is the variance of the prediction at point x0 . From this equation we deduce two important results: • •
The average discrepancy between the true label of x0 and the prediction given by fA (x) is always higher than the discrepancy between the true label of x0 and the average prediction μA (x0 ). The difference between EP {[Y0 − fA (x)]2 } and EP {[(Y0 − μA (x0 )]2 } is proportional to variance VP [fA (x0 )], that directly measures the stability of the prediction of algorithm A at point x0 .
The first point means that it is always better (equivalent at worst) to use μA (x0 ) rather than fA (x0 ). Since μA (x0 ) cannot be computed in practice (since P is unknown), we can use its bagging version (2.9) as a surrogate. The second point shows that the relative improvement due to bagging will be greater as the variance (instability) of algorithm A increases. As an example, the CART algorithm described in Section 2.3.1.2 is a classical example of an unstable classification algorithm (because of its internal variable selection process), and many articles demonstrated the efficiency of bagging when applied to CART. In contrast, the kNN algorithm is a much more stable algorithm, and application of bagging to kNN yields little improvement of the classification performance (Breiman 1996). Choosing the number of bootstrap samples is an open question. Selecting a small B can lead to suboptimality of the resulting classifier in terms of classification performance, but increasing B comes at a cost in terms of computational burden that is not worthwhile if the gain in performance is marginal. A discussion on this topic can be found in Sutton (2005). Boosting In Schapire (2002), it is argued that building a classifier with good prediction performance is difficult, but building moderately efficient classifiers is often simple. The idea is then to combine such classifiers to build a prediction rule that will be more accurate than any one of the basic classifiers. The boosting algorithm builds the combination iteratively: at each step a new basic classifier is added to the combination. This classifier is trained on a weighted version of the original training set, where increased weights are given to observations that were frequently misclassified during the previous iterations. The final combination then votes to predict the label of any new information. Many boosting algorithms have been proposed. Here we briefly introduce the Adaboost algorithm of Freund and Schapire (1997). For a given classification algorithm A, starting with initial weights w1i = 1/n for observations i = 1, . . . , n, Adaboost runs B times the following program: 1. Build classifier fAb on the training set with weights wbi n b b wi I{fA (xi ) = / yi } 2. Compute errb = i=1 n b i=1
3. Compute αb =
b log( 1−err errb )
wi
∝ wbi exp(αb I{fAb (xi ) = / yi }) 4. Set wb+1 i
Statistical Methods for Complex Systems
27
The final classifier is then: fAboost (x0 )
=
1 −1
if
B
b b=1 αb fA (x0 )
otherwise.
>0
(2.10)
Equation (2.10) and point 3 show that the weight αb given to classifier fAb in the final voting rule depends on its performance errb : efficient classifiers receive more weights in the final decision. Point 4 of the program shows that misclassified points during step b will have their weights inflated by a factor exp(αb ) at step b + 1. While many applications showed the efficiency of boosting in practice (Dettling and Buhlmann 2003), one may wonder why boosting works. Compared with bagging, boosting does not rely on the fact that many classifiers are built on samples drawn from the same empirical distribution, since the distribution (i.e. the weights) change at each boosting iteration. In fact, theoretical arguments have been advanced that show that the way boosting concentrates on observations that are hard to classify is a key of its accuracy (Schapire et al. 1998). Hard-to-classify points often lie on the boundary between the two classes in X, i.e. in regions where p+1 (x) ≈ p−1 (x), hence the need to focus on these points to sharply evaluate these two quantities at stake in (2.7). As for bagging, boosting can be applied to any classification algorithm, and in practice is often (but not only) applied to trees of moderate size or neural networks (Schwenk and Bengio 1998; Dietterich 2000). The choice of the number of iterations is also critical. In bagging, increasing the number of iterations makes the resulting classification rule more robust. In boosting, increasing the number of observations may lead to an over-fit of the data. Over-fitting refers to classifiers that exhibit optimistic performance on the training set that are not reproducible on future data to classify. This usually happens when the classification algorithm is free to adapt to the training data without restriction. In boosting, the higher the number of iterations, the higher the risk of over-fitting. However, some authors observed that in practice boosting was robust to over-fitting, even for large number of iterations. This behaviour is still under theoretical investigation (Friedman et al. 1998). 2.3.3
Regularization
Once a classification algorithm is chosen and the training data collected, obtaining a classifier usually requires the solution of an optimization problem. The function to optimize can be directly the classification error rate (or one of its surrogates, such as the resubstitution error rate, see Section 2.3.4), or some other objective function that depends on the training sample. In large dimension problems, a direct resolution of the optimization problem may lead to over-fitting. Regularization refers to the strategy that consists of adding some constraints to the optimization process to avoid the aforementioned problems. 2.3.3.1
L2 regularization
Penalized Logistic Regression (LR) To build a LR classifier, one has to estimate parameters αj , j = 0, . . . , p. This is done by minimizing the negative log-likelihood of the training data n
log {1 + exp[−yi g(xi )]},
i=1
with notation of Equation (2.8). In large dimension problems, applying minimization without any constraint on the norm of the parameters will lead to inconsistent parameter estimates, with possibly infinite values. To
28 Handbook of Statistical Systems Biology
avoid this problem, one can consider the quadratically regularized log-likelihood (Zhu and Hastie 2004): n
log {1 + exp[−yi g(xi )]} + λg22 ,
i=1
where the last term is a quadratic penalization function defined as g22 =
(2.11) p
2 j=1 αj .
Support Vector Machine (SVM) SVM is another popular classification algorithm, that can be understood as a classification method with an embedded regularization process. Indeed, the SVM algorithm (Vapnik 1995) p fits a function g(x) = α0 + j=1 αj xj that minimizes n
1 − yi g(xi )
+
+ λg22 ,
(2.12)
i=1
where (u)+ = u if u is positive, 0 otherwise. The resulting SVM classifier is +1 if g(x) > 0 fSVM (x) = −1 otherwise. Interpretation It is worth noticing that expressions (2.11) and (2.12) are similar, and lead to a general interpretation of regularized methods. The regularized optimization problem can be rewritten as: n
(g, (xi , yi )) + λR(g).
i=1
The loss function (., .) evaluates the quality of adjustment of function g to the training data: if g(xi ) and yi have the same sign, then the classifier associated with g makes the right prediction for xi , and the loss is low. If the signs of g(xi ) and yi differ, the loss is high. Note that the loss increases according to −yi g(xi ): severe classification errors make the loss higher. The penalization function R(.) takes into account the statistical cost of function g. For example, in (2.11) and (2.12), the regularization function is the Euclidian norm of the parameters (also called the L2 -norm). This term implies that given two functions g1 and g2 with identical fits to the training set, the simplest one (in terms of norm) should be preferred. Regularization amounts to finding a trade-off between adjustment and cost. Depending on parameter λ, emphasis will be put either on fit or on cost. Tuning parameter λ to achieve optimal performance is a difficult task. In practice, λ can be chosen by cross-validation (see Section 2.3.4). 2.3.3.2
Other regularization functions
We observed that penalized LR and SVM correspond to the same regularization strategy, applied with different loss functions. It is also possible to consider alternative regularization functions, that are more adapted to the nature of the data, or to the classification problem at hand. L1 Regularization In large dimension problems, it is sometimes assumed that only a small subset of variables is informative to predict the label. In this situation, a relevant classifier should only depend on this (unknown) subset. While L2 regularization globally shrinks coefficients, in practice, even for a large value of λ, no or only a few of them will be shrunk to 0. Conversely, the use of the L1 regularization function p R(g) = j=1 |αj | will tend to sparse solutions where only a few parameters will be nonzero (Tibshirani 1996; Lee et al. 2006). Therefore, L1 regularization (also called Lasso regularization) is commonly used as a variable selection strategy.
Statistical Methods for Complex Systems
29
Fused Lasso Regularization In some situations, a natural ordering of the predictor variables can be defined. Consider for instance Comparative Genomic Hybridization (CGH) array experiments where variables correspond to probe measurements. Since two adjacent probes on a given chromosome are likely to share the same copy number, their associated weights in the decision rule should be similar. A relevant classifier should be sparse (only some chromosomal regions are informative), and parameter estimates corresponding to adjacent variables (i.e. adjacent probes) should be identical (Rapaport et al. 2008; Tibshirani and Wang 2008). For this purpose, the following regularization function can be considered: R(g) = λ1
p
|αj | + λ2
|αj − αj+1 |.
j
j=1
The first term corresponds to the Lasso penalty that guarantees sparsity, and the second term causes the classification algorithm to produce similar weights for adjacent variables. These two examples illustrate the flexibility of regularization methods, and shed light on their effectiveness to benefit from prior information about the data structure, and produce classifiers with good prediction performance. In practice, the implementation of regularization methods may raise difficult optimization problems, that have to be taken into account when elaborating a new regularization function. 2.3.4
Performance assessment
In previous sections we have been many algorithms to build or combine classifiers. While some algorithms may be preferred for some specific applications, we usually have little clue about how to select the classification algorithm to use. In practice, the choice will often be based on the respective performance of each classifier. It is therefore critical to have at our disposal efficient tools to assess the performance of any classifier, both for evaluation and comparison. An intuitive strategy would be to evaluate the classifier f on the training data, and compare the prediction obtained for the true labels. The ‘’ notation in f underlines the fact that f was already built using the training data. More precisely, we can compute the empirical error rate (EER, also called the resubstitution error rate) n 1 EER(f) = I f (xi ) = / yi . n i=1
Unfortunately, this strategy leads to a biased estimation of the true classification error rate of the classifier, as defined in (2.6). Indeed, since the data are used both for building and evaluating the classifier, EER will lead to an optimistic estimation of the true error rate. Therefore no fair evaluation or comparison between classifiers is possible using EER. From this, we conclude that an unbiased evaluation necessitates splitting the data into two distinct samples, one for construction and one for evaluation. The next section presents different strategies of splitting that are commonly used. 2.3.4.1
Hold-out and cross-validation
Hold-Out (HO) Usually, a single sample is collected at once, and then split into a train and a test sample, for construction and evaluation, respectively. This strategy is called Hold-out. Defining e = (X1 , Y1 ), . . . , (Xn−m , Yn−m ) and e¯ = (Xn−m+1 , Yn−m+1 ), . . . , (Xn , Yn ), the HO estimator is n
1 e e (Xj ) = e ) = 1 I f / Yj = I f (Xj ) = / Yj , HO(f m m j=n−m+1
j∈¯e
e is the classifier built on sample e. Although HO is unbiased, the splitting of the data into train/test where f induces some randomness due to the arbitrary choice of the withdrawn data points.
30 Handbook of Statistical Systems Biology
Leave-m-Out (LmO) To avoid an arbitrary choice between classifiers based on a particular splitting of the data, every possible subset of m observations is successively left out of the initial sample and used as the test set. The algorithm with the smallest averaged classification error rate is then selected. The LmO estimator is then ⎡ ⎤ −1
−1
1 n n e (Xj ) = e ), ⎣ I f / Yj ⎦ = HO(f LmO(f) = m m m e e j∈¯e where the sum is over all the possible subsets e of size n − m. While LmO solves the randomness problem of HO, its computational cost is prohibitive for large values of m, and in practice only the leave-one-out (LOO) procedure is routinely used. K-Fold (KF) An intermediate strategy consists in dividing the complete dataset into K subsamples e¯ 1 , . . . , e¯ K with roughly equal size n/K, each subsample being successively used for validation. This leads to the following estimator: 1
ek ). KF (f) = HO(f K e k
KF with m = n/K is a compromise between HO and LmO: it is more stable than HO since it averages the results of different samplings, and its computational cost is much lower than the one of LmO. Choosing the Train/Test Proportions All these procedures require the tuning of a parameter that balances the sizes of the training and test sets: m for HO and LmO, K for KF. Choosing different values for this parameter yields different evaluations of the performance, and sometimes different conclusions about which classifier performs best. The optimal tuning of this parameter is still an open question. A review can be found in Arlot and Celisse (2010). About the Good Use of Cross-Validation We emphasize again that all the resampling methods presented here assume that the test set is never used in any step of the training. In particular, selecting a subset of variables, or tuning the inner parameters of a classification algorithm are part of the training process, and therefore should not involve the test set. In Ambroise and McLachlan (2002) for instance, the authors showed that the performance estimation could be severely biased when the variable selection step is performed external to rather than within the cross-validation loop, an error that still occurs in publications. 2.3.4.2
ROC curves
So far, we worked under the implicit hypothesis that all types of classification error have the same cost: whether an observation is misclassified as +1 or misclassified as −1 is equivalent. In many applications this assumption does not hold: for instance, in cancer diagnosis, to misclassify an individual as ‘healthy’ or ‘ill’ does not have the same consequences. In such cases, one needs to focus either on the sensitivity or the specificity of the classification rule, where Sensitivity = Pr{f (X) = 1|Y = 1} and
Specificity = Pr{f (X) = −1|Y = −1}.
In Section 2.3.1.1, we defined the optimal classifier as +1 if p+1 (x) > t fB (x) = −1 otherwise,
Statistical Methods for Complex Systems
31
where t = 0.5. This particular choice of threshold t assumes a balance between sensitivity and specificity. It is possible to change the trade-off between these two quantities by tuning the threshold: the higher t, the higher the specificity (and the lower the sensitivity). The evolution of the sensitivity/specificity trade-off can be investigated for a given classifier via the ROC curve, that plots sensitivity versus 1−specificity as t varies from 0 to 1. Figure 2.1 (b) exhibits four ROC curves obtained from classifiers A, B, C and D. The bottom-left corner corresponds to high values of t, for which sensitivity is high and specificity low. As t increases, sensitivity decreases whereas specificity increases. For instance, if one requests a specificity of at least 80% then classifier B will score a sensitivity of 65%. ROC curves may also be useful to compare classifiers. For instance, in Figure 2.1 classifier C should be preferred to the other ones since its ROC curve is uniformly higher. In practice though, different classifiers may be optimal for different sets of values of t, resulting in no obvious leader: in Figure 2.1, if a high specificity is requested, one should prefer B to D. In contrast, if the goal is to achieve high sensitivity, D should be preferred to D. The area under the curve (AUC) provides a summary of the classifier performance over all values of t. A perfect classifier with high sensitivity and specificity at the same time would score an AUC of 1. Classifiers can then be compared using their AUC scores, according to the rule ‘the higher the better’. According to this criterion, classifier B should be preferred to classifier D since their AUC score are 0.78 and 0.71, respectively.
2.4
Class discovery
Clustering is one of the most used techniques to find some structure within a dataset of entities. This problem is also known as ‘unsupervised classification’ for the classes are discovered based on a set of unlabelled observations. This makes a strong distinction with the previous section where classes were ‘learned’, based on a set of labelled observations. Clustering aims at finding an underlying structure (i.e. clusters) that is a priori unknown. Although other descriptive techniques, such as principal component analysis (PCA), may sometimes reveal the existence of clusters, this is not their primary goal. They are therefore not considered as clustering methods. Gene clustering based on expression data has become common place in the last few decades and tens (hundreds?) of strategies have been proposed to this end. Two main approaches can be distinguished: geometric (or distance-based) techniques and probabilistic (or model-based) techniques. We will briefly introduce the former and concentrate on the latter. As for the notations, we consider a dataset of n observations (i = 1, . . . , n), each characterized by a vector Xi , that we aim at clustering into K distinct clusters. Depending on the application, K may be known a priori or not. 2.4.1 2.4.1.1
Geometric methods Hierarchical algorithms
Distance-based clustering traces back to the early stages of taxonomy (Sokal and Sneath 1963). The most typical algorithms provide a hierarchical structure and work as follows. Start with n clusters each made of one observation. At each step the two closest individuals (or groups of individuals) are merged to make a new cluster (Hastie et al. 2001). The number of clusters is hence reduced by 1 at each step. The clustering process is often depicted as a tree, also called a ‘dendrogram’, with n leaves corresponding to each individual, and one root corresponding to the final cluster that includes them all. Such algorithms are based on three key ingredients [a more detailed presentation can be found in Mary-Huard et al. (2006)]. Distance between observations. We first need a distance (or similarity, or dissimilarity) measure d(i, j) between observations, to decide which ones have to be first clustered. It is not possible to summarize the huge
32 Handbook of Statistical Systems Biology
variety of distances that can be considered. It goes from simple Euclidean distances d(i, j) = Xi − Xj when observations Xi are vectors of real numbers [e.g. gene expression levels, see D’haeseleer (2005) for some examples] to so-called kernels when observations are made of complex heterogenous descriptors (Sch¨olkopf and Smola 2002). Aggregation criterion. Once the hierarchical clustering process has started, some observations are gathered into clusters that are themselves to be clustered. This requires defining a second distance D(A, B), where A and B are clusters. Again, there exists an impressive list of such distances (simple linkage, diameter, UPGMA, etc.), which often give their name to the clustering algorithm itself. Stopping rule. Because clustering aims at finding a ‘reasonable’ number of groups, the process has to be stopped before it ends with one single cluster. In the absence of modelling, the choice of the number of clusters often relies on some heuristics, such as finding long branches in the dendrogram, or finding change-points in the behaviour of the variance between clusters along the clustering process. 2.4.1.2
K means
Nonhierarchical methods where the number of groups is known a priori also exist. The most emblematic is probably the K means algorithm where the K clusters are associated with K centroids (Hastie et al. 2001). The algorithm starts with K randomly chosen centroids ands then alternates two steps: 1. Each observation is clustered with its closest centroid; 2. Centroids are updated as the mean of the observations clustered together. This kind of algorithm is known to converge very quickly and to be very sensitive to the choice of the initial centroids (‘seeds’). 2.4.2
(Discrete) latent variable models
The clustering problem can be set in a model-based setting via the definition of latent random variables Zi ∈ {1 . . . K} corresponding to the labels of each observation. (Equivalently, we define the binary variable Zik which is 1 if Zi = k and 0 otherwise.) In this setting, the observations Xi are often supposed to be independent conditionally to the Zi , with label-dependent distribution: {Xi }i=1...n independent |{Zi }i=1...n ,
(Xi |Zik = 1) ∼ f (·; θk ).
(2.13)
f is often chosen within a family of parametric distributions (with parameter θk ) that seem relevant for the observed data Xi . Note that f can be any distribution, up to non-parametric regression (Luan and Li 2003) or even more complex structures. Denoting X = {Xi } the observed data and Z = {Zi } the unobserved labels, the conditional log-likelihood of X under (2.13) is
Zik log f (Xi ; θk ) (2.14) log P(X|Z) = i
k
Such models mostly differ for the distribution of the unobserved labels Z. We give here a quick overview of the most classical models: Mixture. The simplest model assumes that the labels are independent and identically distributed (i.i.d.) with multinomial distribution M(1; π) where π is the vector of proportions of each group (McLachlan and Peel 2000), so we have
Zik log πk . (2.15) log P(Z) = i,k
Statistical Methods for Complex Systems
33
Hidden Markov Model (HMM). When data are observed along a single direction (such as time or genomic position), it may be relevant to introduce a dependency between the labels of adjacent positions. So-called HMMs often actually refer to hidden Markov chain models where Z is assumed to be a Markov chain with transition matrix π:
Zi−1,k Zi log πk . (2.16) log P(Z) i,k,
Such models are intensively used in sequence analysis (Durbin et al. 1999). Hidden Markov Random Field (HMRF). When the data have a more general spatial organization (such as pixels in an image), {Zi } can be assumed to be a Markov random field where each Zi only depends on its neighbours and is conditionally independent from the rest (Besag 1974). Stochastic Block Model (SBM). This last example refers to interaction graphs where nodes can be proteins or species linked to each other by edges revealing possible physical interactions or some similarity in terms of phenotype. The observed data are then the (possibly valued) edges {Xij } between the n nodes. The SBM is a mixture model stating that the unknown labels are i.i.d. M(1; π) and that the edge values are conditionally independent with distribution depending on the labels of both nodes (Nowicki and Snijders 2001): Xij |(Zi = k, Zj = ) ∼ f (·, θk ). The log-likelihood of Z is hence the same as in (2.15) but, due to the network structure, the conditional log-likelihood of X differs from (2.14):
Zik Zj log f (Xij ; θk ). (2.17) log P(X|Z) = i,j k,
2.4.3 2.4.3.1
Inference Maximum likelihood inference
As for all incomplete data models, the main issue for the inference of such models comes from the unobserved labels Zi . Although several approaches do exist, ML is the most commonly used. In order to maximize the likelihood of the observed data log P(X), most inference algorithms include a ‘retrieval’ step where the conditional distribution of the unobserved labels given the observations P(Z|X) is calculated or approximated. Likelihoods The likelihood of the observed data log P(X) is often difficult (or even impossible) to handle, whereas the likelihood of the complete dataset log P(X, Z) = log P(Z) + log P(X|Z) generally has a much more convenient form, as seen in Section 2.4.2. A connexion between these likelihoods can be made (Dempster et al. 1977) as log P(X) = E[log P(X, Z)|X] + H[P(Z|X)] (2.18) where H(F ) stands for the entropy of distribution F defined as H(F ) = − F (z) log F (z)dz. The calculation of the two elements in the right-hand side requires the evaluation of the conditional distribution P(Z|X), or at least some of its moments. All expectation-maximization (EM) like algorithms alternate 1. E-step: evaluation of P(Z|X); 2. M-step: maximization of E[log P(X, Z)|X] with respect to the parameters. The M-step does, not in general, raise more difficulties than classical ML inference.
34 Handbook of Statistical Systems Biology
Conditional Distribution The evaluation of P(Z|X) is hence the crucial step for the inference of such models. We go back to some models presented above to illustrate how this can be achieved. Mixture. Because all couples (Xi , Zi ) are independent, the Zi are conditionally independent and the distribution of each of them is given by the Bayes formula:
P(Zi = k|X) = P(Zi = k|Xi ) = πk f (Xi ; θk ) π f (Xi ; θ ) (2.19)
HMM. Due to the underlying Markov structure, the Zi are not conditionally independent. However, adding (2.16) and (2.14) shows that only single labels Zik and couples of adjacent labels Zi−1,k Zi, are involved in log P(X, Z). This proves that the Zi still have a Markovian dependency structure. The conditional distribution P(Z|X) can be obtained via the so-called ‘forward-backward’ recursion, with a linear complexity (Capp´e et al. 2005). The forward step iteratively calculates the conditional distribution of Zi given the past observations: P(Zi |X1 , . . . , Xi ). The backward step completes the calculation of all P(Zi |X), starting from P(Zn |X1 , . . . , Xn ) = P(Zn |X), and going back to P(Z1 |X). HMRF and SBM. The conditional distribution of the labels displays a more intricate dependency structure. In HRMF it is still a Markovian structure but, in the absence of natural order in two dimensions or more, no efficient recursion such as ‘forward-backward’ can be derived. In SBM, the sum of (2.15) and (2.17) involves both single labels Zik and all couples of labels Zi,k Zj, . This proves that their conditional dependency structure is a clique (Daudin et al. 2008) and no simplification can be expected. In such models, the conditional distribution P(Z|X) can only be either estimated via Monte-Carlo techniques [giving rise to stochastic versions of EM such as Stochastic EM (SEM) (Celeux and Diebolt 1992) and Stochastic Approximation EM (SAEM) (Delyon et al. 1999) or approximated with mean-field (Jaakkola 2000) or message-passing (Winn and Bishop 2005) techniques. Variational Point of View From a more general point of view, these inference strategies can be considered as variational techniques which aim at finding the distribution Q(Z) that best approximates P(Z|X). The most popular strategy consists in minimizing the K¨ullback–Leibler (KL) divergence KL[Q(Z); P(Z|X)], where KL[F (Z); G(Z)] = F (z) log[F (z)/G(z)]dz. As the KL divergence is always positive, for any distribution Q(Z), log P(X) can be lower-bounded by log P(X) − KL[Q(Z); P(Z|X)], that can be reformulated as (Jaakkola 2000) log P(X) ≥ Q(z) log P(X, z)dz + H(Q). (2.20) This lower bound is similar to (2.18), except that P(Z|X) is replaced by Q(Z). Regular EM algorithms correspond to cases (such as mixtures or HMM) where P(Z|X) can actually be computed, so Q(Z) is chosen as Q(Z) = P(Z|X). The divergence is hence set to zero, so (2.20) equals (2.18) and ML inference is achieved. In more complex cases (HMRF, SBM), Q(Z) is chosen as the best possible approximation (in terms of KL divergence) of P(Z|X) among a certain class of manageable distributions. It results in the maximization of a lower bound of the likelihood, which is not equivalent to ML inference (Gunawardana and Byrne 2005). Such strategies can be generalized to many graphical models (Jordan et al. 1999) that will be introduced later in this book. Note that the class of incomplete data models is not limited to those described here, but also includes mixed models described in Section 2.2. Another interesting example of such a modelling can be found in Beal et al. (2005). Note also that a Bayesian version of variational inference does exist in the context of exponential family distributions (Beal and Ghahramani 2003).
Statistical Methods for Complex Systems
2.4.3.2
35
Additional issues
Model Selection The determination of the number of groups K is a problem per se. Although in some situations it may be made based on some prior knowledge, K generally needs to be estimated as well as the other parameters. Because K class mixtures are part of the set of K + 1 class mixtures, the maximized K , θK ) cannot be used to choose K as it always increases when K increases. In the context likelihood log P(X; π of ML inference, most criteria consist of adding a penalty to the likelihood, which accounts for the complexity of the model (Burnham and Anderson 1998). The Akaike Information Criterion (AIC) and the Bayesian Informaion Criterion (BIC) are the two most famous examples of that kind. The latter is an approximation of the probability P(K, X) based on a Laplace approximation. It is hence only valid for reasonably large datasets. BIC is clearly most popular in unsupervised classification (McLachlan and Peel 2000) and is defined as K , BIC(K) = log P(X; π θ K ) − 0.5(# parameters) × log n and where π θ are the ML estimates of π and θ. In the example of a HMM where each parameter θk has dimension D, there are K(K − 1) independent transition probabilities, so the total number of parameters is K(D + K − 1). In the context of classification, the fit of the model may not be the only desirable property to achieve. It is also important to account for the reliability of the classification which can be measured by the entropy of the conditional distribution H[P(Z|X)]. A large value of this entropy reveals that the classification is uncertain for a large part of the observation. Biernacki et al. (2000) defined the Integrated Completed Likelihood (ICL) criterion, which consists of adding this entropy to the penalty. Combined with the decomposition (2.18), this reduces to , ICL(K) = E[log P(X; π θ)|X] − 0.5(# parameters) × log n. Variable Selection When the clustering is based on a large set of descriptors (e.g. Xi ∈ Rd , d 1), the question of selecting the relevant descriptors is raised naturally. The distribution of relevant descriptors is supposed to change from one group to another, whereas irrelevant descriptors keep a constant distribution for all observations. If the labels were known, the problem would reduce to a class comparison problem (see Section 2.2). In the unsupervised context, it must be combined with the determination of the classes and this raises both theoretical and algorithmic issues (Raftery and Dean 2006; Maugis et al. 2009). Classification As explained above, the inference of latent-variable models provides the (approximate) conditional distribution of the labels P(Z|X). The posterior probability P(Zi = k|X) may seem sufficient as it provides a ‘fuzzy’ classification. This also informs us about the reliability of the classification: classification is clearly risky when all conditional probabilities are similar. When a formal classification is needed, the maximum a posteriori (MAP) strategy consists of classifying each observation in the class with highest conditional probability: i = arg max P(Zi = k|X). Z k
i } is not equal to the most probable set of It is worth noting that, in general, the set of most probable labels {Z labels Z. As an example, in an HMM, the most probable hidden path = arg max P(Z = z|X) Z z
can be retrieved via the Viterbi algorithm (Capp´e et al. 2005). It is not made from the succession of MAP i at each position. The difference is due to the distinction between the marginal classification of predictions Z one observation and the joint classification of all observations. The latter provides a path which is typically more consistent with the estimated transition probabilities than the former, which concentrates on the separate classification of each observation.
36 Handbook of Statistical Systems Biology
References Allison DB, Gadbury G, Heo M, Fernandez J, Lee CK, Prolla TA and Weindruch RA 2002 Mixture model approach for the analysis of microarray gene expression data. Comput. Statist. Data Anal. 39, 1–20. Ambroise C and McLachlan J 2002 Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Nat Acad. Sci. USA 99, 6562–6. Arlot S and Celisse A 2010 A survey of cross-validation procedures for model selection. Statist. Surv. 4, 40–79. Beal MJ and Ghahramani Z 2003 The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Bayes. Statist. 7, 543–52. Beal MJ, Falciani F, Ghahramani Z, Rangel C and Wild D 2005 A bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics 21(3), 349–56. Benjamini Y and Hochberg Y 1995 Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSSB 57(1), 289–300. Besag J 1974 Spatial interaction and the statistical analysis of lattice systems (with discussion). J. R. Statist. Soc. B 36(2), 192–326. Biernacki C, Celeux G and Govaert G 2000 Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Machine Intel. 22(7), 719–25. Blanchard G and Roquain E 2008 Two simple sufficient conditions for FDR control. Electron. J. Statist. 2, 963–92. Blanchard G and Roquain E 2009 Adaptive false discovery rate control under independence and dependence. J. Machine Learn. Res. 10, 2837–71. Breiman L 1996 Bagging predictors. Mach. Learn. 24, 123–40. Breiman L, Friedman J, Olshen R and Stone C 1984 Classification and Regression Trees. Wadsworth International, Belmont, CA. Burnham KP and Anderson RA 1998 Model Selection and Inference: a Practical Information-Theoretic Approach. John Wiley & Sons, Ltd, New York. Capp´e O, Moulines E and Ryd´en T 2005 Inference in Hidden Markov Models. Springer, Berlin. Celeux G and Diebolt J 1992 A stochastic approximation type EM algorithm for the mixture problem. Stochastics 1–2, 119–34. Celisse A and Robin S 2010 A cross-validation based estimation of the proportion of true null hypotheses. J. Statist. Planning Inference. 140, 3132–47. Cressie N 1993 Statistics for Spatial Data. John WIley & Sons, Ltd, New York. Daudin JJ, Picard F and Robin S 2008 A mixture model for random graphs. Stat. Comput. 18(2), 173–83. Delyon B, Lavielle M and Moulines E 1999 Convergence of a stochastic approximation version of the EM algorithm. Ann. Statist. 27(1), 94–128. Demindenko E 2004 Mixed Models: Theory and Applications. John Wiley & Sons, Ltd, Hoboken, NJ. Dempster AP, Laird NM and Rubin DB 1977 Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1–38. Dettling M and Buhlmann P 2002 Supervised clustering of genes. Genome Biol. 3(12), 1–15. Dettling M and Buhlmann P 2003 Boosting for tumor classification with gene expression data. Bioinformatics 19(9), 1061–1069. D’haeseleer P 2005 How does gene expression clustering work? Nat. Biotechnol. 23, 1499–501. Dietterich T 2000 An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach. Learn. 40(2), 139–57. Diggle P, Liang K and Zeger S 2002 Analysis of Longitudinal Data, 2nd edn. Oxford Science, London. Dobson AJ 1990 An Introduction to Generalized Linear Models. Chapman & Hall, London. Dudoit S, Shaffer JP and Boldrick JC 2003 Multiple hypothesis testing in microarray experiments. Stati. Sci. 18(1), 71–103. Durbin R, Eddy S, Krogh A and Mitchison G 1999 Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge. Efron B and Tibshirani R 1993 An Introduction to the Bootstrap. Chapman & Hall/CRC, Boca Raton, FL.
Statistical Methods for Complex Systems
37
Efron B, Tibshirani R, Storey JD and Tusher V 2001 Empirical Bayes analysis of a microarray experiment. J. Am. Statist. Assoc. 96, 1151–60. Fai A and Cornelius P 1996 Approximate F-tests of multiple degree of freedom hypotheses in generalized least squares analyses of unbalanced split-plot experiments. J. Stat. Comput. Simul. 54, 363–78. Fix E and Hodges J 1951 Discriminatory analysis, nonparametric discrimination: consistency properties. Technical Report 4, US Air Force, School of Aviation Medicine. Fix E and Hodges J 1952 Nonparametric discrimination: small sample performance. Technical Report 11, US Air Force, School of Aviation Medicine. Freund Y and Schapire R 1997 A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–39. Friedman J, Hastie T and Tibshirani R 1998 Additive logistic regression: a statistical view of boosting. Ann. Statist. 28, 2000. Friguet C, Kloareg M and Causeur D 2009 A factor model approach to multiple testing under dependence. J. Am. Statist. Assoc. 104(488), 1406–15. Genovese C and Wasserman L 2002 Operating characteristics and extensions of the fdr procedure. J. R. Statist. Soc. B 64(3), 499–517. Genovese C and Wasserman L 2004 A stochastic process approach to false discovery control. Ann. Statist. 32(3), 499–518. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C and Lander E 1999 Class prediction and discovery using gene expression data. Science 286, 531–7. Gunawardana A and Byrne W 2005 Convergence theorems for generalized alternating minimization procedures. J. Mach. Learn. Res. 6, 2049–73. Hastie T, Tibshirani R, Eisen M, Alizadeh A, Levy R, Staudt L, Chan W, Botstein D and Brown P 2000 ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1(RESEARCH0003), 1–21. Hastie T, Tibshirani R and Friedman J 2001 The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York. Hue M, Riffle M, Vert J and Noble W 2010 Large-scale prediction of protein-protein interactions from structures. BMC Bioinformatics 11, 144. Jaakkola T 2000 Tutorial on variational approximation methods. In Advanced Mean Field Methods: Theory and Practice (eds Opper M and Saad D), pp. 129–59. MIT Press, Cambridge, MA. Jordan MI, Ghahramani Z, Jaakkola T and Saul LK 1999 An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233. Kenward M and Roger J 1997 Small sample inference for fixed effects from restricted maximum likelihood. Biometrics 53, 983–97. Kerr MK and Churchill G 2001 Experimental design gene expression microarrays. Biostatistics 2, 183–201. Khuri A, Mathew T and Sinha B 1998 Statistical Tests for Mixed Linear Models. John Wiley & Sons, Ltd, New York. Kim KI and van de Wiel MA 2008 Effects of dependence in high-dimensional multiple testing problems. BMC Bioinformatics 9, 114. Langaas M, Lindqvist B and Ferkingstad E 2005 Estimating the proportion of true null hypotheses, with applications to DNA microarray data. J. R. Statist. Soc. B 67(4), 555–72. Lee S, Lee H, Abbeel P and Ng A 2006 Efficient L1 regularized logistic regression. AAAI. Luan Y and Li H 2003 Clustering of time-course gene expression data using a mixed-effects model with b-splines. Bioinformatics 19(4), 474–82. Mary-Huard T and Robin S 2009 Tailored aggregation for classification. IEEE Trans. Pattern Anal. Mach. Intell. 31, 2098–105. Mary-Huard T, Picard F and Robin S 2006 Introduction to statistical methods for microarray data analysis. In Mathematical and Computational Methods in Biology (eds Maass A, Martinez S and Pecou E), pp. 56–126. Hermann, Paris. Maugis C, Celeux G and Martin-Magniette ML 2009 Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53(11), 3872–82. McLachlan G and Peel D 2000 Finite Mixture Models. John Wiley & Sons, Ltd, New York.
38 Handbook of Statistical Systems Biology McLachlan G, Bean R and Ben-Tovim Jones L 2006 A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 22, 1608–15. Michiels S, Koscielny S and Hill C 2005 Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365, 488–92. Nowicki K and Snijders T 2001 Estimation and prediction for stochastic block-structures. J. Am. Statist. Assoc. 96, 1077–87. Quinlan J 1993 C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc, San Francisco, CA. Raftery A and Dean N 2006 Variable selection for model-based clustering. J. Am. Statist. Assoc. 101(473), 168–78. Rapaport F, Barillot E and Vert J 2008 Classification of arraycgh data using fused svm. Bioinformatics. 24, i375–82. Robin S, Bar-Hen A, Daudin JJ and Pierre L 2007 A semi-parametric approach for mixture models: Application to local false discovery rate estimation. Comput. Statist. Data Anal. 51(12), 5483–93. SAS 2002-03 SAS OnlineDoc 9.1.3. support.sas.com/onlinedoc/913/. Schapire R 2002 The boosting approach to machine learning: an overview. In Nonlinear Estimation and Classification (eds Denison DD, Hansen MH, Holmes C, Mallick B and Yu B), pp. 149–72. Springer, Berlin. Schapire R, Freund Y, Bartlett P and Lee W 1998 Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Statist. 26(5), 1651–86. Sch¨olkopf B and Smola A 2002 Learning with Kernels. MIT Press, Cambridge, MA. Schwenk H and Bengio Y 1998 Training methods for adaptive boosting of neural networks. In Conference on Advances in Neural Information (ed. Cohn DA), pp. 647–53. Searle S 1971 Linear Models, 1st edn. John Wiley & Sons, Ltd, New York. Sokal RR and Sneath PHA 1963 Principles of Numerical Taxonomy. Freeman, San Francisco, CA. Storey JD 2002 A direct approach to false discovery rate. J. R. Statist. Soc. B 64(3), 479–98. Strimmer K 2008 fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics 24(12), 1461–2. Sun W and Cai TT 2008 Large-scale multiple testing under dependence. J. R. Statist. Soc. B 71(2), 393–424. Sutton CD 2005 Classification and regression trees, bagging, and boosting. In Handbook of Statistics: Data Mining and Data Visualization (eds Rao C, Wegman E and Solka J), pp. 303–29. Elsevier/North-Holland, Amsterdam. Tempelman RJ 2008 Statistical analysis of efficient unbalanced factorial designs for two-color microarray experiments. Int. J. Plant Genomics 2008, 584360. Tibshirani R 1996 Regression shrinkage and selection via the lasso. J. R. Statist. Soc B 58(1), 267–88. Tibshirani R and Wang P 2008 Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9, 18–29. Vapnik V 1995 The Nature of Statistical Learning Theory. Springer Verlag, New York. Verbeke G and Molenberghs G 2000 Linear Mixed Models for Longitudinal Data. Springer Verlag, New York. Westfall PH and Young SS 1993 Resampling-Based Multiple Testing: Examples and Methods for P-value Adjustment. John Wiley & Sons, Ltd, New York. Winn JM and Bishop C 2005 Variational message passing. J. Mach. Learn. Res. 6, 661–94. Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, Afshari C and Paules RS 2001 Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8(6), 625–37. Zhu J and Hastie T 2004 Classification of gene microarrays by penalized logistic regression. Biostatistics 5(3), 427–43. Zuur A, Ieno E, Walker N, Saveliev A and Smith G 2009 Mixed Effects Models and Extensions in Ecology with R. Springer, New York.
3 Bayesian Inference and Computation Christian P. Robert1,3 , Jean-Michel Marin2 and Judith Rousseau1,3
2
3.1
1 Universit´e Paris-Dauphine, CEREMADE, France Institut de Mathematiques et Modelisation de Montpellier, Universit´e de Montpellier 2, France 3 CREST and ENSAE, Malakoff, France
Introduction
This chapter provides a overview of Bayesian inference, mostly emphasising that it is a universal method for summarising uncertainty and making estimates and predictions using probability statements conditional on observed data and an assumed model (Gelman 2008). The Bayesian perspective is thus applicable to all aspects of statistical inference, while being open to the incorporation of information items resulting from earlier experiments and from expert opinions. We provide here the basic elements of Bayesian analysis when considered for standard models, referring to Marin and Robert (2007) and to Robert (2007) for book-length treatments.1 In the following, we refrain from embarking upon philosophical discussions about the nature of knowledge [see, e.g., Robert (2007) chapter 10] opting instead for a mathematically sound presentation of an eminently practical statistical methodology. We indeed believe that the most convincing arguments for adopting a Bayesian version of data analyses are in the versatility of this tool and in the large range of existing applications, rather than in those polemical arguments [for such perspectives, see, e.g., Jaynes (2003) and MacKay (2002)].
3.2
The Bayesian argument
3.2.1
Bases
We start this section with some notations about the statistical model that may appear to be over-mathematical but are nonetheless essential. 1 The
chapter borrows heavily from chapter 2 of Marin and Robert (2007).
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
40 Handbook of Statistical Systems Biology
Given an independent and identically distributed (i.i.d.) sample Dn = (x1 , . . . , xn ) from a distribution represented by its density fθ , with an unknown parameter θ ∈ , like the mean μ of the benchmark normal distribution, the associated likelihood function is (θ|Dn ) =
n
fθ (xi ).
i=1
This quantity is a fundamental entity for the analysis of the information provided about the parameter θ by the sample Dn , and Bayesian analysis relies on this function to draw inference on θ.2 Since all models are approximations of reality, the choice of a sampling model is wide-open for criticisms [see, e.g., Templeton (2008)], but those criticism go far beyond Bayesian modelling and question the relevance of completely built models for drawing inference or running predictions. We will therefore address the issue of model assessment later in the chapter. The major input of the Bayesian perspective, when compared with a standard likelihood approach, is that it modifies the likelihood – which is a function of θ – into a posterior distribution on the parameter θ – which is a probability distribution on defined by π(θ|Dn ) =
(θ|Dn )π(θ) . (θ|Dn )π(θ) dθ
(3.1)
The factor π(θ) in (3.1) is called the prior (often omitting the qualificative density) and it must be determined to start the analysis. A primary motivation for introducing this extra factor is that the prior distribution summarises the prior information on θ; that is, the knowledge that is available on θ prior to the observation of the sample Dn . However, the choice of π(θ) is often decided on practical or computational grounds rather than on strong subjective beliefs or on overwhelming prior information. As will be discussed later, there also exist less subjective choices, made of families of so-called noninformative priors. The radical idea behind Bayesian modelling is thus that the uncertainty on the unknown parameter θ is more efficiently modelled as randomness and consequently that the probability distribution π is needed on as a reference measure. In particular, the distribution Pθ of the sample Dn then takes the meaning of a probability distribution on Dn that is conditional on (the event that the parameter takes) the value θ, i.e. fθ is the conditional density of x given θ. The above likelihood offers the dual interpretation of the probability density of Dn conditional on the parameter θ, with the additional indication that the observations in Dn are independent given θ. The numerator of (3.1) is therefore the joint density on the pair (Dn , θ) and the (standard probability calculus) Bayes theorem provides the conditional (or posterior) distribution of the parameter θ given the sample Dn as (3.1), the denominator being called the marginal (likelihood) m(Dn ). There are many arguments which make such an approach compelling. When defining a probability measure on the parameter space , the Bayesian approach endows notions such as the probability that θ belongs to a specific region with a proper meaning and those are particularly relevant when designing measures of uncertainty like confidence regions or when testing hypotheses. Furthermore, the posterior distribution (3.1) can be interpreted as the actualisation of the knowledge (uncertainty) on the parameter after observing the data. At this early stage, we stress that the Bayesian perspective does not state that the model within which it operates is the ‘truth’, nor that it believes that the corresponding prior distribution π it requires has a connection with the ‘true’ production of parameters (since there may even be no parameter at all). It simply provides an inferential machine that has strong optimality properties under the right model and that can similarly be evaluated under any other well-defined alternative model. Furthermore, the Bayesian approach 2 Resorting
to an abuse of notations, we will also call (θ|Dn ) our statistical model, even though the distribution with the density fθ is, strictly speaking, the true statistical model.
Bayesian Inference and Computation
41
includes techniques to check prior beliefs as well as statistical models (Gelman 2008), so there seems to be little reason for not using a given model at an earlier stage even when dismissing it as ‘un-true’ later (always in favour of another model). 3.2.2
Bayesian analysis in action
The operating concept that is at the core of Bayesian analysis is that one should provide an inferential assessment conditional on the realised value of Dn , and Bayesian analysis gives a proper probabilistic meaning to this conditioning by allocating to θ a (reference) probability (prior) distribution π. Once the prior distribution is selected, Bayesian inference formally is ‘over’; that is, it is completely determined since the estimation, testing, prediction, evaluation, and any other inferential procedures are automatically provided by the prior and the associated loss (or penalty) function.3 For instance, if estimations θˆ of θ are evaluated via the quadratic loss function ˆ = θ − θ ˆ 2, L(θ, θ) the corresponding Bayes estimator is the expected value of θ under the posterior distribution, θˆ = E[θ|Dn ], for a given sample Dn . For instance, observing a frequency 38/58 of survivals among 58 breast-cancer patients and assuming a binomial B(58, θ) with a uniform U(0, 1) prior on θ leads to the Bayes estimate 1 58 38 θ 38 θ (1 − θ)20 dθ 38 + 1 θ = 0 1 58 = , 38 20 58 +2 0 38 θ (1 − θ) dθ since the posterior distribution is then a beta Be(38 + 1, 20 + 1) distribution. When no specific loss function is available, the posterior expectation is often used as a default estimator, although alternatives also are available. For instance, the maximum a posteriori (MAP) estimator is defined as θˆ = arg max π(θ|Dn ) = arg max π(θ)(θ|Dn ), θ
θ
(3.2)
where the function to maximise is usually provided in closed form. However, numerical problems often make the optimisation involved in finding the MAP estimator far from trivial. Note also here the similarity of (3.2) with the maximum likelihood (ML) estimator: The influence of the prior distribution π(θ) progressively disappears with an incrasing number of observations, and the MAP estimator recovers the asymptotic properties of the ML estimator. See Schervish (1995) for more details on the asymptotics of Bayesian estimators. As an academic example, consider the contingency table provided in Figure 3.1 on survival rate for breastcancer patients with or without malignant tumours, extracted from Bishop et al. (1975), the goal being to distinguish between the two types of tumour in terms of survival probability. We then consider each entry of the table on the number of survivors (first column of figures in Figure 3.1) to be independently Poisson distributed P(Nit θi ), where t = 1, 2, 3, denotes the age group, i = 1, 2, the tumour group, distinguishing between malignant (i = 1) and nonmalignant (i = 2), and Nit is the total number of patients in this age group and for this type of tumour (i.e. the sums of the rows). Therefore, denoting by xit the number of survivors in age group t and tumour group i, the corresponding prbability density is fθi (xit |Nit ) = e−θi Nit
3 Hence
(θi Nit )xit , xit !
the concept, introduced above, of a complete inferential machine.
x ∈ N.
4
5
42 Handbook of Statistical Systems Biology
above 70
2
3
survive yes no 77 10 51 13 51 11 38 20 7 3 6 3
1
50–69
malign no yes no yes no yes
0
age under 50
0.4 (a)
(b)
0.6
0.8 θ
1.0
1.2
Figure 3.1 (a) Data describing the survival rates of some breast-cancer patients (Bishop et al. 1975) and (b) representation of two gamma posterior distributions on the survival rates i differentiating between malignant (dashed, i = 1) versus nonmalignant (solid, i = 2) breast cancer survival rates
The corresponding likelihood on θi (i = 1, 2) is thus L(θi |D3 ) =
3
(θi Nti )xti exp{−θi Nti }
t=1
which, under an exponential θ ∼ Exp(2) prior with scale parameter 2, leads to the posterior x +x2i +x3i
π(θi |D3 ) ∝ θi 1i
exp {−θi (2 + N1i + N2i + N3i )}
i.e. a Gamma (x1i + x2i + x3i + 1, 2 + N1i + N2i + N3i ) distribution. The choice of the (prior) exponential parameter 2 corresponds to a prior estimate of 50% survival probability over the period.4 In the case of the nonmalignant breast cancers, the parameters of the (posterior) Gamma distribution are a = 136 and b = 161, while, for the malignant cancers, they are a = 96 and b = 133. Figure 3.1 shows the difference between both posteriors, the nonmalignant case being stochastically closer to 1, hence indicating a higher survival rate. [Note that the posterior in this figure gives some weight to values of θ larger than 1. This drawback can be fixed by truncating the exponential prior at 1 or even more easily by a beta-binomial modelling with a binomial B(Nit , θi ) distribution on the xit ’s and a beta Be(1/2, 1/2) prior on the θi ’s.] 3.2.3
Prior distributions
The selection of the prior distribution is an important issue in Bayesian modelling. When prior information is available about the data or the model, it can be used in building the prior, and we will see some illustrations of this recommendation in the following chapters. In many situations, however, the selection of the prior distribution is quite delicate in the absence of reliable prior information, and generic solutions must be chosen instead. Since the choice of the prior distribution can have a considerable influence on the resulting inference, this choice must be conducted with the utmost care. It is straightforward to come up with examples where a particular choice of the prior leads to absurd decisions. Before entering into a brief description of some 4 Once again, this is an academic example. The prior survival probability would need to be assessed by a physician in a real life situation.
Bayesian Inference and Computation
43
existing approaches of constructing prior distributions, note that, as part of model checking, every Bayesian analysis needs to assess the influence of the choice of the prior, for instance through a sensitivity analysis. Since the prior distribution models the knowledge (or uncertainty) prior to the observation of the data, the sparser the prior information is, the flatter the prior should be. There actually exists a category of priors whose primary aim is to minimise the impact of the prior selection on the inference; they are called noninformative priors and we will detail them below. When the sample model is from an exponential family of distributions5 with densities of the form fθ (x) = h(x) exp {θ · R(x) − (θ)} ,
θ, R(x) ∈ Rp ,
where θ · R(x) denotes the canonical scalar product in Rp , there exists an associated class of priors called the class of conjugate priors, of the form π(θ|ξ, λ) ∝ exp {θ · ξ − λ(θ)} , which are parameterised by two quantities, λ > 0 and ξ, the rescaled λξ being of the same nature as R(y). These parameterised prior distributions on θ are appealing for the simple computational reason that the posterior distributions are exactly of the same form as the prior distributions; that is, they can be written as π(θ|ξ (Dn ), λ (Dn )),
(3.3)
where (ξ (Dn ), λ (Dn )) is defined in terms of the sample of observations Dn (Robert 2007, section 3.3.3). Equation (3.3) simply says that the conjugate prior is such that the prior and posterior densities belong to the same parametric family of densities but with different parameters. In this conjugate setting, it is the parameters of the posterior density themselves that are ‘updated’, based on the observations, relative to the prior parameters, instead of changing the whole shape of the distribution. To avoid confusion, the parameters involved in the prior distribution on the model parameter are usually called hyperparameters. (They can themselves be associated with prior distributions, then called hyperpriors.) The computation of estimators, of confidence regions or of other types of summaries of interest on the conjugate posterior distribution often becomes straightforward. As a first illustration, note that a conjugate family of priors for the Poisson model is the collection of gamma distributions (a, b), since fθ (x)π(θ|a, b) ∝ θ a−1+x e−(b+1)θ leads to the posterior distribution of θ given X = x being the gamma distribution Ga(a + x, b + 1). [Note that this includes the exponential distribution Exp(2) used on the dataset of Figure 3.1. The Bayesian estimator of the average survival rate, associated with the quadratic loss, is then given by θˆ = (1 + x1 + x2 + x3 )/(2 + N1 + N2 + N3 ), the posterior mean.] As a further illustration, consider the case of the normal distribution N (μ, 1), which is indeed another case of an exponential family, with θ = μ, R(x) = x, and (μ) = μ2 /2. The corresponding conjugate prior for the normal mean μ is thus normal, N λ−1 ξ, λ−1 . This means that, when choosing a conjugate prior in a normal setting, one has to select both a mean and a variance a priori. (In some sense, this is the advantage of using a conjugate prior, namely that one has to select only a few parameters to determine the prior distribution. Conversely, the drawback of conjugate priors is 5 This
framework covers most of the standard statistical distributions, see Lehmann and Casella (1998) and Robert (2007). If this sounds too high a level of abstraction, consider the special cases of Poisson and normal distributions.
44 Handbook of Statistical Systems Biology
that the information known a priori on μ either may be insufficient to determine both parameters or may be incompatible with the structure imposed by conjugacy.) Once ξ and λ are selected, the posterior distribution on μ for a single observation x is determined by Bayes’ theorem, π(μ|x) ∝ exp(xμ − μ2 /2) exp(ξμ − λμ2 /2)
2 ∝ exp −(1 + λ) μ − (1 + λ)−1 (x + ξ) /2 , i.e. a normal distribution with mean (1 + λ)−1 (x + ξ) and variance (1 + λ)−1 . An alternative representation of the posterior mean is λ−1 1 x+ λ−1 ξ, 1 + λ−1 1 + λ−1
(3.4)
that is, a weighted average of the observation x and the prior mean λ−1 ξ. The smaller λ is, the closer the posterior mean is to x (under the additional assumption that λ−1 ξ remains fixed as λ goes to zero). The general case of an i.i.d. sample Dn = (x1 , . . . , xn ) from the normal distribution N (μ, 1) is processed in exactly the same manner, since xn is a sufficient statistic with normal distribution N (μ, 1/n): each 1 in (3.4) is then replaced with n−1 . The general case of an i.i.d. sample Dn = (x1 , . . . , xn ) from the normal distribution N (μ, σ 2 ) with an unknown θ = (μ, σ 2 ) also allows for a conjugate processing. The normal distribution does indeed remain an exponential family when both parameters are unknown. It is of the form
(σ 2 )−λσ −3/2 exp − λμ (μ − ξ)2 + α /2σ 2 since
π((μ, σ 2 )|Dn ) ∝ (σ 2 )−λσ −3/2 exp − λμ (μ − ξ)2 + α /2σ 2
×(σ 2 )−n/2 exp − n(μ − x)2 + sx2 /2σ 2
∝ (σ 2 )−λσ (Dn ) exp − λμ (Dn )(μ − ξ(Dn ))2 + α(Dn ) /2σ 2 ,
(3.5)
where sx2 = ni=1 (xi − x)2 . Therefore, the conjugate prior on θ is the product of an inverse gamma distribution on σ 2 , I G (λσ , α/2), and, conditionally on σ 2 , a normal distribution on μ, N (ξ, σ 2 /λμ ). The apparent simplicity of conjugate priors is however not a reason that makes them altogether appealing, since there is no further (strong) justification to their use. One of the difficulties with such families of priors is the influence of the hyperparameter (ξ, λ). If the prior information is not rich enough to justify a specific value of (ξ, λ), arbitrarily fixing (ξ, λ) = (ξ0 , λ0 ) is problematic, since it does not take into account the prior uncertainty on (ξ0 , λ0 ) itself. To improve on this aspect of conjugate priors, one solution is to consider a hierarchical prior, i.e. to assume that γ = (ξ, λ) itself is random and to consider a probability distribution with density q on γ, leading to θ|γ ∼ π(θ|γ) γ ∼ q(γ), as a joint prior on (θ, γ). The above is equivalent to considering, as a prior on θ π(θ|γ)q(γ)dγ. π(θ) =
Bayesian Inference and Computation
45
As a general principle, q may also depend on some further hyperparameters η. Higher order levels in the hierarchy are thus possible, even though the influence of the hyper(-hyper)-parameter η on the posterior distribution of θ is usually smaller than that of γ. But multiple levels are nonetheless useful in complex populations as those found in animal breeding (Sørensen and Gianola 2002). Instead of using conjugate priors, even when mixed with hyperpriors, one can opt for the so-called noninformative (or vague) priors (Robert et al. 2009) in order to attenuate the impact on the resulting inference. These priors are defined as refinements of the uniform distribution, which rigorously does not exist on unbounded spaces. A peculiarity of those vague priors is indeed that their density usually fails to integrate to one since they have infinite mass, i.e. π(θ)dθ = +∞,
and they are defined instead as positive measures, the first and foremost example being the Lebesgue measure on Rp . While this sounds like an invalid extension of the standard probabilistic framework – leading to their denomination of improper priors – it is quite correct to define the corresponding posterior distributions by (3.1), provided the integral in the denominator is defined, i.e. π(θ)(θ|Dn ) dθ = m(Dn ) < ∞. In some cases, this difficulty disappears when the sample size n is large enough. In others like mixture models (see also Section 3.3), the impossibility of using a particular improper prior may remain whatever the sample size is. It is thus strongly advised, when using improper priors in new settings, to check that the above finiteness condition holds. The purpose of noninformative priors is to set a prior reference that has very little bearing on the inference (relative to the information brought by the likelihood function). More detailed accounts are provided in Robert (2007, section 1.5) about this possibility of using σ-finite measures in settings where genuine probability prior distributions are too difficult to come by or too subjective to be accepted by all. While a seemingly natural way of constructing noninformative priors would be to fall back on the uniform (i.e. flat) prior, this solution has many drawbacks, the worst one being that it is not invariant under a change of parameterisation. To understand this issue, consider the example of a Binomial model: the observation x is a B(n, p) random variable, with p ∈ (0, 1) unknown. The uniform prior π(p) = 1 could then sound like the most natural noninformative choice; however, if, instead of the mean parameterisation by p, one considers the logistic parameterisation θ = log[p/(1 − p)] then the uniform prior on p is transformed into the logistic density π(θ) = eθ /(1 + eθ )2 by the Jacobian transform, which is not uniform. There is therefore a lack of invariance under reparameterisation, which in its turn implies that the choice of the parameterisation associated with the uniform prior is influencing the resulting posterior. This is generaly considered to be a drawback. Flat priors are therefore mostly restricted to location models x ∼ p(x − θ), while scale models x ∼ p(x/θ)/θ are associated with the log-transform of a flat prior, that is, π(θ) = 1/θ.
0
5
10
15
20
25
46 Handbook of Statistical Systems Biology
−0.10
−0.05
0.00
0.05
Figure 3.2 Two posterior distributions on a normal mean corresponding to the flat prior (solid) and a conjugate prior (dotted) for a dataset of 90 observations. Source: Marin and Robert (2007)
In a more general setting, the (noninformative) prior favoured by most Bayesians is the so-called Jeffreys’ (1939) prior, which is related to Fisher’s information I F (θ) by 1/2 πJ (θ) = I F (θ) , where |I| denotes the determinant of the matrix I. Since the mean μ of a normal model N(μ, 1) is a location parameter, the standard choice of noninformative prior is then π(μ) = 1 (or any other constant). Given that this flat prior formally corresponds to the choice λ = μ = 0 in the conjugate prior, it is easy to verify that this noninformative prior is associated with the posterior distribution N (x, 1). An interesting consequence of this remark is that the posterior density (as a function of the parameter θ) is then equal to the likelihood function, which shows that Bayesian analysis subsumes likelihood analysis in this sense. Therefore, the MAP estimator is also the ML estimator in that special case. Figure 3.2 provides the posterior distributions associated with both the flat prior on μ and the conjugate N (0, σˆ 2 /10) prior for a crime dataset discussed in Marin and Robert (2007). The difference between the two posteriors is still visible after 90 observations and it illustrates the impact of the choice of the hyperparameter (ξ, λ) on the resulting inference. 3.2.4
Confidence intervals
As should now be clear, the Bayesian approach is a complete inferential approach. Therefore, it covers among other things confidence evaluation, testing, prediction, model checking, and point estimation. Unsurprisingly, the derivation of the confidence intervals (or of confidence regions in more general settings) is based on the posterior distribution π(θ|Dn ). Since the Bayesian approach processes θ as a random variable and conditions upon the observables Dn , a natural definition of a confidence region on θ is to determine C(Dn ) such that π(θ ∈ C(Dn )|Dn ) = 1 − α
(3.6)
where α is either a predetermined level such as 0.05,6 or a value derived from the loss function (that may depend on the data). The important difference from a traditional perspective is that the integration here is done over the parameter space, rather than over the observation space. The quantity 1 − α thus corresponds to the probability that a 6 There
is nothing special about 0.05 when compared with, say, 0.87 or 0.12. It is just that the famous 5% level is adopted by most as an acceptable level of error.
Bayesian Inference and Computation
47
Figure 3.3 (a) Posterior distribution of the location parameter of a Cauchy sample for a N (0, 10) prior and corresponding 95% HPD region. Source: Marin and Robert (2007). (b) Representation of a posterior sample of 103 values of (, 2 ) for the normal model, x1 , . . . , x10 ∼ N(, 2 ) with x = 0, s2 = 1 and n = 10, under Jeffreys’ prior, along with the pointwise approximation to the 10% HPD region (darker shading). Source: Robert and Wraith (2009)
random θ belongs to this set C(Dn ), rather than to the probability that the random set contains the ‘true’ value of θ. Given this interpretation of a confidence set (called a credible set by Bayesians in order to stress this major difference with the classical confidence set), the determination of the best7 confidence set turns out to be easier than in the classical sense: It simply corresponds to the values of θ with the highest posterior values, C(Dn ) = {θ; π(θ|Dn ) ≥ kα } , where kα is determined by the coverage constraint (3.6). This region is called the highest posterior density (HPD) region. When the prior distribution is not conjugate, the posterior distribution is not necessarily so easily managed. For instance, if the normal N (μ, 1) distribution is replaced with the Cauchy distribution, C (μ, 1), in the likelihood n n n (μ|Dn ) = fμ (xi ) = 1 π (1 + (xi − μ)2 ), i=1
i=1
there is no conjugate prior available and we can consider a normal prior on μ, say N (0, 10). The posterior distribution is then proportional to n 2 ˜ (1 + (xi − μ)2 ). π(μ|D n ) = exp(−μ /20) i=1
˜ Solving π(μ|D n ) = k is not possible analytically, only numerically, and the derivation of the bound kα requires some amount of trial-and-error in order to obtain the correct coverage. Figure 3.3 gives the posterior distribution of μ for the observations x1 = −4.3 and x2 = 3.2. For a given value of k, a trapezoidal approximation can 7 In
the sense of offering a given confidence coverage for the smallest possible length/volume.
48 Handbook of Statistical Systems Biology
be used to compute the approximate coverage of the HPD region. For α = 0.95, a trial-and-error exploration of a range of values of k then leads to an approximation of kα = 0.0415 and the corresponding HPD region is represented in Figure 3.3(a). As illustrated in the above example, posterior distributions are not necessarily unimodal and thus the HPD regions may include several disconnected sets. This may sound counterintuitive from a classical point of view, but it must be interpreted as indicating indeterminacy, either in the data or in the prior, about the possible values of θ. Note also that HPD regions are dependent on the choice of the reference measure that defines the volume (or surface). The analytic derivation of HPD regions is rarely straightforward but, due to the fact that the posterior density is most often known up to a normalising constant, those regions can be easily derived from posterior simulations. For instance, Figure 3.3(b) illustrates this derivation in the case of a normal N(θ, σ 2 ) model with both parameters unknown and Jeffreys’ prior, when the sufficient statistics are x = 0 and s2 = 1, based on n = 10 observations.
3.3
Testing hypotheses
Deciding about the validity of some restrictions on the parameter θ or on the validity of a whole model – like whether or not the normal distribution is appropriate for the data at hand – is a major and maybe the most important component of statistical inference. Because the outcome of the decision process is clearcut, accept (coded by 1) or reject (coded by 0), the construction and the evaluation of procedures in this set-up are quite crucial. While the Bayesian solution is formally very close to a likelihood ratio statistic, its numerical values and hence its conclusions often strongly differ from the classical solutions. 3.3.1
Decisions
Without loss of generality, and including the set-up of model choice, we represent null hypotheses as restricted parameter spaces, namely θ ∈ 0 . For instance, θ > 0 corresponds to 0 = R+ . The evaluation of testing procedures can be formalised via the 0 − 1 loss that equally penalises all errors: If we consider the test of H0 : θ ∈ 0 versus H1 : θ ∈/ 0 , and denote by d ∈ {0, 1} the decision made by the researcher and by δ the corresponding decision procedure, the loss 1 − d if θ ∈ 0 , L(θ, d) = d otherwise, is associated with the Bayes decision (estimator) 1 if P π (θ ∈ 0 |x) > P π (θ ∈/ 0 |x), π δ (x) = 0 otherwise. This estimator is easily justified on an intuitive basis since it chooses the hypothesis with the largest posterior probability. The Bayesian testing procedure is therefore a direct transform of the posterior probability of the null hypothesis. 3.3.2
The Bayes factor
A notion central to Bayesian testing is the Bayes factor π B10 =
P π (θ ∈ 1 |x)/P π (θ ∈ 0 |x) , P π (θ ∈ 1 )/P π (θ ∈ 0 )
Bayesian Inference and Computation
49
which corresponds to the classical odds or likelihood ratio, the difference being that the parameters are integrated rather than maximised under each model. While it is a simple one-to-one transform of the posterior probability, it can be used for Bayesian testing without resorting to a specific loss, evaluating the strength of π the evidence in favour or against H0 by the distance of log10 (B10 ) from zero (Jeffreys 1939). This somehow ad-hoc perspective provides a reference for hypothesis assessment with no need to define the prior probabilities of H0 and H1 , which is one of the advantages of using the Bayes factor. In general, the Bayes factor does depend on prior information, but it can be perceived as a Bayesian likelihood ratio since, if π0 and π1 are the π prior distributions under H0 and H1 , respectively, B10 can be written as m1 (x) fθ (x)π1 (θ) dθ π = , B10 = 1 m0 (x) 0 fθ (x)π0 (θ) dθ thus replacing the likelihoods with the marginals under both hypotheses. By integrating out the parameters within each hypothesis, the uncertainty on each parameter is taken into account, which induces a natural penalisation for larger models, as intuited by Jeffreys (1939). The Bayes factor is connected with the Bayesian information criterion (BIC, see Robert 2007, chapter 5), with a penalty term of the form d log n/2, which makes the penalisation induced by Bayes factors in regular parametric models explicit. In a wide generality, the Bayes factor asymptotically corresponds to a likelihood ratio with a penalty of the form d ∗ log n∗ /2 where d ∗ and n∗ can be viewed as the effective dimension of the model and number of observations, respectively, see Berger et al. (2003) and Chambaz and Rousseau (2008). The Bayes factor therefore offers the major appeal that it does not require to compute a complexity measure (or penalty term) – in other words, to define what is d ∗ and what is n∗ – which often is quite complicated and may depend on the true distribution. 3.3.3
Point null hypotheses
At face value, testing a point null hypothesis, H0 : θ = θ0 , is essentially meaningless (how often can one distinguish θ = 0 from θ = 0.0001?). The prior reflects this difficulty in the construction of the Bayesian procedure, given that, for an absolutely continuous prior π, P π (θ = θ0 ) = 0. The only logical incorporation of the question within the inferential framework is indeed to consider it from a model choice perspective, namely whether or not the simpler model imposing that θ = θ0 is compatible with the observations. From this perspective, testing point null hypotheses naturally requires a modification of the prior distribution so that, when testing H0 : θ ∈ 0 versus H1 : θ ∈ 1 , π(0 ) > 0
and
π(1 ) > 0
hold, whatever the measures of 0 and 1 for the original prior, which means that the prior must be decomposed as π(θ) = P π (θ ∈ 0 ) × π0 (θ) + P π (θ ∈ 1 ) × π1 (θ) with positive prior probabilities on both 0 and 1 . Note that this modification makes sense from both informational and operational points of view. Given that H0 : θ = θ0 corresponds to a simpler (or more parsimonous) model, the fact that the hypothesis is tested implies that this simpler model is a possibile model and this fact brings some additional prior information on the model structure, hence by simplification on the parameter θ. In particular, if H0 is tested and accepted, this means that, in most situations, the (reduced) model under H0 should be used rather than the (full) model
50 Handbook of Statistical Systems Biology
considered previously. Thus, a prior distribution under the reduced model must be available from the start for potential later inference. (Formally, the fact that this later inference depends on the selection of H0 should also be taken into account.) In the special case 0 = {θ0 }, π0 is the Dirac mass at θ0 , denoted Iθ0 (θ), which simply means that P π0 (θ = θ0 ) = 1, and we need to introduce a separate prior weight of H0 , namely, ρ = P π (θ = θ0 )
and
π(θ) = ρIθ0 (θ) + (1 − ρ)π1 (θ).
Then, π(0 |x) =
fθ0 (x)ρ fθ0 (x)ρ = . fθ (x)π(θ) dθ fθ0 (x)ρ + (1 − ρ)m1 (x)
In the case when x ∼ N(μ, σ 2 ) and μ ∼ N(ξ, τ 2 ), consider the test of H0 : μ = 0. We can choose ξ equal to 0 if we do not have additional prior information. Then the Bayes factor is the ratio of marginals under both hypotheses, μ = 0 and μ = / 0, e−x /2(σ +τ m1 (x) σ = √ 2 2 f0 (x) σ 2 + τ 2 e−x /2σ 2
π B10 =
2
2)
and ⎡ 1−ρ π(μ = 0|x) = ⎣1 + ρ
σ2 exp σ2 + τ2
⎤ −1 τ 2 x2 ⎦ 2σ 2 (σ 2 + τ 2 )
is the posterior probability of H0 . Table 3.1 gives an indication of the values of the posterior probability when the normalised quantity x/σ varies. This posterior probability again depends on the choice of the prior variance τ 2 : The dependence is actually quite severe, as shown in the next section with the Jeffreys–Lindley paradox. 3.3.4
The ban on improper priors
Unfortunately, this decomposition of the prior distribution into two subpriors brings a serious difficulty related to improper priors, which amounts in practice to banning their use in testing situations. In fact, when using the representation π(θ) = P π (θ ∈ 0 ) × π0 (θ) + P π (θ ∈ 1 ) × π1 (θ), the probability weights P π (θ ∈ 0 ) and P π (θ ∈ 1 ) are meaningful only if π0 and π1 are normalised probability densities. Otherwise, they cannot be interpreted as weights. Table 3.1 Posterior probability of = 0 for different values of z = x/, = 1/2, and for = (top), 2 = 10 2 (bottom), when using a normal N(0, 2 ) prior on under the alternative z
0
0.68
1.28
1.96
( = 0|z) ( = 0|z)
0.586 0.768
0.557 0.729
0.484 0.612
0.351 0.366
Source: Marin and Robert (2007).
Bayesian Inference and Computation
51
Table 3.2 Posterior probability of H0 : = 0 for the Jeffreys’ prior 1 () = 1 under H1 . As explained in the text, these probabilities have no validity and should not be used x
0.0
1.0
1.65
1.96
2.58
( = 0|x)
0.285
0.195
0.089
0.055
0.014
Source: Marin and Robert (2007).
In the instance when x ∼ N(μ, 1) and H0 : μ = 0, the improper (Jeffreys’) prior is π1 (μ) = 1; if we write π(μ) =
1 1 I0 (μ) + · Iμ =/ 0 , 2 2
then the posterior probability is 1 e−x /2 √ = . 2 +∞ −(x−θ)2 /2 2 /2 −x 1 + 2πex /2 e + −∞ e dθ 2
π(μ = 0|x) =
A first consequence of this choice is that the posterior probability of H0 is bounded from above by √ π(μ = 0|x) ≤ 1/(1 + 2π) = 0.285. Table 3.2 provides the evolution of this probability as x goes away from 0. An interesting point is that the numerical values are close to the p-values used in classical testing (Casella and Berger 2001). If we are instead testing H0 : θ ≤ 0 versus H1 : θ > 0, then the posterior probability is 0 1 2 π(θ ≤ 0|x) = √ e−(x−θ) /2 dθ = (−x), 2π −∞ and the answer is now exactly the p-value found in classical statistics. The difficulty in using an improper prior also relates to what is called the Jeffreys–Lindley paradox, a phenomenon that shows that limiting arguments are not valid in testing settings. In contrast to estimation settings, the noninformative prior no longer corresponds to the limit of conjugate inferences. In fact, for a conjugate prior, the posterior probability ⎫ ⎧ ⎬−1 ⎨ σ2 τ 2 x2 1 − ρ0 exp π(θ = 0|x) = 1 + ⎩ ρ0 σ2 + τ2 2σ 2 (σ 2 + τ 2 ) ⎭ converges to 1 when τ goes to +∞, for every value of x, as√ already illustrated by Figure 3.4. This noninformative procedure differs from the noninformative answer [1 + 2π exp(x2 /2)]−1 above. The fundamental issue that bars us from using improper priors on one or both of the sets 0 and 1 is a normalising difficulty: If g0 and g1 are measures (rather than probabilities) on the subspaces 0 and 1 , the choice of the normalising constants influences the Bayes factor. Indeed, when gi is replaced by ci gi (i = 0, 1), where ci is an arbitrary constant, the Bayes factor is multiplied by c0 /c1 . Thus, for instance, if the Jeffreys’ prior is flat and g0 = c0 , g1 = c1 , the posterior probability ρ0 c0 0 fθ (x) dθ π(θ ∈ 0 |x) = ρ0 c0 0 fθ (x) dθ + (1 − ρ0 )c1 1 fθ (x) dθ is completely determined by the choice of c0 /c1 . This implies, for instance, that the function [1 + √ 2π exp(x2 /2)]−1 obtained earlier has no validity whatsoever. Hence, the numerical values found in Table 3.2
0.0
0.2
0.4
B10
0.6
0.8
1.0
52 Handbook of Statistical Systems Biology
1e−04
1e−02
1e+00
τ Figure 3.4 Range of the Bayes factor B10 when goes from 10−4 to 10. (Note: The x-axis is in logarithmic scale.) Source: Marin and Robert (2007)
are arbitrary, because they rely on the choice of an arbitrary pair of normalising constants, c0 = c1 = 1. The call to improper priors in settings where the same improper prior cannot be used under both hypotheses is therefore prohibited. (Note that the case of the test of μ > 0 against μ < 0 above escapes this prohibition.) Since improper priors are an important part of the Bayesian approach, there have been many proposals to overcome this ban. Most use a device that transforms the prior into a proper probability distribution by using a portion of the data Dn and then use the other part of the data to run the test as in a standard situation. The variety of available solutions is due to the many possibilities of removing the dependence on the choice of the portion of the data used in the first step. The resulting procedures are called pseudo-Bayes factors. See Robert (2007, chapter 5) for more details. 3.3.5
The case of nuisance parameters
In some settings, some parameters are shared by both hypotheses (or by both models) that are under comparison. Since they have the same meaning in each model, the above ban can be partly lifted and a common improper prior can be used on these parameters, in both models. For instance, consider a regression model, represented as y|X, β, σ ∼ N(Xβ, σ 2 In ),
(3.7)
where X denotes the (n, p) matrix of regressors – upon which the whole analysis is conditioned – y the vector of the n observations, and β is the vector of the regression coefficients. [This is a matrix representation of the repeated observation of yi = β1 xi1 + . . . + βp xip + σi
i ∼ N(0, 1),
when i varies from 1 to n.] Variable selection in this set-up means removing covariates, that is, columns of X, that are not significantly contributing to the expectation of y given X. In other words, this is about
Bayesian Inference and Computation
(Intercept) X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Estimate
BF
log10(BF)
9.2714 -0.0037 -0.0454 0.0573 -1.0905 0.1953 -0.3008 -0.2002 0.1526 -1.0835 -0.3651
26.334 7.0839 3.6850 0.4356 2.8314 2.5157 0.3621 0.3627 0.4589 0.9069 0.4132
1.4205 (***) 0.8502 (**) 0.5664 (**) -0.3609 0.4520 (*) 0.4007 (*) -0.4412 -0.4404 -0.3383 -0.0424 -0.3838
53
evidence against H0: (****) decisive, (***) strong, (**) substantial, (*) poor
Figure 3.5 R output of a Bayesian regression analysis on a processionary caterpillar dataset with 10 covariates analysed in Marin and Robert (2007). The Bayes factor on each row corresponds to the test of the nullity of the corresponding regression coefficient
testing whether or not a null hypothesis like H0 : β1 = 0 holds. From a Bayesian perspective, a possible noninformative prior distribution on the generic regression model (3.7) is the so-called Zellner’s (1986) gprior, where the conditional8 π(β|σ) prior density corresponds to a normal N(0, nσ 2 (XT X)−1 ) distribution on β, AT denoting the transposed matrix associated with A, and where a ‘marginal’ improper prior on σ 2 , π(σ 2 ) = σ −2 , is used to complete the joint distribution. While this construction of the prior may sound too artificial to allow for any meaningful interpretation, this is actually the closest one can get to an exchangeable prior on the components of Xβ due to the noninvertibility of X. The choice of the factor n in the scale amounts to weighting the prior informational input as much as the one from a single observation of the regression model. Obviously a further incentive for using this prior is that it allows for closed-form expressions (Marin and Robert 2007). With this default (or reference) prior modelling, and when considering the submodel corresponding to the null hypothesis H0 : β1 = 0, with parameters β−1 and σ, we can use a similar g-prior distribution T β−1 |σ, X ∼ N(0, nσ 2 (X−1 X−1 )−1 ),
where X−1 denotes the regression matrix missing the column corresponding to the first regressor, and σ 2 ∼ π(σ 2 ) = σ −2 . Since σ is a nuisance parameter in this case, we may use the improper prior on σ 2 as common to all submodels and thus avoid the indeterminacy in the normalising factor of the prior when computing the Bayes factor f (y|β−1 , σ, X)π(β−1 |σ, X−1 ) dβ−1 σ −2 dσ 2 B01 = f (y|β, σ, X)π(β|σ, X)dβ σ −2 dσ 2 Figure 3.5 reproduces a computer output from Marin and Robert (2007) that illustrates how this default prior and the corresponding Bayes factors can be used in the same spirit as significance levels in a standard 8 The fact that the prior distribution on β depends on the matrix of regressors X is not contradictory with the Bayesian paradigm of defining the prior distribution before observing the data, in that the whole model is defined conditional on X. The potential randomness of the regressors is not accounted for in this analysis.
54 Handbook of Statistical Systems Biology
regression model, each Bayes factor being associated with the test of the nullity of the corresponding regression coefficient. For instance, only the intercept and the coefficients of X1 , X2 , X4 , X5 are significant. This output mimics the standard lm R function outcome in order to show that the level of information provided by the Bayesian analysis covers at least the classical output. (We stress that all items in Figure 3.5 are obtained via closed-form formulae.) Obviously, this reproduction of a frequentist output is not the whole purpose of a Bayesian data analysis, quite the opposite: it simply reflects on the ability of a Bayesian analysis to produce automated summaries, just as in the classical case, but the inferential abilities of the Bayesian approach are considerably wider. [For instance, testing simultaneously the nullity of β3 , β6 , . . . , β10 is of identical difficulty, as detailed in Marin and Robert (2007, chapter 3). An example of inference simplified by the Bayesian approach is prediction, where the distribution of a future yi can be computed in closed form by integrating the parameters on the posterior.] 3.3.6
Bayesian multiple testing
Recently, massive multiple testings or comparisons has received attention, in frameworks such as gene expression data where it is of interest to detect genes that are differentially expressed among a (very) large number genes, or other types of applications, see for instance M¨uller et al. (2008) for an excellent review on the subject. Consider thus a large number, say m, of tests or comparisons to perform, which we write as γi = 0 if H0 is true and γi = 1 if H1 is true for i = 1, . . . , m. Typically i indexes the gene in gene expression data. As explained in Chapter 2, if all tests are performed independently with a fixed type one error, the number of false positive tests will increase with m which is often an undesirable feature. A natural way to take into account this multiplicity in a Bayesian framework is to build a global hierarchical model in the following way. Let zi denote some summary statistic for the ith test (gene), and assume that zi ∼ f0i and zi ∼ f1i under H1 , where both f0i and f1i possibly depend on parameters. For instance, zi ∼ N(θi , σ 2 ), where θi = 0 under H0 and θi = / 0 under H1 . In this example, the corresponding hierarchical model can be written as : (i = 1, . . . , m) iid
γi ∼ B(1, 1 − p0 ) zi |γi = 1 ∼ N(θi , σ ), zi |γi = 0 ∼ N(0, σ 2 ) σ 2 ∼ πσ . 2
i = 1, . . . , m iid
θi ∼ πθ (.)
In most cases the probability p0 that a hypothesis is null is unknown, so that in a fully Bayesian approach it is assumed to be distributed from a prior probability πp . An alternative is to use an empirical Bayesian approach by computing a frequentist estimate of p0 , see for instance Efron et al. (2001). As shown by Scott and Berger (2006), the individual posterior probabilities of H0 , pi = Pr[γi = 0|z], where z = (z1 , . . . , zm ), decrease, for a given zi , when the number of observations under the null (and m) increases, illustrating the fact that in the computation of the posterior probabilities pi , multiplicity is implicitly taken into account. However the pi depend heavily on the prior distribution πp , see Scott and Berger (2006) for numerical illustration of this dependence. Obviously other models can also be considered for f0 and f1 with possibly nonparametric models for f1 , and it is also possible to extend this model to cases where the zi are not assumed to be conditionally independent, see for instance Berry and Berry (2004) and Gopalan and Berry (1998). It is often of interest to decide which are the tests (genes) to accept (not differentially expressed) among the m different tests, it is then not enough to compute the pi and thresholds ti must be chosen so that a test is considered as significant if pi ≤ ti . We then set δi = 1 if pi ≤ ti and 0 otherwise. The choice of such threshold can be obtained by defining a loss function. In M¨uller et al. (2004), loss functions in the form: ¯ + FN, ¯ ¯ + FNR ¯ L(γ, z) = cFD L(γ, z) = cFDR
Bayesian Inference and Computation
55
¯ (respectively FNR) ¯ subject to FD ¯ ≤ α (respectively FDR) ¯ ¯ is the or minimising FN where FDR are considered, m m ¯ = m ¯ posterior expectation of the False Discovery Rate (FDR), i.e. FDR δ p / δ , FD = i=1 i i i=1 i i=1 δi pi , m −1 m ¯ = FN ¯ ¯ = i=1 (1 − δi )(1 − pi ) and FNR is the posterior expectation of the ratio of FN i=1 (1 − δi ) false negatives over the ratios of negatives, see also Chapter 2 for comments on these quantities. They all lead to decision rules in the form δi = 1 if and only if pi ≤ t, where the threshold t depends on the loss function. Alternative loss functions can and have been considered, see Scott and Berger (2006) and M¨uller et al. (2008) and the choice of the loss function is typically based on specific features and aims of the analysis under study.
3.4
Extensions
The above description of inference is only an introduction and is thus not representative of the wealth of possible applications resulting from a Bayesian modelling. We consider below three extensions inspired from Marin and Robert (2007). 3.4.1
Prediction
When considering a sample Dn = (x1 , . . . , xn ) from a given distribution, there can be a sequential or dynamic structure in the model that implies that future observations are expected. While more realistic modelling may involve probabilistic dependence between the xi , we consider here the simpler set-up of predictive distributions in i.i.d. settings. If xn+1 is a future observation from the same distribution fθ (·) as the sample Dn , its predictive distribution given the current sample is defined as f π (xn+1 |Dn ) =
f (xn+1 |θ, Dn )π(θ|Dn ) dθ =
fθ (xn+1 )π(θ|Dn ) dθ.
The motivation for defining this distribution is that the information available on the pair (xn+1 , θ) given the data Dn is summarised in the joint posterior distribution fθ (xn+1 )π(θ|Dn ) and the predictive distribution above is simply the corresponding marginal on xn+1 . This is nonetheless coherent with the Bayesian approach, which then considers xn+1 as an extra unknown. For the normal N (μ, σ 2 ) set-up, using a conjugate prior on (μ, σ 2 ) of the form
(σ 2 )−λσ −3/2 exp − λμ (μ − ξ)2 + α /2σ 2 , the corresponding posterior distribution on (μ, σ 2 ) given Dn is N
σ2 λμ ξ + nxn , λμ + n λμ + n
× I G λσ + n/2,
α + sx2
nλμ 2 (x − ξ) /2 , + λμ + n
denoted by ξ(Dn ), σ 2 /λμ (Dn ) × I G (λσ (Dn ), α(Dn )/2) ,
N
56 Handbook of Statistical Systems Biology
and the predictive on xn+1 is derived as f π (xn+1 |Dn ) ∝ (σ 2 )−λσ −2−n/2 exp −(xn+1 − μ)2 /2σ 2
× exp − λμ (Dn )(μ − ξ(Dn ))2 + α(Dn ) /2σ 2 d(μ, σ 2 )
∝ (σ 2 )−λσ −n/2−3/2 exp − (λμ (Dn ) + 1)(xn+1 − ξ(Dn ))2 # /λμ (Dn ) + α(Dn ) /2σ 2 dσ 2 −(2λσ +n+1)/2 λμ (Dn ) + 1 2 (xn+1 − ξ(Dn )) ∝ α(Dn ) + . λμ (Dn ) Therefore, the predictive of xn+1 given the sample Dn is a Student’s t distribution with mean ξ(Dn ) and 2λσ + n degrees of freedom. In the special case of the noninformative prior, λμ = λσ = α = 0 and the predictive is −(n+1)/2 n π . (xn+1 − xn )2 f (xn+1 |Dn ) ∝ sx2 + n+1 $ This is again a Student’s t distribution with mean xn , scale sx (n + 1)/n2 , and n degrees of freedom. 3.4.2
Outliers
Since normal modelling is often an approximation to the ‘real thing’, there may be doubts about its adequacy. As already mentioned above, we will deal later with the problem of checking that the normal distribution is appropriate for the whole dataset. Here, we consider the somehow simpler problem of assessing whether or not each point in the dataset is compatible with normality. There are many different ways of dealing with this problem. We choose here to take advantage of the derivation of the predictive distribution above: If an observation xi is unlikely under the predictive distribution based on the other observations, then we can argue against its distribution being equal to the distribution of the other observations. For each xi ∈ Dn , we consider fiπ (x|Dni ) as being the predictive distribution based on Dni = (x1 , . . . , xi−1 , xi+1 , . . . , xn ). Considering fiπ (xi |Dni ) or the corresponding cumulative distribution function (c.d.f.) Fiπ (xi |Dni ) (in dimension one) gives an indication of the level of compatibility of the observation with the sample. To quantify this level, we can, for instance, approximate the distribution of Fiπ (xi |Dni ) as uniform over [0, 1] since Fiπ (·|Dni ) converges to the true c.d.f. of the model. Simultaneously checking all Fiπ (xi |Dni ) over i may signal outliers. The detection of outliers must pay attention to the Bonferroni fallacy, which is that extreme values do occur in large enough samples. This means that, as n increases, we will see smaller and smaller values of Fiπ (xi |Dni ) even if the whole sample is from the same distribution. The significance level must therefore be chosen in accordance with this observation, for instance using a bound a on Fiπ (xi |Dni ) such that 1 − (1 − a)n = 1 − α, where α is the nominal level chosen for outlier detection. 3.4.3
Model choice
For model choice, i.e. when several models are under comparison for the same observation Mi : x ∼ fi (x|θi ),
i ∈ I,
Bayesian Inference and Computation
57
where I can be finite or infinite, the usual Bayesian answer is similar to the Bayesian tests as described above. The most coherent perspective (from our viewpoint) is actually to envision the tests of hypotheses as particular cases of model choices, rather than trying to justify the modification of the prior distribution criticised by Gelman (2008). This also incorporates within model choice the alternative solution of model averaging, proposed by Madigan and Raftery (1994), which strives to keep all possible models when drawing inference. The idea behind Bayesian model choice is to construct an overall probability on the collection of models ∪i∈I Mi in the following way: the parameter is θ = (i, θi ), i.e. the model index and given the model index equal to i, the parameter θi in model Mi , then the prior measure on the parameter θ is expressed as % % dπ(θ) = pi = 1. pi dπi (θi ), i∈Ii i∈I As a consequence, the Bayesian model selection associated with the 0–1 loss function and the above prior is the model that maximises the posterior probability pi fi (x|θi )πi (θi )dθi π(Mi |x) = % i pj fj (x|θj )πj (θj )dθj j
j
across all models. Contrary to classical plug-in likelihoods, the marginal likelihoods involved in the above ratio do compare on the same scale and do not require the models to be nested. As mentioned in Section 3.3.5 integrating out the parameters θi in each of the models takes into account their uncertainty thus the marginal likelihoods i fi (x|θi )πi (θi )dθi are naturally penalised likelihoods. In most parametric set-ups, when the number of parameters does not grow to infinity with the number of observations and when those parameters are identifiable. the Bayesian model selector as defined above is consistent, i.e. with increasing numbers of observations, the probability of choosing the right model goes to 1.
3.5
Computational issues
The computational challenge offered by Bayesian inference has led to a specific branch of Bayesian statistics addressing these issues. In particular, the past twenty years have witnessed a tremendous and healthy surge in computational Bayesian statistics, due to the development of powerful approximation methods like Markov chain Monte Carlo (MCMC), sequential Monte Carlo (SMC), and approximate Bayesian computation (ABC) techniques. To some extent, this branch of Bayesian statistics is now so intricately connected with Bayesian inference that some notions like Bayesian model choice (Section 3.4.3) and Bayesian model comparison hardly make sense without it. In this final section, we briefly present simulation-based computational methods. For comprehensive entries on those methods, we refer the reader to Chen et al. (2000), Liu (2001) and Robert and Casella (2004, 2009), pointing out that books like Albert (2009) and Marin and Robert (2007) encompass both Bayesian inference and computational methodologies in a single unifying perspective. 3.5.1
Computational challenges
While the previous sections illustrated the Bayesian paradigm through simple closed-form examples, this is not feasible for the majority of problems handled by Bayesian methodology. Complex models essentially
58 Handbook of Statistical Systems Biology
do not allow for closed-form expressions and they require alternative ways of exploration of the posterior distributions, most of which are based upon (computer generated) simulation. A first and most obvious level of difficulty corresponds to the case when the quantity of interest does not allow for an analytic expression. This situation even occurs with standard conjugate priors, as shown in Section 3.2.4 with the issue of the derivation of a credible region where we were ‘forced’ to mention simulation before this dedicated section. (The proposed numerical solution usually is quite easy to construct once a simulation method is implemented.) More generally, any mildly complex transform of the parameters will not enjoy a closed-form posterior expectation (Section 3.2.2). For instance, the computation of Bayes factors requires the use of specific computational techniques like path sampling and bridge sampling [see Marin and Robert (2010) for details]. A second common level of difficulty corresponds to the case when the posterior distribution is not a standard distribution. This is essentially the case for all nonconjugate situations (Section 3.2.3). In such settings, the derivation of almost any summary related with the posterior distribution involves a numerical step, which can be stochastic or deterministic. For instance, the analysis of the regression model (3.7) under Zellner’s (1986) g-prior led to closed-form expressions in Section 3.3.5, in particular for the Bayes factors. If, instead, we consider the case of the generalised linear models, this facility disappears. We recall that those models (McCullagh and Nelder 1989) assume a dependence of y on x that is partly linear y|x, β ∼ f (y|xT β, φ), where φ > 0 is a dispersion parameter that does not depend on x. The posterior distribution of (β, φ) given (X, y): π(β, φ|X, y) ∝
n
f yi |xiT β, φ π(β, φ|X)
(3.8)
i=1
is never available as a standard distribution outside the normal linear model, even when using Marin and Robert (2007) generalised g-prior, −1 . β|X ∼ N 0, n XT X For instance, a standard statistical model corresponding to this case is the probit model P(Y = 1|x) = 1 − P(Y = 0|x) = (xT β), where denotes the standard normal cumulative distribution function. The corresponding posterior distribution π(β|X)
n
(xiT β)yi (−xiT β)1−yi ,
(3.9)
i=1
is available in closed form, up to the normalising constant, but it is not a standard distribution and thus it cannot be easily handled. A third level of difficulty could be called ‘engorging’ in that we are faced with many cases, each of which is easily handled, but in such a number that the exhaustive examination of all cases is impossible. The most common illustration of this type of difficulty is model choice, described in Section 3.4.3, and in particular variable selection. Indeed, when considering a regression model like (3.7), if we try to compare all possible subsets of covariates as in Section 3.3.5, handling p regressors implies that 2p models are under comparison. Even with closed-form expressions for the evidence, the exploration of the model space is clearly impossible when contemplating 30 or more covariates.
Bayesian Inference and Computation
59
A fourth and final level of difficulty occurs when the likelihood function is itself unavailable. The reason for the difficulty may be that the computation time of one single value of the likelihood function requires several seconds, as in the cosmology analysis of Wraith et al. (2009) where the likelihood is represented by a computer program. It may also be that the only possible representation of the likelihood function is as an integral or a sum over a possibly large number of latent variables corresponding to a joint (unobserved) likelihood, as for instance in stochastic volatility models (Jacquier et al. 1994; Chib et al. 2002) which include as many latent variables as observations. Another illustration is provided by phylogenetic trees (Beaumont et al. 2002; Cornuet et al. 2008), where the likelihood is obtained as a sum over all possible ancestral trees. A different setting occurs when the likelihood function is missing a normalising term that cannot be reconstituted by completion and thus requires a custom-built solution. This is the case of Gaussian random fields, found in spatial statistics and image analysis, where distributions of the form & n ' %% Z(β) Iyi =y k f (y|X, β) = exp β i=1 ∼i
over a finite space of labels yi ∈ {1, . . . , G} involve a normalising constant Z(β) that cannot be computed, due to the Gn terms in the summation. (The symbol ∼ i means that the individuals i and are neighbours for a predefined neighbourhood relation denoted by ∼.) 3.5.2
Monte Carlo methods
The generic approach for solving computational problems related with Bayesian analysis is to use simulation, i.e. to produce via a computer program a sample from the posterior distribution and to use the simulated sample to approximate the procedures of interest. This approach goes under the generic name of Monte Carlo methods. Recall that a standard Bayesian estimate of a transform h(θ) of the parameter is the posterior expectation of this function I= h(θ)π(θ|y) dθ.
There is therefore a general requirement for computing integrals in an efficient manner. The standard Monte Carlo approximation of I is given by N 1 % (i) MC = h θ , I N i=1
where the θ (i) are generated from the posterior distribution π(θ|y). For instance, the approximation of a credible region alluded to in Section 3.2.4 used this approach. When given a sample from π(θ|y), the θi corresponding to the α% highest values of π(θ|y) provide an approximation both to the α level of the posterior density and to the localisation of the HPD region. Figure 3.3 shows how such an approximate HPD region can be constructed. [More details can be found in, e.g., Marin and Robert (2010).] MC converges to I and the speed of When the computing effort N grows to infinity, the approximation I √ convergence is 1/ N if h is square-integrable against π(θ|y) (Robert and Casella 2004). The assessment of this convergence relies on the Central Limit Theorem, as described in Robert and Casella (2009, chapter 4). When the posterior distribution π(θ|y) is not a standard distribution, producing simulations from this distribution may prove challenging. There however exists a simple generalisation to the basic Monte Carlo method that avoids this difficuly. This generalisation stems from an alternative representation of the above integral I, changing both the integrating density and the integrand: h(θ)π(θ|y) g(θ) dθ, (3.10) I= g(θ)
60 Handbook of Statistical Systems Biology
where the support of the posterior distribution π(·|y) is included in the support of g(·). The corresponding importance sampling approximation of I is given by N 1 % (i) (i) IS , Ig = ω h θ N
(3.11)
i=1
( where the ωi are the importance weights, ω(i) = π θ (i) |y g θ (i) , While the representation (3.10) holds in wide generality (the only requirement is the constraint on the support of g(·)), the choice of g(·) is fundamental to provide good approximations of I. Poor choices of g(·) lead to unreliable approximations: for instance, if h2 (θ)ω2 (θ)g(θ)dθ
is infinite, the variance of the estimator (3.11) is also infinite (Robert and Casella 2009, chapters 3 and 4) and then (3.11) cannot be used for approximation purposes. As noted in the above, a common feature in Bayesian integration is that the normalising constant of the IS Ig cannot be used and they posterior distribution cannot be computed in closed form. In that case, ω(i) and are replaced by the unnormalised version ω(i) = m(y)π(θ (i) |y)/g(θ (i) ) and by the self-normalised version SNIS Ig =
N %
N % ω(i) h θ (i) ω(i) ,
i=1
i=1
respectively. The self-normalised estimator also converges to I since constant m(y). The weights (i = 1, . . . , T ) ω(i) = ω(i)
N (%
N
i=1 ω
(i)
converges to the normalising
ω(j)
j=1
are then called normalised weights. In general, importance sampling techniques require a rather careful tuning to be of any use, especially in large dimensions. For instance, using a normal distribution centred at the ML estimator of β and scaled by the asymptotic variance of the ML estimator does provide a (very) good approximation in the probit case (Marin and Robert 2010) but this is rarely the case. While MCMC methods (Section 3.5.3) are a ready-made solution to this problem, given that they can break the global distribution into distributions with smaller dimensions, the recent literature has witnessed a fundamental extension of importance sampling techniques that adaptively calibrates some importance functions towards more similarity with the target density (Capp´e et al. 2004, 2008; Del Moral et al. 2006; Douc et al. 2007). This extension is called sequential Monte Carlo (SMC) because it evolves along a time axis either through the target distributions – as in regular sequential statistical problems where the sample size increases – or through the importance functions. Another common name is population Monte Carlo (PMC), following Iba (2000), because this technique produces populations rather than points.9 Although the idea has connections 9 This simulated population is then used to devise new and, it is hoped, improved importance (or proposal) functions at the next time step.
Bayesian Inference and Computation
61
with the earlier particle filter literature (Gordon et al. 1993; Doucet et al. 2001), the main principle of this method is to build a sequence of increasingly better proposal distributions through a sequence of simulated samples (which thus behave like populations). The performance criterion may be the entropy divergence from the target distribution or the variance of the corresponding estimator of a fixed integral. Given that the validation of the technique is still based on sampling importance resampling principles, the resulting dependence on past samples can be arbitrarily complex, while the approximation to the target remains valid (unbiased) at each iteration. A quick description of the SMC algorithm is as follows. Starting from an initial sample, (θi,0 ) 1 ≤ i ≤ N, generated from q0 , importance weights ωi,0 = π(θi,0 |y)/q0 (θi,0 ) are computed to resample the θi,0 . A new importance function qi,1 is constructed for each resampled θ˜ i,0 and a new importance sample is then generated, with weights ωi,t = π(θi,t |y)/qi,t (θi,t ), the processes being iterated until some stability is achieved. [The weights can be modified towards more stability, as in Del Moral et al. (2006).] While the SMC algorithm does appear as a repeated (or sequential) sampling importance resampling algorithm (Rubin 1988), the major modification is the open choice of the new proposals qit , since qit can depend on all past simulated samples as well as on the index of the currently simulated value. For instance, in Capp´e et al. (2008), mixtures of standard kernels are used with an update of the weights and of the parameters of those kernels at each iteration in such a way that the entropy distance of the corresponding importance sampling estimator to the target are decreasing from one iteration to the other. [Adopting a different goal, Andrieu et al. (2010) integrate the particle updating within an MCMC setting.]
3.5.3
MCMC methods
Given the difficulties involved in constructing an efficient importance function in complex settings, and in particular in large dimensions, an alternative of considerable interest is offered by the MCMC methods (Gelfand and Smith 1990; Robert and Casella 2004, 2009; Marin and Robert 2007). Those methods try to overcome some of the limitations of regular Monte Carlo strategies by simulating a Markov chain with stationary (and limiting) distribution the target distribution.10 There exist fairly generic ways of producing such chains, including the Metropolis–Hastings and Gibbs algorithms defined below. Besides the theoretical fact that stationarity of the target distribution is enough to justify a simulation method by Markov chain generation, the idea at the core of MCMC algorithms is that local exploration, when properly weighted, can lead to a valid (global) representation of the distribution of interest. This includes for instance using only component-wise (and hence small-dimensional) simulations – that escape (to some extent) the curse of dimensionality – as in the Gibbs sampler. The Metropolis–Hastings algorithm truly is the generic MCMC method in that it offers a straightforward and universal solution to the problem of simulating from an arbitrary11 posterior distribution π(θ|x) ∝ f (y|θ) π(θ): starting from an arbitrary point θ0 , the corresponding Markov chain explores the surface of this posterior distribution using an instrumental Markov proposal θ ∼ q(θ|θ (t−1) ) and, in principle, it progressively visits the whole range of the possible values of θ. This internal Markov kernel should be irreducible with respect to the target distribution [that is, the Markov chain associated with the proposal q(·|·) should be able to visit the whole support of the target distribution in a finite number of steps]. The reason why the resulting chain does converge to the target distribution despite the arbitrary choice of q(θ|θ (t−1) ) is that the proposed values θ are 10 The theoretical foundations of MCMC algorithms are both sound and simple: as stressed by Tierney (1994) and Mengersen and Tweedie (1996), the existence of a stationary distribution almost immediately validates the principle of a simulation algorithm based on a Markov chain, i.e. does not create a fundamental difference with the i.i.d. case. 11 The only restriction is that this function is known up to a normalising constant.
62 Handbook of Statistical Systems Biology
sometimes rejected by an accept-reject step associated with the probability π(θ )f (y|θ )q(θ (t−1) |θ ) . π(θ (t−1) f (y|θ (t−1) )q(θ |θ (t−1) ) When the proposed value θ is rejected, the chain is extended as θ (t) = θ(t−1) . A rather common choice for q(θ|θ (t−1) ) is the random walk proposal: q(θ|θ (t−1) ) = g(θ − θ (t−1) ) with a symmetric function g, which provides a simplified acceptance probability ( π(θ )f (y|θ ) π(θ (t−1) )f (y|θ (t−1) ). This particular choice ensures that values θ that are more likely than the current θ (t−1) for the posterior distribution are always accepted while values that are less likely are sometimes accepted. Finding the proper scale for the random walk is not always straightforward and asymptotic normal approximations to the posterior distribution may be very inefficient proposals. While the Metropolis–Hastings algorithm recovers better from facing large-dimensional problems than standard importance sampling techniques, this still is a strong limitation to its use in large-dimensional set-ups. Considering again the probit ˆ of the ML model of Section 3.5.1, the random walk proposal using the asymptotic covariance matrix estimator as the scale, ˆ β ∼ N2 (β(t−1) , ), produces very decent performances, as described in Marin and Robert (2007). In contrast, the alternative Gibbs sampler is an attractive algorithm that caters to large-dimensional problems because it naturally fits the hierarchical structures often present in Bayesian models and more generally in graphical and latent variable models. The fundamental strength (and appeal) of the Gibbs sampler is its ability to break a joint target distribution like π(θ1 , . . . , θp |y) in the corresponding conditional distributions πi (θi |y, θ−i ) (i = 1, . . . , n) and to simulate successively from these low-dimensional targets: (t) (t−1) θ1 ∼ π1 θ1 |y, θ−1 (t) (t) (t−1) θ2 ∼ π2 θ2 |y, θ1 , θ−(1:2) .. .
(t) θp(t) ∼ πp θp |y, θ(1:(p−1)) While this algorithm may seem restricted to hierarchical multidimensional models, the special case of the slice sampler (Robert and Casella 2004, chapter 8) shows that the Gibbs sampler applies in a wide variety of models. See Robert and Casella (2004, chapters 9 and 10) for a detailed coverage of the implementation and tuning of the Gibbs sampler. The difficulty with the model explosion in variable selection (Section 3.5.1) can be handled naturally via a Gibbs sampler that considers each variable indicator (variable in the model or not) conditional on the others, as illustrated in Marin and Robert (2007, chapter 4). Mixing Metropolis–Hastings and Gibbs algorithms often results in better performances like faster convergence of the resulting Markov chain, the former algorithm being often used for global exploration of the target and the latter for local improvement. A classic hybrid algorithm replaces a nonavailable Gibbs update by a Metropolis–Hastings step. Another hybrid solution alternates Gibbs and Metropolis–Hastings proposals. The corresponding algorithms are valid: they produce ergodic Markov chains with the posterior target as stationary distribution.
Bayesian Inference and Computation
63
Learning about the specificities of the target distribution while running an MCMC algorithm and thus tuning the proposal accordingly, i.e. constructing an adaptive MCMC procedure, is difficult because this cancels the Markov property of the original method and thus jeopardizes convergence. Similarly, using an on-line scaling of the algorithm against the empirical acceptance rate in order to reach a golden number like 0.234 (Roberts and Rosenthal 2009) is inherently flawed in that the attraction of a modal region may give a false sense of convergence and may thus lead to a choice of too small a scale, simply because other modes will fail to be visited during the scaling experiment. However, there are algorithms that preserve ergodicity (convergence to the target) while implementing adaptivity. For instance, Roberts and Rosenthal (2007) consider basic ergodicity properties of adaptive MCMC algorithms under minimal assumptions, using coupling constructions. They prove convergence in distribution and a weak law of large numbers. Moreover, Roberts and Rosenthal (2009) investigate the use of adaptive MCMC algorithms to automatically tune the Markov chain parameters during a run. Examples include the adaptive Metropolis multivariate algorithm of Haario et al. (2001), Metropolis-within-Gibbs algorithms for nonconjugate hierarchical models, regionally adjusted Metropolis algorithms, and logarithmic scalings. Roberts and Rosenthal (2009) present some computer simulation results that indicate that the algorithms perform very well compared with nonadaptive algorithms, even in high dimensions. 3.5.4
Approximate Bayesian computation techniques
As mentioned above, there exist situations where the likelihood function f (y|θ) is too expensive or impossible to calculate, but where simulations from the density f (y|θ) can be produced in a reasonable time. An illustration is given by inverse problems where computing the function f (y|θ) for a given pair (y, θ) involves solving a complex numerical equation. In such cases, it is almost impossible to use the computational tools presented in the previous sections to sample from the posterior distribution π(θ|y). Approximate Bayesian computation (ABC) is an alternative to such techniques that only requires being able to sample from the likelihood f (·|θ). It was first proposed for population genetic models (Beaumont et al. 2002) but applies in much wider generality (see e.g. Chapter 18). The algorithm produces samples from the joint distribution π(θ)f (z|θ) and only accepts the simulations such that z is close enough to the observation y, ρ(η(z), η(y)) ≤ , where η is a summary statistic, ρ is a distance and is a tolerance level – all of three being arbitrarily chosen by the statistician. This ‘likelihood-free algorithm’, also called ABC, thus samples from the marginal in z of the following joint distribution: π (θ, z|y) ∝ π(θ)f (z|θ)Iρ(η(z),η(y))≤ The idea behind ABC (Beaumont et al. 2002) is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) if is small enough (but not too small to keep some simulations within the sample). In many examples, including phylogenetic, this is the only solution that can be implemented (Cornuet et al. 2008), at the expense of shying away from exact simulation. Both Grelaud et al. (2009) and Ratmann et al. (2009) consider the use of this approximation step to run model comparison in otherwise untractable models. Various sequential Monte Carlo algorithms have been constructed as extensions to the original ABC method. For instance, Beaumont et al. (2009) proposed an ABC version of the PMC algorithm presented above. The key idea is to decompose the difficult issue of sampling from π (θ, z|y) into a series of simpler subproblems. The algorithm begins at time 0 sampling from π0 (θ, z|y) with a large value 0 which means simulating almost from the prior, then simulating from an increasing difficult sequence of target distribution πt (θ, z|y), that is when t < t−1 . Illustrations of this technique are provided in Beaumont et al. (2009) both at the methodological and at the application level.
64 Handbook of Statistical Systems Biology
Acknowledgements The authors are most grateful to David Balding for his advice during the preparation of this chapter. This work had been partly supported by the Agence Nationale de la Recherche (ANR, 212 rue de Bercy, 75012 Paris, France) through the 2009–12 projects Big’MC and EMILE.
References Albert J 2009 Bayesian Computation with R, 2nd edn. Springer-Verlag, New York. Andrieu C, Doucet A and Holenstein R 2010 Particle Markov chain Monte Carlo (with discussion). J. R. Statist. Soc. B 72(3), 269–342 Beaumont M, Zhang W and Balding D 2002 Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035. Beaumont M, Cornuet JM, Marin JM and Robert C 2009 Adaptive approximate Bayesian computation. Biometrika 96(4), 983–990. Berger J, Ghosh J and Mukhopadhyay N 2003 Approximations to the Bayes factor in model selection problems and consistency issues. J. Statist. Plan. Inference 112, 241–258. Berry S and Berry D 2004 Accounting for multiplicities in assessing drug safety: A three-level hierarchical mixture model. Biometrics 60, 418–426. Bishop YMM, Fienberg SE and Holland PW 1975 Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. Capp´e O, Guillin A, Marin JM and Robert C 2004 Population Monte Carlo. J. Comput. Graph. Statist. 13(4), 907–929. Capp´e O, Douc R, Guillin A, Marin JM and Robert C 2008 Adaptive importance sampling in general mixture classes. Statist. Comput. 18, 447–459. Casella G and Berger R 2001 Statistical Inference, 2nd edn. Wadsworth, Belmont, CA. Chambaz A and Rousseau J 2008 Bounds for Bayesian order identification with application to mixtures. Ann. Statist. 36, 938–962. Chen M, Shao Q and Ibrahim J 2000 Monte Carlo Methods in Bayesian Computation. Springer-Verlag, New York. Chib S, Nadari F and Shephard N 2002 Markov chain Monte Carlo methods for stochastic volatility models. J. Econometrics 108, 281–316. Cornuet JM, Santos F, Beaumont MA, Robert CP, Marin JM, Balding DJ, Guillemaud T and Estoup A 2008 Inferring population history with DIYABC: a user-friendly approach to Approximate Bayesian Computation. Bioinformatics 24(23), 2713–2719. Del Moral P, Doucet A and Jasra A 2006 Sequential Monte Carlo samplers. J. R. Statist. Soc. B 68(3), 411–436. Douc R, Guillin A, Marin JM and Robert C 2007 Convergence of adaptive mixtures of importance sampling schemes. Ann. Statist. 35(1), 420–448. Doucet A, de Freitas N and Gordon N 2001 Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York. Efron B, Tibshirani R, Storey J and Tusher V 2001 Empirical bayes analysis of a microarray experiment. J. Am. Statist. Assoc. 96, 185–198. Gelfand A and Smith A 1990 Sampling based approaches to calculating marginal densities. J. Am. Statist. Assoc. 85, 398–409. Gelman A 2008 Objections to Bayesian statistics. Bayesian Anal. 3(3), 445–450. Gopalan R and Berry D 1998 Bayesian multiple comparisons using dirichlet process priors. J. Am. Statist. Assoc. 93, 1130–1139. Gordon N, Salmond J and Smith A 1993 A novel approach to non-linear/non-Gaussian Bayesian state estimation. IEEE Proc. Radar Signal Process. 140, 107–113. Grelaud A, Marin JM, Robert C, Rodolphe F and Tally F 2009 Likelihood-free methods for model choice in Gibbs random fields. Bayesian Anal. 3(2), 427–442. Haario H, Saksman E and Tamminen J 2001 An adaptive Metropolis algorithm. Bernoulli 7(2), 223–242.
Bayesian Inference and Computation
65
Iba Y 2000 Population-based Monte Carlo algorithms. Trans. Jpn Soc. Artif. Intell. 16(2), 279–286. Jacquier E, Polson N and Rossi P 1994 Bayesian analysis of stochastic volatility models (with discussion). J. Bus. Econ. Stat. 12, 371–417. Jaynes E 2003 Probability Theory. Cambridge University Press, Cambridge. Jeffreys H 1939 Theory of Probability, 1st edn. The Clarendon Press, Oxford. Lehmann E and Casella G 1998 Theory of Point Estimation, revised edn Springer-Verlag, New York. Liu J 2001 Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York. MacKay DJC 2002 Information Theory, Inference & Learning Algorithms. Cambridge University Press, Cambridge. Madigan D and Raftery A 1994 Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Am. Statist. Assoc. 89, 1535–1546. Marin J and Robert C 2007 Bayesian Core. Springer-Verlag, New York. Marin J and Robert C 2010 Importance sampling methods for Bayesian discrimination between embedded models. In Frontiers of Statistical Decision Making and Bayesian Analysis (eds Chen MH, Dey D, M¨uller P, Sun D and Ye K), pp. 513–553. Springer-Verlag, New York. McCullagh P and Nelder J 1989 Generalized Linear Models. Chapman and Hall, New York. Mengersen K and Tweedie R 1996 Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 24, 101–121. M¨uller P, Parmigiani G, Robert C and Rousseau J 2004 Optimal sample size for multiple testing: the case of gene expression microarrays. J. Am. Statist. Assoc. 99, 990–1001. M¨uller P, Parmigiani G and Rice K 2008 FDR and Bayesian multiple comparisons rules. In Bayesian Statistics 8: Proceedings of the Eighth International Meeting (eds Bernardo JM, Berger JO, Dawid AP and Smith AFM). Oxford University Press, Oxford. Ratmann O, Andrieu C, Wiuf C and Richardson S 2009 Model criticism based on likelihood-free inference, with an application to protein network evolution. PNAS 106, 1–6. Robert C 2007 The Bayesian Choice, paperback edn. Springer-Verlag, New York. Robert C and Casella G 2004 Monte Carlo Statistical Methods, 2nd edn. Springer-Verlag, New York. Robert C and Casella G 2009 Introducing Monte Carlo Methods with R. Springer-Verlag, New York. Robert C and Wraith D 2009 Computational methods for Bayesian model choice. In The 29th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (ed. Goggans PM), pp. 251–262. American Institute of Physics, Melville, NY. Robert C, Chopin N and Rousseau J 2009 Theory of probability revisited (with discussion). Statist. Sci. 24(4), 574–576. Roberts G and Rosenthal J 2007 Coupling and ergodicity of adaptive Markov Chain Monte Carlo algorithms. J. Appl. Prob. 44(2), 458–475. Roberts G and Rosenthal J 2009 Examples of adaptive MCMC. J. Comp. Graph. Stat. 18, 349–367. Rubin D 1988 Using the SIR algorithm to simulate posterior distributions. In Bayesian Statistics 3: Proceedings of the Third Valencia International Meeting, June 1–5, 1987 (eds Bernardo J, Degroot M, Lindley D and Smith A). Clarendon Press, Oxford. Schervish M 1995 Theory of Statistics. Springer-Verlag, New York. Scott J and Berger J 2006 An exploration of aspects of Bayesian multiple testing. J. Statist. Plan. Inference 136, 2144–62. Sørensen D and Gianola D 2002 Likelihood, Bayesian, and MCMC Methods in Qualitative Genetics. Springer-Verlag, New York. Templeton A 2008 Statistical hypothesis testing in intraspecific phylogeography: nested clade phylogeographical analysis vs. approximate Bayesian computation. Mol. Ecol. 18(2), 319–331. Tierney L 1994 Markov chains for exploring posterior distributions (with discussion). Ann. Statist. 22, 1701–1786. Wraith D, Kilbinger M, Benabed K, O. C, Cardoso JF, G. F, Prunet S and Robert C 2009 Estimation of cosmological parameters using adaptive importance sampling. Phys. Rev. D 80(2), 023507. Zellner A 1986 On assessing prior distributions and Bayesian regression analysis with g-prior distribution regression using Bayesian variable selection. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, pp. 233–243. North-Holland/Elsevier, Amsterdam.
4 Data Integration: Towards Understanding Biological Complexity David Gomez-Cabrero and Jesper Tegner Unit of Computational Medicine, Department of Medicine, Karolinska Institutet, Stockholm, Sweden
Systems biology will fully develop its potential when researchers are able to make use of all the available data. To fulfill this goal it is needed to overcome two major challenges: (i) how to store and make data and knowledge accessible; and (ii) how to integrate and analyze data from different sources and of different types. Hence, this chapter is divided into two sections. The first section describes public storage resources such as experimental data repositories, ontologies and knowledge databases; however we will not discuss the (relevant topic of) requirements for a data-warehouse capable of managing several databases in an interoperable manner (Kimball and Ross 2002). The second section reviews available tools and algorithms that are useful for integrating different types of data sets. However, since a comprehensive enumeration is beyond the scope of this chapter, we present some representative examples. In this chapter we list integrative tools using open source codes and public repositories and databases (see Table 4.1). The relevant R packages (Ihaka and Gentleman 1996) are highlighted in bold (see Table 4.2).
4.1
Storing knowledge: Experimental data, knowledge databases, ontologies and annotation
Quantitative measurements of biological entities are data; organized data (e.g. by relational connections) becomes Information, and we gain Knowledge by the appropriate collection and analysis of Information. Biological research needs to find ways to store efficiently Data, Information and Knowledge. The term efficient denotes ‘ways to allow integration among the different data types’. Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
Data Integration
67
Experimental data are stored in data repositories. Information is stored in Information Resources such as Knowledge Databases and Ontologies. This section will describe each one of those terms, key examples of each one and the process of annotation (that acts as the bridge between experimental data and Information). In this section we use protein p53 to provide relevant examples. p53 is a tumor repressor and a nuclear transcription factor that accumulates in response to cellular stress, including DNA damage and oncogene activation; therefore it guards the genome stability and normal cell growth. It was discovered in 1979 and it was then considered an oncogene (DeLeo et al. 1979; Lane and Crawford 1979; Linzer and Levine 1979), but it did not get the real attention of the research community until several years later when it was classified as a tumor suppressor that is mutated in most cancer cells (Jay et al. 1981; Mowat et al. 1985; Baker et al. 1989). The importance of this protein is highlighted by the 54 530 entries in PubMed (http://www.ncbi.nlm. nih.gov/pubmed, searching for ‘p53’, August 2010) where 6857 entries are reviews. 4.1.1
Data repositories
Over the years, the necessity for efficiently storing data has become clear. Sequence analysis provides us with a clarifying example: initially all new sequences obtained were submitted to journals; therefore it complicated its (efficient) use by researchers. To deal with this problem a database of sequences (EMBL Nucleotide Sequence Data Library), where all researchers would be able to submit their own sequences, was established at the European Molecular Biology Laboratory (EMBL); it was the first sequence database and its importance and utility became rapidly clear. Other initiatives, such as GenBank founded in 1982 and later absorbed into the National Center for Biotechnology Information (NCBI), followed and extended this pioneer experience. These first initiatives have been growing more rapidly in requirements over recent years: from microarrays to RNA-Seq, the amount of data generated daily is overwhelming and it clearly points out the necessity of storage and analytical tools in order to use the data efficiently. Several efforts have been made over the years to provide ordered access to all public available data so that comparative studies can be performed. A relevant example is found in microarrays; two of the most relevant microarray repositories are: (i) ArrayExpress (Parkinson et al. 2007; http://www.ebi.ac.uk/arrayexpress) developed by European Bioinformatics Institute (EBI); and (ii) Gene Expression Omnibus (GEO) developed by National Center for Biotechnology Information (NCBI) (Edgar et al. 2002; Barrett et al. 2007; http://www.ncbi.nlm.nih.gov/geo). At GEO the basic repository unit is a sample record which describes condition, manipulation and abundance measurement of each element derived from it; samples can be grouped in Series by submitters and/or Data Sets by GEO curators. Both repositories are regularly updated. A query of ‘p53’ within ArrayExpress (August 2010) returns 142 different experiments and 5948 microarray assays. For each experiment listed, data are available in a Raw format (usually CEL files) and/or in Processed format; however it is recommended to upload data in Raw format, as new (and hopefully better) methods to pre-process the data are continuously being developed. Each experiment has a unique identifier. For instance, E-GEOD-10795 is an identifier for an experiment that aims to elucidate the impact of TAF6d on cell death and gene expression; we also observe related information such as links to publications (see Lesne and Benecke 2008; Wilhelm et al. 2008), files (e.g. data archives, experiment design and array design), a description of the experiment and links to other databases. Among the links there is a reference to the same data set available at GEO where its identifier is GSE10795; following this link we observe similar information to the information found in ArrayExpress. A p53 query in GEO search returns 31 data sets and 257 series. Repositories can store different types of data in different formats for the same defined measure, or different types of quantitative data for the same system. For example microarrays can be used to measure as many different things as (i) mRNA expression, (ii) exon differential expression (to compare isoforms) (Clark et al. 2007), (iii) as part of the ChIP-chip assay (Horak and Snyder 2002) and (iv) SNPs (Hacia et al. 1999). The need to integrate these different data sets highlights the need for standards; as an example, the Microarray
68 Handbook of Statistical Systems Biology
Gene Expression Data (MGED) Society defined the Minimum Information About a Microarray Experiment (MIAME) (Brazma et al. 2001) that corresponds to the minimum information that must be reported about a microarray experiment to enable its unambiguous interpretation and reproduction. There are also repositories specific for certain diseases or biological systems: RefDIC (Hijikata et al. 2007) is a public compendium of quantitative mRNA/protein profile data obtained from microarray and two-dimensional gel electrophoresis based proteome experiments specifically for immune cells; Oncomine initiative collects and standardizes all published (and publicly available) cancer microarray data (http://www.oncomine.org), as described in Rhodes et al. (2007). There are also repositories for most biological data types. For instance Protein Data Bank (http://www.rcsb.org/pdb/home/home.do) includes mass spectrometry experiments. Data repositories are dealing continuously with new challenges. For instance, as High Throughput Sequencing (HTS) techniques provide tools that will replace microarrays (e.g. RNA-Seq provides a more precise assessment of differential mRNA expression), it highlights the necessity of new data types and new storage structures. Another necessary characteristic of repositories is ‘updatability’: repositories should be able to accommodate new types of data, and link them to the previous ones if possible as it will allow the integration of old and new data types. 4.1.2
Knowledge Databases
Experimental data need to be transformed into information. This information needs to be stored in an ordered way in (public) databases, named Knowledge Databases (KDs). Researchers use KDs to test their experimental results or to validate their hypotheses. Because KDs present the processed data extracted from experiments and are specific for different types of entities (e.g. genes, proteins or diseases) or even variations of the same topic (e.g. DNA sequence: transcripts or SNPs) they need specific structures for each case. However, relevant characteristics (and challenges) that all KDs share are: 1. Interconnection: every KD is providing a narrow view of a certain topic; to provide a better integrative view interconnection between KDs is needed. 2. Public access: all information in major KDs must be publicly available. this policy is enforced by entities such as International Nucleotide Sequence Database Collaboration (INSD, http://www.insdc.org/policy. html): ‘...no use restrictions or licensing requirements will be included in any sequence data records, and no restrictions or licensing fees will be placed on the redistribution or use of the database by any party...’; ‘All database records submitted to the INSD will remain permanently accessible as part of the scientific record’. 3. Updatability: KDs must remain flexible enough in order to be able to include new data types. 4. Standardization: in order to fulfill Interconnection and Updatability challenges it is necessary to define information storage standards. KDs represent the state-of-knowledge in biology. However KDs are based on current knowledge, with different degrees of certainty. 4.1.2.1
p53 KD tour
As there is not space to enumerate all KDs we present the most important ones in Table 4.1. However, we find it useful to provide an example: we describe several p53 queries among different KDs. We begin by searching p53 at the NCBI, which allows the user to search over different databases. A search of ‘p53 Homo sapiens’ in the Gene Database redirects the query to Entrez Gene Database (Maglott et al. 2007) and returns ‘TP53’
Data Integration
69
Table 4.1 Repositories (R), Knowledge Databases (KDs) and Ontologies (O)
ArrayExpress CCDS ChEBI dbSNP DIP Entrez Gene FlyBase GEO GO HUGO JASPAR KEGG MGI Negatome OBO Oncomine PDB PubMed RefDIC RefSeq TRANSFAC UniProt
Type
Field
Reference/URL
R KD O R KD KD KD KD R O KD R KD R+KD KD O R KD R R KD R KD
Microarray Gene and Proteins Chemical Entities SNPs Protein Interaction Gene Fly Genome High-throughput Gene Gene PWM Pathways Mouse Genome Protein Interaction Ontologies Cancer, microarray Protein & Metabolomic Journals Immune system Genes and Proteins PWM Protein
Parkinson et al. (2007) Pruitt et al. (2009) Matos et al. (2009) Wheeler et al. (2007) Salwinski et al. (2004) Maglott et al. (2007) http://flybase.org Barrett et al. (2007) The Gene Ontology Consortium (2000) http://www.genenames.org Portales-Casamar et al. (2009) Kanehisa et al. (2006) http://www.informatics.jax.org Smialowski et al. (2010) Smith et al. (2007) Rhodes et al. (2007) http://www.rcsb.org/pdb/home/home.do http://www.ncbi.nlm.nih.gov/pubmed Hijikata et al. (2007) http://www.ncbi.nlm.nih.gov/RefSeq/ Windenger (2008) The Uniprot Consortium (2010)
that links to http://www.ncbi.nlm.nih.gov/gene/7157 as the first result. Entrez Gene is a NCBI database for gene-specific information that focuses on genomes ‘that have been completely sequenced, that have an active research community to contribute gene-specific information or that are scheduled for intense sequence analysis’ and provides unique integer identifiers (GeneID) for genes and other loci. In H. sapiens the p53 GeneID is 7157; this identifier will help if any query is redirected to any other NCBI database. The official symbol, official name provided by a nomenclature authority, is TP53 and it was provided by the HUGO Gene Nomenclature Committee (HGNC; http://www.genenames.org/aboutHGNC.html). Other information available is: (1) TP53 has other aliases (P53, LFS1, TRP53, FLJ92943) that can be used to identify the gene in older references, (2) TP53 is a protein coding gene and (3) its HGNC identifier is HGNC:11998. Entrez Gene Database integrates information from RefSeq database (Pruitt et al. 2007; http://www.ncbi.nlm.nih.gov/RefSeq/), where RefSeq is ‘a curated non-redundant collection of sequences representing genomes, transcripts and proteins’ that integrates information from multiple sources therefore adding descriptions as: coding regions, conserved domains, gene and protein product names and (again) database cross-references. From a TP53 Entrez Gene database query the following RefSeq information can be obtained: (1) RefSeqGene identifier (for well-characterized genes to be used as reference standards) and (2) different transcripts and proteins, whose identifier names begin by NM and NP , respectively. Among the transcripts and proteins the query returns the cellular tumor antigen p53 isoform a, that has two transcripts+protein related: (i) NM 000546.4 + NP 000537.3 and (ii) NM 001126112.1 + NP 001119584.1; in both cases there is a common reference, CCDS11118.1, to a consensus coding sequence (CCDS; Pruitt et al. 2009). The CCDS database annotates identical proteins on the reference mouse and human genomes with a stable identifier (CCDS ID) ensuring consistency between NCBI, Ensembl (Flicek et al. 2010), and UCSC Genome Browsers.
70 Handbook of Statistical Systems Biology
Following the links to Ensembl we arrive at Ensembl.org, a project that generates databases for chordates. The identifier for ‘TP53 H. sapiens’ is ENSG00000141510, and again at this page the original source, HGCN, and the access number on it, 11998, are shown. The web page provides the location of the gene (Chromosome 17: 7,565,257-7,590,856 reverse strand, GRCh37 human assembly) and different transcripts related to the gene, the length of each one of them and the protein products related. There are also the references to the CCDS if they are available. The CCDS database is only one of the examples of how different entities collaborate in standardizing annotations. Another major example is INSD (http://www.insdc.org/index.html) that combines the effort of the DNA Data Bank of Japan, GenBank (Benson et al. 2009) and the European Nucleotide Archive (European Molecular Biology Laboratory) to collect and disseminate DNA and RNA sequence data. Other types of knowledge databases are not focused on individual terms but instead target physical interactions. Reactome (Matthews et al. 2009) is a curated knowledge database of biological pathways that includes cross-references to other biological databases such as Ensembl and Gene Entrez. A search for ‘p53’ within Reactome returns 123 terms (e.g. ‘Transcriptional activation of p53 responsive genes’ pathway, uniquely identified as REACT 202.2). The link to this pathway includes information about the preceding (e.g. ‘Stabilization of p53’, REACT 309.2) and following (e.g. ‘Translocation of p27 to the nucleoplasm’, REACT 9043.1) events. KEGG (Kanehisa et al. 2006), is a collection of manually drawn pathway maps from metabolism and cellular processes. A p53 query in KEGG PATHWAYS returns the KEGG ‘p53 signaling pathway’, identified as map04115; it includes (i) information about evidence used to develop the pathway, (ii) related pathways and (iii) links to other databases. 4.1.3
Ontologies
KDs present the processed data extracted from experiments. As we have observed in p53 we can obtain the sequence of the gene, the different exons and introns and other relevant data but we are missing answers to relevant questions such as ‘Does p53 work alone or is it included in a gene module?’ and ‘Is p53 involved in any metabolomic pathway?’. These questions needed to be addressed from a different perspective than a KD. Therefore it was necessary to develop ways of storing relational information; Biological Ontologies (BOs) are one response to this need. BOs represent the entities of biomedical interest and their relations and categories. Ontologies can be domain-specific (e.g. Chemical Entities of Biological Interest, ChEBI) (Matos et al. 2009) or level-specific [e.g. Gene Ontology (GO) has biological processes, cellular component and molecular function levels, see The Gene Ontology Consortium (2000)]. Ontologies can overlap, and can reuse elements from other ontologies. Ontologies are tools used to (i) integrate different meta-data, answering questions such as the existence of groups of entities and the possible hierarchical orders and relationships between them; and (ii) provide resource interoperability. In order to fulfill these tasks there are some prerequisites: high-quality, free availability, and re-distributiveness. Open Biomedical Ontologies (OBOs) are a collection of controlled vocabularies (ontologies) freely available to the biomedical community. Within OBOs, OBO Foundry (Smith et al. 2007) regulates the development of new ontologies by defining principles. Many new ontologies are defined to communicate with already available ones, e.g. PRotein Ontology (Natale et al. 2007) includes connections to GO, OBO Disease Ontology and several others. Until the mature development of federated biomedical ontologies different bridges are being created between ontologies. Two relevant examples are: (1) the Unified Medical Language System developed by the US National Library of Medicine whose Metathesaurus integrates more than 1.4 million concepts from over one hundred terminologies (http://www.bioontology.org/); and (2) the ‘Minimal Information Requested In the Annotation of biochemical Models,’ (MIRIAM) (Laibe and Le Nov´ere 2007), that presents a set of guidelines for the annotation and curation of processes in computational systems biology models. MIRIAM Resources are being developed to support the use of Uniform Resource Identifiers, a useful tool for inter-operability.
Data Integration
4.1.3.1
71
p53 ontology tour
As in KDs we provide an example of ontology characteristics and structure by querying p53 in GO. GO contains a specific and curated (selected, collected and maintained by expert users) vocabulary for (i) the entities within the ontology, (ii) for terms related to entities (such as genes pointing to a biological process) and (iii) the terms related to the description of the entities. It is organized in three domains (cellular component, molecular function and biological process) and each domain is structured as a directed acyclic graph (DAG). The main page http://www.geneontology.org/ acts as a web browser that allows searching in the GO database. A ‘p53’ query filtered by, ‘H. sapiens’ in the biological process domain returns the term classified with the symbol ‘TP53’ and with the name ‘Cellular tumor antigen p53’. Within the link to this term it is possible to retrieve information about the gene product (that offers different synonyms such as ‘p53’), the peptide sequence, the sequence information and links to different Knowledge and Experimental Databases such as DIP, EMBL, GenBank, UniProt and PDB. Most importantly, the ‘TP53 H. sapiens’ page shows links to 60 different terms in the GO Database (e.g. Tp53 is related to the GO biological process term apoptosis, GO:0006915). All relations between genes and GO entities must be evidence based. There are two types of evidence: (a) Experimental Evidence that can be inferred from (i) Direct Assay, (ii) Physical Interaction, (iii) Mutant Phenotype, (iv) Genetic Interaction and (v) Expression Pattern; and (b) Computational Analysis Evidence that can be further classified as evidence inferred from (i) Sequence or Structural Similarity (Sequence Orthology, Sequence Alignment or Sequence Model), (ii) Genomic Context and (iii) Reviewed Computational Analysis. Evidence can be assigned by curators or assigned by automated methods; in all cases a clear trace of how the relation was generated must be provided. The relation between GO:0006915 and Gene Symbol ‘TP53’ is classified as Inferred from Direct Assay, it was assigned by UniProtKB and an identification for the reference is provided PMID:7720704 [the identifier in PubMed for the reference Eizenberg et al. (1995)]. Further exploration of the term GO:0006915 provides: (i) a definition of the term (‘A form of programmed cell death that begins when...’) and a reference (PMID:18846107), (ii) the relations to other GO terms in the DAG structure, such as ‘apoptosis’ is a ‘programmed cell death’ (GO:0012501), (iii) external references (e.g. links to Reactome), and (iv) a list of genes related to apoptosis (there are 1130 gene product associations). 4.1.4
Annotation
Annotation is the process of assigning properties to a given bioentity or the process of relating bioentities. For instance, if the entity is a gene the annotation process can (i) assign the gene to a gene set, (ii) classify the gene as constitutive or not constitutive and (iii) link the gene to other genes it regulates. Annotation is therefore a necessary process in the creation and updating of Knowledge and Ontology databases. Annotation is based on evidence (as we observed previously in GO) that can be classified as Experimental Evidence or Computational Analysis Evidence. In this section we present some of the methods used to annotate databases and, at the end, we include a subsection that briefly reviews the R tools available. 4.1.4.1
Annotation by similarity
The very classic idea of annotation is based on the idea of ‘similarity’: those elements that are similar in one aspect maybe be similar in other aspects, therefore we can describe (functionally annotate) one gene by those genes that are similar to it; however the term ‘similar’ is specified differently in different approaches. Following this idea, high-throughput data have become a tool to functionally annotate genes and proteins (Kasif and Steffen 2010) by: (i) automated prediction of the function of genes based on homology and sequence similarity to genes of known function, (ii) organization of proteins (and genes) into clusters [PFAM (Finn et al. 2008 and US National Center for Biotechnology Information Protein Clusters (Klimke et al. 2009)] (iii) extending (i) by including further information such as phylogenetic profiles, coexpression, chromosomal gene clustering
72 Handbook of Statistical Systems Biology
and gene fusion. This information can be integrated with machine learning algorithms that are able to predict gene functions (Jansen et al. 2003). Automated annotation is a very active research field. 4.1.4.2
Annotation by Protein Binding Sites
Protein Binding Site (PBS) annotation is based on the identification of those sequences of nucleotides (proteinbinding motifs) where a given Transcription Factor (TF, proteins that bind to promoter and/or enhancer regions of a gene regulating its expression) would bind. Protein Weight Matrices (PWMs) store information regarding which sequences are bound by a given TF; the PWM associated with a TF is a 4-row and n-column matrix that, for a nucleotide sequence of length n, each column i depicts the probability of the TF binding to a sequence that has the nucleotide in position i. TRANSFAC (semi-public; Matys et al. 2003; Windenger 2008) and JASPAR (public; Portales-Casamar et al. 2009; http://jaspar.genereg.net) are two PWM repositories. A search for TP53 in JASPAR database returns the MA0106.1 identifier, a H. sapiens zinc coordinating transcription factor that pertains to the Loop-Sheet-Heliz family and whose PWM is provided. 4.1.4.3
Annotation by Temporal Series
The use of Temporal Series (TS) in Annotation is that those bioentities that share the same expression pattern over time are regulated together and therefore, they would be expected to have the same functional group (cluster). Several statistical tools have been developed for the analysis of temporal series. Here we review some of the tools developed for microarray data analysis; however the main ideas can be extended to other data types (that also consider a small number of samples and a huge number of variables). The clustering methods can be classified by (i) the nature of the clusters they identify and (ii) the searching strategy. The searching algorithms can be grouped into three sets [from Krishna et al. (2010)]: (i) Pointwise distance based methods: grouping genes by minimizing an objective function generated by the distance (measure of similarity or dissimilarity) between pairs of genes. The description of this set can be found in Chapters 2 and 7 (see k-means and hierarchical clustering). (ii) Feature based clustering methods: grouping genes by using the general shape (local or global characteristics) of an expression profile, therefore they detect more complicated relations such as time-shifted or inversion relations. (iii) Model based clustering methods: based on statistical mixture models, which consider data to be generated from a finite mixture of underlying probability distributions, therefore each component corresponds to a different cluster. A very useful tool of this group is MaSigPro (Conesa et al. 2006), a statistical procedure for multi-series time-course microarray experiments. It is available as a Bioconductor package and from http://www.ivia.es/centrogenomica/bioinformatics.htm. Recent methodologies combine different strategies as in Krishna et al. (2010) where the authors combine pairwise distances (based on the Granger distance) with network clustering. 4.1.4.4
Experimental Design to improve annotation
Many experiments are designed to increase our knowledge of the relationship between bioentities. We discuss (and extend our previous approach to) PWMs as an example. The generation of JASPAR and TRANSFAC PWM is widely questioned because : (i) those matrices are constructed using a median of 18 individual sequences therefore they are expected to capture only a subset of the permissible range of binding sites; (ii) the accuracy of PWM models has been questioned (Benos et al. 2002); (iii) there are many examples in which transcription factors bind sets of sequences that cannot be described by standard PWMs (Chen and Schwartz 1995); and (iv) it is possible that PWMs are specific for different conditions (Berger et al. 2008). A more original approach that makes use of high-throughput technologies is Protein Binding Microarray (PBM) (Mukherjee et al. 2004); the authors used PBMs containing 41 944 60-mer probes in which all possible 10-base sequences were represented to analyze the
Data Integration
73
DNA-binding specificity. The specific construction of the microarrays provides a way to robustly estimate the binding preference of each protein to all 8-mers (Berger et al. 2006). Berger et al. (2006) provided a database of newly generated PWMs that does not solve all the problems stated but gives a major improvement. 4.1.4.5
Annotation by Text-Mining
Researchers whose experiments highlight new relationships between elements are encouraged to insert this information in publicly available curated databases. However because this is not always the case and due to the amount of information stored in journals, text-mining tools have been developed (Renear and Palmer 2009; Attwood et al. 2010). Text-mining is the process of extracting information from text, therefore it can be used as a tool for annotation in biology where text is extensively available in journals. Some relevant text-mining tools are: 1. iHOP (information Hyperlinked Over Proteins) (http://www.ihop-net.org/; Hoffman and Valencia 2004). It allows the search for biomedical terms that are mentioned together in the same sentence with a gene or protein of reference; the query returns all biomedical terms and for each one of them the references and sentences where they were both, term and gene/protein, mentioned. 2. PubGene (www.pubgen.org; Jenssen et al. 2001). It extends the query options offered by iHOP and it also generates a network that relates all single elements returned by the query and provides statistical significance of every relation in the network. For instance, a TP53 H. sapiens query indentifies tp53 in 10 265 documents and returns a list of elements (BAX, BCL2, CDKN1A, MKI67 and TCEAL1) organized within a network. 3. GRAIL (http://www.broadinstitute.org/mpg/grail/; Raychaudhuri et al. 2009). It integrates published scientific text with SNPs. Given a set of SNPs or genomic regions, a set of relevant genes is generated. From this gene set GRAIL searches in the literature for similarities in the associated genes. This tool can be understood as a SNP to Gene set selection tool, where usually SNPs are coming from the output of genome wide association studies. 4.1.4.6
Annotation in Bioconductor
Bioconductor uses the R programming language to develop ‘tools for the analysis and comprehension of highthroughput genomic data’ (www.bioconductor.org). It contains more than 380 packages and it is periodically updated. Within Bioconductor there are many tools that allow researchers to use annotations and ontologies. Each package is updated periodically and full descriptions of them can be found at the website. Regarding annotation, there is a set of resources that allows programmers and users to map between probes, genes, proteins, pathways and ontology terms. Bioconductor has built-in representations of major ontology databases and data resources as: 1. GO: GO.db is a set of annotation maps that describes the entire Gene Ontology. GO, within each of its categories, is conceived as a DAG; within GO.db there is a set of datasets that specify those relations. This package is updated biannually. 2. KEGG: KEGG.db package provides information about the latest version of the KEGG pathway databases. It is updated biannually and it maps KEGG identifiers and elements within them to other databases such as GO terms or Entrez Gene. 3. Microarrays: there are packages that annotate the different microarray platforms and versions to different gene identifiers. An example is the classical Affymetrix Mouse Genome 430 2.0 Array, where in
74 Handbook of Statistical Systems Biology Table 4.2 R packages for Data Integration Package
Description
Reference
CCA Gostats (B)
Canonical correlation analysis Tools for interacting with GO and microarray data (including functional enrichment) Functional enrichment by Gene Set Enrichment Analysis Integration of different types of omic data Functional Set Enrichment by logistic regression Analysis of multi-series time-course microarray experiments Tool that allows the combination of ordered lists using Rank aggregation
Gonz´alez et al. (2008) Falcon and Gentleman (2007)
GSEA IntegrOmics LRPath MaSigPro (B) RankAggreg
Subramanian et al. (2005) Lˆe Cao et al. (2009) Sartor et al. (2009) Conesa et al. (2006) Pihur et al. (2009)
(B), package included in Bioconductor.
Bioconductor the annotation data (mouse4302.db, that provides mappings between manufacturer identifiers and other identifiers such as Entrez Gene and Ensembl), the cdf file (mouse4302.cdf, used to convert between (x,y)-coordinates on the chip to single-number indices and back) and the probe sequence data (mouse4302probe, the probe sequence data in a data-frame R object) are available. 4. Full genomes: there are packages that contain the different genomes sequenced such as H. sapiens, Mus musculus and Saccharomyces cerevisiae. Most of the previous packages depend on the AnnotationDBi package that provides user interface and database connection code for annotation data packages using SQLite data storage.
4.2
Data integration in biological studies
This section reviews relevant examples of integrating Data repositories, KDs and Ontologies. We divide this section into two parts. We first review the integration of different experimental data types; the second part reviews the integration of meta-data and experimental data. Finally, we review how network and visualization tools have been used in data integration. 4.2.1
Integration of experimental data
In order to provide a unifying view of a biological system through all experimental data available it is needed to develop techniques to overcome the problem of integrating different data types and the different experimental conditions. From the literature we can extract two approaches. One considers (R) generic tools to integrate different omic data types. For instance IntegrOmics (Lˆe Cao et al. 2009), is a package developed to integrate different datasets, even if they are of different types. To deal with the problem of the large number of elements and the reduced number of measures (p >> n), the authors developed and implemented two different approaches: (i) a regularized canonical correlation analysis (CCA) (Gonz´alez et al. 2008) in the case of p >> n (Gonz´alez et al. 2009) and (ii) a sparse partial least squares regression (Lˆe Cao et al. 2008) to simultaneously integrate and select variables using Lasso penalization. RankAggreg (Pihur et al. 2009) provides two methods (a Cross-Entropy method and a Genetic Algorithm) to combine ordered lists using Rank aggregation; the strenght of this approach is that Rank aggregation allows the combination of lists from different sources (e.g. data types). A second approach is to review practical cases whose methodologies can be standardized. Below we provide some key examples.
Data Integration
4.2.1.1
75
Microarrays sampled from different experimental designs
Microarray repositories are a major resource as thousands of microarray experiments have been stored over the last decade. However the challenges to use this resource in an integrative way are: (i) experimental designs are as a rule very different (e.g. different animal models and different experimental conditions), (ii) the necessary large number of comparisons (usually more than 10 000) (that will result in many false-positives unless appropriately stringent thresholds are employed, see Chapter 2) and (iii) different sources of variability (Jarvinen et al. 2004; Thompson et al. 2004); for example in some cases variability between technologies (such as the use of different microarray technologies) is lower than variability between laboratories (the same experiment, with the same technology in different laboratories) (Wang et al. 2005). Two main approaches have been considered to deal with those challenges. One approach is to avoid individual-level comparison between datasets and use only data summaries. Oncomine (Rhodes et al. 2007) is a cancer microarray database that integrates a web-based data-mining platform. Researchers are expected to first select properly among the datasets available in the database and then to use meta-analysis to identify the genes that are significantly over-expressed or under-expressed across multiple independent studies. However other approaches have been evaluated that avoid the ‘selection’ step: in Submap, Hoshida et al. (2007) developed a method for integrating and comparing data from different datasets. The method begins with a set of datasets and a pre-gene grouping within each dataset, then it compares the relationship between the different groups by Gene Set Enrichment Analysis (see Chapter 7) and it summarizes the relations between different dataset clusterings by a matrix. This method has been validated against different sets and it is robust against different DNA microarray platforms and laboratories. A second approach is to consider the low probability for multiple transcripts to follow a complex pattern of expression across dozens or hundreds of conditions by chance. Therefore, if those sets exist they may constitute coherent and biologically meaningful transcriptional units. However, transcriptional units must be validated by the use of other techniques and experimental designs. Chaussabel et al. (2008) designed a methodology to identify transcriptional modules formed by genes co-ordinately expressed in multiple microarray disease datasets. They tested the methodology over microarray datasets from blood samples and used the obtained modules to provide a set of biomarkers that were able to indicate the disease progression in patients with lupus erythematosus. Following the same idea but considering a predefined set of genes, Nilsson et al. (2009) developed a large-scale computational screen to identify mitochondrial proteins whose transcripts consistently co-express with the core machinery of heme biosynthesis. The idea is that interesting genes are those that are correlated with the gene set of reference only when the gene set is acting as a functional unit. The authors succeeded in proving (by experimental validation) that several top-ranked genes not previously related to heme biosynthesis and mitochondrial iron homeostasis were actually related. A third approach is the use of clustering algorithms which are of major relevance in integrating datasets from different samples. However the algorithm classification provided in Chapters 2 and 7 need to be extended by the nature of the clustering. Methods can be further classified as: (i) one-way clustering, to find either gene clusters or sample clusters; (ii) two-way clustering, to find both gene clusters and sample clusters in a combined approach; and (iii) bi-clustering methods, gene clusters defined only over a sample cluster that is found simultaneously (Getz et al. 2000; H¨agg et al. 2009).
4.2.1.2
Comparing and/or integrating different technologies
When a new experimental technique is developed it is tested against the well known results from previous techniques; this allows an evaluation of the weaknesses and strengths of new methodologies. Therefore validation can be considered as Data Integration because different data types should be compared. We compare
76 Handbook of Statistical Systems Biology
High Throughput Sequencing (HTS, see Chapter 7) versus microarray in Chromatin Immunoprecipitation (ChIP) analysis. ChIP is a procedure used to determine whether a given protein binds to or is localized to a specific DNA in vivo. A short introduction on how it works would be: first a target [such as protein or chromatin mark, see for instance Barski et al. (2007) and Wang et al. (2008)] is selected, then an antibody that attaches to the selected target is added to the sample and it is used to purify selected DNA. The ChIP essay returns an amount of ‘marked’ DNA for a further study; the two major analysis methodologies applied to analyze it are HTS (ChIP-Seq) and microarray (ChIP-chip). HTS technology needs to map all readings to the genome of reference (generating a coverage) and then statistical tools are used to search for regions that are differentially expressed [such as MACS (Zhang et al. 2008) for TF binding sites, and SICER (Zang et al. 2009) for histone modification profiling] and therefore are considered to contain marks. Park (2009) compares ChIP-Seq and ChIP-chip. The conclusion is that comparative analysis between technologies must take into account that: (i) microarrays are only measuring over predefined regions and no new region will be found; however, in ChIP-Seq new binding regions can be discovered; (ii) in microarrays there is cross-hybridization between probes and nonspecific targets, where in ChIP-Seq some GC bias can be present; (iii) the amount of DNA required is higher in ChIP-chip analysis; and (iv) the necessity of amplification steps is less required in ChIP-Seq. ChIP profiles are definitely more defined in ChIP-Seq however, if a nucleotide window is selected, there is a correlation between the profiles obtained by both technologies. 4.2.1.3
Genetics and Epigenetics
The importance of epigenetic modifications (Flintoft 2010) has been extensively shown in recent studies such as Barski et al. (2007), Jothi et al. (2008), Schones et al. (2008), and Wang et al. (2008, 2009) where the CD4+ T cell was extensively studied. The conclusion is that gene expression needs to account for epigenetic modifications as they modify the availability of the genes to be transcribed. Recently Karli et al. (2010) showed that it is possible to predict the expression level of a single gene by using a maximum of three histone modifications as predictors. Gene expression was measured by normalized microarray data, while for histone modifications the log tag number from ChIP-Seq experiments (one per modification) were used. The authors were able to export the models to other cells that were not trained for these data showing a clear validation of their initial assumptions. Recently, Artyomov et al. (2010), developed the first mathematical model that considers genetic and epigenetic regulatory networks, describing the transformations resulting from expression of reprogramming factors. However the major approaches that use epigenetic data are now based on statistical models. 4.2.1.4
Transcriptomics and Metabolomics
Jozefczuk et al. (2010) compare and integrate gene expression and metabolomic measurements over different stress conditions and over a period of time. For five conditions (oxidative stress, glucose-lactose diauxic shift, heat, cold and unperturbed culture as control) Gas Chromatography Mass Spectrometry (GC-MS) measurements were made in three different samples, for three different technical replicates and for 12 different times points. Microarray analysis was performed for three samples each without technical replicates and for two time points under each condition; except for oxidative stress condition for which 12 samples were measured. Individually both types of data return the same type of conclusions: in both cases, metabolites and mRNA expression, it is possible to group the time profiles by the type of perturbations. However, the authors show that the profiles generated by the metabolomic data are much more specific (to the perturbations) than the profiles generated by the transcriptomic data, showing that even if both omics are related, metabolites were more sensitive to the perturbations. Also condition-dependent associations between metabolites and transcripts were identified by co-clustering and canonical correlation analysis on combined metabolite and
Data Integration
77
transcript datasets. Therefore the authors were able to confirm existing models for co-regulation between gene expression and metabolites. 4.2.2
Ontologies and experimental data
A key example of integration of knowledge and ontology databases and experimental gene expression data is Gene-set analysis. The basic idea is to identify predefined (biologically relevant) gene sets (PGSs) enriched with differentially expressed (DE) genes. Sets associated with Gene Ontology terms (The Gene Ontology Consortium 2000) and KEGG pathways (Kanehisa et al. 2006) are of common use. Given two experimental conditions (control and disease), the identification of enriched GO biological terms points out possible biological processes that are involved in the disease development. A description of this technique and some variations are included in Chapter 7. 4.2.3
Networks and visualization software as integrative tools
Since the seminal paper on network analysis (Barab´asi and Albert 1999) and subsequent application to biological systems (Jeong et al. 2000, 2001), there has been a revolution on how we understand and analyze molecular biology data. One basic idea that makes networks a powerful tool is that any biological entity and their relations can be analyzed through them. More important, biological entities of different types can be compared through them. For instance we present an example in Figure 4.1; let us consider that Figure 4.1 shows a set of genes (named A, B, C and D) that can be considered as genes or as their related proteins depending on the context. In Figure 4.1(a) each node denotes a gene; there is an edge between genes if both pertain to the same module by clustering analysis of transcriptomic data sets (see Section 4.2.1). Figure 4.1(b) considers each node as a protein and shows a link between two nodes if there is experimental validation by yeast two-hybrid interaction of physical interaction [such as binding, see Steltlz et al. (2005)]. In Figure 4.1(c) each node denotes both a gene and its related protein; it is shown a direct edge (x,y) if protein x binds to the promoter region of y by using for instance PWMs. We can compare networks by observing which relations are unique to each one of the networks and which are common to all and by comparing network properties (such as degree distribution and distribution of the shortest paths; see Chapters 14 and 15 for greater detail). All networks can also be merged for further analysis [see Figure 4.1]. Chapters in Part C develop these ideas and show their integrative power. However, networks need to be visualized. One visualization key tool is Cytoscape, which is an open source software that allows visualization of molecular interaction networks (Shannon et al. 2003) and includes tools to integrate these interactions with experimental data (Cline et al. 2007). As a visualization tool it includes: (i) Data Integration that supports many standards such as Simple Interaction Format, GML, BioPAX, SBML and OBO and it allows the importing of data files; (ii) Visualization manager (VizMapper); and (iii) Network Analysis tools. One of the major resources of Cytoscape is the number of plug-ins available; two plug-ins of interest are BinGO (Maere et al. 2005) that checks for representation of Gene Ontology categories in biological networks and CABIN (Collective Analysis of Biological Interaction Networks) (Singhal and Domicob 2007) that enables analysis and integration of interactions evidence obtained from multiple sources. Cytoscape is able to import data files generated by R.
4.3
Concluding remarks
Biological sciences are in a revolutionary phase fueled by recent advances in technologies for measuring biological entities and states. New data types are being produced and the volume of data that requires storage is rapidly increasing. This situation presents new challenges, not only in terms of data storage but perhaps primarily in the sense of data integration. Furthermore data integration is becoming a must as it is generally admitted that no data type provides a complete vision of any biological system.
78 Handbook of Statistical Systems Biology
Figure 4.1 Examples of gene and protein networks. (a) Gene association by clustering algorithms. (b) Protein association by physical interaction such as binding. (c) Node association by transcription factor binding to a promoter sequence. (d) Network merging
We have shown that KDs and Ontologies are storage structures that provide an organized view of current knowledge, but it is necessary to further develop these structures to increase their scope and integration. On the other hand, the use of the experimental data and/or knowledge stored will make necessary the development of (statistical) generic tools able to integrate different data types; we can expect both metabolomics and lipidomics to increase in volume and quality and thereby increase the need for integrative tools beyond current transcriptomics and proteomics applications. Equally important to development and use of standards and new tools for data integration will be the development of tools for scientific visualization. This area and its application to systems biology is still rather underdeveloped with the exception of Cytoscape and a few others. This opens up the possibility for exciting projects involving computer scientists with expertise in visualization, computational biologists and experimental and medical researchers. Finally we find that data integration will be the key challenge in systems biology; statistics, networks, mathematical analysis and data structures will be key technologies in the success of this integrative approach.
References Attwood TK, Kell DB, McDermott P, et al. 2010 Utopia documents: linking scholarly literature with research data. Bioinformatics 26, i568-i574.
Data Integration
79
Artyomov MN, Meissner A and Chakraborty AK 2010 A model for genetic and epigenetic regulatory networks identifies rare pathways for transcription factor induced pluripotency. PLoS Computational Biology 6(5), 1–14. Baker SJ, Fearon ER, Nigro JM, et al. 1989. Chromosome 17 deletions and p53 gene mutations in colorectal carcinomas. Science 244, 217–221. Barab´asi AL and Albert R. 1999 Emergence of scaling in random networks. Science 286, 509–512. Barrett T, Troup DB, Wilhite SE, et al. 2007 NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Research 35, 760–765. Barski A, Cuddapah S, Cui K et al. 2007 High-resolution profiling of histone methylations in the human genome. Cell 129(4), 823–837. Benson DA, Karsch-Mizrachi I, Lipman DJ, et al. 2009 GenBank. Nucleic Acids Research 37, 26–31. Benos PV, Bulyk ML and Stormo GD. 2002 Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Research 30, 4442–4451. Berger MF, Philippakis AA, Qureshi AM, et. al. 2006 Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotechnology 24, 1429–1435. Berger MF, Badis G, Gehrke AR, et al. 2008 Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell 133, 1266–1276. Brazma A, Hingamp P, Quackenbush J, et al. 2001 Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nature Genetics 29, 365–371. Chaussabel D, Quinn C, Shen J, 2008 A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus. Immunity 29, 150–164. Chen CY and Schwartz RJ 1995 Identification of novel DNA binding targets and regulatory domains of a murine tinman homeodomain factor, nkx-2.5. Journal of Biological Chemistry 270, 15628–15633. Clark TA, Schweitzer AC, Chen TX, et al. 2007 Discovery of tissue-specific exons using comprehensive human exon microarrays. Genome Biology 8, 64. Cline MS, Smoot M, Cerami E, et al. 2007 Integration of biological networks and gene expression data using Cytoscape. Nature Protocols 2, 2366–2382. Conesa A, Nueda MJ, Ferrer A and Talo M 2006 maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments. Bioinformatics 22(9), 1096–1102. DeLeo AB, Jay G, Appella E, et al. 1979 Detection of a transformation related antigen in chemically induced sarcomas and other transformed cells of the mouse. Proceedings of the National Academy of Sciences of the United States of America 76, 2420–2424. Edgar R, Domrachev M and Lash AE 2002 Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30, 207–210. Eizenberg O, Faber-Elman A, Gottlieb E, et al. 1995 Direct involvement of p53 in programmed cell death of oligodendrocytes. EMBO Journal 14(6), 1136–44. Falcon S and Gentleman R 2007 Using GOstats to test genelists for GO term association. Bioinformatics 23(2), 257–258. Finn RD, Tate J, Mistry J, et al. 2008 The Pfam protein families database. Nucleic Acids Research 36, 281–288. Flicek P, Aken BL, Ballester B, et al. 2010 Ensembl’s 10th year. Nucleic Acids Research 38, 557–562. Flintoft L 2010 Complex disease: Adding epigenetics to the mix. Nature Reviews Genetics 11, 94–95. Getz G, Levine E and Domany E 2000 Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences of the United States of America 97, 12079–12084. Gonz´alez I, Dejean S, Martin PGP and Baccini A 2008 CCA: an R package to extend canonical correlation analysis. Journal of Statistical Software 23, 1–14. Gonz´alez I, Dejean S, Martin PGP, et al. 2009 Highlighting relationships between heteregeneous biological data through graphical displays based on regularized canonical correlation analysis. Journal of Biological Systems 17, 173–199. Hacia JG, Fan J, Ryder O, et al. 1999 Determination of ancestral alleles for human single nucleotide polymorphisms using high-density oligonucleotide arrays. Nature Letter 22, 164–167. H¨agg S, Skogsberg J, Lundstr´om J, et al. 2009 Multi-organ expression profiling uncovers a gene module in coronary artery disease involving transendothelial migration of leukocytes and LIM domain binding 2: the Stockholm Atherosclerosis Gene Expression (STAGE) Study. PLoS Genetics 5(12), e1000754.
80 Handbook of Statistical Systems Biology Hijikata A, Kitamura H, Kimura Y, et al. 2007 Construction of an open-access database that integrates cross-reference information from the transcriptome and proteome of immune cells. Bioinformatics. 23(21), 2934–2941. Hoffmann R and Valencia A 2004 A gene network for navigating the literature. Nature Genetics 36, 664. Horak CE and Snyder M 2002 ChIP-chip: a genomic approach for identifying transcription factor binding sites. Methods in Enzymology 350, 469–83. Hoshida Y, Brunet JP, Tamayo P, et al. 2007 Subclass mapping: identifying common subtypes in independent disease data sets. PLoS ONE 2(11), e1195. Ihaka R and Gentleman R 1996 R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics 5, 299–314. Jansen R, Yu H, Greenbaum D, et al. 2003 A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302, 449–453. Jarvinen A, Hautaniemi S, Edgren H, et al. 2004 Are data from different gene expression microarray platforms comparable? Genomics 83, 1164–1168. Jay G, Khoury G, DeLeo AB, et al. 1981. p53 transformation-related protein: detection of an associated phosphotransferase activity. Proceedings of the National Academy of Sciences of the United States of America 78, 2932–2936. Jenssen T, Laegreid A, Komorowski J and Hovig E 2001 A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 28, 21–28. Jeong H, Tombor B, Albert R, et al. 2000 The large-scale organization of metabolic networks. Nature 407, 651–654. Jeong H, Mason S, Barab´asi AL and Oltvai ZN 2001 Lethality and centrality in protein networks. Nature 411, 41–42. Jothi R, Cuddapah S, Barski A, et al. 2008 Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Research 36(16), 5221–5231. Jozefczuk S, Klie S, Catchpole G, et al. 2010 Metabolomic and transcriptomic stress response of Escherichia coli. Molecular Systems Biology 6, 364. Kanehisa M, Goto S, Hattori M, et al. 2006 From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Research 34, 354–357. Karli R, Chung HR, Lasserrea J, et al. 2010 Histone modification levels are predictive for gene expression. Proceedings of the National Academy of Sciences of the United States of America 107(7), 2926–2931. Kasif S and Steffen M 2010 Biochemical networks: the evolution of gene annotation. Nature Chemical Biology 6(1), 4–5. Kimball R and Ross M 2002 The Data Warehouse Toolkit. The Complete Guide to Dimensional Modeling., John Wiley & Sons, Ltd, New York. Klimke W, Agarwala R, Badretdin A, et al. 2009 The National Center for Biotechnology Information’s Protein Clusters Database. Nucleic Acids Research 37, 216–223. Krishna R, Li C and Buchanan-Wollaston V 2010 A temporal precedence based clustering method for gene expression microarray data. BMC Bioinformatics 11, 68. Laibe C and Le Nov´ere N 2007 MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Systems Biology 1, 58. Lane DP and Crawford LV 1979. T antigen is bound to a host protein in SV40-transformed cells. Nature 278, 261–263. Lˆe Cao KA, Rossouw D, Robert-Grani C and Besse P 2008 A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, Article 35. Lˆe Cao K, Gonzalez I and Dejean S 2009 IntegrOmics: an R package to unravel relationships between two omics datasets. Bioinformatics 21, 2855–2856. Lesne A and Benecke A 2008 Feature context-dependency and complexity-reduction in probability landscapes for integrative genomics. Theoretical Biology and Medical Modelling 5, 21. Linzer DI and Levine AJ 1979 Characterization of a 54K dalton cellular SV40 tumor antigen present in SV40-transformed cells and uninfected embryonal carcinoma cells. Cell 17, 43–52. Maere S, Heymans K and Kuiper M 2005 BiNGO: a Cytoscape plugin to assess over representation of Gene Ontology categories in biological networks. Bioinformatics 21, 3448–3449. Maglott D, Ostell J, Pruitt KD and Tatusova T 2007 Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research 35, 26–31. Matthews L, Gopinath G, Gillespie M, et al. 2009 Reactome knowledge base of human biological pathways and processes. Nucleic Acids Research 37, 619–22.
Data Integration
81
Matos P, Alcantara R, Dekker A, et al. 2009 Chemical Entities of Biological Interest: an update. Nucleic Acids Research 38, 249–254. Matys V, Fricke E, Geffers R, et al. 2003 TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research 31(1), 374–378. Mowat M, Cheng A, Kimura N, et al. 1985 Rearrangements of the cellular p53 gene in erythroleukaemic cells transformed by Friend virus. Nature 314, 633–636. Mukherjee S, Berger MF, Jona G, et al. 2004 Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nature Genetics 36, 1331–1339. Natale DA, Arighi CN, Barker WC, et al. 2007 Framework for a protein ontology. BMC Bioinformatics, 27(8),S1. Nilsson R, Schultz IJ, Pierce EL, et al. 2009 Discovery of genes essential for heme biosynthesis through large-scale gene expression analysis. Cell Metabolism 10, 119–130. Park PJ 2009 ChIP-seq: advantages and challenges of a maturing technology. 2009 Nature Reviews Genetcs 10(10), 669–680. Parkinson H, Kapushesky M, Shojatalab M, et al. 2007 ArrayExpress–a public database of microarray experiments and gene expression profiles. Nucleic Acids Research 35, 747–750. Pihur V, Datta S and Datta S 2009 RankAggreg, an R package for weighted rank aggregation. BMC Bioinformatics 10, 62. Portales-Casamar E, Thongjuea S, Kwon AT, et al. 2009 JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Research 1–6. Pruitt KD, Tatusova T and Maglott DR 2007 NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 35, 61–65. Pruitt KD, Harrow J, Harte RA, et al. 2009 The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Research 19(7), 1316–23. Raychaudhuri S, Plenge RM, Rossin EJ, et al. 2009 Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions. PLOS Genetics 5(6), e1000534. Renear AR and Palmer CL 2009 Strategic reading, ontologies, and the future of scientific publishing. Science 325(5942), 828–832. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, et al. 2007 Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia, 9(2), 166–180. Salwinski L, Miller CS, Smith AJ, et al. 2004 The Database of Interacting Proteins: 2004 update. Nucleic Acids Research 32, 449–451. Sartor MA, Leikauf GD and Medvedovic M 2009 LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics 25(2), 211–217. Schones DE, Cui K, Cuddapah S, et al. 2008 Dynamic regulation of nucleosome positioning in the Human Genome. Cell 132(5), 887–898. Shannon P, Markiel A, Ozier O, et al. 2003 Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 13(11), 2498–2504. Singhal M and Domicob K. 2007 CABIN: Collective Analysis of Biological Interaction Networks. Computational Biology and Chemistry 31, 222–225 Smialowski P, Pagel P, Wong P, et al. 2010 The Negatome database: a reference set of non-interacting protein pairs Nucleic Acids Research 38, 540–544. Smith B, Ashburner M, Rosse C, et al. 2007 The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25(11), 1251–1255. Stelzl U, Worm U, Lalowski M, et al. 2005 A human protein-protein resource interaction network: a resource for annotating the proteome. Cell 122, 957–968. Subramanian A, Tamayo P, Mootha VK, et al. 2005 Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102(43), 15 545–15 550. The Gene Ontology Consortium 2000 Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29. The UniProt Consortium 2010 The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research 38, 142–148. Thompson KL, Afshari CA, Amin RP, et al. 2004 Identification of platform-independent gene expression markers of cisplatin nephrotoxicity. Environmental Health Perspectives 112, 488–494.
82 Handbook of Statistical Systems Biology Wang H, He X, Band M, et al. 2005 A study of inter-lab and inter-platform agreement of DNA microarray data. BMC Genomics 6(1), 71. Wang Z, Zang C, Rosenfeld JA, et al. 2008 Combinatorial patterns of histone acetylations and methylations in the human genome. Nature Genetics 40(7), 897–903. Wang Z, Zang C, Cui K, et al. 2009 Genome-wide mapping of HATs and HDACs reveals distinct functions in active and inactive genes. Cell 138(5), 1019–1031. Wheeler DL, Barrett T, Benson DA, et al. 2007 Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 35, D5–12 Wilhelm E, Pellay F, Benecke A and Bell B 2008 TAF6d controls apoptosis and gene expression in the absence of p53. PloS One 3(7), e2721. Windenger E 2008 TheTRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Briefings in Bioinformatics 9(4), 326–332. Zang C, Schones DE, Zeng C, et al. 2009 A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics 25, 1952–1958. Zhang Y, Liu T, Meyer CA, et al. 2008 Model-based Analysis of ChIP-Seq (MACS). Genome Biology 9(9), 137.
5 Control Engineering Approaches to Reverse Engineering Biomolecular Networks Francesco Montefusco1,2 , Carlo Cosentino2 and Declan G. Bates1 1
2
College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK. School of Computer and Biomedical Engineering, Universit`a degli Studi Magna Græcia di Catanzaro, Catanzaro, Italy
The last decade has witnessed a tremendous growth in interdisciplinary research on the application of systems and control engineering techniques to biological problems. A fundamental challenge in this field is the development of appropriate modelling and simulation frameworks for biological networks at the molecular level: gene regulatory, metabolic, signal transduction and protein–protein interaction networks provide a rich field of application for mathematicians, engineers and computer scientists. The reason for this renewed appeal can be largely ascribed to recent breakthroughs in the field of biotechnology, such as cDNA microarrays and oligonucleotide chips [1, 2], which have made high-throughput and quantitative experimental measurements of biological systems much easier and cheaper to make. The availability of such an overwhelming amount of data, however, poses a new challenge: how to reverse engineer the biological systems (especially at the molecular level) starting from the measured response to external perturbations (e.g. drugs, signalling molecules, pathogens) and changes in the environmental conditions (e.g. change in the concentration of nutrients or in the temperature level). In this chapter, we provide an overview of several promising techniques, based on dynamical systems identification theory, for reverse engineering the topology of biomolecular interaction networks from this kind of experimental data.
5.1
Dynamical models for network inference
A standard approach to model the dynamics of biomolecular interaction networks is by means of a system of ordinary differential equations (ODEs) that describes the temporal evolution of the various compounds [3, 4]. Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
84 Handbook of Statistical Systems Biology
Typically, the network is modelled as a system of rate equations in the form x˙i (t) = fi (x(t), p(t), u(t)),
(5.1)
for i = 1, . . . , n with x = (x1 , . . . , xn )T ∈ Rn , where the state variables xi denote the quantities of the different compounds present in the system (e.g. mRNA, proteins, metabolites) at time t, fi is the function that describes the rate of change of the state variable xi and its dependence on the other state variables, p is the parameter set and u is the vector of external perturbation signals. The level of detail and the complexity of these kinetic models can be adjusted, through the choice of the rate functions fi , by using more or less detailed kinetics, i.e. specific forms of fi (linear or specific types of nonlinear functions). Moreover, it is possible to adopt a more or less simplified set of entities and reactions, e.g. choosing whether to take into account mRNA and protein degradation, or delays for transcription, translation and diffusion time [3]. The use of systems of ODEs as a modelling framework for biological networks presents several challenges to control engineering approaches to network reconstruction, which typically stem from classical system identification procedures. The problem is typically tackled in two steps: the first step is to determine the model structure, i.e. the mathematical form of fi , that is most appropriate to describe the experimental dynamics; the second step aims to compute the values of the parameters that yield the best fit of the experimental data. The two steps are strictly interconnected: the number and type of parameters is defined by the model structure; on the other hand, the interpolation of the experimental data provides important hints about how to modify the model structure to get better results. When the order of the system increases, nonlinear ODE models quickly become intractable in terms of parametric analysis, numerical simulation and especially for identification purposes. If the nonlinear functions fi are allowed to take any form, indeed, determination of the network topology becomes impossible. A more sensible approach, therefore, is to use equations composed of as few mathematical terms as possible. Even assuming that the model structure is perfectly known, each equation in the model requires knowledge of one or more parameter values (thermodynamic constants, rate constants), which are difficult to estimate using current data production techniques. This fact, along with the low number of measurements, typically renders the ODE system not uniquely identifiable from the data at hand. Due to the above issues, although biomolecular networks are characterized by complex nonlinear dynamics, many network inference approaches are based on linear models or are limited to very specific types of nonlinear functions. In what follows, we will illustrate some of the most significant advances that have been achieved in the development of effective network reconstruction methods based on dynamical systems identification. As will become clear, the reverse engineering methods are closely related to the choice of model structure. Therefore, before illustrating the reverse engineering methods, we briefly introduce the most common model structures which may be chosen within this identification framework. 5.1.1
Linear models
The dynamical evolution of a biological network can be described, at least for small excursions of the relevant quantities from the equilibrium point, by means of linear systems, made up of ODEs in the continuous–time case, or difference equations in the discrete–time case (see [5–10] and references therein). We consider the continuous-time LTI model x˙ (t) = Ax(t) + Bu(t),
(5.2)
where x(t) = (x1 (t), . . . , xn (t))T ∈ Rn , the state variables xi , i = 1, . . . , n, denote the quantities of the different compounds present in the system (e.g. mRNA concentrations for gene expression levels), A ∈ Rn×n is the dynamic matrix and B ∈ Rn×1 is a vector that determines the direct targets of external perturbations
Reverse Engineering Biomolecular Networks 85
u(t) ∈ R (e.g. drugs, overexpression or downregulation of specific genes), which are typically induced during in vitro experiments. Note that the derivative (and therefore the evolution) of xi at time t is directly influenced by the value xj (t) iff Aij = / 0. Moreover, the type (i.e. promoting or inhibiting) and extent of this influence can be associated with the sign and magnitude of the element Aij , respectively. Thus, if we consider the state variables as quantities associated with the nodes of a network, the matrix A can be considered as a compact numerical representation of the network topology. Therefore, the topological reverse engineering problem can be recast as the problem of identifying the dynamical system (5.2). A possible criticism of this approach could be raised with respect to the use of a linear model, which is certainly inadequate to capture the complex nonlinear dynamics of certain molecular reactions. However, this criticism would be reasonable only if the aim was to identify an accurate model of large changes in the states of a biological system over time, and this is not the case here. If the goal is simply to describe the qualitative functional relationships between the states of the system when the system is subjected to perturbations then a first-order linear approximation of the dynamics represents a valid choice of model. Indeed, a large number of approaches to network inference and model parameter estimation have recently appeared in the literature which are based on linear dynamical models [5, 8, 9, 11, 12]. In addition to their conceptual simplicity, the popularity of such approaches arises in large part due to the existence of many well established and computationally appealing techniques for the analysis and identification of this class of dynamical system. 5.1.2
Nonlinear models
In the following, we introduce several nonlinear models which have also been used for the purposes of network inference. As discussed above, general nonlinear ODE models quickly become intractable for identification purposes and thus, in order to overcome this limitation, alternative modelling approaches have been devised which exploit the particular dynamical characteristics of biological networks. In general the dynamics of a biomolecular network of m species (x) and r reactions (v), can be described by a system of ODEs in the form dxi = nij vj , dt r
j=1
for i = 1, . . . , m, where nij is the stoichiometric coefficient of the ith species in the jth reaction (nij > 0 for products and nij < 0 for reactants) and vj is the rate of the jth reaction. For the sake of simplicity, we assume that the changes of concentrations are only due to reactions (i.e. we neglect the effect of convection or diffusion). We can then define the stoichiometric matrix N = nij , for i = 1, . . . , m, and for j = 1, . . . , r , in which columns correspond to reactions and rows to concentration variations. Therefore, the mathematical description of a biomolecular network can be given in matrix form as dx = Nv . (5.3) dt The reaction rate vj is typically a polynomial (e.g. mass-action kinetics) or rational (e.g. Michaelis–Menten or Hill functions) function of the concentrations of the chemical species taking part in a reaction. A particular case is when these reaction rates are approximated through power-law terms, yielding the so-called S-system models. 5.1.2.1
Polynomial and rational models
The behaviour of a biomolecular network can be described by a system of differential equations obtained from the reaction mechanism by the law of mass action: the rate of an elementary reaction (a reaction that
86 Handbook of Statistical Systems Biology
proceeds through only one transition state, that is one mechanistic step) is proportional to the product of the concentrations of the participating molecules. For example, for the following interaction k A + B −→ C, the rate of change of protein C with respect to time can be described as a polynomial function d[C] = k[A][B], dt where [A], [B] and [C] denote the concentrations of the molecules A, B and C, respectively, and k represents the rate constant that depends on reaction conditions such as temperature, pH, solvents, etc. To occur at significant rates, almost all biological processes in the cell need enzymes, proteins that catalyse chemical reactions. Consider the simplest enzymatic reaction, in which there is a reversible association between an enzyme E and a substrate S, yielding an intermediate enzyme–substrate complex C which irreversibly breaks down to form a product P: k1 k2 S + E C −→ P + E , (5.4) k−1 where k1 , k−1 and k2 are relative reaction constants. By applying the law of mass action we obtain d[S] = −k1 [E][S] + k−1 [C] dt d[E] = −k1 [E][S] + (k−1 + k2 )[C] dt d[C] = k1 [E][S] − (k−1 + k2 )[C] dt d[P] = k2 [C], dt where [E], [S], [C] and [P] denote the concentrations of the relative proteins, with the initial conditions ([S], [E], [C], [P]) = ([S0 ], [E0 ], 0, 0) at time t = 0. Note that, by summing the second and third expressions, the total amount of free and bound enzyme is invariant over time d[E] d[C] + = 0 ⇒ [E](t) + [C](t) = E0 , dt dt and thus we obtain the simplified model described by: d[S] = −k1 E0 [S] + (k1 [S] + k−1 )[C] dt d[C] = k1 E0 [S] − (k1 [S] + k−1 + k2 )[C], dt with the initial conditions ([S], [C]) = ([S0 ], 0) at time t = 0. As the formation of the complex C is very fast it may be considered to be at the equilibrium state ( d[C] dt = 0), and we obtain [C] =
E0 [S] d[P] d[S] vmax [S] ⇒ , =− = [S] + Km dt dt [S] + Km
2 where vmax = k2 E0 is the maximum reaction velocity and Km = k−1k+k is known as the Michaelis–Menten 1 constant. Clearly this kinetic term is a rational function. This type of rate law exhibits saturation at high
Reverse Engineering Biomolecular Networks 87
substrate concentrations, a well known behaviour of enzymatic reactions. Empirically, for many reactions the rate of product formation follows sigmoidal kinetics with the substrate concentration. In 1910, Hill devised an equation to describe the cooperative binding (i.e. the affinity of a protein for its ligand changes with the amount of ligand already bound) of oxygen to haemoglobin, which is of the form for enzyme–substrate reactions: d[S] vmax [S]n d[P] , =− = dt dt [S]n + Km n where n is the Hill coefficient. From the Hill equation we see that in the absence of cooperativity n = 1. n > 1 is called positive cooperativity and n < 1 negative cooperativity. The Michaelis–Menten reaction (5.4), in the form of the stoichiometric model (5.3), is given by ⎛ ⎞ ⎛ ⎞ [E] −1 1 1 ⎛ ⎞ v1 ⎜ ⎟ ⎜ ⎟ 1 0⎟ ⎜ [S] ⎟ ⎜−1 ⎜v ⎟ ⎜ ⎟ ⎜ ⎟, x=⎜ ⎟, v = ⎝ 2 ⎠, N = ⎜ 1 −1 −1⎟ ⎝[ES]⎠ ⎝ ⎠ v3 [P] 0 0 1 where v1 , v2 and v3 represent the reaction rates of the complex [ES] formation, of the [ES] dissociation and of [P] production, respectively. The relation (5.3) hides the underlying chemical network structure that we are trying to identify. Hence, in the following, we introduce the notation used in chemical reaction network theory: N and v are decomposed into the so-called bookkeeping matrix Y , which maps the space of complexes into the space of species, the concentration vector of the different complexes (x) and matrix Ak , which defines the network structure. For the Michaelis–Menten reaction (5.4), the vector of complexes is given by ⎛ ⎞ [E][S] ⎜ ⎟ = ⎝ [ES] ⎠ . [E][P] The matrix Y is determined in the following way: the elements of the i-th row tell us in which complexes species i appears and how often; equivalently, the entries in the jth column tell us how much of each species make up complex j. Thus, for (5.4), ⎛ ⎞ 1 0 1 ⎜ ⎟ ⎜1 0 0 ⎟ ⎟ Y =⎜ ⎜0 1 0⎟ . ⎝ ⎠ 0 0 1 Matrix K is the transpose of the weighted adjacency matrix of the digraph representing the chemical reaction network; that is, entry Kij is nonnegative and corresponds to the rate constant associated with the reaction from complex j to i. The so-called kinetic matrix Ak is given by Ak = K − diag(KT e), where e = (1, . . . , 1)T ∈ Rn and n is the number of complexes. For (5.4), ⎞ ⎞ ⎛ ⎛ 0 k−1 0 −k1 k−1 0 ⎜ ⎜ 0 0⎟ −(k−1 + k2 ) 0⎟ K = ⎝ k1 ⎠, Ak = ⎝ k1 ⎠. 0 0 0 k2 0 k2 The set of nonlinear ODEs given by (5.1) can be rewritten as [13]: dx = YAk (x), dt
(5.5)
88 Handbook of Statistical Systems Biology
where ln (x) = Y T ln x. Then, from experimental data (in addition Y is often known), the identification problem corresponds to the reconstruction of the network structure given by Ak . An example of a mathematical model of a biological pathway, based on the mass-action law for protein interactions and the saturating rate law for transcriptional reactions using a Hill-type function, is presented in [14], where the authors constructed a model of the aryl hydrocarbon receptor (AhR) signal transduction pathway. Rational terms do not necessarily have to arise from the law of mass action, but can also be used as a phenomenological description of some biological events exhibiting a sigmoidal response. For instance, in [15] the authors modelled a gene network, consisting of genes, mRNA and proteins, by the following ODE system: d[xi ] = mi · fi (y) − λi RNA · xi dt d[yi ] = ri · xi − λi Prot · yi , dt where mi is the maximum transcription rate, ri the translation rate, λi RNA and λi Prot are the mRNA and protein degradation rates, respectively. fi (·) is the so-called input function of gene i, which determines the relative activation of the gene, modulated by the binding of transcription factors (TFs) to cis-regulatory sites, and is approximated using Hill-type terms. 5.1.2.2
S-systems
Power-law models have been developed as an alternative approach for modelling reactions following nonideal kinetics in various biological systems [16]. The basic concept underlying power-law models is the approximation of classical ODE models by means of a uniform mathematical structure. Michaelis–Menten kinetics of the form v = v(X) =
vmax [X] Km + [X]
can be approximated by power-law functions as v ≈ α[X]g . A particular class of power-law models are the S-systems (synergistic-systems), where the rate of change of a state variable is equal to the difference of two products of variables raised to noninteger powers: d[Xi ] = αi Xj gi,j − βi Xj hi,j , dt n
n
j=1
j=1
(5.6)
for i = 1, . . . , n, where the first term represents the net production and the second term the net removal rate for the ith species, αi and βi are multiplicative parameters called rate constants, and gi,j and hi,j are exponential parameters called kinetic orders for the production and degradation term, respectively. By changing to a logarithmic scale, the relation (5.6) becomes a linear system that is much more tractable to analyse than the original nonlinear system. However the generalized aggregation may introduce a loss of accuracy, the model may conceal important structural features of the network, and it is not able to describe many important biochemical effects such as saturation and sigmoidicity [17].
Reverse Engineering Biomolecular Networks 89
5.2
Reconstruction methods based on linear models
The general problem of reverse engineering a biological interaction network from experimental data may be tackled via methods based on dynamical linear systems identification theory. The basic step of the inference process consists of estimating, from experimental measurements (either steady-state or time-series data), the weighted connectivity matrix A and the exogenous perturbation vector B of the in silico network model (5.2). With this objective, the problem may be tackled by means of regression algorithms based on the classical Least Squares Estimator (LSE), extensions of the LSE algorithm, such as the Constrained Total Least Squares (CTLS) technique, or by using efficient convex optimization procedures cast in the form of linear matrix inequalities (LMIs). 5.2.1
Least squares
In this subsection, approaches based on the classical Least Squares (LS) method are presented. The problem tackled consists of identifying the original network topology starting from the experimental measurements of the temporal evolution of each node. 5.2.1.1
Identification of the connectivity matrix by least squares methods
Assuming that h + 1 experimental observations, x(k) ∈ Rn , k = 0, . . . , h, are available, we can recast the problem in the discrete-time domain as
X := x(h) . . . x(1) = , (5.7) where
ˆ B ˆ , = A
:=
x(h − 1) . . . x(0) . u(h − 1) . . . u(0)
Since we are dealing with a linear model, it is possible to separately estimate each row, i, , of the connectivity matrix to be identified. Let Z = T ∈ Rh×(n+1) , Xi = (xi (h), . . . , xi (1))T ∈ Rh and ˆ T ∈ Rn+1 . The problem to be solved in a standard LS setting can then be formulated as follows β= i, Xi = Z · β.
(5.8)
Now, if we assume that the measurements are noisy, relation (5.8) can be written in the following form Xi + Xi = (Z + Z) · β, where
⎛
1vh−1 · · · nvh−1 ⎜ . . . . . ... Z = ⎜ ⎝ . 1v0 · · · nv0
(5.9)
⎞ 0 .. ⎟ h×(n+1) , .⎟ ⎠∈R 0
Xi = (ivh · · · iv1 )T ∈ Rh , and ivj is the ith noise component of vj , for i = 1, . . . , n and for j = 0, . . . , h. Z and Xi are unknown terms caused by the noise in the data. Although the exact values of the correction terms, Z and Xi , are not known, the structure, i.e. how the noise appears in each element, is known. If the unknown terms are ignored,
90 Handbook of Statistical Systems Biology
then the problem is solved by the standard LS method as follows. The optimal estimator for the regression coefficients vector is given by
T Z Z βLS = ZT Xi , which admits the unique solution
−1 T Z Xi βLS = ZT Z if the number of samples is greater or equal than the number of regression coefficients, that is h ≥ n + 1, and the matrix Z has full rank, n + 1. When Z does not have full rank, there are infinitely many LSEs of β, in the form
† βLS = ZT Z ZT Xi ,
†
where ZT Z is a generalized inverse of ZT Z [18]. 5.2.2
Methods based on least squares
In this section, we introduce some methods for reverse-engineering biomolecular networks based on linear models and least squares regression. In this field, important contributions have been produced by the groups of Gardner and Collins, that devised NIR (Network Identification by multiple Regression, [5]) and MNI (Microarray Network Identification, [19]) algorithms, and by the group of di Bernardo, that devised TSNI (Time-Series Network Identification, [20, 21]). A common motif of these approaches is the use of linear ODE models and multivariate regression to identify the targets of exogenous perturbations in a biomolecular network. The NIR algorithm has been developed for application with perturbation experiments on gene regulatory networks. The direct targets of the perturbation are assumed to be known and the method uses only the steady-state gene expression. Under the steady-state assumption [˙x(t) = 0 in (5.2)] the problem to be solved is n
aij xj = −bi u,
(5.10)
j=1
The LS formula is used to compute the network structure, that is the rows ai, of the connectivity matrix, from the gene expression profiles (xj , j = 1, . . . , n) following each perturbation experiment; the genes that are directly affected by the perturbation are expressed through a nonzero element in the B vector. NIR is based on a network sparsity assumption: only k (maximum number of incoming edges per gene) out of the n elements on each row are different from zero. For each possible combination of k out of n weights, the k coefficients for each gene are computed such as to minimize the interpolation error. The maximum number of incoming edges, k, can be varied by the user. An advantage of NIR is that k can be tuned so as to avoid underdetermined problems. Indeed, if one has Ne different (independent) perturbation experiments, the exact solution to the regression problem can be found for k ≤ Ne , at least in the ideal case of zero noise. The MNI algorithm, similarly to NIR, uses steady-state data and is based on relation (5.10), but it does not require a priori knowledge of the specific target gene for each perturbation. The algorithm employs an iterative procedure: first, it predicts the targets of the treatment using a full network model; subsequently, it translates the predicted targets into constraints on the model structure and repeats the model identification to improve the reconstruction. The procedure is iterated until certain convergence criteria are met. The TSNI algorithm uses time-series data, instead of steady-state values, of gene expression following a perturbation. It identifies the gene network (A), as well as the direct targets of the perturbations (B), by
Reverse Engineering Biomolecular Networks 91
applying the LS to solve the linear equation (5.2). Note that, to solve (5.2), it is necessary to measure the derivative values, which are never available. Numerical estimation of the derivative is not a suitable option, since it is well known to yield considerable amplification of the measurement noise. The solution implemented by TSNI consists in converting (5.2) to the corresponding discrete-time system x(k + 1) = Ad x(k) + Bd u(k), As discussed above (see Section 5.2.1) this problem admits a unique globally optimal solution if h ≥ n + p, where h is the number of data points, n is the number of state variables and p the number of perturbations. To increase the number of data points, after using a cubic smoothing spline filter, a piecewise cubic spline interpolation is performed. Then a Principal Component Analysis (PCA) is applied to the data-set in order to reduce its dimensionality and the problem is solved in the reduced dimension space. In order to compute the continuous-time system’s matrices, A and B, from the corresponding discretized Ad and Bd , respectively, the following bilinear transformation is applied [22]: 2Ad − I Ts A d + I B = (Ad + I)ABd ,
A=
where I ∈ Rn×n is the square identity matrix and Ts is the sampling interval. Bonneau et al. devised another algorithm, named Inferelator [23], which uses regression and variable selection to infer regulatory influences for genes and/or gene clusters from mRNA and/or protein expression levels. Finally, the group of K.-H. Cho have developed a number of algorithms based on dynamical linear systems theory and convex optimization to infer biological regulatory networks from time-series measurements. In [9] the identification procedure leads to a convex optimization problem with regularization [24] in order to achieve a sparse network and to take into account any a priori information on the network structure. In [8] an optimization method was presented which allows the inference of gene regulatory networks from time-series gene expression data taking into account time-delays and noise in the measurement data. 5.2.3
Dealing with noise: CTLS
To write (5.9) in a more compact form, we make the following definitions: C := (Z
Xi ) , Xi ) .
C := ( Z Then the relation (5.9) is written as
C + C
β
= 0.
−1
(5.11)
An extension of the LS algorithm, namely the Total Least Squares (TLS) technique, was developed to solve exactly this problem by finding the correction term C. The TLS problem is then posed as follows [25]: min C2F v,β
s.t.
C + C
β −1
= 0,
(5.12)
92 Handbook of Statistical Systems Biology
where || · ||F is the Frobenius norm defined by ||A||F = tr(AAT ) for a matrix A in which tr(AAT ) is the trace of the matrix, i.e. the sum of the diagonal terms. When the smallest singular value of (Z Xi ) is not repeated, the solution of the TLS problem is given by: −1 Z T Xi , (5.13) βTLS = ZT Z − λ2 I where λ is the smallest singular value of (Z Xi ). The TLS solution has a correction term, λ2 , at the inverse of the matrix, compared with the LSE. This reduces the bias in the solution, which is caused by the noise in C. The TLS solution can be also computed using the singular value decomposition as follows ([26], p. 503): C = U V T , where U V is the singular value decomposition of the matrix C, U ∈ Rh×h and V ∈ R(n+2)×(n+2) are unitary matrices and ∈ Rh×(n+2) is a diagonal matrix with k = min(h, n + 2) non negative singular values, σi , arranged in descending order along its main diagonal, the other entries are zero. The singular values are the positive square roots of the eigenvalues of CT C. Let V = [ν1 · · · νn νn+1 νn+2 ], where νi ∈ Rn+2 is the ith column of the matrix V . Then, the solution is given by βTLS νn+2 , (5.14) =− ν 1 where ν is the last element of νn+2 . Numerically, this is a more robust method than computing the inverse of a matrix. The TLS solution, (5.13) or (5.14), is not optimal when the two noise terms in Z and Xi are correlated, since one of the main assumptions in this method is that the two noise terms are independent of each other. If there is some correlation between them, this knowledge can be used to improve the solution by using the CTLS technique [27]. In the case of the problem in the form of (5.9), the two noise terms are obviously correlated because Z is a function of the noise term from the sampling time k equal to 0 to h − 1 and Xi is a function of the noise term from k equal to 1 to h. To use the structural information in C, first the minimal set of noise is defined as follows: v = (vTh · · · vT0 )T ∈ Rn(h+1)×1 . If v is not white random noise, a whitening process using Cholesky factorization is performed [27]. Here, v is assumed to be white noise and this whitening process is not necessary. Consider each column of C, i.e.
C = C1 · · · Cn Cn+1 Cn+2 , where Ci is the ith column vector of C. More specifically Ci = (ivh−1 · · · iv0 )T ∈ Rh , Cn+1 = 0h×1 ,
i = 1, . . . , n,
Cn+2 = Xi = (ivh · · · iv1 )T ∈ Rh .
(5.15)
Each Ci can be written as Ci = Gi v for i = 1, . . . , n, n + 1, n + 2. To obtain the explicit form for each Gi , we first define the following column vector of all zero elements, but one, the ith element, equal to 1: ei = (0 · · · 0
1
0 . . . 0)T ∈ Rn
i = 1, . . . , n.
Reverse Engineering Biomolecular Networks 93
For i equal to 1,
T
C1 = 1vh−1 · · · 1v0 = vTh−1 e1 · · · vT0 e1 T
0 n×h = vT = 0h×n (Ih ⊗ e1 )T v. Ih ⊗ e1
Likewise for the ith column of C,
Ci = 0h×n (Ih ⊗ ei )T v
and hence
Gi = 0h×n (Ih ⊗ ei )T
for i = 1, . . . , n . Also, from (5.15) Gn+1 = 0h×n(h+1) ,
Gn+2 = (Ih ⊗ ei )T 0h×n . Since C can be written as
C = G1 v . . . Gn v Gn+1 v Gn+2 v ,
then the TLS problem can be recast as follows [27]: min v2 v,β
s.t. C + G1 v . . . Gn v Gn+1 v Gn+2 v
β = 0. −1
(5.16)
This is called the CTLS problem. With the following definition, Hβ :=
n
ai,j Gj + b1 Gn+1 − Gn+2 =
j=1
n+1
βj Gj − Gn+2 ,
(5.17)
j=1
where βj for j = 1, . . . , n is the jth element of the ith row of A and βn+1 is the ith element of the vector B, (5.16) can be written in the following form: β C + Hβ v = 0. −1 Solving for v, we get
v=
† −Hβ C
β −1
,
†
(5.18)
where Hβ is the pseudoinverse of Hβ . Hence, the original constrained minimisation problem, (5.16), is transformed into an unconstrained minimization problem as follows:
T †T † T β 2 . (5.19) min v = min β −1 C Hβ Hβ C v,β β −1 Now, we introduce two assumptions, which make the formulation simpler.
94 Handbook of Statistical Systems Biology
(1) The number of measurements are always strictly greater than the number of unknowns, i.e. we only consider the overdetermined case, explicitly h + 1 > n + 2, that is h > n + 1. (2) Hβ is full rank. †
Then the pseudoinverse Hβ is given by −1 † Hβ = HβT Hβ HβT and the unconstrained minimization problem can be further simplified as follows: −1
T T β T min β −1 C Hβ Hβ C . β −1
(5.20)
The starting guess for β used in the above optimization problem is simply the value returned by the solution of the standard LS problem. The problem to be solved is to find the values of n(n + 1) parameters of a linear model that yield the best fit to the observations in the LS sense. Hence, as assumed above, if the number of observations are always strictly greater than the number of explanatory variables, that is h > n + 1, then the problem admits an optimal or at least sub-optimal solution. In the other case, h ≤ n + 1, the interpolation problem is undetermined, and thus there exist infinitely many values of the optimization variables that equivalently fit the experimental measurements. In this case, several expedients can be adopted: first, it is possible to exploit clustering techniques to reduce the number of nodes and smoothing techniques to increase the number of samples, in order to satisfy the constraint h > n + 1. Furthermore, adopting a bottom–up reconstruction approach (i.e. starting with a blank network and increasingly adding new edges) may help in overcoming the dimensionality problem: in this case, indeed, the number of edges incident to each node (and therefore the number of explanatory variables) is iteratively increased and can be limited to satisfy the above constraint. Finally, the introduction of sign constraints on the optimization variables, derived from qualitative prior knowledge of the network topology (as described below), results in a significant reduction of the solution space. Note that the methods illustrated above can be extended if p experiments are performed around the same equilibrium point. Then the terms in the relation (5.9) are constructed as follows: Z=
ζ1 · · · ζp Ip ⊗ 11×h
T ,
T
Xi = χi 1 · · · χi p ,
where ⎞ x1 k (h − 1) · · · x1 k (0) ⎜ .. .. ⎟ .. n×h ζk = ⎜ . . . ⎟ ⎠∈R , ⎝ ⎛
xn k (h − 1) · · · xn k (0) is the kth set of experimental data, χi k = (xik (h), . . . , xik (1)) ∈ Rh , for k = 1, . . . , p, Ip ∈ Rp×p is the identity matrix, 1 ∈ R1×h is a vector of ones and ⊗ is the Kronecker product. The unknown β is given by β = aˆ i,
1 p bˆi · · · bˆi
T
∈ Rn+p
Reverse Engineering Biomolecular Networks 95
The correction terms, Z and Xi , are given by 1 ζ · · · ζ p , Z = 0p×ph where
⎛
1kvh−1 ⎜ ⎜ . ζ k = ⎜ .. ⎝ nkvh−1
Xi = χi 1 · · · χi p ,
⎞ · · · 1kv0 ⎟ . . .. ⎟ . . ⎟ ∈ Rn×h , ⎠ · · · nkv0
χi k = (ikvh · · · ikv1 )T ∈ Rh , for k = 1, . . . , p and 0 ∈ Rp×ph is a matrix of zeros. 5.2.3.1
PACTLS algorithm
In this subsection we describe the PACTLS algorithm, a method devised for the reverse engineering of partially known networks from noisy data. PACTLS uses the CTLS technique to optimally reduce the effects of measurement noise in the data on the reliability of the inference results, while exploiting qualitative prior knowledge about the network interactions with an edge selection heuristic based on mechanisms underpinning scale–free network generation, i.e. network growth and preferential attachment (PA). The algorithm allows prior knowledge about the network topology to be taken into account within the CTLS optimization procedure. Since each element of A can be interpreted as the weight of the edge between two nodes of the network, this goal can be achieved by constraining some of the optimization variables to be zero and others to be strictly positive (or negative), and using a constrained optimization problem solver, e.g. the nonlinear optimisation function fmincon from the MATLAB Optimization Toolbox, to solve (5.20). Similarly, we can impose a sign constraint on the i-th element of the input vector, bi , if we a priori know the qualitative (i.e. promoting or repressing) effect of the perturbation on the ith node. Alternatively, an edge can be easily pruned from the network by setting to zero the corresponding entry in the minimization problem. ˆ and B ˆ are not actually the estimates of A and B in (5.2), Note that, since the system evolution is sampled, A but rather of the corresponding matrices of the discrete-time system obtained through the Zero-Order-Hold (ZOH) discretization method ([28], p. 676) with sampling time Ts from system (5.2), that is x(k + 1) = Ad x(k) + Bd u(k),
(5.21)
where x(k + 1) is a shorthand notation for x(kTs + Ts ), x(k) for x(kTs ), u(k) for u(kTs ), and Ts Ad = eATs , Bd = eAτ dτ B. 0
In general, the sparsity patterns of Ad and Bd differ from those of A and B. However, if the sampling time is suitably small, (A)ij = 0 implies that (Ad )ij exhibits a very low value, compared with the other elements on the same row and column, and the same applies for Bd and B (see Section 5.2.5 for a detailed discussion). Therefore, in order to reconstruct the original sparsity pattern of the continuous-time system’s matrices, one can set to zero the elements of the estimated matrices whose values are below a certain threshold; this is the basic principle underpinning the edges selection strategy, as described next.
96 Handbook of Statistical Systems Biology
So far we have described a method to add/remove edges and to introduce constraints on the sign of the associated weights in the optimization problem. The problem remains of how to devise an effective strategy to select the nonzero entries of the connectivity matrix. The initialization network for the devised algorithm has only self-loops on every node, which means that the evolution of the i-th state variable is always influenced by its current value. This yields a diagonal ˆ (0) . Subsequently, new edges are added step-by-step to the network according to the initialization matrix, A following iterative procedure: ¯ is computed by solving (5.20) for each row, without setting any optimization variable (1) A first matrix, A, to zero. The available prior information is taken into account at this point by adding the proper sign constraints on the corresponding entries of A before solving the optimization problem, as explained in ¯ is not representative of the previous subsection. Since it typically exhibits all nonzero entries, matrix A the network topology, but is rather used to weight the relative influence of each entry on the system’s dynamics. This information will be used to select the edges to be added to the network at each step. Each ¯ is normalized with respect to the values of the other elements in the same row and column, element of A ˜ whose elements are defined as which yields the matrix A, ˜ ij =
A
¯ ij A . ¯ ,j · A ¯ i, 1/2 A
˜ (k) is computed, (2) At the kth iteration, the edges ranking matrix G (k)
˜ (k) = G ij
˜ ij |p |A j , n (k) ˜ p |Ail | l=1
(5.22)
l
where (k)
(k)
pj =
Kj n (k) Kl
(5.23)
l=1
(k)
is the probability of inserting a new edge starting from node j and Kl is the number of outgoing ˜ (k) are connections from the lth node at the kth iteration. The μ(k) edges with the largest scores in G selected and added to the network; μ(·) is chosen as a decreasing function of k, that is μ(k) = n/k. Thus, the network grows rapidly at the beginning and is subsequently refined by adding smaller numbers of nodes at each iteration. The form of the function p(·) stems from the so–called PA mechanism, which states that in a growing network new edges preferentially start from popular nodes (those with the highest connectivity degree, i.e. the hubs). By exploiting the mechanisms of network growth and PA, we are able to guide the network reconstruction algorithm to increase the probability of producing a network with a small number of hubs and many poorly connected nodes. Note also that, for each edge, the probability of incidence is blended with the edge’s weight estimated at point (1); therefore, the edges with larger estimated weights have a higher chance to be selected. This ensures that the interactions exerting greater influence on the network dynamics have a higher probability of being selected. ˆ (k) is defined by adding the entries selected at point (2) to those (3) The structure of nonzero elements of A selected up to iteration k − 1 (including those derived by a priori information), and the set of inequality
Reverse Engineering Biomolecular Networks 97
constraints is updated accordingly; then (5.20) for each row, with the additional constraints, is solved to ˆ (k) . compute A (4) The residuals generated by the identified model are compared with the values obtained at the previous iterations; if the norm of the vector of residuals has decreased, in the last two iterations, at least by a factor r with respect to the value at the first iteration, then the procedure iterates from point (2), otherwise ˆ (k−2) . The factor r is inversely it stops and returns the topology described by the sparsity pattern of A correlated with the number of edges inferred by the algorithm; on the other hand, using a smaller value of r raises the probability of obtaining false positives. By conducting numerical tests for different values of r , we have found that setting r = 0.1 yields a good balance between the various performance indices. Concerning the input vector, we assume that the perturbation targets and the qualitative effects of the ˆ is preassigned at perturbation are known, thus the pattern (but not the values of the nonzero elements) of B the initial step and the corresponding constraints are imposed in all the subsequent iterations. 5.2.4
Convex optimization methods
In this section, two methods for identifying a linear network model by means of a convex optimization procedure cast in the form of LMIs are illustrated [10, 29]. The distinctive feature of these methods is that they easily enable exploitation of any qualitative prior knowledge which may be available from knowledge of the underlying biology, thus significantly increasing the inference performance. The first algorithm is based on an iterative procedure which starts from a fully connected network – the edges are subsequently pruned according to a maximum parsimony criterion. The pruning algorithm terminates when the estimation error exceeds an assigned threshold. The second algorithm, named CORE-Net (Convex Optimisation algorithm for Reverse Engineering biological interaction Networks), is also based on the mechanisms of growth and PA, as in the algorithm described in Section 5.2.3.1. A similar approach based on convex optimization was devised by Julius et al. [30], where a method for identifying genetic regulatory networks using expression profiles from genetic perturbation experiments is presented. A stability constraint is added to the convex optimization problem and the l1 -norm is used in the cost function in order to obtain a sparse connectivity matrix. 5.2.4.1
Identification of the connectivity matrix via LMI-based optimization
Assuming that h + 1 experimental observations, x(k) ∈ Rn , k = 0, . . . , h, are available, we can recast the problem in the discrete-time domain as shown in (5.7). Our aim is to reconstruct the matrix A and vector B from the experimental values x(k), k = 0, . . . , h. The identification problem can be transformed into that of minimizing the norm of − , and thus we can state the following problem. Problem 5.1 Given the sampled data set x(k), k = 0, . . . , h, and the associated matrices , , find min ε
s.t.
( − )T ( − ) < εI.
(5.24)
Note that condition (5.24) is quadratic in the unknown matrix variable . In order to obtain a linear optimization problem we convert it to the equivalent condition ( − )T −εI < 0, (5.25) ( − ) −I
98 Handbook of Statistical Systems Biology
The equivalence between (5.24) and (5.25) is readily derived by applying the following lemma. Lemma 5.1 Let M ∈ Rn×n be a square symmetric matrix partitioned as M11 M12 , M= T M12 M22
(5.26)
−1 T assume that M22 is nonsingular and define the Schur complement of M22 , := M11 − M12 M22 M12 . The following statements are equivalent:
(i) M is positive (negative) definite; (ii) M22 and are both positive (negative) definite. Proof. Recall that M is positive (negative) definite if ∀x ∈ Rn , xT Mx > 0 moreover it can be decomposed as ([31], p.14) M11 M12 M= T M12 M22 −1 I M12 M22 = 0 0 I
0 M22
(< 0),
I 0
−1 M12 M22 I
T .
The latter is a congruence transformation ([32], p. 568), which does not modify the sign definiteness of the transformed matrix; indeed, ∀x ∈ Rn and ∀C, P ∈ Rn×n P positive (negative) definite ⇒ xT CT PCx = zT Pz > 0
(< 0).
Therefore M is positive (negative) definite if M22 and are both positive (negative) definite.
Problem 5.1 with the inequality constraint in the form (5.25) is a generalized eigenvalue problem ([33], p. 10), and can be easily solved using efficient numerical algorithms, such as those implemented in the Matlab LMI Toolbox [34]. A noteworthy advantage of the proposed convex optimization formulation is that the approach can be straightforwardly extended to the case when multiple experimental datasets are available for the same biological network. In this case, there are several matrix pairs (k , k ), one for each experiment: the problem can be formulated again as in (5.24), but using a number of constraints equal to the number of experiments, that is min ε
s.t.
(k − k )T (k − k ) < εk I,
k = 1, . . . , Ne ,
where Ne is the number of available experiments. As discussed in the previous section, the devised technique is based on the assumption that the sparsity pattern of the dynamical matrix and input matrix of system (5.2) can be recovered through the estimation of the corresponding matrices of the associated sampled-data discrete-time system (5.21) (see Section 5.2.5 for a detailed discussion).
Reverse Engineering Biomolecular Networks 99
Except for the LMI formulation, the problem is identical to the one tackled by classical linear regression, i.e. finding the values of n(n + 1) parameters of a linear model that yield the best fitting of the observations in the LS sense. Hence, if the number of observations, n(h + 1), is greater or equal than the number of explanatory variables, that is h ≥ n, the problem admits a unique globally optimal solution. In the other case, h < n, the interpolation problem is undetermined, thus there exist infinitely many values of the optimization variables that equivalently fit the experimental measurements. In the latter case, as discussed previously in Section 5.2.3, several expedients (clustering techniques, bottom-up reconstruction approach, prior knowledge, etc.) can be adopted to solve the undetermination. The key advantage of the LMI formalism is that it makes it possible to take into account prior knowledge about the network topology by forcing some of the optimization variables to be zero and other ones to be strictly positive (or negative), by introducing the additional inequality Aij > 0 (< 0) to the set of LMIs. Similarly, we can impose a sign constraint on the ith element of the input vector, bi , if we a priori know the qualitative (i.e. promoting or repressing) effect of the perturbation on the ith node. Also, an edge can be easily pruned from the network by setting to zero the corresponding entry in the matrix optimization variable in the LMIs. 5.2.4.2
Top-down approach
The first algorithm based on an LMI approach consists of an iterative procedure that starts from a fully connected network. Subsequently, the edges are pruned according to a maximum parsimony criterion. The pruning algorithm terminates when the estimation error of the identified model with respect to the experimental data set exceeds an assigned threshold. The following basic idea underpins the pruning algorithm: Given the identified (normalized) connectivity matrix, the edges connecting nonadjacent (in the original network) nodes have lower weights than the others. This is a straightforward mathematical translation of the reasonable assumption that indirect interactions are weaker than direct ones. In addition to the loose connectivity assumption, our algorithm is also capable of directly exploiting information about some specific interactions that are a priori known, taking into account both the direction of the influence, and its type (promoting or repressing). The reconstruction algorithm is structured as follows: (1) A first system is identified by solving Problem 5.1, and adding all the known sign constraints. ˆ ij }i,j=1,...,n be the matrix computed at the kth step; in order to compare the values of ˆ (k) = {A (2) Let A the identified coefficients the matrix has to be normalized, so let us define the normalized matrix ˜ ij }i,j=1,...,n , where ˜ k = {A A ˆ ij A ˆ ,j A ˆ i, 1/2 A
˜ ij =
A
ˆ ,j , and column, A ˆ i, . is obtained by dividing the original value by the norms of its row, A (3) The normalized matrix is analysed to choose the coefficients to be nullified at the next identification step; various rules can be adopted at this step in order to define the threshold below which a given coefficient is nullified. Good performance has been obtained by setting, for each coefficient, two thresholds proportional to the mean values of its row and column. Then the element is nullified only if its absolute value is lower than the minimum of the two thresholds. This rule reflects the idea that an arc is a good candidate for elimination if its weight is low compared with the other arcs arriving to and starting from the same node. (4) After choosing the coefficients to be nullified, a new LMI problem is cast, eliminating the corresponding optimization variables, and a new solution is computed.
100
Handbook of Statistical Systems Biology
(5) The evolution of the identified system is compared with the experimental data: if the estimation error exceeds a prefixed threshold then the algorithm stops, otherwise another iteration starts from point (2). The algorithm requires tuning two optimization parameters: (i) the threshold value used in the pruning phase, which affects the number of coefficients eliminated at each step; (ii) the upper bound defining the admissible estimation error, which determines the algorithm termination. The first parameter influences the connectivity of the final reconstructed network: the greater its value, the lower the number of connections, thus it is termed the specificity parameter. The algorithm terminates when either it does not find new arcs to remove or the estimation error becomes too large. Concerning the input vector, in this phase, we do not take into account the effects of external perturbations, ˆ = 0 and the linear systems, that represent the in-silico networks, are not subject to exogenous inputs. then B 5.2.4.3
Bottom-up approach – CORE-Net algorithm
The CORE-Net algorithm, based on the convex optimization problem cast in the LMI form as shown above, is specifically suited to infer gene regulatory networks (GRNs) exhibiting a scale-free topology. The procedure adopts a bottom-up reconstruction approach, that allows the iterative increment of the number of edges, exploiting the growth and PA mechanisms, as for the PACTLS algorithm, illustrated in Section 5.2.3. A key point underlying CORE-Net (as for PACTLS) is the experimental observation that Metabolic [35], protein–protein interaction [36] and transcriptional networks [37], as well as many other genomic properties [38], typically have a small number of hubs and many poorly connected nodes. A plausible hypothesis for the emergence of such a feature, as shown in [39], is the PA mechanism during network growth and evolution, where, when a new node is added to the network, it is more likely to be connected with one of the few hubs than with one of the many other loosely connected nodes. In large networks, this evolution rule may generate particular degree distributions, e.g. the well known power-law distribution of scale-free networks. The technique devised implements the PA mechanism within the reconstruction process, therefore mimicking the evolution of a biological network to improve the inference performance. Note also that in the case of partially known interaction networks, it is very likely that the available prior knowledge about the connections of the most widely studied genes/proteins in literature will often coincide with the hubs of the network, and in such cases we can expect that the PA mechanism will provide significant advantages in the network reconstruction process. It is worth noticing that, differently from the approach based on top-down strategy, where the algorithm starts from a fully connected network and then the edges are subsequently pruned according to a maximum parsimony criterion, CORE-Net iteratively increments the number of edges, thus helping to overcome the undetermination problem, as discussed previously, and to decrease the computational burden. 5.2.5
Sparsity pattern of the discrete-time model
The techniques described in the previous sections are based on the assumption that the sparsity pattern of the dynamical matrix and input matrix of system (5.2) can be recovered through the estimation of the corresponding matrices of the associated sampled-data discrete-time system (5.21). Here we want to validate this hypothesis, by analysing the relationship between the dynamical matrices of the continuous-time and discrete-time systems. For the sake of simplicity, in what follows we will assume that A has n distinct real negative eigenvalues, λi , |λi | < |λi+1 |, i = 1, . . . , n and it is therefore possible to find a nonsingular matrix P such that1 1 The
case of nondiagonalizable matrices is beyond the scope of the present work and will not be treated here.
Reverse Engineering Biomolecular Networks 101
A = PDP −1 , with D = diag (λ1 , . . . , λn ). Then, the matrix Ad can be rewritten as ([32], p. 525) (ATs )2 (ATs )3 + + ... Ad = I + ATs + 2! 3!
λT = Pdiag e 1 s , . . . , eλn Ts P −1 .
(5.27)
If the sampling time is properly chosen, such as to capture all the dynamics of the system, then Ts τi := 1/|λi |, i = 1, . . . , n, which implies |λi |Ts 1. Therefore the following approximation holds eλi Ts =
∞ (λi Ts )k k=0
k!
≈ 1 + λi Ts .
From this approximation and (5.27), we obtain Ad ≈ I + ATs . As for the input matrix B, the following approximation holds Bd = A−1 eATs − I B ≈ A−1 (ATs ) B = BTs . Note that the sparsity patterns of I + ATs and BTs are identical to those of A and B, respectively; only the diagonal entries of A are different, these, however, are always assumed to be free optimization parameters in our algorithm. What can be concluded from the previous calculations is that, in general, (A)ij = 0 does not imply (Ad )ij = 0; however, one can reasonably expect (Ad )ij to be much lower than the other elements on the ith row and jth column, provided that Ts is much smaller than the characteristic time constants of the system dynamics (the same applies for B and Bd ). Such considerations can be readily verified by means of numerical tests. The algorithms presented in this section are based on these arguments, indeed each algorithm chooses at each step only the largest elements of the (normalized) estimated Ad and Bd matrices, and it is therefore expected to disregard the entries corresponding to zeros in the continuous-time matrices. 5.2.6
Application examples
In this section we provide some results of PACTLS, CORE-Net [10] and TSNI [21] algorithms. In particular we show the performance obtained by these approaches for inferring gene regulatory networks form experimental gene expression data. PACTLS and CORE-Net have been used to reconstruct a cell cycle regulatory subnetwork in Saccharomyces cerevisiae from experimental microarray data. We considered the model proposed by [40] for transcriptional regulation of cyclin and cyclin/CDK regulators and the model proposed by [41], where the main regulatory circuits that drive the gene expression program during the budding yeast cell cycle are considered. The network is composed of 27 genes: 10 genes that encode for transcription factor proteins (ace2, fkh1, swi4, swi5, mbp1, swi6, mcm1, fkh2, ndd1, yox1) and 17 genes that encode for cyclin and cyclin/CDK regulatory proteins (cln1, cln2, cln3, cdc20, clb1, clb2, clb4, clb5, clb6, sic1, far1, spo12, apc1, tem1, gin4, swe1 and whi5). The microarray data have been taken from [42], selecting the data set produced by the alpha factor arrest method. Thus, the raw data set consists of 27 genes and 18 data points. A smoothing algorithm has been applied in order to filter the measurement noise and to increase by interpolation the number of observations. The gold standard regulatory network comprising the chosen 27 genes has been drawn from the BioGRID database [43], taking into account the information of [40] and [41]: the network consists of 119 interactions, not including the self-loops, yielding a value of the sparsity coefficient, defined by η = 1 − #edges/(n2 − n), equal to 0.87.
102
Handbook of Statistical Systems Biology 1
RANDOM PACTLS BANJO
0.9 0.8
PPV
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Sn
Figure 5.1 Results for the cell cycle regulatory subnetwork of S. cerevisiae assuming different levels of prior knowledge (PK=10, 20, 30 and 40%)
Figure 5.1 shows the results obtained by PACTLS assuming four different levels of prior knowledge (PK) from 10 to 40% of the network. The performance is evaluated by using two common statistical indices (see [44], p.138): • Sensitivity (Sn), defined as Sn =
•
TP , TP + FN
which is the fraction of actually existing interactions (TP, true positives; FN, false negatives) that the algorithm infers, also termed Recall, and Positive Predictive Value (PPV), PPV =
TP , TP + FP
which measures the reliability of the interactions (FP, false positives) inferred by the algorithm, also named Precision. To compute these performance indexes, the weight of an edge is not considered, but only its existence, so the network is considered as a directed graph. The performance of PACTLS is compared with one of the most popular statistical methods for network inference, dynamic Bayesian networks. For these purposes we used the software BANJO (BAyesian Network inference with Java Objects), a tool developed by Yu et al. [45], that performs network structure inference for static and dynamic Bayesian networks (DBNs). The performance of both approaches is compared in Figure 5.1. In order to further validate the inference capability of the algorithms, Figure 5.1 shows also the results obtained by a random selection of the edges, based on a binomial distribution: given any ordered pair of nodes, the existence of a directed edge between them is assumed true with probability pr , and false with probability 1 − pr . By varying the parameter pr in [0, 1], the random inference algorithm produces results shown as the solid curves on the (PPV, Sn) plot in Figure 5.1.
Reverse Engineering Biomolecular Networks 103
The performance of PACTLS is consistently significantly better than the method based on DBNs: the distance of the PACTLS results from the random curve is almost always larger than those obtained with the BANJO software, which is not able to achieve significant Sn levels, probably due to the low number of time points available. Moreover the results show that the performance of PACTLS improves progressively when the level of prior knowledge increases. Figure 5.2 shows the regulatory subnetwork inferred by CORE-Net, assuming 50% of the edges are a priori known. Seven functional interactions, which are present in the gold standard network, have been correctly inferred. Moreover, seven other functional interactions have been returned, which are not present in the gold standard network. To understand if the latter should be classified as TP or FP, the authors have manually mined the literature and the biological databases, and uncovered the following results:
Figure 5.2 Gene regulatory subnetwork of S. cerevisiae inferred by CORE-Net with 50% of the edges a priori known (thin solid edges). Results according to the gold-standard network drawn from the BioGRID database: TP, thick solid edge; FN, dotted edge; FP, thick dashed edge. The thick green dashed edges are not present in the BioGRID database, however they can be classified as TP according to other sources. The FP thick orange dashed edges are indirect interactions mediated by ndd1. No information has been found regarding the interactions denoted by the thick red dash-dot edges
104 • • •
Handbook of Statistical Systems Biology
The interaction between mbp1 and gin4 is reported by the YEASTRACT database [46]: mbp1 is reported to be a transcription factor for gin4. A possible interaction between fkh2 and swi6 is also reported by the YEASTRACT database: fkh2 is reported to be a potential transcription factor for swi6). The interaction between clb1 and swi5 appears in Figure 1 in [41], where the scheme of the main regulatory circuits of budding yeast cell cycle is described.
Thus, these three interactions can be classified as TP as well and are reported as green dashed edges in Figure 5.2. Concerning the other inferred interactions, two of them can be explained by the indirect influence of swi6 on fkh1 and fkh2, which is mediated by ndd1: in fact, the complexes SBF (Swi4p/Swi6p) and MBF(Mbp1p/Swi6p) both regulate ndd1 [40], which can have a physical and genetic interaction with fkh2. Moreover, fkh1 and fkh2 are forkhead family transcription factors which positively influence the expression of each other. Thus, the inferred interactions are not actually between adjacent nodes of the networks and have to be formally classified as FP (these are reported as orange dashed edges in Figure 5.2). Concerning the last two interactions, that is clb2→apc1 and mcm1→tem1, since we have not found any information on them in the literature, in the absence of further experimental evidence they have to be classified as FP (reported as red dash-dot edges in Figure 5.2). The results obtained in this example confirm that the exploitation of prior knowledge together with the PA mechanism significantly improves the inference performances. The algorithm TSNI has been tested for inferring a nine gene subnetwork of the DNA-damage response pathway (SOS pathway) in the bacteria Escherichia coli. The network is composed of nine genes: lexA, SsB, dinI, umuDC, rpoD, rpoH, rpoS, recF, recA. The gene expression profiles for all nine genes were obtained by computing the average of the three replicates for each time point following treatment by norfloxacin, a known antibiotic that acts by damaging the DNA. In order to assess the performance of TSNI, the inferred network is compared with the one identified in [5] by NIR and with the one obtained from the interactions known in the literature among these nine genes (43 interactions, apart from the self-feedback). In Figure 5.3 the results obtained by the TSNI algorithm for the E. coli time-series data are shown by plotting the average of rnz (the ratio between the identified correct nonzero coefficients with correct sign and the true number of nonzero coefficients) versus rz (the ratio between the identified correct zero coefficients and the true number of zero coefficients), obtained by comparing the predicted network of TSNI with the network obtained from the literature. The cross on the plot shows the value of rnz and rz , which is obtained by comparing the network predicted by NIR in [5] with the network from the literature (NIR predicted 22 connections). When the information that there should be five connections for each gene is used, as in [5], and thus four elements in each row of the A matrix to be identified are set to zero, then TSNI finds 20 connections correctly (corresponding to the diamond in Figure 5.3). The results of TSNI are similar to NIR, even if for TSNI only a single perturbation experiment and five time points are used, whereas for NIR nine different perturbation experiments are used and the matrix B is assumed to be known.
5.3
Reconstruction methods based on nonlinear models
Many models for describing biological networks are nonlinear ODE systems, involving polynomial or rational functions or power-law approximations. This section deals with the reverse engineering approaches which are available for these types of nonlinear models.
Reverse Engineering Biomolecular Networks 105
Figure 5.3 Results for the SOS pathway example obtained by TSNI and compared with NIR
5.3.1
Approaches based on polynomial and rational models
In the following the method developed by August and Papachristodoulou [13] is illustrated. The procedure is applied for reverse engineering chemical reaction networks with mass action kinetics (polynomial functions) and gene regulatory networks with dynamics described by Hill functions (rational) from time-series data. 5.3.1.1
Chemical reaction networks
We consider the dynamical systems described as x˙ (t) = Af (x),
(5.28)
where x(t) = (x1 (t), . . . , xn (t))T ∈ Rn , the state variables xi , i = 1, . . . , n, represent the levels of the different species present in the system, A ∈ Rn×m is the unknown matrix representing the connectivity structure of the different species and f:∈ Rn →∈ Rm is the vector of known functions which satisfy appropriate smoothness conditions to ensure local existence and uniqueness of solutions. Therefore, the relation (5.28) is linear in the unknown parameters and the main objective is to devise a procedure for reconstruction of the topology of the network represented by the matrix A, from the given experimental data. The chemical reaction network model, described by the relation (5.5), is equivalent to the dynamical system in the form (5.28), with A = YAk and f (x) = (x). Note that the unknown parameters, which determine the network structure, are in A.
106
Handbook of Statistical Systems Biology
For reverse engineering the matrix A, we consider the corresponding discrete-time system of (5.28) by using the Euler discretization: x(tk+1 ) = x(tk ) + (tk+1 − tk )Af (x(tk )).
(5.29)
This problem can be solved in a standard LS setting as shown in Section 5.2.1, but in the presence of extra constraints on the entries of A it does not have a closed-form solution; indeed, it is necessary to take into account the problem of noise in the data and the sparseness of the network, features that cannot be obtained by solving a LS problem. The problem can, however, be formulated as a Linear Program (LP), a convex optimization problem for which efficient algorithms are available that can treat large data sets efficiently and also deal with uncertainties in data or model parameters: min vec(A) 1 s.t.
− μ− x(tk+1 ) + xˆ (tk ) + (tk+1 − tk )Af (ˆx(tk )) ≤ μ+ k ≤ −ˆ k, μ+ k
≥
0, μ− k
(5.30)
≥ 0, ∀k, k = 1, . . . , p − 1,
− where vec(A) ∈ Rnm is a vector containing the entries of A, xˆ is the set of measurements, μ+ k and μk are scalars which are as small as possible for all k to ensure that the data are in close Euler-fit with the model, thus making the approximation error as small as possible. Minimizing the l1 -norm then allows the computation of a sparse connectivity matrix [24, 30, 47]. In addition, by modelling the uncertainty in the measurements as
x˜ (tk ) − (k) ≤ xˆ (tk ) ≤ x˜ (tk ) + (k), f˜ (tk ) − δ(k) ≤ fˆ (ˆx(tk )) ≤ f˜ (tk ) + δ(k), (k), δ(k), x˜ (tk ), f˜ (tk ) ≥ 0, ∀k, Aij ≥ 0, then a robust formulation of the LP (5.30) is given as min vec(A) 1 s.t.
− μ− x(tk+1 ) − (k + 1) + x˜ (tk ) − (k) + (tk+1 − tk )A(f˜ (tk ) − δ(k)), k ≤ −˜ − x˜ (tk+1 ) + (k + 1) + x˜ (tk ) + (k) + (tk+1 − tk )A(f˜ (tk ) + δ(k)) ≤ μ+ , k
− (k), δ(k), x˜ (tk ), f˜ (tk ), μ+ k , μk ≥ 0, ∀k, k = 1, . . . , p − 1, Aij ≥ 0, ∀i, j.
5.3.1.2
(5.31)
Gene regulatory networks
Consider the model of the gene regulatory network described as x˙ i = γi + fi (x) − di xi , n n bij xj ij fi (x) =
j=1 n
m 1+ kij xj ij j=1
(5.32)
,
where γi and di > 0 are the basal transcription and degradation/dilution rates, respectively, fi are activation (nij = mij > 0) and repression (nij = 0, mij > 0) Hill input functions [48], and kij and bij denote the contribution of the different transcription factors on the transcription rate. The model (5.32) can be reformulated and cast in a form that allows identification using Linear Programming. Consider the corresponding discrete-time system of (5.32) xi (tk+1 ) = xi (tk ) + t(γi + fi (xi (tk )) − di xi (tk )),
(5.33)
Reverse Engineering Biomolecular Networks 107
where t = tk+1 − tk . If bij , kij and mij are unknown then (5.33) is not affine in the unknown parameters as is the case in (5.29). We rewrite (5.33) as follows: m n˜ kij xj ij ) + t bij xj ij + tbi = 0, (5.34) (xi (tk )(1 − tdi ) − xi (tk+1 ) + tγi )(1 + j
where, for nij = n¯ ij = 0, bi =
j
n¯
bij x¯ j ij =
j
bij , whereas, for nij > 0, n˜ ij = nij . For all i, j, let an entry to
j
matrix B be bij for which nij > 0, and let an entry of matrix K be kij . As before, given a set of measurements, xˆ , we can approximate the structure of the gene regulatory network determined by bij , bi and kij if the Hill coefficients mij and nij are known and the basal production and degradation rates are known or considered uncertain but within a known range. For instance, we can try to recover B, K through the following LP: min vec(B b K) 1 s.t.
⎛
− μ < (ˆxi (tk )(1 − tdi ) − xˆ i (tk+1 ) + tγi ) ⎝1 + + t
⎞ m kij xˆ j ij ⎠
j n bij xˆ j ij
+ tbi < μ,
j
μ > 0, bij , kij , bi ≥ 0, ∀i, j, k, 0 ≤ 1i ≤ γi ≤ 2i , 0 ≤ ε1i ≤ di ≤ ε2i , ∀i,
(∗),
(5.35)
where the last requirements (∗) represent the case of uncertain production and degradation rates. Note that as for (5.32) kij = 0, if and only if bij = 0 or bi = 0, ∀i, j.
(5.36)
In the following case, the solution of (5.35) violates (5.36), that cannot be implemented in a LP: • • •
If kij = / 0, bij = 0 and bi = 0, then the production of Xi is not influenced by Xj , i.e. it is the same case as when kij = 0. If bij = / 0 and kij =0, then Xj enhances the production of Xi , i.e. it is the same case as when kij = / 0. If bi = / 0 and kij = 0 ∀i, then the production of Xi is not affected by Xj , i.e. it is the same case when bi = 0.
5.3.2
Approaches based on S-systems
Several numerical techniques have been proposed in the literature for inferring S-systems from time series measurement; most of them use computationally expensive meta-heuristics such as Genetic Algorithms (GAs), Simulated Annealing (SA), artificial neural networks function approximation or global optimization methods (see [49] and references therein). Akutsu et al. [50] developed a simple approach based on a linear programming method, named SSYS-1. n n Assume that dXdti (t) > 0 in (5.6) at time t. By taking the logarithm of each side of αi Xj gi,j > βi Xj hi,j , j=1
j=1
we have log αi +
n j=1
gi,j log Xj (t) > log βi
n j=1
hi,j log Xj (t).
(5.37)
108
Handbook of Statistical Systems Biology
Since Xj (t) are known, this is a linear inequality if we consider log αi and log βi as parameters. The same is valid in the case of dXdti (t) < 0. Therefore, solving these inequalities by using LP, we can determine the parameters. However, the parameters are not determined uniquely even if many data points are given, because the inequality can be rewritten as (log αi − log βi ) +
n
(gi,j − hi,j ) log Xj (t) > 0.
(5.38)
j=1
Therefore, only the relative ratios of log αi − log βi and gi,j − hi,j are computed. However this information is useful for qualitative understanding of S-systems. Since it seems that gi,j = / hi,j holds for most (i, j), the fact that |gi,j − hi,j | is not small means that Xi is influenced by Xj . An approach based on alternating regression (AR) was proposed as a fast deterministic method for S-system parameter estimation with low computational cost [51]. A method, inspired by AR and based on multiple linear regression and sequential quadratic programming (SQP) optimization, is proposed in [49] for identification of S-systems models from time-series, when no information about the network topology is known. Furthermore, the algorithm is extended to the optimization of network topologies with constraints on metabolites and fluxes. This method is based on the substitution of differentials with estimated slopes and the minimization of the differences between two vectors obtained from multiple linear regression (MLR) equations. Consider the following relation Si (tn ) = PTi (tn ) − DTi (tn ), n = 1, . . . , N, PTi (tn ) = αi
M
Xj (tn )gi,j , DTi (Tn ) = βi
j=1
M
(5.39)
Xj (tn )hi,j ,
j=1
where Si (tn ) represents the estimated slope of metabolite i at time tn , PTi (tn ), DTi (tn ) the production and degradation term vectors, respectively, and N is the number of time points and M the number of species (e.g. metabolites). Because PTi must be positive, (5.39) written as log PTi = log(STi + DTi )
(5.40)
LVpi = γi ,
(5.41)
or in matrix form as
where ⎛
⎞ 1 log(X1 (t1 )) · · · log(XM (t1 )) ⎜. ⎟ .. .. .. ⎟, . L =⎜ . . . ⎝. ⎠ 1 log(X1 (tN )) · · · log(XM (tN )) Vpi = [log αi gi1 gi2 . . . giM ], γi = log(STi + DTi ).
(5.42)
By MLR, the production parameter vector Vpi can be obtained as Vpi = (LT L)−1 LT γi
(5.43)
Reverse Engineering Biomolecular Networks 109
and substituting (5.43) in (5.41) we obtain L(LT L)−1 LT γi = γi .
(5.44)
Note that γi must be an eigenvector of the matrix W = L(LT L)−1 LT , with an eigenvalue equalling 1. Several standard algorithms to calculate the eigenvector of the matrix W directly were implemented, but none of them returned a satisfactory result. Therefore the task was reformulated as a minimization problem for the logarithm of the squared residuals between the right- and left-hand sides in (5.44), to define this problem in matrix form with the cost function: F = log((γi − γˆ i )T (γi − γˆ i ))
(5.45)
where γˆ i = Wγi . 5.3.3
A case-study
In the following the applicability of the approach devised in [13] by reconstructing the glycolytic pathway of Lactococcous lactis, using the same experimental data from [52], is shown. L. lactis is a bacterium used in dairy industry for the production of cheese and buttermilk, mainly because of its capacity to convert about 95% of the milk sugar lactose (Lact) to lactic acid. The glycolytic pathway (or glycolysis) consists of chemical reactions that convert glucose (Glu) into pyruvate (Pyru). In the first step, glucose is converted into glucose-6-phosphate (G6P). A conversion of G6P into fructose-1,6-bisphosphate (FBP) follows, which is then converted sequentially to glyceraldehyde-3-phosphate (Ga3P), 3-phosphoglyceric acid (3-PGA) and phosphoenolpyrurate (PEP). Additionally, Glucose and PEP are converted directly to pyruvate and G6P. In [52], since measurement data for the intermediate Ga3P were unavailable, an additional rate denoting depletion of FBP was included. A simplified description of the pathway is illustrated in Figure 5.4. In [13] the following complexes which participate in the chemical reaction network are assumed: Glu, G6P, FBP, 2×3PGA, 2×PEP, 2×Pyru and Lact. Therefore for the glycolytic pathway in the form of system (5.5)
Figure 5.4 The glycolysis of L. lactis
110
Handbook of Statistical Systems Biology
the matrix Y , the vectors of the species (x) and of the functions (f = ) are given by ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ [Glu] 1 0 0 0 0 0 0 [Glu] ⎜ [G6P] ⎟ ⎜ [G6P] ⎟ ⎜0 1 0 0 0 0 0⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ [FBP] ⎟ ⎜ [FBP] ⎟ ⎜0 0 1 0 0 0 0⎟ ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ 2⎟ Y = ⎜0 0 0 2 0 0 0⎟, x = ⎜[3PGA]⎟, f (x) = ⎜[3PGA] ⎟ . ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎜0 0 0 0 2 0 0⎟ ⎜ [PEP] ⎟ ⎜ [PEP]2 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜0 0 0 0 0 2 0⎟ ⎜ [Pyru] ⎟ ⎜ [Pyru]2 ⎟ ⎝ ⎠ ⎝ ⎝ ⎠ ⎠ [Lact] 0 0 0 0 0 0 1 [Lact] For reverse engineering the network topology described by the matrix Ak one of the possible LPs to be solved is the following: given Y ⎛ ⎜ min ⎜ ⎝
⎞
xˆ (t1 ) − xˆ (t2 ) + (t2 − t1 )Af (ˆx(t1 )) .. .
⎟ ⎟ 1 +α vec(A) 1 ⎠
xˆ (tp−1 ) − xˆ (tp ) + (tp − tp−1 )Af (ˆx(tp−1 ))
s.t.
A = YAk Aki,j ≥ 0, i = / j, ∀i, j, eT Ak = 0,
(5.46)
where α is a nonnegative constant that allows to regulate the sparsity of A explicitly. In [13] (5.46) is solved for α = 0, α = 2, and α = 3, and the relative pathway is shown in Figure 5.5, where we can see that the sparse reaction topology was almost reconstructed. Note that a gradual increase of α, for 3 ≤ α ≤ 75 does not change the network structure.
Figure 5.5 Reconstructed reaction topology for the glycolysis of L. lactis: two reactions only obtained for ˛ = 0 and ˛ = 2 are marked with ˛2 , one for ˛ = 0 with ˛0 , and one for ˛ = 3 with ˛3
Reverse Engineering Biomolecular Networks 111
References [1] P. A. Brown and D. Botstein, ‘Exploring the new world of the genome with DNA microarrays,’ Nature Genetics, vol. 21, no. 1, pp. 33–37, 1999. [2] R. J. Lipschutz, S. P. A. Fodor, T. R. Gingeras, and D. J. Lockhart, ‘High density synthetic oligonucleotide arrays,’ Nature Genetics, vol. 21, no. 1, pp. 20–24, 1999. [3] F. d’Alch´e Buc and V. Schachter, Modeling and simulation of biological networks. In Proceedings of the International Symposium on Applied Stochastic Models and Data Analysis (ASMDA’05), Brest, France, 2005, pp. 167–179. [4] M. Hecker, S. Lambeck, S. Toepfer, E. van Someren, and R. Guthke, ‘Gene regulatory network inference: Data integration in dynamic models – A review,’ BioSystems, vol. 96, pp. 86–103, 2009. [5] T. S. Gardner, D. di Bernardo, D. Lorenz, and J. J. Collins, ‘Inferring genetic networks and identifying compound mode of action via expression profiling,’ Science, vol. 301, pp. 102–105, 2003. [6] D. di Bernardo, T. S. Gardner, and J. J. Collins, ‘Robust identification of large genetic networks,’ in Proc. Pacific Symposium on Biocomputing (PSB’04), Hawaii, USA, 2004, pp. 486–497. [7] K.-H. Cho, J.-R. Kim, S. Baek, H.-S. Choi, and S.-M. Choo, ‘Inferring biomolecular regulatory networks from phase portraits of time-series expression profiles,’ FEBS Letters, vol. 580, no. 14, pp. 3511–3518, 2006. [8] S. Kim, J. Kim, and K.-H. Cho, ‘Inferring gene regulatory networks from temporal expression profiles under timedelay and noise,’ Computational Biology and Chemistry, vol. 31, no. 4, pp. 239–245, 2007. [9] S. Han, Y. Yoon, and K. H. Cho, ‘Inferring biomolecular interaction networks based on convex optimization,’ Computational Biology and Chemistry, vol. 31, no. 5–6, pp. 347–354, 2007. [10] F. Montefusco, C. Cosentino, and F. Amato, ‘CORE–Net: Exploiting Prior Knowledge and Preferential Attachment to Infer Biological Interaction Networks,’ IET Systems Biology, vol. 4, no. 5, pp. 296–310, 2010. [11] R. Guthke, U. Moller, M. Hoffman, F. Thies, and S. Topfer, ‘Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection,’ Bioinformatics, vol. 21, pp. 1626–1634, 2005. [12] J. Kim, D. G. Bates, I. Postlethwaite, P. Heslop-Harrison, and K. H. Cho, ‘Linear time-varying models can reveal non-linear interactions of biomolecular regulatory networks using multiple time-series data,’ Bioinformatics, vol. 24, pp. 1286–1292, 2008. [13] E. August and A. Papachristodoulou, ‘Efficient, sparse biological network determination,’ BMC Systems Biology, vol. 3:25, 2009. [14] J. Gim, H.-S. Kim, J. Kim, M. Cho, J.-R. Kim, Y. J. Chung, and K.-H. Cho, ‘A system-level investigation into the cellular toxic response mechanism mediated by AhR signal transduction pathway,’ Bioinformatics, vol. 26, no. 17, pp. 2169–2175, 2010. [15] D. Marbach, R. J. Prill, T. Schaffter, C. Mattiussi, D. Floreano, and G. Stolovitzky, ‘Revealing strengths and weaknesses of methods for gene network inference,’ Proceedings of the National Academy of Sciences of the United States of America, vol. 107, no. 14, pp. 6286–6291, 2010. [16] M. A. Savageau, A critique of the enzymologists test tube. In Fundamentals of Medical Cell Biology. Academic Press, New York, 1992, pp. 45–108. [17] R. Heinrich and S. Schuster, The Regulation of Cellular Systems. Chapman & Hall, New York, 1996. [18] D. C. Montgomery and G. C. Runger (Eds), Applied Statistics and Probability for Engineers. John Wiley & Sons, Ltd, New York, 2003. [19] D. di Bernardo, M. J. Thompson, T. S. Gardner, S. E. Chobot, E. L. Eastwood, A. P. Wojtovich, S. J. Elliott, S. E. Schaus and J. J. Collins, ‘Chemogenomic profiling on a genomewide scale using reverse-engineered gene networks,’ Nature Biotechnology, vol. 23, pp. 377–383, 2005. [20] M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. di Bernardo, ‘How to infer gene regulatory networks from expression profiles,’ Molecular Systems Biology, vol. 3, p. 78, 2007. [21] M. Bansal, G. D. Gatta, and D. di Bernardo, ‘Inference of gene regulatory networks and compound mode of action from time course gene expression profiles,’ Bioinformatics, vol. 22, no. 7, pp. 815–822, 2006. [22] L. Ljung, System Identification: Theory for the User. Prentice Hall, Upper Saddle River, NJ, 1999. [23] R. Bonneau, D. J. Reiss, P. Shannon, M. Facciotti, L. Hood, N. S. Baliga, and V. Thorsson, ‘The Inferelator: an algorithm for learning parsimonious regulatory networks from systems–biology data sets de novo,’ Genome Biology, vol. 7, p. R36, 2006.
112
Handbook of Statistical Systems Biology
[24] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, Cambridge, 2004. [25] G. H. Golub and C. F. V. Loan, ‘An analysis of the total least squares problem,’ SIAM Journal on Numerical Analysis, vol. 17, no. 6, pp. 883–893, 1980. [26] S. Skogestad and I. Postlethwaite, Multivariable Feedback Control: Analysis and Design. Chichester, New York, Brisbane, Toronto, John Wiley & Sons Ltd, Chichester, 1996. [27] T. Abatzoglou, J. M. Mendel, and G. A. Harada, ‘The constrained total least squares technique and its application to harmonic superresolution,’ IEEE Transactions on Signal Processing, vol. 39, pp. 1070–1087, 1991. [28] G. F. Franklin, J. D. Powell, and A. Emami-Naeini, Feedback Control of Dynamic Systems. Prentice Hall, Upper Saddle River, NJ, 2002. [29] C. Cosentino, W. Curatola, F. Montefusco, M. Bansal, D. di Bernardo, and F. Amato, ‘Linear matrix inequalities approach to reconstruction of biological networks,’ IET Systems Biology, vol. 1, no. 3, pp. 164–173, 2007. [30] A. Julius, M. Zavlanos, S. Boyd, and G. Pappas, ‘Genetic network identification using convex programming,’ IET Systems Biology, vol. 3, no. 3, pp. 155–166, 2009. [31] K. Zhou, Essentials of Robust Control. Prentice Hall, Upper Saddle River, NJ, 1998. [32] C. D. Meyer, Matrix Analysis and Applied Linear Algebra. SIAM Press, Philadelphia, PA, 2000. [33] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan, (Eds.), Linear Matrix Inequalities in System and Control Theory. SIAM Press, Philadelphia, PA, 1994. [34] P. Gahinet, A. Nemirovski, A. J. Laub, and M. Chilali, LMI Control Toolbox. The Mathworks, Inc., Natick, MA, 1995. [35] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A. L-Barab´asi, ‘The large-scale organization of metabolic networks,’ Nature, vol. 407, pp. 651–654, 2000. [36] A. Wagner, ‘The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes,’ Molecular Biology and Evolution, vol. 18, pp. 1283–1292, 2001. [37] D. E. Featherstone and K. Broadie, ‘Wrestling with pleiotropy: genomic and topological analysis of the yeast gene expression network,’ Bioessays, vol. 24, pp. 267–274, 2002. [38] N. M. Luscombe, J. Qian, Z. Zhang, T. Johnson, and M. Gerstein, ‘The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties,’ Genome Biology, vol. 3 (8):research0040.1– 1.0040.7, 2002. [39] R. Albert and A. L. Barab´asi, ‘Topology of evolving networks: local events and universality,’ Physical Review Letters, vol. 85, no. 24, pp. 5234–5237, 2000. [40] I. Simon, J. Barnett, N. Hannett, C. T. Harbison, N. J. Rinaldi, T. Volkert, J. J. Wyrick, J. Zeitlinger, D. K. Gifford, T. S. Jaakkola, and R. A. Young, ‘Serial regulation of transcriptional regulators in the yeast cell cycle,’ Cell, vol. 106, no. 1, pp. 697–708, 2001. [41] J. B¨ahler, ‘Cell–cycle control of gene expression in budding and fission yeast,’ Annual Review of Genetics, vol. 39, no. 1, pp. 69–94, 2005. [42] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Andres, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher, ‘Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,’ Molecular Biology of the Cell, vol. 9, no. 1, pp. 3273–3297, 1998. [43] C. Stark, B. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers, ‘BioGRID: a General Repository for Interaction Datasets,’ Nucleic Acids Research, vol. 34, no. 1, pp. D535–D539, 2006. [44] D. Olson and D. Delen (Eds), Advanced Data Mining Techniques. Springer, Berlin, 2008. [45] J. Yu, V. Smith, P. Wang, A. Hartemink, and E. Jarvis, ‘Advances to Bayesian network inference for generating causal networks from observational biological data,’ Bioinformatics, vol. 20, pp. 3594–3603, 2009. [46] M. C. Teixeira , P. Monteiro, P. Jain, S.Tenreiro, A. R. Fernandes, N. P. Mira, M. Alenquer, A. T. Freitas, A. L. Oliveira and I. Sá-Correia, ‘The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae,’ Nucleic Acids Research, vol. 34, no. 1, pp. D446–D451, 2006. [47] M. Zavlanos, A. Julius, S. Boyd, and G. Pappas, Identification of stable genetic networks using convex programming. In Proceedings of the 2008 American Control Conference, Seattle, WA, USA, 2008 , pp. 2755–2760. [48] U. Alon, An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC, Boca Raton, FL, 2006.
Reverse Engineering Biomolecular Networks 113 [49] M. Vilela, I. C. Chou, S. Vinga, A. T. Vasconcelos, E. Voit, and J. Almeida, ‘Parameter optimization in S-system models,’ BMC Systems Biology, vol. 2, p. 35, 2008. [50] T. Akutsu, S. Miyano, and S. Kuhara, ‘Inferring qualitative relations in genetic networks and metabolic pathways,’ Bioinformatics, vol. 16, no. 8, pp. 727–734, 2000. [51] I. Chou, H. Martens, and E. Voit, ‘Parameter estimation in biochemical systems models with alternating regression,’ Theoretical Biology and Medical Modelling, vol. 3, no. 1, p. 25, 2006. [52] A. R. Neves, R. Ventura, N. Ansour, C. Shearman, M. J. Gasson, C. Maycock, A. Ramos, and H. Santos, ‘Is the glycolytic flux in Lactococcus lactis primarily controlled by the redox charge? Kinetics of NAD+ and NADH pools determined in vivo by 13 C NMR,’ The Journal of Biological Chemistry, vol. 277, no. 31, pp. 28 088–28 098, 2002.
6 Algebraic Statistics and Methods in Systems Biology Carsten Wiuf Bioinformatics Research Centre, Aarhus University, Denmark
6.1
Introduction
Algebra in statistics is as old as statistics itself; examples include the use of linear algebra in theory for Gaussian variables and the use of group theory in experimental designs and rank statistics (Diaconis 1988). However, the term algebraic statistics is only about ten years old and was coined by Pistone et al. (2000) to denote the use of algebraic geometry in statistics. Algebraic geometry is the study of (ratios of) polynomials and the solutions to systems of polynomial equations. Many common statistical procedures involve finding solutions to polynomial equations and it is therefore not surprising that algebraic geometry might be used in statistical methodology. Recent developments in computational algebra and their implementation in computer software have facilitated the application of algebraic geometry in statistics and probability, and generated what is now understood as algebraic statistics. The field of algebraic statistics is still young, but very active, and the full range of its use is still to be seen. A historic account is given in Riccomagno (2009) and an editorial survey in Fienberg (2007). The main areas of research in algebraic statistics include graphical models (Geiger et al. 2004; Garcia et al. 2005), contingency tables and conditional independency tests (Diaconis and Sturmfels 1998; Drton et al. 2008; Dobra et al. 2009), and experimental designs (Pistone et al. 2000; Drton et al. 2008). Most theory and applications are concerned with discrete modelling on a finite state space, where the relationship to the algebraic structures is wellestablished (Pistone et al. 2000; Pachter and Sturmfels 2005; Drton et al. 2008). It is therefore not surprising that algebraic statistics is often encountered in connection with discrete graphical models and computational biology of sequence data. In recent years, several textbooks on algebraic statistics have been published, see for example Pistone et al. (2000), Pachter and Sturmfels (2005), Drton et al. (2008) and Gibilisco et al. (2010); in addition many lecture
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
Algebraic Statistics and Methods in Systems Biology
115
and workshop notes are freely available on the internet. These provide a solid background in polynomial algebra and algebraic geometry and their relationship to statistics and probability. This chapter relies to large extent on these general and expository presentations. As mentioned above algebraic statistics has already found its use in computational biology (Pachter and Sturmfels 2005). It is mainly through the extensive use of discrete (finite) state space models with latent variables such as Hidden Markov Models (HMMs) and more generally graphical models. Systems biology is also rich in graphical and other finite state-space models (see Chapters 11–13 and 19 for some examples), but has not yet seen the same use of algebraic statistics as computational biology. Thus, there is an anticipated potential for algebraic statistics in systems biology too. Here, systems biology is seen as the study of (dynamical) systems of different components and their interactions and dependencies. Additionally, systems biology is rich in deterministic models, for example models of pathways and reaction networks based on ordinary differential equations or Boolean Networks (Chapters 16 and 17). The algebraic techniques used in algebraic statistics have also been applied in the deterministic setting [see Manrai and Gunawardena (2008), Craciun et al. (2009), and Veliz-Cuba et al. (2010) for examples], and biochemical theory has been linked and developed in the context of computational algebra (Gatermann 2001; Gatermann and Huber 2002). The use of these techniques appears to be spreading (Laubenbacher and Sturmfels 2009). Part of this chapter will therefore be concerned with results obtained through these methods and how they might be applied in a statistical setting.
6.2
Overview of chapter
In relation to systems biology, in particular with reference to graphical modelling, algebraic techniques have been used for the following: (i) (ii) (iii) (iv)
parameter inference; derivation of model invariants; hypothesis tests under log-linear models; reverse engineering of networks.
The above list is not exhaustive, but provides some examples. Here parameter inference is used in a statistically nonstandard way. It refers to a range of different activities such as finding all parameter values ω that result in the same probabilites Pω (X = x) or identifying all ω in a HMM that have the same optimal hidden path (Viterbi sequence). Gr¨obner bases and Newton polytopes are algebraic tools used here. Finding model invariants is about reducing a system of equations to a smaller set, for example by means of Gr¨obner bases, that are characteristic of the model. Model invariants have been investigated in phylogenetics (Cavender and Felsenstein 1987; Evans and Speed 1993; Sturmfels and Sullivant 2005; Allman and Rhodes 2007), where they can be used to statistically distinguish between different phylogenetic trees. In systems biology, stochastic models often have very large state spaces, which makes the task of finding invariants daunting. In Manrai and Gunawardena (2008), invariants for a system of protein phosphorylation are derived in a deterministic model, which is a simpler task than in a similar stochastic model. The invariants predict that (perfect) observations of protein concentrations should lie on a plane (Example 6.4.4). If the system is adequately described by the deterministic model overlaid with some random noise, then the invariants can be used statistically to distinguish between different qualitative behaviours of the system. Log-linear models are well-studied. They are frequently used in systems biology and might provide one of the best examples of the use of algebraic techniques in statistics. One of the first theorems of algebraic statistics relates to exact tests for log-linear models and how a Markov chain Monte Carlo (MCMC) sampler
116
Handbook of Statistical Systems Biology
can be constructed to sample from the conditional distribution of the data given the minimal sufficient statistic (Diaconis and Sturmfels 1998). A characterization of the state space conditional on the sufficient statistic is given in terms of Markov bases (see Section 6.7) and these are explicitly used in the construction of the MCMC sampler. The best example might be Fisher’s exact test and conditional sampling from a contingency table with given fixed marginals under the assumption of independence of rows and columns. Generally, reverse engineering refers to estimation of a dependency structure underlying the data, such as a tree, network, or set of functions. In the present context it relates to finding algebraic functions that fit noise-free data perfectly and one way of doing this is discussed (Laubenbacher and Stigler 2004). The data could be time series mesurements of different components in a system. Here, reverse engineering deviates from the standard usage in two senses; first of all, uncertainty in the data is ignored and secondly, the set of functions is generally very large such that several best fit solutions can be found. In the last section some concluding remarks and perspectives are offered.
6.3
Computational algebra
A system is not only represented by its individual constituents but also by the interactions and dependencies between constituent parts. Over the years, many different types of mathematical models have been proposed for modelling biological systems; for example Boolean Networks (Kauffman 1969), Bayesian Nets (Chapters 12 and 13), Petri Nets (Wilkinson 2006; Chaouiya et al. 2008), and logical networks (de Jong 2002; Laubenbacher 2005). Some perspectives on the role of algebra in modelling can be found in Mishra (2007). Symbolic representations in terms of graphs and diagrams are used such that constraining/defining relations are represented by corresponding graphical/diagrammatic entities. Typically, a mathematical model of a system has many parameters and variables. A priori there is little possibility to fix parameters to certain ‘known’ values, let alone restricting them to certain parts of the parameter space. In general, this provides a challenge for statistical analysis as well as for mathematical treatment. Computational algebra, using concepts from abstract algebra such as Gr¨obner bases and Newton polytopes, allows for symbolic manipulation of equations involving constants, parameters and variables, irrespective of their values and numbers (Laubenbacher and Sturmfels 2009). Computer programs like Mathematica and Maple, as well as noncommercial programs like Macauley2 (Grayson and Stillman 1996), Singular (Decker et al. 2010), Maxima (Maxima 2009), and CoCoa (CoCoATeam), have very efficient algebraic routines for handling symbolic equations and computations, and these programs are key to the successful application of algebraic methods in statistics and systems biology. In many cases, manual manipulations are not possible due to the number of equations and terms, and numeric manipulations only allow for the exploration of a limited part of the state and parameter spaces. Besides, this does not provide general insight into the nature of the system. To apply algebraic methods we are restricted to consider statistical and mathematical models with certain polynomial parameterizations and constraints. However, there are many of these. Some well-known examples are given below. Example 6.3.1 (Binomial model) Let X be a binomial random variable with probabilities m−j mm − j m j m−j θ (1 − θ) = (−1)i θ j+i , pj = gj (θ) = j j i
(6.1)
i=0
for j = 0, . . . , m, and θ in (0, 1). The probabilities pj are given as polynomials gj (θ) in one unknown, θ, and with rational (binomial) coefficients, and therefore pj are in the so-called polynomial ring Q[θ] of all
Algebraic Statistics and Methods in Systems Biology
117
polynomials in one unknown and with rational coefficients. In this case, the polynomials gj are of degree m and have terms of lower order. On the other hand, one can see that the following relations are satisfied pj−1 pj+1 = p2j
j(m − j) , (j + 1)(m − j + 1)
(6.2)
and that any model fulfilling these relations is a binomial model for some θ. Thus, writing fj (z0 , . . . , zm ) = zj−1 zj+1 − z2j
j(m − j) , (j + 1)(m − j + 1)
we find that for any binomial model, the probabilities p = (p0 , . . . , pm ) are positive zero-points (solutions) of some polynomials (the fj ) in m + 1 unknowns (z0 , . . . , zm ). The set of binomial models is therefore given as VBin = {p|fj (p) = 0} with the constraint that the probabilities sum to one. This constraint is also a polynomial relation. A set defined as the common zero-points of some polynomials is called a variety. In Section 6.4, two different definitions of being an algebraic statistical model are given. The first (Definition 6.4.1) corresponds to defining the binomial models by Equation (6.2), whereas the second (Definition 6.4.2) corresponds to defining the binomial models by Equation (6.1). Example 6.3.2 (Contingency table) Consider an r × s table assuming independence of rows and columns. The probabilities pij , i = 1, . . . , r, and j = 1, . . . , s, are given as polynomials in r + s unknowns,
with the constraints
i pi
=
pij = gij (p1 , . . . , pr , q1 , . . . , qs ) = pi qj , j qj
(6.3)
= 1, or alternatively, as polynomial constraints on the probabilities pij , pij =
s k=1
pik
r
plj .
(6.4)
l=1
In the first case, the polynomials gij are in the polynomial ring Q[p1 , . . . , pr , q1 , . . . , qs ] of all polynomials in r + s unknowns and with rational coefficients. All nonzero coefficients are one and all polynomials are of degree two. In the second case, the polynomial constraints may be rewritten as
0 = pij pkl − pil pkj
(6.5)
with the constraint that i,j pij = 1 (which follows from pij = pi qj ). The right-hand sides are polynomials of degree two in the polynomial ring Q[p11 , . . . , prs ] with rs unknowns, again with all nonzero coefficients being one. Similarly to Example 6.3.1, the probability distributions of the contingency table have been characterized in two ways. The first defines the distributions through a polynomial parameterization (6.3) and corresponds to Definition 6.4.2. The other defines it as the zero-points of polynomials of the form (6.5) and corresponds to Definition 6.4.1. The variety of the zero-points is given by V2x2 = {p|pij pkl − pil pkj = 0, for i, k = 1, . . . , r and j, l = 1, . . . , s}.
118
Handbook of Statistical Systems Biology
Algebraically, this variety is well-studied and goes under the name the Segre variety. Contingency tables are discussed further in Section 6.7. In the last example, the variety V2x2 is the same as that derived from Equation (6.4), but given by a different set of polynomials. The polynomials are examples of model invariants, algebraic relations that are fulfilled for all probability distributions in the model. In general, it is desirable to find simple model invariants, because these might have biological interpetations and might be used in connection with statistical inference and model selection. The algebraic apparatus for finding model invariants is essentially Gr¨obner bases, which will be considered in Section 6.6.
6.4
Algebraic statistical models
In the literature, different definitions are given for a model to be an algebraic statistical model; Pachter and Sturmfels (2005) define it through a polynomial parameterization, whereas Sullivant (2008) defines it implicitly through polynomial constraints (a variety); see also Drton and Sullivant (2007) and Drton et al. (2008) for further discussion. In Examples 6.3.1 and 6.3.2, both definitions were given. As already discussed, for solving and manipulating equations, there is a large body of algebraic theory and methods available. This is the main reason for seeking models with these special characteristics and differs from how other broad classes of statistical models typically are defined. For example, graphical models are defined through conditional independency relations which represent explicit statistical constraints or properties of the model, and exponential families are defined through assumptions about the form of the density function. As we will see later, both these classes of models are related to algebraic statistical models. 6.4.1
Definitions
Consider a probability distribution defined on a finite state space, X = {x1 , . . . , xm }, and let X be a stochastic variable or vector that takes values in X. Define pi = Pr(X = xi ). Then the vector of probabilities p = (p1 , . . . , pm ) is a point in the simplex m , m = {p ∈ Rm |pi ≥ 0 for all i, and
m
pi = 1}.
i=1
In the examples, I might refrain from the general notation outlined here, whenever it is convenient. In our context we define a statistical model M for a stochastic variable to be a subset of m , i.e. M ⊆ m . For example, the Examples 6.3.1 and 6.3.2 both constitute statistical models. Definition 6.4.1 An algebraic statistical model M is a set of probability distributions of the form M = V (F ) ∩ m , where F ⊆ Q[p] (or R[p]) is a set of polynomials with rational (real) coefficients and V (F ) is the variety V (F ) = {p|f (p) = 0, for all f in F }. In the parlance of statistics, the simplex m is the full, unrestricted model and the variety V (F ) imposes certain constraints on the probability distributions. These constraints must be given as polynomial constraints in the variables p1 , . . . , pm . Thus, not any statistical model is an algebraic statistical model. However, many statistical models in systems biology are algebraic statistical models in the above sense.
Algebraic Statistics and Methods in Systems Biology
119
There is another class of algebraic models, which is referred to as parametric algebraic statistical models (Drton et al. 2008; Sullivant 2008). Definition 6.4.2 A parametric algebraic statistical model M is a set of probability distributions of the form M = φ(), where ⊆
Rd
is a semi-algebraic set (the parameter space) and φ : → m a rational function.
A rational function is a ratio of two polynomials; in particular a polynomial is a rational function. A semialgebraic set is a subset of Rn defined by a finite number of polynomial equations and inequalities, or any finite union of such sets (Cox et al. 2007). The technicalities will not be of importance here. All standard models are defined on semi-algebraic parameter sets. Definitions 6.4.1 and 6.4.2 do not necessarily agree. The set φ() does not have to be a variety, or more precisely, the intersection of a variety with the simplex (cf. Definition 6.4.1), though in many cases it is. It is desirable for it to be a variety because one can then directly apply results about varieties. To remedy this, one might consider the smallest variety that contains φ(), i.e. a set of polynomials whose zeros include the set φ(). This variety can be much bigger than φ() implying that the variety contains probability distributions that are not in φ(), and thus the two definitions are distinct. Example 6.4.1 (Example 6.3.1 continued) The binomial model is an algebraic statistical model as well as a parametric algebraic statistical model. In the latter case Bin = (0, 1),
and
φBin (θ) = (g0 (θ), . . . , gm (θ)),
for θ in (0, 1), whereas in the former case, F = F Bin . Example 6.4.2 (Example 6.3.2 continued) model with 2×2 = r × s ,
and
A 2 × 2 contingency table is a parametric algebraic statistical φ2×2 (ω) = (g11 (ω), g12 (ω), . . . , grs (ω)),
and an algebraic statistical model with F = F 2×2 .
In the two examples given above it seems natural to define the model through Definition 6.4.2, and subsequently establish that Definition 6.4.1 also applies, i.e. that φ() is a variety. However, in some cases, as will be seen in the next subsection, Definition 6.4.1 is a natural starting point. 6.4.2
Further examples
Graphical models provide an example of a class of models that by definition form algebraic statistical models. However, if the graphical model contains hidden variables, then it is not an algebraic statistical model in the sense of Definition 6.4.1 (Pachter and Sturmfels 2004a,b). Example 6.4.3 (Graphical models) A undirected graphical model M (see Chapter 11; Lauritzen 1996) defined on a finite state space forms an algebraic statistical model. The probability distributions in M are defined via conditional independence relations that cause the probabilities to factorize in certain ways (Lauritzen 1996). For example, the conditional independence relation X1 ⊥⊥ X2 |X3 (X1 and X2 are conditionally independent given X3 ) corresponds to the relation P(X1 = x1 , X2 = x2 |X3 = x3 ) = P(X1 = x1 |X3 = x3 )P(X2 = x2 |X3 = x3 ),
120
Handbook of Statistical Systems Biology
which might be written as px1 •x3 p•x2 x3 = px1 x2 x3 p••x3 ,
(6.6)
where px1 x2 x3 = P(X1 = x1 , X2 = x2 , X3 = x3 ) and • indicates summation over the respective values of the variables. The conditional independence relations might also be given in the form of cross-product relations, px1 x2 x3 px x x3 = px x2 x3 px1 x x3 , 1 2
1
(6.7)
2
which follows straightforwardly from the definition of conditional independence (Lauritzen 1996). Using Equation (6.6) or (6.7), the probability distributions can be written as the positive zero-points of a set of polynomials, hence as a variety. For example, Equation (6.7) corresponds to the quadratic polynomial f (p) = px1 x2 x3 px x x3 − px x2 x3 px1 x x3 . 1 2
1
2
If all variables are observed, that is if all terms px1 x2 x3 correspond to a possible observable outcome, then f = 0 defines a polynomial constraint on the probabilities and thus fits into Definition 6.4.1. However, if some variables are hidden, as for example is the case in a HMM, then the probabilities of the observed variables are partly parameterized by the probabilities of the hidden variables (Pachter and Sturmfels 2005). There is a large body of work on graphical models. In the algebraic statistics literature, they have been treated rigorously (Dobra 2003; Dobra and Sullivant 2004; Geiger et al. 2004), directed as well as undirected graphical models, and Bayesian Nets (Garcia et al. 2005), and also in relation to log-linear models (Section 6.7) and exponential families in general (Drton et al. 2008). Equilibrium distributions for probabilistic Boolean Networks might be characterized in a similar way. The same is true for stochastic Petri Nets if they have a bounded state space. Thus these systems will also form algebraic statistical models. Many systems are conveniently described by stochastic Petri Nets (Wilkinson 2006) and I will give one example below as an illustration of an algebraic statistical model. The example will be taken up again later in connection with model invariants (see Section 6.6). Example 6.4.4 (Two-site phosphorylation cycle) Phosphorylation acts as a way to activate and inactivate proteins. In this way, cellular processes can be controlled and signals can be mediated in pathways. Most proteins can be phosphorylated but in many cases it is unknown how many phosphorylation steps are involved in transferring a signal, in what order and how they are carried out. The topic has attracted considerable interest in the mathematical literature, where it has been shown that different modes of transferring a signal might lead to radically different and complex systems behaviour, see for example Huang and Ferrell (1996), Markevich et al. (2004) and Manrai and Gunawardena (2008). Statistical and mathematical ways of distinguishing between qualitative aspects of systems are therefore of interest. Here, as an example, a two-site phosphorylation cycle (Huang and Ferrell 1996; Manrai and Gunawardena 2008) is considered as a stochastic Petri Net. A stochastic Petri Net is defined by a set of chemical species Ci , i = 1, . . . , K, and a set of reactions between the species, such as C1 + C2 → C3 . A state of the system is a vector describing the number of each species. The system jumps in a Markovian fashion between states according to rates that depend on these numbers and the form of the reactions [for details, see e.g. CornishBowden (2004) for the deterministic setting and Wilkinson (2006) for the stochastic setting]. The protein has two sites which can be phosphorylated in a distributive or a processive way (Figure 6.1). In general it is difficult to know experimentally whether one or the other mode is the preferred mode of transfer, and it is therefore desirable to be able to distinguish the two modes mathematically. The cycle has the following reactions: Su + X ↔ YuX → Sv + X,
(6.8)
Algebraic Statistics and Methods in Systems Biology
121
Figure 6.1 Illustration of the system in Example 6.4.4. The unphosphorylated protein S00 can be phosphorylated at one or two sites by the kinase E, and likewise already-phosphorylated proteins can be dephosphorylated by the phosphatase F. Both sites being (un)phosphorylated at the same time (as indicated by the dashed line) might be possible in some systems while not in others. If the dashed line is not present, the system is referred to as distributive, and otherwise as processive. Each bidirectional arrow corresponds to several reactions in Equation (6.8). For X = E (arrow in forward direction) and u = 00, there are four (five, if the system is processive) reactions as the intermediate complex might split up in either 01 or 10 (or 11, if processive), for 01 and 10 there are only three reactions as the product is always 11. Similarly for X = F (arrow in backward direction). Thus there are 20 [= (4 + 3 + 3) · 2] rate constants for a processive system and 22 [= (5 + 3 + 3) · 2] for a distributive system
where Su and Sv are protein substrates, phosphorylated at none (00), one (10, 01) or two (11) sites, as in Figure 6.1. The symbol X denotes a kinase E (catalysing phophorylation) or a phosphatase F (catalysing dephosphorylation), and YuX is the intermediate complex formed between substrate and kinase/phosphatase. The complex might be written as Su·X to indicate binding of X to Su . In this particular case, the number of each species is bounded by constant total numbers, S, E, and F , where S= [Su ] + [Yu ], E = [E] + [Yu ], F = [F ] + [Yu ], (6.9) u
u
u= / 11
u= / 00
and [C] denotes the number of species C. Hence, the state space is finite. The first sum counts the number of molecules that involves substrates, the second how many involve kinase and the last how many involve phosphotase. The three equations are referred to as conservation laws. The rate of each reaction is given according to the law of mass-action (see Chapter 18; Cornish-Bowden 2004; Wilkinson 2006). For example the reaction S00 + E → Y00 happens at rate λ[S00 ][E], where λ is some positive kinetic constant. Each reaction corresponds to a specific change of the current state. Let n = (n1 , n2 , . . . , nK ) be the current state of the system and let n1 = [S00 ], n2 = [E] and n3 = [Y00 ]. For the particular reaction the current state changes to the state n = (n1 − 1, n2 − 1, n3 + 1, n4 , . . . , nK ), because one S00 and one E form one Y00 . The rate qnn for the transition between the two states is the same as the reaction rate, that is qnn = λ[S00 ][E]. Thus, the dynamics of the system is governed by a Markov rate matrix Q with entries qnn for general states n and n . All entries depend on the rate constants (λ1 , λ2 , . . .) and the states (n1 , n2 , . . .), and the entries are linear polynomials in λ1 , λ2 , . . . with rational (integer) coefficients, similarly to the example λ[S00 ][E]. In total there are 20 rate constants for the processive system, whereas the distributive system has 22 (Figure 6.1). The number of states depends on the total amounts in Equation 6.9 and is potentially large. For a comprehensive treatment of stochastic Petri Nets, see Wilkinson (2006). In this particular case the system is recurrent (from any state, one can always get back to the same state again), and a unique equilibrium distribution p exists fulfilling the linear equation system pQ = 0 (Wilkinson 2006). By solving the linear equation system for p, a parametric algebraic statistical model as in Definition 6.4.2 is obtained, where the equilibrium probability pn for state n is given as a rational function φ depending on
122
Handbook of Statistical Systems Biology
the parameters λ1 , λ2 , . . . only. The function φ maps {(λi )i |λi > 0} ⊂ R20 (or R22 ) into RN , where N is the number of states in the system. The two different molecular mechanisms, processive and distributive phosphorylation, give rise to different equilibrium distributions and different algebraic invariants. If we seek to characterize systems by their invariants, then the high-dimensionality of the state-space, potentially thousands of states, puts a severe restricton on symbolic manipulations. Some resolution to this problem is offered in Example 6.3.2, where the model is further discussed.
6.5
Parameter inference
In the algebraic statistics literature parameter inference covers distinct but inter-related activities (Pachter and Sturmfels 2004a,b, 2005). Pachter and Sturmfels (2004a) list two problems: (i) Find all parameter values that result in the same probability distribution. (ii) Given an observation x (from a graphical model) and hidden (unobserved) data h find all parameter values ω, such that h is the most likely explanation of the observation x. They point out that these problems are not frequently encountered in classical parameter inference although (i) is related to the classical problem of parameter identifiability. (Under what circumstances can the parameters of a model be estimated from observations?) No details are discussed here, but note that the algebra involved relates to that of Problem (ii). Problem (ii) is in part motivated by a similar problem in alignment theory, where one asks the question of identifying all weights (match, mismatch, and gap penalties) that lead to the same optimal alignment. However, for a HMM, the statistical uncertainty in the optimal hidden path might be better addressed with other statistical measures (Capp´e et al. 2005). To illustrate Problem (ii) consider a simple example, adapted from Pachter and Sturmfels (2004b), of a HMM with three hidden nodes (on/off=0/1) and for each hidden node the dependent variable has two possible outcomes (up/down=u/d; Figure 6.2). In the context of systems biology, the hidden nodes could represent
Figure 6.2 A HMM with three hidden nodes (light blue) and corresponding output variables (dark blue). The hidden nodes have two states (0/1), and likewise the output variables (u/d). In total there are eight different hidden paths, and eight different combinations for the output variables
Algebraic Statistics and Methods in Systems Biology
123
Table 6.1 Exponents of the transition parameters thh for the eight hidden paths. The exponents in each column sum to two, since there are two transitions in each path Hidden paths, h Transitions
000
001
010
011
100
101
110
111
00 01 10 11
2 0 0 0
1 1 0 0
0 1 1 0
0 1 0 1
1 0 1 0
0 1 1 0
0 0 1 1
0 0 0 2
dependent transcription factors, each regulating a gene, or the same transcription factor for each gene, but where additional transcription factors are required for activation of the genes. The additional transcription factors are implicitly modelled through the underlying Markov chain. The a posteriori probability Pω (h|x) of a path h = (h1 , h2 , h3 ) given an observation x = (x1 , x2 , x3 ) is a rational function in the unknown parameters (Capp´e et al. 2005), Pω (h|x) ∝ th1 h2 th2 h3 eh1 x1 eh2 x2 eh3 x3 ,
(6.10)
where the thh denote transition probabilities, ey emission probabilities, ω = (t00 , t01 , t10 , t11 , e0d , e0u , e1d , e1u ) is the vector of transition and emission probabilities, and the initial distribution is assumed uniform. The transitions between states are summarized in Table 6.1. For example, the 2 in the first column of Table 6.1 indicates that there are two instances of the transition 00 in the path 000 (from hidden node 1 to 2, and from 2 to 3) and no other types of transitions. The optimal path, i.e. the path with the highest a posteriori probability, also called the Viterbi sequence, can be, found from a geometric object called a Newton polytope (Pachter and Sturmfels 2005). The Newton polytope is the convex hull spanned by the vectors containing how many times each parameter occurs in (6.10). For example, x = (u, u, d) and h = (0, 0, 0) give the eight-dimensional vector (0, 0, 1, 1, 1, 2, 0, 0) (with entries ordered as in ω). It follows from the expression of the posterior probability, 2 0 0 0 1 2 0 0 Pω (h|y) ∝ t00 t00 e0u e0u e0d = t00 t01 t10 t11 e0d e0u e1d e1u .
Each vertex of the polytope corresponds to a particular hidden path. Note that there are only eight vectors, because the exponents of the emission probabilities follow from h (x is fixed). For illustration, a simpler Newton polytope in two dimensions is shown in Figure 6.3. The Newton polytope is in R8 , but it is only four-dimensional because of constraints on the exponents. For example, the exponents in Table 6.1 sum to two, thereby reducing the dimension of the polytope. In this case, each vertex of the polytope corresponds to an optimal hidden path for some choice of parameters. The parameter space is divided into eight regions, such that each region contains the parameters that make a given hidden path optimal. These regions are derived from the Newton polytope. At the boundaries between regions, two or more hidden paths are equally likely. The geometric object dividing R8 into regions is called the fan of the Newton polytope (Pachter and Sturmfels 2005). Again, see Figure 6.3 for a simpler situation. Thus, in this case, any hidden path is optimal for some parameter choice, and no hidden path can be excluded a priori. In general, the Newton polytope is spanned by fewer vertices (optimal hidden paths) than the possible number of hidden paths. The possible number grows exponentially with the length of the HMM (in our case, the length is n = 3, hence 23 = 8 paths), but the number of vertices in the polytope only grows polynomially (Pachter and Sturmfels 2004a,b). Hence, most hidden paths cannot be optimal for a given observed sequence for any choice of parameters.
124
Handbook of Statistical Systems Biology
Figure 6.3 (a) The Newton polytope (light blue quadrilateral with boundary indicated by darker blue lines) of the polynomial f = 5i=1 fi , where f1 = x10 x20 = 1, f2 = x12 x20 = x12 , f3 = x10 x21 = x2 , f4 = x13 x22 , and f5 = x11 x21 = x1 x2 . The parameters are x1 and x2 and these are the coordinates on the x- and y-axes, respectively. In the context of a HMM, each polynomial would correspond to the posterior probability of a hidden path. There are five monomials (polynomials consisting of a single term) giving rise to five points, (0, 0), (2, 0), (0, 1), (3, 2), and (1, 1), but only four of these form corners in the polytope; all except (1, 1). The polytope is the convex hull of the five points. The normal fan is illustrated with red lines; its form follows from that of the polytope with red lines being perpendicular to the boundary segments of the polytope (Pachter and Sturmfels 2005). One region of the polytope is shaded in grey. (b) Here the different regions of the normal fan are put together. Each region contains the parameters (x1 , x2 ) for which a given polynomial is larger than the other polynomials. The shaded region [corresponding to the shaded region in (a)] is optimal for f3 , corresponding to the point (0,1). Polynomial f5 , corresponding to the point (1,1), is not optimal for any choice of parameters. At the boundaries (red lines) between regions two polynomials are equally optimal
6.6
Model invariants
Model invariants are algebraic relations that are fulfilled for all probability distributions in the model. Hence they are a characteristic of the model. The model invariants fi , i = 1, 2, . . . , take the form, 0 = fi (p), see e.g. Example 6.3.2. In general, the set of model invariants is large, because the sum of two invariants is again an invariant, fi (p) + fj (p), and an invariant multiplied by an arbitrary polynomial (h) is also an invariant, h(p)fi (p). A set of polynomials with these properties is called an ideal. The ideal of model invariants is often called the vanishing ideal. Among all the possible invariants we seek to select a small number that characterize the model. As we saw in Example 6.3.2, these might be chosen in different ways. One way is to use Gr¨obner bases, in particular reduced Gr¨obner bases, which are finite sets of polynomials that generate a given ideal and fulfill certain algebraic requirements [the precise definition of a Gr¨obner basis can be found in Pachter and Sturmfels (2005), Cox et al. (2007)]. See Figure 6.4 for an illustration. A reduced Gr¨obner basis is not unique but depends on how the unknown variables, p1 , p2 , . . ., are ordered. Irrespective of the order, all reduced Gr¨obner bases of an ideal have the same number of elements and thus, this number is a property of the ideal (similarly to a basis for a vector field, for example). The nonuniqueness relates to division with remainder, e.g. the polynomial 2pq can be written as 2pq = p(2q − p) + p2 ,
and
2pq = −2q(2q − p) + 4q2 ,
such that division by 2q − p does not produce a unique result. When we consider polynomials in one unknown only (say, p), then the remainder is unique and always has smaller degree than the divisor. To obtain a similar result for polynomials in several unknowns the terms have to be ordered, such that it can be decided whether the remainder p2 is smaller or larger than the divisor. This implies that to find good invariants one might have
Algebraic Statistics and Methods in Systems Biology
125
¨ Figure 6.4 Illustration of the idea of a reduced Grobner basis generating an ideal of polynomials. The ideal is ¨ given by the large black ellipse and two reduced Grobner bases are shown as the smaller (overlapping) red and blue circles. They have the same number of elements (indicated by dots), but are based on different orderings of ¨ variables. Each Grobner basis generates the whole ideal by repeated application of addition and multiplication of polynomials (as described in the first paragraph of Section 6.6), but from different starting sets. Eventually all ¨ polynomials in the ideal will be generated from the polynomials in the Grobner basis by means of these operations. For example, if p1 and p42 are in the basis, then also g1 = p1 + p42 and g2 = p71 p42 are in the ideal generated from ¨ the basis. Subsequently, also g1 + g2 , and so forth. A Grobner basis always exists due to Hilbert’s basis theorem, see Cox et al. (2007)
to order the terms in different ways and find a Gr¨obner basis for each of them. As soon as an order is chosen, the Gr¨obner basis and the remainder upon division are unique. Hence, a polynomial is in the ideal generated by the basis if and only if the remainder is zero. There are efficient algorithms for finding reduced Gr¨obner bases, for example Buchberger’s algorithm (Pachter and Sturmfels 2005). It is implemented in many computer packages for symbolic computation. The algorithm takes as input an order and a finite set of polynomials F and provides a reduced Gr¨obner basis as output. Example 6.6.1 (A reduced Gr¨obner basis for Example 6.3.1) Consider the contingency table with r = 2 and s = 3, and assume the polynomials are ordered p11 < p12 < p13 < p21 < p22 < p23 . This order is known as the lexicographic order. Using the set of polynomials in Equation (6.5) as input to Buchberger’s algorithm yields that the set is also a reduced Gr¨obner basis. It has three elements, p11 p22 − p12 p21 ,
p11 p23 − p13 p21 ,
and
p22 p13 − p23 p12 .
[The remaining polynomials in Equation (6.5) are either identical to zero or identical to one of the three polynomials above.] It is an example of a toric ideal; ideals that occur in relation to the toric, or log-linear, models in Section 6.7. If the set of polynomials in Equation (6.4) is used as input instead, the same reduced Gr¨obner basis is found, since the basis is unique when the order is fixed. Example 6.6.2 (One-site phosphorylation cycle, continued) Example 6.4.4 shows a small network, but the state space is still prohibitively large. In nature, the total numbers of molecules could be as high as 103−104 . This implies that manipulation of pQ is doomed to be computationally inefficient and any reduction of the equations seems difficult. Hence, it is unlikely that we could find good model invariants; keep in mind that we also have to choose an ordering of the variables. However, the stochastic system in Example 6.4.4 has a deterministic counterpart. If the number of molecules is high, the stochastic fluctuations might be ignored and the system described by ordinary differential equations using concentrations rather than actual numbers.
126
Handbook of Statistical Systems Biology
In Manrai and Gunawardena (2008), the deterministic system is studied. As shown in Figure 6.1, the system is studied assuming two different modes, a distributive mode and a processive mode that allows both sites to be phosphorylated or dephosphorylated at the same time. The system is governed by a system of differential equations; as an example, the concentration of S00 changes according to the equation, d[S00 ] = −κ1 [E][S00 ] + κ2 [Y00 ] + κ3 [Y01 ] + κ4 [Y10 ], dt if the system is distributive, whereas an additional term κ5 [Y11 ] is added in the case of processivity. The first two terms correspond to reactions involving E and the last two (three) correspond to reactions involving F . Further, the constants κi are unknown rate constants. In steady state, or equilibrium, the differential equations equate to zero. There is one equation for each species, that is 12 in total for both the distributive system and the processive system, which is considerably less than in Example 6.4.4, where the number of equations characterizing the equilibrium distribution of the Markov chain is N (the number of states). In addition, there are three conservation laws, Equation (6.4). Manrai and Gunawardena (2008) used Gr¨obner bases to show that if the system is distributive, then the points with coordinates
[S01 ]2 [S01 ][S10 ] [S10 ]2 , , [S00 ][S11 ] [S00 ][S11 ] [S00 ][S11 ]
(6.11)
all lie on a plane that depends only on the parameters and not on any total amounts [Equation (6.9)] or other concentrations. Here [Su ] denotes substrate concentration at steady state. Equation (6.11) is a simple model invariant. In a controlled experiment it might be possible to vary the total amount of enzyme and substrate and measure the steady state concentrations (Manrai and Gunawardena 2008; Cooper and Hausman 2009). In this way one would obtain a series of (noisy) data points around a plane for a distributive system and one could test whether there is a systematic discrepancy of the data points from the plane. If the system is processive, the points would not fall on a plane (Manrai and Gunawardena 2008). The deterministic and the stochastic systems both have the same number of rate constants. However to test for distributivity, it is not necessary to estimate the rate constants explictly, since one can rely on a generic description of the plane. However, there are differences between the stochastic and the deterministic systems. Deterministic systems might give rise to multiple different steady states, some might be stable whereas others not. What steady state the system ends up in depends on the initial concentrations of chemical species. If the system is modelled by a stochastic Petri Net, there will only be one equilibrium distribution, but the system might move slowly in the state space and be trapped in some parts of the space before moving on to another part, corresponding to the multiple steady states of the deterministic system. To obtain an estimate of the equilibrium distribution, it is therefore necessary to track the system over a long period of time.
6.7
Log-linear models
In the algebraic statistical literature log-linear models are also called toric models. Here the statistically more familiar name, log-linear model, is used. They have been studied intensively in the algebraic statistics literature (Diaconis and Sturmfels 1998; Dobra 2003; Dobra et al. 2009; Gibilisco et al. 2010). Here only one aspect will be discussed, namely an MCMC procedure to obtain a sample from a conditional distribution given the sufficient statistics. This might be used in various contexts, including hypothesis testing and hypergeometric sampling.
Algebraic Statistics and Methods in Systems Biology
127
A log-linear model takes the form log(pj ) = c(θ) + hj +
d
aij θi ,
i=1
where A = {aij }i,j is a d × m matrix of non-negative integers, hj , j = 1, . . . , m are constants independent of θ ∈ Rd and c(θ) is a normalizing constant. By taking exponentials on both sides and substituting ωi for exp(θi ), it follows that any log-linear model is a parameterized algebraic statistical model. A log-linear model is an exponential family, but not all exponential family models are algebraic models. In particular, the entries of A are constrained to be non-negative integers, because by exponentiation only these result in polynomial expressions. As aij might be interpreted as an interaction term, log-linear models are closely related to graphical models and hierarchical models in general, and therefore, they are of relevance for systems biology (see Chapter 11; Drton and Sullivant 2007). If a series of observations u = (u1 , . . . , um ) from a log-linear model is obtained, where uj denotes the number of times xj is observed, then the likelihood has a multinomial form
m m d u• u• u u h (Au)i pj j = ωi . c(θ)u• e j=1 j j u1 , . . . , um u1 , . . . , um j=1
i=1
Here (Au)i denotes the ith row in Au. Since the likelihood only depends on ω through Au, the minimal sufficient statistic of a log-linear model is Au, and thus the distribution of u given Au does not contribute information about ω (or θ). This has been utilized to design hypothesis tests (or goodness-of-fit tests) for log-linear models (Diaconis and Sturmfels 1998). Fisher’s exact test is one example of this. Example 6.7.1 (Example 6.3.2 continued) Assume two independent criteria as in Example 6.3.2 with r = 2 and s = 3. The example can be put in the form of a log-linear model with A given by ⎛ ⎞ 1 1 1 0 0 0 ⎜0 0 0 1 1 1⎟ ⎜ ⎟ ⎜ ⎟ 1 0 0 1 0 0⎟ , A=⎜ ⎜ ⎟ ⎜0 1 0 0 1 0⎟ ⎝ ⎠ 0 0 1 0 0 1 where the first two rows correspond to the row criteria and the last three to the column criteria. Any observation v with Av = Au has the same row and column sums as the observation u; that is the difference v − u must be in the kernel of the matrix A. The conditional distribution of u given Au is a hypergeometric distribution of the form
u11 +u21 u12 +u22 u13 +u23 u11
u12 u13 u•• u11 +u12+u13
,
where uij denotes the number of observations in row i and column j. It is known as Fisher’s exact test. The conditional distribution does not depend on any parameter and deviations from the hypergeometric law is taken as evidence that the two criteria are not independent. Only if r and s are small and likewise the row and column sums, is it possible to calculate Fisher’s exact test numerically. The statistical software package R (http://www.r-project.org/) provides an option to simulate p-values for r or s larger than 2.
128
Handbook of Statistical Systems Biology
To calculate Fisher’s exact test one needs to enumerate all v with Av = Au, which in general is prohibitively expensive. Alternatively, one might resort to simulations. Perhaps the first theorem of algebraic statistics is a characterization of the space of vectors with the property Av = Au (Diaconis and Sturmfels 1998). To elaborate on this, we need the notion of a Markov basis. It is a finite set {b1 , . . . , bK } belonging to the kernel of A, such that for every pair u, v ∈ Nm of observations with Au = Av, vj = uj +
k
bij
(6.12)
bij ≥ 0.
(6.13)
i=1
for some k, and for each 1 ≤ k ≤ k,
vj
= uj +
k i=1
In other words, starting with u any other vector of observations with the same value of the sufficient statistics can be generated, and each step (k ) in generating v is a valid observation (v ). The Markov basis is independent of the observation vector. Markov bases relate to ideals generated by polynomials (Section 6.6). Diaconis and Sturmfels (1998) show that B = {b1 , . . . , bK } is a Markov basis if the polynomials b+ j
p
b− j
−p
=
b+
pijij −
i,j
b−
pijij
i,j
generate the set of model invariants (the vanishing ideal) described in Section 6.6. Here bij+ = max{bij , 0} and bij− = max{−bij , 0}. A Markov basis leads to a Metropolis–Hasting scheme to simulate from the conditional distribution: (i) (ii) (iii) (iv)
Initialization: Set v := u, where u is an observation. Choose bi randomly in the Markov basis B. Set v := v + bi with probability min(1, pv+bi /pv ). Continue until the desired number of MCMC samples has been obtained.
Gr¨obner bases might be used to find candidates for Markov bases, however, currently there is no general test that allows to check if a Gr¨obner basis fulfils the two defining characteristcs of a Markov basis, Equation (6.12) and (6.13) (Dobra 2003; Dobra et al. 2009). Some theoretical work has been done to determine Markov bases and their properties. Dobra and Sullivant (2004) have developed a divide-and-conquer algorithm for significantly reducing the time needed to find a Markov basis when the underlying independence graph is not decomposable, and Dobra (2003) provides formulae that fully identify Markov bases for decomposable random graphs and show how to use these formulae to dynamically generate random moves. Example 6.7.2 (Example 6.3.2 continued) Consider the r × s contingency table from Example 6.3.2. In Diaconis and Sturmfels (1998) it is shown that the following moves lead to a Markov basis. Choose two columns, i and l, at random (out of s columns) and two rows, j and k, at random (out of r rows); these intersect in four entries: (i, j) (l, j)
(i, k) (l, k).
Algebraic Statistics and Methods in Systems Biology
129
Modify the table by adding or substracting one from the entries according to the rules + −
− +
or
− +
+ −
with probability 21 each. If a move leads to negative entries, then it is discarded and a new move is proposed. Formally, the Markov basis is given by B = {bijkl , bijkl }, where bijkl = eij − eik − elj + elk and bijkl = −eij + eik + elj − elk for i, l = 1, . . . , r and j, k = 1, . . . , s, where eij is a vector with one in entry (i, j) and zero otherwise.
6.8
Reverse engineering of networks
Reverse engineering of networks is an important topic in systems biology (see Chapters 12 and 13). It is closely related to model selection as we want to make qualitative statements about the presence or absence of links between nodes in a network. In relation to the methods discussed in this chapter reverse engineering has come to mean something related though different (Laubenbacher and Stigler 2004; Jarrah et al. 2007). Consider a polynomial dynamical system (PDS) (Laubenbacher and Stigler 2004; Veliz-Cuba et al. 2010) f = (f1 , . . . , fm ) : Km → Km , where each coordinate function is a polynomial in x = (x1 , . . . , xm ), that is fi is in the polynomial ring K[x], where K is an arbitrary field (Cox et al. 2007). Natural choices for K are F2 = {0, 1} (genes being on/off) or F3 = {−1, 0, 1} (genes being up/down regulated and unchanged). By iterating f , the dynamics of a timediscrete and time-homogeneous system is described in a deterministic way; that is at time t the system is described by the function f t = f ◦ f ◦ . . . ◦ f (t times). The PDS might have steady states, i.e. points such that f (x) = x, or show oscillating behaviour, that is the system returns to the same state again after a series of iterations. Jarrah et al. (2007) use Gr¨obner bases theory to obtain an algorithm for selecting PDSs that fit the observed data perfectly. Details are not given here, but a few issues are mentioned: (i) The algorithm assumes the observations are noise free. For example, by discretizing microarray expression data into a small number of states some noise might be removed at the cost of reducing the amount of information in the data. (ii) A given input results always in the same output, i.e. transitions are not allowed to be stochastic. This also implies that the algorithm can be applied to each fi separately. (iii) For given data, st = (st1 , . . . , stm ), where t denotes time, the algorithm selects a PDS such that f (st ) = st+1 for all t and determines the ideal of functions g that vanish on the data, i.e. g(st ) = 0, and hence h = f + g is also a solution. Thus according to (iii), the algorithm produces many solutions and not just a single solution. Subsequently, one might choose one solution based on for example algebraic criteria (Jarrah et al. 2007), though this seems statistically questionable. From a statistical point of view, all solutions appear to have equal value unless they are described by different numbers of parameters. From a biological point of view, some solutions might be considered more plausible than others. The different solutions might naturally differ on new data that were not included in the original data used for training.
130
Handbook of Statistical Systems Biology
6.9
Concluding remarks
Algebraic statistics is a new research area and the full potential is still to be seen. It has not had a big impact yet on systems biology but there is an increasing awareness of algebraic statistics and algebraic techniques in systems biology. Thus, systems biology might benefit from the use of algebraic techniques in the future. Many discrete models in systems biology are algebraic statistical models, such as discrete graphical models and stochastic Petri Nets (Example 6.4.4). Thus, in principle, the algebraic apparatus for manipulating polynomials is available for these classes of models. However, the typical state space is high-dimensional and typically bigger than can be handled by current algebraic computational algorithms. This is a severe limitation which must be overcome in order to study stochastic systems like Example 6.4.4 with these techniques. Systems biology is rich in deterministic systems, for example systems decribed by ordinary differential equations (Example 6.6.2) or polynomial dynamical systems (Section 6.8). These systems give rise to polynomial equations and can therefore be studied by techniques from algebraic geometry. The modelling equations are typically much simpler than the similar equations for a stochastic system (Examples 6.4.4 and 6.6.2). As the deterministic system might be considered the limit of a stochastic system with many molecules (replacing counts of molecules with their concentrations), it might be possible to gain insight about a stochastic system from its deterministic counterpart. In particular, it might be possible to derive model invariants (Section 6.6) that are characteristic of the system. Many systems are not fully understood in the sense that some aspects are (partly) unknown. That could for example be whether some reaction occurs or not (as in Example 6.4.4), whether mass-action applies, or the details of a reaction are unknown (is an intermediate complex formed?). These differences could easily lead to different qualitative behaviours, such as whether there are multiple steady states or oscillations. Model invariants might be used to distinguish statistically between different versions of a system; that is as a model selection tool or a goodness-of-fit test. However, as discussed in Example 6.6.2, there are differences between a stochastic system and its deterministic counterpart and more work is required to bring the algebra in the two systems closer together. Continuous space models are important in systems biology, but have not yet received much attention in algebraic statistics; see however Drton and Sullivant (2007) and Drton et al. (2008) for some examples in connection with Gaussian variables and exponential families. Continuous space models also show algebraic structure and a full algebraic theory should be developed for these as well in order to have a statistical theory fully based on algebraic principles. Deterministic models, such as the one in Example 6.4.4, would be more useful in relation to analysis of real data if Gaussian noise is added, and for example gene regulatory systems are conveniently described by graphical models with continuous variables, rather than discretized variables. Some issues have not been discussed because they have less direct contact to systems biology. For example, maximum likelihood estimation has not been discussed since it is a general statistical issue. See for example Pachter and Sturmfels (2005) for an introduction to maximum likelihood estimation from an algebraic perspective. Another issue that has only been discussed briefly, is model selection. There is a rich literature on algebraic notions of dimensionality (Garcia 2004; Geiger et al. 2004), which might be useful in connection with model selection, where the model with the lowest dimension (degree of freedom) often is chosen.
References Allman ES and Rhodes JA 2007 Molecular phylogenetics from an algebraic viewpoint. Statistica Sinica 17, 1299–1316. Capp´e O, Moulines E, Ryd´en T 2005 Inference in Hidden Markov Models, Springer.
Algebraic Statistics and Methods in Systems Biology
131
Cavender JA and Felsenstein J 1987 Invariants of phylogenies: a simple case with discrete states. Journal of Classification 4, 57–71. Chaouiya C, Remy E and Thieffry D 2008 Petri net modelling of biological regulatory networks. Journal of Discrete Algorithms 6, 165–177. CoCoATeam. CoCoA: a system for doing computations in commutative algebra. Available at http://cocoa.dima.unige.it/. Cooper GM and Hausman RE 2009 The Cell: a Molecular Approach, 5th Edn, Sinauer Associates Inc. Cornish-Bowden A 2004 Fundamentals of Enzyme Kinetics, 3rd Edn, Portland Press. Cox D, Little JL and O’Shea D 2007 Ideals, Varieties, and Algorithms: an Introduction to Computational Algebraic Geometry and Commutative Algebra 3rd Edn, Springer. Craciun G, Dickenstein A, Shiu A and Sturmfels B 2009 Toric dynamical systems. Journal of Symbolic Computation 44, 1551–1565. Decker W, Greuel G-M, Pfister G and Sch¨onemann H 2010 Singular 3-1-1: A computer algebra system for polynomial computations. Available at http://www.singular.uni-kl.de/. Diaconis P 1988 Group Representations in Probability and Statistics, Institute of Mathematical Statistics. Diaconis P and Sturmfels B 1998 Algebraic algorithms for sampling from conditional distributions. The Annals of Statistics 26, 363–397. Dobra A 2003 Markov bases for decomposable graphical models. Bernoulli 9, 1093–1108. Dobra A and Sullivant S 2004 A divide-and-conquer algorithm for generating Markov bases of multi-way tables. Computational Statistics 19, 347–366. Dobra A, Fienberg SE, Rinaldo A, Slavkovic A and Zhou Y 2009 Algebraic statistics and contingency table problems: loglinear models, likelihood estimation, and disclosure limitation. The IMA Volumes in Mathematics and its Applications 149, 1–26. Drton M and Sullivant S 2007 Algebraic statistical models. Statista Sinica 17, 1273–1297. Drton M, Sturmfels B and Sullivant S 2008 Lectures on Algebraic Statistics (Oberwolfach Seminars), Birk¨aser Verlag. Evans SN and Speed TP 1993 Invariants of some probability models used in phylogenetic inference. Annals of Statistics 21, 355–77. Fienberg SE 2007 Expanding the statistical toolkit with algebraic statistics (editorial). Statista Sinica 17, 1261–1272. Garcia LD 2004 Algebraic statistics in model selection. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, ACM International Conference Proceeding Series 70, 177–184. Garcia LD, Stillman M and Sturmfels B 2005 Algebraic geometry of Bayesian networks. Journal of Symbolic Computation 39, 331–355. Gatermann K 2001 Counting stable solutions of sparse polynomial systems in chemistry. In: Green E et al. (eds) Symbolic Computation: Solving Equations in Algebra, Geometry and Engineering, vol. 286, pp. 53–69, American Mathematical Society. Gatermann K and Huber B 2002 A family of sparse polynomial systems arising in chemical reaction systems. Journal of Symbolic Computation 33, 275–305. Geiger D, Meek C and Sturmfels B 2004 On the toric algebra of graphical models. The Annals of Statistics 34, 1464–1492. Gibilisco P, Riccomagno E, Rogantin MP and Wynn HP 2010 Algebraic and Geometric Methods in Statistics, Cambridge University Press. Grayson DR and Stillman ME 1996 Macaulay2, a software system for research in algebraic geometry. Available at http://www.math.uiuc.edu/Macaulay2/. Huang CY and Ferrell JE 1996 Ultrasensitivity in the mitogen-activated protein kinase cascade. Proceedings of the National Academy of Science, USA 17, 10078–10083. Jarrah AS, Laubenbacher R, Stigler B and Stillman M 2007 Reverse-engineering of polynomial dynamical systems. Advances in Applied Mathematics 39, 477–489. de Jong H 2002 Modelling and simulation of genetic regulatory systems: a literature review. Journal of Computational Biology 9, 67–103. Kauffman SA 1969 Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology 22, 437–467. Laubenbacher R 2005 Algebraic models in systems biology. Proceedings of the First International Conference on Algebraic Biology 33–40.
132
Handbook of Statistical Systems Biology
Laubenbacher R and Stigler B 2004 A computational algebra approach to the reverse engineering of gene regulatory networks. Journal of Theoretical Biology 229, 523–537. Laubenbacher R and Sturmfels B 2009 Computer algebra in systems biology. American Mathematical Monthly 116, 882–891. Lauritzen S 1996 Graphical Models, Oxford University Press. Manrai AK and Gunawardena J 2008 The geometry of multisite phosphorylation. Biophysical Journal Volume 95, 5533– 5543. Markevich NI, Hoek JB and Kholodenko BN 2004 Signaling switches and bistability arising from multisite phosphorylation in protein kinase cascades. Journal of Cell Biology 164, 353–359. Maxima 2009 Maxima, a computer algebra system, Version 5.18.1. Available at http://maxima.sourceforge.net/. Mishra B 2007 Algebriac systems biology: These and hypotheses. Lecture Notes in Computer Science 4545, 1–14. Pachter L and Sturmfels B 2004a Tropical geometry of statistical models. Proceedings of the National Academy of Sciences, USA 101, 16132–16137. Pachter L and Sturmfels B 2004b Parametric inference for biological sequence analysis. Proceedings of the National Academy of Sciences, USA 101, 16138–16143. Pachter L and Sturmfels B (eds.) 2005 Algebraic Statistics for Computational Biology, Cambridge University Press. Pistone G, Riccomagno E and Wynn HP 2000 Algebraic Statistics: Computational Commutative Algebra in Statistics, Chapman & Hall/CRC. Riccomagno E 2009 A short history of algebraic statistics. Metrika 69, 397–418 Sturmfels B 1996 Gr¨obner Bases and Convex Polytopes, American Mathematical Society. Sturmfels B and Sullivant J 2005 Toric ideals of phylogenetic invariants. Journal of Classification 12, 457–481. Sullivant S 2008 Statistical models are algebraic varieties. Available at http://www.math.harvard.edu/∼seths/lecture1.pdf. Veliz-Cuba A, Jarrah AS and Laubenbacher R 2010 Polynomial algebra of discrete models in systems biology. Bioinformatics 26, 1637–1643. Wilkinson D 2006 Stochastic Modelling for Systems Biology, Chapman & Hall/CRC.
Part B TECHNOLOGY-BASED CHAPTERS
7 Transcriptomic Technologies and Statistical Data Analysis Elizabeth Purdom1 and Sach Mukherjee2 1
Department of Statistics, University of California at Berkeley, USA 2 Department of Statistics, University of Warwick, Coventry, UK
The ability to simultaneously measure the abundance of all mRNA in a sample has transformed many areas of molecular biology. Generating a snapshot of mRNA activity is now a routine experiment and has led many biologists to focus not on the activity of a single gene but rather a global picture of their interactions. We describe the technologies that provide such data and the statistical tools that are commonly used in their analysis.
7.1
Biological background
Transcription is the process by which the information encoded in DNA is transcribed into messenger RNA (mRNA) within the nucleus and then transferred into the cytosol; there the mRNA is translated into proteins which are central to cellular structure and function. Transcription of DNA into mRNA is regulated by transcription factors that bind to the promoter regions in the DNA and initiate the copying of DNA into mRNA. Careful control of the production of transcription factors, the physical access of transcription factors to promoter regions, and the recruitment of transcription factors to the DNA are all ways in which transcription is regulated within a cell. In many eukaryotic organisms, the mature mRNA product does not emanate from a single contiguous portion of DNA. Rather the DNA encoding the mRNA transcript includes additional regions, called introns, that are not part of the final mRNA product but are removed during the process of transcription. The entire DNA region, including the introns, is copied into pre-mRNA products, and the introns are removed in a process called splicing. In higher organisms, the introns can be removed from the pre-mRNA in various ways Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
136
Handbook of Statistical Systems Biology
to create different mRNA transcripts, a process referred to as alternative splicing. In this way, a single region of DNA encoding a gene can actually encode for multiple mRNA transcripts or isoforms and ultimately different proteins. Alternative splicing is a part of the tightly controlled process regulating mRNA production. Estimates of expression, then, are estimates of the quantity of a mRNA transcript in a sample. Many estimates of mRNA transcription levels do not differentiate between different isoforms from a single gene, and for this reason measurements of the levels of mRNA are usually given as gene expression estimates, where the level of expression of a gene is that of all of its component isoforms. Expression estimates are traditionally evaluated on the log-scale or in terms of fold-changes as this is more biologically relevant for comparison than absolute differences in the number of mRNA transcripts. In addition, log transformations are also natural from a statistical viewpoint in stabilizing the variance of the estimates. mRNA transcription and regulation are themselves of great biological interest and serve as a motivation for measurement of mRNA levels. A further reason for the interest in mRNA expression levels is as a proxy for measurements of protein levels. Like DNA, mRNA is a long chain of nucleotide units and is therefore relatively easy to isolate and measure by taking advantage of the complementary hybridization properties of its nucleotides. Proteins, on the other hand, are much more difficult to isolate and quantify. Furthermore, the current methods for quantifying proteins do not scale such that all proteins in a cell can be quantified at the same time in a single experiment; in contrast, the technologies that we describe below can provide such measurements for mRNA. Though mRNA expression levels are often used as a substitute for protein levels, most studies show only moderate correlation between mRNA and protein products (Futcher et al. 1999; Gygi et al. 1999; Greenbaum et al. 2003; Nie et al. 2006; Lu et al. 2007). Different possible biological mechanisms could contribute to the disparity, such as differences in protein half-lives and post-transcriptional regulation, as well as technological errors in measuring mRNA and protein levels. While the role of mRNA as a proxy for protein concentrations is imperfect, genome-wide measurements of mRNA levels have nonetheless transformed studies of molecular biology by providing a glimpse of thousands of transcriptional events at the same time.
7.2 7.2.1
Technologies for genome-wide profiling of transcription Microarray technology
Microarray technology (Schena et al. 1995) was the first technology that allowed the simultaneous measurement of the abundance of thousands of different mRNA. A microarray is a small glass slide, or chip, printed with a grid of microscopic spots or features. Each feature contains copies of a specific nucleotide sequence, known as a probe, bound to the slide. The probes are chosen so as to hybridize with a specific DNA target of interest. For quantification of mRNA, the mRNA of a sample is converted to cDNA and fluorescently labeled. The resulting sample of cDNA is then added to the microarray, and those cDNA strands matching a probe on the array will hybridize to the corresponding spot on the array. The microarray is then scanned to measure the fluorescence of the individual spots on the array. The intensity of the fluorescence at a specific spot measures the amount of cDNA that hybridized to the probes on that spot. The thousands of probes on a gene expression microarray are chosen so as to uniquely query a specific gene, and thus the fluorescence intensity at a spot is a measure of the amount of corresponding gene expression [for more details see Heller (2002), Hobman et al. (2007) and Russell et al. (2009)] An important distinction between commonly used microarrays is the labeling strategies for input cDNA samples. One-color (or one-channel) arrays have a single cDNA sample added to each array, and the fluorescent intensities are compared across microarrays. Two-color (or two-channel) arrays instead hybridize two samples to each array, each sample labeled with a different fluorescent. In this case, it is then standard to compare the ratio of the intensities of the two samples from the same array [though see Kerr and Churchill (2001) and
Transcriptomic Technologies and Statistical Data Analysis 137
Hoen et al. (2004) who analyze the two channels separately]; this relative intensity is then compared across multiple arrays. Unless there is a natural pairing of the data (such as a tumor and normal sample from the same patient), the usual assignment of samples to two-color arrays is to pair each sample of interest against the same common reference sample, often the Universal Human Reference (UHR). More sophisticated experimental designs can be used to assign samples across the two-channel arrays in more efficient ways (Kerr and Churchill 2001; Yang and Speed 2002) but the use of a standard reference for one of the channels facilitates the comparison of samples across different laboratories and large database queries. As a result, two-color and one-color arrays often use the same number of arrays to analyze the same number of samples, but report different kinds of measurements (absolute intensities versus ratios of intensities). 7.2.2
mRNA expression estimates from microarrays
If only a single probe matches to a gene, which is the case with many longer-oligonucleotide arrays, then the intensity of this probe provides the sole estimate of gene expression for each of the samples. However, Affymetrix arrays, which use shorter oligonucleotides, have many more probes on the array and are therefore designed to have multiple probes from the same gene, each of which queries a different portion of the gene. These multiple probes, which constitute a probeset, must be combined together to produce a final estimate of gene expression, a process of summarization. Different sequence composition means that probes have different binding affinities and different propensities to cross-hybridize with other parts of the genome. The most effective summarization methods try to estimate and remove these and other systematic differences between the probes by using multiple chips from the same experiment, such as Robust Multichip Analysis (RMA) (Irizarry et al. 2003) and the dChip package (Li and Wong 2001). As the number of probes per array has increased, even arrays with longer-oligonucleotides now frequently have more than one distinct probe for many genes, though there are fewer specific techniques for combining these estimates. Differences in arrays done at different times or laboratories usually result in samples with substantially different intensity measurements. Because of this, there is a need to correct the intensity measurements before either summarizing the probes into a single gene measurement or comparing intensity levels across samples. The process of normalization adjusts the intensity levels to make the bulk of intensity levels the same across samples. The assumption is that while individual probes may vary across samples due to true biological differences, the overall distribution properties of the probe intensities should remain the same. Common techniques are quantile normalization (Bolstad et al. 2003) and loess smoothing (Dudoit et al. 2002), both of which adjust the intensity measurements so that the distribution of corrected intensity levels matches that of a common reference, usually the median across all samples, though in different ways depending on the method of comparing the distributions (see Steinhoff and Vingron (2006) for a review). 7.2.3
High throughput sequencing (HTS)
Recent developments in HTS technologies have made HTS an option for measuring mRNA levels in individual samples. Because of its recent adoption, there are not yet mature processing pipelines for obtaining mRNA estimates, and methods for estimation, normalization, and quality control assessment are still active areas of research. Moreover, the technologies are rapidly changing and improving. While the details of the sequencing depend on the technology, generally the input DNA to be sequenced (mRNA converted into cDNA in the case of mRNA sequencing) is fragmented into smaller pieces and a random sample of these fragments is selected to be sequenced, usually resulting in a sample of tens of millions of fragments (see Figure 7.1). Then for each fragment selected, a portion of that fragment is sequenced. What portion is sequenced depends on the technology, but usually the portion sequenced is at the end of the fragment. If ‘paired-end’ sequencing is chosen, then both ends of the fragment are sequenced. ‘Strobe reads’ (Li et al.
138
Handbook of Statistical Systems Biology
Figure 7.1 Overview of RNA-seq. A RNA fraction of interest is selected, fragmented and reverse transcribed. The resulting cDNA can then be sequenced using any of the current ultra-high-throughput technologies to obtain ten to a hundred million reads, which are then mapped back onto the genome. The reads are then analyzed to calculate expression levels. Figure and caption reproduced from Pepke et al. (2009)
2008) sequence many large portions interspersed across the input fragment DNA. Continuing improvements in technology are simultaneously increasing both the length and number of reads from a single run, so that millions of reads of hundreds or thousands of base pairs will soon be possible. The resulting sequences or reads are then analyzed to determine from where the input fragment originated, a process of aligning or mapping the reads to the genome [see Li and Homer (2010) for a review of algorithms that map reads to the genome]. In the case of mRNA sequencing, the sequence must be related back to the individual mRNA transcripts. In the presence of splicing, a read may span noncontiguous regions of the genome, and the possible set of such junctions is generally not completely known in advance. As a result, aligning reads to the genome will miss some valid reads. One simple solution is to provide a set of possible junctions based on known annotation and limit further analysis to those reads that match the known annotation. More sophisticated methods try to determine gapped alignments for reads that did not map to the genome (Bona et al. 2008; Trapnell et al. 2009; Bryant et al. 2010).
Transcriptomic Technologies and Statistical Data Analysis 139
7.2.4
mRNA expression estimates from HTS
With HTS data we can define estimates for any region of the genome, such as individual exons or collections of exons that form a gene. The basic measurement of the level of expression of a generic region of the genome is the number of sequenced fragments that have been aligned to the region. This results in data from each region in the form of counts of fragments. The counts must be adjusted to be properly compared across regions and samples. In particular, because the mRNA is fragmented before sequencing, longer mRNA transcripts contribute more fragments to the sequencing and thus will have higher counts relative to a shorter mRNA transcript expressed at the same level in the original sample. Similarly, the total amount of sequenced fragments differs between samples, resulting in higher counts despite the same level of expression in the original samples. The need for normalization was early noted by Mortazavi et al. (2008) and they developed the measure ‘reads per kilobase of exon per million mapped reads’ (RPKM) given by dividing the original number of fragments Fij mapping to a genomic region j in a sample i by the total length of the region and the number of total counts in the sample, which they then scale by a constant: RPKMij =
Fij 109 . (Lj Ni )
(7.1)
It has been demonstrated (Bullard et al. 2010; Robinson and Oshlack 2010), that the RPKM normalization can be insufficient to provide comparable expression estimates across different biological samples, and can result in estimates of differential expression that are biased compared with the qRT-PCR gold-standard on the same samples (Bullard et al. 2010). Bullard et al. (2010) and Robinson and Oshlack (2010) both propose normalization procedures that replace Ni in (7.1) with a data-derived parameter θi , and Bullard et al. (2010) report better sensitivity in detecting differential expression as a result. Another source of error in these estimates is that different regions can have different chances of being sequenced, which can have important implications in estimating the underlying abundance of mRNA transcripts. One such source of bias is the sequence composition of the fragment or read: fragments with relatively high GC content in the sequenced region are more likely to be sequenced (Dohm et al. 2008; Li et al. 2010b), though sequences with extremely high GC content are not well sequenced. Fragments with certain sequence patterns in the first 6–12 bases are also more frequently sequenced due to bias in the random hexamer priming (Hansen et al. 2010). Another source of bias is due to the position of the fragment in the transcript: the ends of transcripts tend to be less frequently sequenced. The bias tends to be most noticable for the 5’ ends of the transcript, probably because most mRNA-Seq protocols extract mRNA by using the 3’ poly-A tails so as to remove ribosomal RNA. A final source of error in expression estimates is the inability to map some reads uniquely to the genome because the sequenced portion of the fragment matches multiple locations in the genome. The most common practice is to ignore such reads in downstream analysis; however, this means that the expression level of a region that contains portions repeated at other places in the genome will have a reduced number of reads and thus a reduced expression estimate. Some proposals have been introduced to probabilistically ‘share’ such multiply mapped reads (Faulkner et al. 2008; Mortazavi et al. 2008; Li et al. 2010a). Furthermore, the most common mRNA protocols result in reads that are not strand-specific, meaning that if there are genes on both strands of the DNA that overlap, it is not possible to know from which strand the read originated. 7.2.4.1
Expression estimates in the presence of alternative splicing
The above exposition assumes that a region has an expression level that is the same for all positions in the region. For higher organisms, the presence of alternative splicing results in the individual exons of a gene having different levels of expression depending on which isoforms of the gene are included. In particular,
140
Handbook of Statistical Systems Biology
not all exons give an estimate of μg , the total amount of expression across all isoforms, but rather of some unknown proportion of μg . As a result, the simple estimate based on the total number of sequenced fragments that fall within the boundaries of a gene will underestimate μg in the presence of alternative splicing. On the other hand, because most reads cannot be identified as originating uniquely from a particular isoform of a gene, estimates of individual isoform expression levels, λt , cannot be based on just counting sequenced fragments. Several methods have been proposed to estimate λt (Jiang and Wong 2009; Richard et al. 2010; Salzman et al. 2010; Trapnell et al. 2010). They are generally based on noting that for an observed read sequence x, the observed frequency of that sequence, Fx , is the sum of Fx (t), the number of times that sequence x originated from transcript t: Fx = t Fx (t). The frequency Fx (t) is generally unobservable, but its distribution can be modeled, for example as Fx (t) ∼ Poisson(θLt λt ), where Lt is the length of the transcript and θ is the normalization parameter for the sample, described above. Therefore the likelihood of Fx is a convolution of the Fx (t) and involves only the unknown parameters λt which can be estimated by maximizing the joint likelihood across all reads x or by other likelihood based techniques. Estimates of λt can then give estimates of μg if overall gene expression is desired. These methods are still new and their behavior not widely explored. Further adaptations will undoubtably improve the estimation techniques, as well as additional changes to address advances in the technology. As isoform-specific estimates become widely available and increasingly reliable, it remains to be seen how they will be incorporated into the downstream analyses that form the focus of the remainder of this chapter. The biologically relevant measure of expression for many analyses may be that of the individual isoforms, rather than the overall gene expression. From a statistical vantage, the impact of replacing gene estimates with isoform estimates in what follows is not yet studied. Expression levels of isoforms from the same gene are likely to have strong biological correlation, even more so than expected among genes, and in many cases should probably be studied as a block rather than treated as individual inputs. Furthermore, as long as the sequencing technology is such that a significant portion of the reads cannot be uniquely identified to a transcript, then all of the reads from a gene region are necessary to create efficient estimates of transcript abundance. For this reason, the estimates of the expression of isoforms from the same gene will be statistically correlated, even if the original expression levels were biologically independent. In the discussion that follows regarding further analysis on mRNA expression levels, the methods have all been developed assuming microarray estimates of gene expression and our discussion will make the same assumption.
7.3
Evaluating the significance of individual genes
A common analysis of genome-wide transcriptional studies is to identify individual genes that might explain the observed sample phenotypes, for example the transcriptional differences between diseased patients and healthy. The general strategy is to perform a statistical test, per gene, of a relevant null-hypothesis, resulting in statistics and their corresponding p-values for thousands of genes. Assume that for a particular gene g, following normalization and summarization of the raw intensity values, we have gene-expression estimates Z1g , . . . , Zng for the n samples. For clarity, we will drop the index g when not needed. For microarray data, these may be absolute measures with single channel data, or relative measures with two-channel data. 7.3.1
Common approaches for significance testing
There are clearly a wide variety of possible experimental designs and sample phenotypes to test. A common design is comparison of two groups, with n1 and n2 samples, in which case the standard t-statistic is often
Transcriptomic Technologies and Statistical Data Analysis 141
used, although the Wilcoxon rank-sum test is sometimes used as a nonparametric alternative for small sample sizes. With multiple groups to compare, standard tests like ANOVA F -tests are often used to test for any difference between the groups, as well as appropriate t-statistics for comparing contrasts between groups. As just these two examples make clear, a general framework for dealing with different designs is to analyze a linear model per gene, with the estimated gene expression as outcome: Zi = xiT β, where xi would consist of indicator variables if the samples are divided into groups. Clearly this strategy generalizes to other situations, where per-gene models can be developed using standard statistical models, and the relevant parameters evaluated for statistical significance. One common example is time-course data, where the input mRNA is collected from samples, or groups of samples, measured across time in order to observe the dependence of gene expression on time. Then a general model is Zi (t) = μi (t) + i (t), where μi (t) is a model of the expression of the ith sample at time t, and could include grouping variables, for example. The samples are often not independent if the same samples are measured across time, in which case the errors i (t) are correlated, requiring estimation of the covariance between observations (Storey et al. 2005; Tai and Speed 2006). There is a wide variety of hypotheses of possible interest with time course data. For example if there is only a single sample (i = 1), then a possible null hypothesis is that of no change, μ(t) = μ for all t. For multiple samples divided into two or more groups, a possible null hypothesis is that μi (t) is the same for samples in the same group. In the case of a single sample, the data resemble standard time series data (see Diggle 1990), and various standard time-series and functional analysis methods have been proposed to flexibly estimate μ(t), such as Fourier analysis (Spellman et al. 1998) or B-splines (Luan and Li 2004; Storey et al. 2005). However, because microarray experiments generally have few time points, there is rarely enough data to accurately fit such models [see Bar-Joseph (2004) and Tai and Speed (2005) for a review]. Other settings can suggest a statistical model in which the gene expression is a predictor variable and the problem is that of predicting an observed phenotype. A common example is the clinical setting, where the survival time of a patient is the outcome to be predicted by the gene expression levels. In this case, a common modeling approach is a Cox’s proportional hazards model (Cox 1972). When predicting an outcome like survival, however, it is common to include many genes together for an integrated model, rather than a geneby-gene analysis [see Schumacher et al. (2007) for a review of survival analysis with microarrays]. Because the number of genes is much larger than the number of samples, this creates many complications which we discuss in Section 7.5. 7.3.2
Moderated statistics
Because the models must be fit per gene across thousands of genes, simple models and hypotheses are usually preferred. Microarray experiments often have a small number of observations (n), limiting the feasibility of fitting a complex model. Modeling assumptions are also often difficult to check across thousands of genes. Furthermore, if fitting the model requires lengthy optimization procedures, the computation time for fitting the model can be too onerous; if the model further requires cross-validation for accurate estimation of regularization parameters then the computational costs quickly multiply. Of greater concern, however, is often the large number of parameters that must be estimated with a small amount of data. For example, a simple t-test comparing two groups requires estimating four parameters per gene, resulting in thousands of parameters. A frequent problem is that by natural fluctuation, some genes will have severe underestimates of standard deviation, which result in inflated t-statistics. In time series data, the problem becomes even more severe because the covariance matrix between observations must be estimated for each gene. A standard remedy is to adopt an empirical Bayes approach for estimation of the variance (L¨onnstedt and Speed 2002; Newton and Kendziorski 2003; Smyth 2004b). The resulting ‘moderated’ estimates will draw on the entire observed distribution of the individual estimates of standard deviation to adjust the estimates to
142
Handbook of Statistical Systems Biology
be more similar to the bulk, increasing small estimates of variance and decreasing large ones, with a larger impact for smaller numbers of samples. For the common example of the linear model, the basic hierarchical model of Smyth (2004b) is implemented in the widely- used limma package in R (Smyth 2005) and assumes a prior distribution on the inverse of the gene variance, 1 2 1 ∼ χd , σg d0 s02 0 where d0 and s02 are parameters of the prior distribution. The resulting posterior mean (or estimate) of σg given the observed sample standard deviation of the data, sg , is d 0 s2 + d g s2 g 0 s˜ g2 = E σg2 |sg2 = , d0 + d g which has the effect of pulling the individual estimate of variance, sg2 , toward that of prior distribution. The prior parameters s0 and d0 are unknown, but by using the distribution of the observed sg , they can be estimated rather than assumed, which is why this method of estimation is known as empirical Bayes estimation. A moderated t-statistic is given by replacing sg with s˜ g . As the sample size increases (increasing dg ) the effect of the moderation will be minimal, but for small sample sizes there can be a large difference in the ordering of genes. In the linear model described above, the sampling distribution of the resulting t-statistics can be calculated to give p-values (Smyth 2004b), but in others cases the sampling distribution is unknown and the resulting moderated test statistics can only give a more reliable ranking of the genes without p-values. One such example is moderated test statistics proposed for time series data (Tai and Speed 2006). For nonparametric approaches, also related to controlling the false discovery rate (FDR), see Efron et al. (2001) and Datta and Datta (2005). Fully Bayesian approaches have been proposed as well, see Baldi and Long (2001), Newton et al. (2001) and Ibrahim et al. (2002). 7.3.3
Statistics for HTS
The above significance tests generally make an underlying assumption of normality or more generally independence of the mean and variance of the estimate of gene expression when expressed on a log-scale. For HTS data, the observed gene estimates are frequently counts. Instead of taking a log transformation, the most common per-gene models rely on assuming a distribution for the counts, such as a Poisson distribution, and using the distribution to estimate important parameters. With this assumption, a generalized linear model (GLM) (McCullagh and Nelder 1989) can be constructed in a similar manner as the linear model described above, where the expected value is modeled as E(Zi ) = g(xiT β), where g is a function that depends on the assumed distribution of the Zi (for example, g = log if Zi is assumed Poisson). Wald t-statistics can be calculated for testing if βj = 0; such statistics are based on asymptotic assumptions which can have poor behavior in practice (Pawitan 2001). An alternative is a likelihood ratio (LR) statistic, the ratio of the likelihood under the null βj = 0 to the likelihood making no assumptions on βj . While the χ2 distribution of the LR statistics is also asymptotic, LR statistics generally have better behavior than the t-statistics, and Bullard et al. (2010) demonstrate their improved performance for the example of HTS. The Poisson distribution often fails to account for the amount of biological variability seen in populations, where many unobserved variables can increase the variability compared with that expected from only the observed variables. For this reason, other types of distributions provide additional flexibility to account for additional variability. Robinson and Smyth (2007) and Anders and Huber (2010) suggest negative binomial
Transcriptomic Technologies and Statistical Data Analysis 143
models, including an empirical Bayes method for creating moderated test statistics. As the sample size increases, normal approximations for the log of the counts will start to be valid, allowing standard testing methods as well. An important feature of these count models, particularly the Poisson, is that the power of the test (i.e. the probability that it detects deviations from the null) is dependent on the gene expression level. This is because higher expressed genes will have more reads which means more data for estimating the gene expression. The accuracy of the gene estimates will be greater for highly expressed genes, and thus detection of deviations from the null will be easier. This is in contrast to microarray data, where the standard error is independent of the level of expression of the gene (assuming the gene was expressed enough to be detected by the microarray). This means lists of genes found to be significant with HTS must be interpreted with this caveat in mind. The size of the difference can in many cases be quite small because of small differences in highly expressed genes. This can be particularly a problem when evaluation of a gene list is implicitly also an evaluation of what is not on the gene list, for example what functional categories of genes are overly represented (Section 7.4). For distributions like the negative binomial, which allow for variability between samples, the relationship between expression level and significance is more moderated, and moderately expressed genes can be found significant more easily than with a simple Poisson. 7.3.4
Multiple testing corrections
Standard hypothesis testing theory finds a theoretical cut-off for a test statistic that guarantees, under certain stated assumptions, that the probability of the test rejecting the null hypothesis H0 when H0 is in fact true (a Type I error) is less than or equal to a predesignated level α ∈ (0, 1). In the case of microarray experiments, we perform hypothesis tests on G genes, where G is usually in the thousands. While the theory guarantees that the error rate of each individual test has been controlled at level α, the error rate of the overall results of the experiment (e.g. a list of significant genes) can increase well beyond that point. There are several notions of experiment-wide error rate (see Chapters 2 and 3 for a more detailed discussion). The most traditional notion is family-wise error rate (FWER) – the probability of at least a single incorrect rejection of the null. A more relevant measure for gene expression platforms, with thousands of hypothesis tests, is the number of genes wrongly rejected in relationship to the total number of genes chosen for further study. By far the most common choice of this type of error rate in practice is the FDR (Benjamini and Hochberg 1995), defined as the expected value of the proportion of wrongly rejected null hypotheses. Unlike the FWER, methods that control the FDR tend to be independent of the number of genes considered, which also makes the results much more robust to choices of the number of genes considered (but see Section 7.3.5 regarding filtering of genes). The assumptions regarding the joint distribution of the test statistics needed in order to prove that a method controls the FDR are generally greater than those needed for methods controlling the FWER, and it is generally not possible to know in practice if the assumptions are satisfied. Regardless, in practice FDR corrections have been found to give reasonable control of false positives. We briefly overview the two different approaches to creating multiple testing procedures that are common in the analysis of gene expression data; see Chapter 2 for further discussion. Throughout, assume we are testing the null hypothesis H0g for each gene g at level αg with a corresponding test statistic Tg for g = 1, . . . , G. The classical frequentist approach for controlling type I error rates of an experiment adjusts the level αg (γ) used in testing the individual H0g so as to guarantee that the experiment-wide type I error rate is less than a level γ. Equivalently, the adjustments can be made directly to the raw p-values to create adjusted p-values that have the property that the individual adjusted p-values can be evaluated in the standard way, that is reject the null hypothesis H0g if the adjusted p-value is less than γ, and doing so will guarantee that the resulting experiment has error rates less than γ. Different procedures vary in their choice of αg (γ) and as a result differ
144
Handbook of Statistical Systems Biology
theoretically for the assumptions necessary to ensure that their method of choosing the set of αg (γ) from the data will guarantee that the experiment-wide error rate is less than γ. One important difference between such methods is whether the choice of αg (γ) is based on the marginal distribution of the test statistic of gene g or the joint distribution of the test statistics across genes; note, however, that even if a method does not use the joint distribution of the test statistics to choose αg (γ), the proof of control of the experiment-wide error rate may rest on assumptions of the joint distributions of the genes. Another distinction between correction methods is whether value αg (γ) is the same for all genes (single-step methods) or can monotonically decrease (or increase) based on the rank of the gene g (step-wise methods), in which case adjustment to a test g depends on its order in the list and the hypotheses are adjusted successively. Whether the successive adjustment begins with the most significant or the least significant results in step-down or step-up procedures, respectively. Corresponding adjusted p-values are found by inverting the procedure and finding the minimum γ for which the gth hypothesis would be rejected under the procedure. There are two common step-up methods for providing adjusted p-values that control the FDR (Benjamini and Hochberg 1995; Benjamini and Yekutieli 2001); see Dudoit et al. (2003) and Reiner et al. (2003) for a review. Another approach for controlling the FDR is the empirical Bayes method (Efron et al. 2001; Storey 2003). In this approach, the distribution of the test statistics across genes is assumed to be a simple mixture of the distribution expected if the null hypothesis is true (F0 ) and that expected if the null hypothesis is false (F1 ), with a prior probability π0 of a null hypothesis being true. Then if the Tg are either independent or satisfy certain dependence assumptions, the FDR is given by Pr(H = 0|T ∈ α ), where α is the rejection region for the test based on the statistics T (where the FDR is defined as the expected rate of false rejections conditional on at least one hypothesis being rejected). By using empirical Bayes techniques to estimate π0 and the Fi (see Chapter 2), this relationship allows for control of the FDR by appropriately picking the rejection region α , i.e. by changing α. Storey (2003) defines a q-value as the Bayesian analogue of a p-value, qg = infα:Tg ∈α Pr(H = 0|T ∈ α ), i.e. the minimum posterior probability of the null hypothesis over all α. Storey and Tibshirani (2003) and Efron (2008) give reviews of the empirical Bayes approach to multiple testing. Connections between the empirical Bayes approach and stepwise methods are detailed in Efron and Tibshirani (2002), Storey (2002) and Dudoit and van der Laan (2008). The problem of multiple testing is generally much more complicated than discussed here. In the first place, many studies have multiple questions being asked. A simple example is in a two-way ANOVA, where there can be interactions as well as multiple main-effects to be evaluated per gene, or replicated time-series data, where multiple time points can be of interest. Yekutieli et al. (2006) and Yekutieli (2008) discuss controlling the FDR in a complex study where families of hypotheses tests will be considered and the hypotheses that will be tested depend on the results of a previous family of hypotheses, e.g. testing interaction effects only for those genes found to have significant main effects. Issues of multiple testing often provoke larger questions about reproducibility of research, since each published (and unpublished) paper could be considered an independent experiment at level α, yet the body of published work is not corrected for multiple testing. Ioannidis (2005) is an example of such concern, but see also the critical response of Goodman and Greenland (2007). However, the experiments are not independent, because the hypothesis tests considered at each stage of research depend critically on the results of previous hypothesis testing already done, and as all work in multiple testing correction makes clear [and that of Yekutieli et al. (2006) and Yekutieli (2008) on hierarchical hypothesis testing in particular], the needed amount of correction for dependent hypothesis tests is often not equal to correcting for all hypotheses independently. Moreover, as Tukey (1991) points out, G experiments done by different groups at separate time points have a very different underlying reliability than G hypotheses tests done on a single dataset produced by one
Transcriptomic Technologies and Statistical Data Analysis 145
laboratory, and thus the focus on controlling the errors per experiment make sense for this reason. Ultimately, high-throughput genomic experiments are hypothesis-generating studies – rarely conclusive on their own – and as such the role of multiple testing corrections is to help to quantify the future potential of a gene in a way that is compatible across experiments. 7.3.5
Filtering genes
Multiple testing corrections guarantee only that the FDR is below a certain rate γ (assuming the theoretical assumptions are satisfied). Depending on the (unknown) relationships of the genes, the FDR may be well below γ in which case the methods can be conservative (i.e. it may be at least theoretically possible to increase the power without increasing the nominal FDR). Researchers have continually proposed strategies to try to improve the power of the analysis by removing or pre-filtering genes which are not promising, for example genes that are not expressed or have low variability, before performing hypothesis testing and multiple testing corrections. Alternatively, it is also common after hypothesis testing and multiple testing corrections to focus on those significant genes that also show the largest effect size (e.g. largest differences between two groups). This is commonly done through a volcano plot, in which the adjusted p-values are plotted against the effect size and those genes in the upper corners of the plot are considered the most promising candidates for future investigation (see Figure 7.2). However, the impact of such filtering on the validity of the resulting p-values is problematic and depends on the filtering proposed. The filter is by definition data-dependent and thus can act as a form of hypothesis test itself in which case filtering can increase the true experiment error rate (either FWER or FDR) above the reported γ. An extreme example is filtering based on the test statistic itself – i.e. only consider genes with large test statistics – which has the effect of implicitly performing tests on all of the genes but only correcting
Figure 7.2 Typical volcano scatter plot of the log p-value (y-axis) versus log fold change (x-axis) for each gene based on a t-test of 3 BCR/ABL and 3 control subjects. Horizontal line indicates significance at a level 0.01. Genes with both large p-values and large fold change are in the upper corners of the plot and are often viewed as better candidates for further study. Lightly shaded points are those that did not pass a variance filter cut-off. For this sample size, the post-filtering by log fold change and pre-filtering by variance resulted in essentially equivalent decision rules. Reproduced from Bourgon et al. (2010a)
146
Handbook of Statistical Systems Biology
the hypotheses found to be significant. It is clear that this will increase power (more non-null hypothesis will get rejected), but the reported FDR will be meaningless. Filters based on the fold change between conditions are similarly problematic (van Iterson et al. 2010). Even nonspecific filters (i.e. ones that do not use group information to perform the filtering) can result in invalid p-values (Bourgon et al. 2010a). Also note that while there is always a trade-off in controlling type I and type II errors, the difficulty posed by filtering is that it is not really a trade-off because there is no guarantee for control of either type of error. Let Fg be the filter statistic for a gene and Tg be the test statistic. Then standard pre-filtering approaches remove genes based on Fg < c, for example. Then the testing procedure is based on the null distribution of Tg , but in fact the test statistics after filtering should be considered to be distributed Tg |Fg > c and thus the joint null distribution of Fg and Tg is relevant. Pre-filtering will only give valid p-values if Tg and Tg |Fg have the same distribution under the null hypothesis. Bourgon et al. (2010a) study two common filter statistics, ˆ g (Y )] in the setting of comparing differences between the empirical mean [Fg = Y¯ g ] and variance [Fg = var two groups. They show that when the data are first filtered using as a filter statistic the empirical mean or variance, the standard p-values based on the two-group test statistic Tg are still valid. van Iterson et al. (2010) observe the same behavior in their simulation studies. Bourgon et al. (2010a) also give a discussion of when the assumptions for control of the FDR, which depend on the joint distribution of the p-values across the genes, remain satisfied. The effectiveness of filtering can be unclear because a filter always has a chance of removing from consideration some genes that are in fact not null. The variance filter appears to increase power of detection, while a filter based on just the mean expression can actually worsen the power (Bourgon et al. 2010a,b; Talloen et al. 2010; van Iterson et al. 2010). Another problem in filtering comes with the use of moderated statistics. Since the distribution of estimated variances is used to estimate the moderated statistics, the use of only genes that pass a variance filter is problematic since only a selected portion of the distribution is used. Moderated statistics already have an implicit correction for variance, in the sense that low variance genes are adjusted to have higher variances (and thus lower t-statistics) and conversely for high variance genes. Unlike a direct filter on variance, the stringency of this correction decreases with larger sample sizes and thus increases in power using moderated test statistics will be only seen with small samples [see Bourgon et al. (2010a) for more on moderated test statistics and filtering]. Pre-filtering on fold change is a specific filter because it makes use of the group/condition labels, and it gives invalid adjusted p-values and no control of FDR (Bourgon et al. 2010a; van Iterson et al. 2010). Ad hoc post-filtering, such as with volcano plots, is common and generally thought to produce more reliable and reproducible results (Patterson et al. 2006), but again there is no guarantee on controlling the FDR for those genes selected after filtering. In the context of finding significant differences between two groups, filtering on fold change and filtering on overall variance (which is non-specific and still results in valid p-values) can have similar results because there is a direct relationship between fold change and overall variance (Bourgon et al. 2010a), T2 n(n − 1) (Y¯ 1 − Y¯ 2 )2 = S2 n1 n2 T 2 + n − 1 where T is the standard t-statistic based on a pooled estimate of variance and S is the overall variance ignoring the grouping. Thus, filtering on variance induces a lower bound on the fold change, a bound that depends on the significance result. For large samples this lower bound depends more heavily on the p-value, so that more significant genes require a higher fold change to also pass the variance filter, but for small sample sizes the lower bound is almost constant so that that variance filtering is roughly equivalent to post-filtering on fold change (see Figure 7.2). An alternative approach is to use methods that directly account for the fold-change filter, for example the work of McCarthy and Smyth (2009) that explicitly tests for the null hypothesis that
Transcriptomic Technologies and Statistical Data Analysis 147
the fold change is greater than a cut-off β using the empirical Bayes framework of moderated test statistics (Section 7.3.2).
7.4
Grouping genes to find biological patterns
The hypothesis testing approaches described above score genes one-at-a-time using marginal statistics. However, genes may act in concert to influence biological response, for example in a pathway or regulatory module. This motivates the use of statistical methods that can model the joint activity of genes. The importance of modeling the joint effects of groups of genes, whilst dealing with concerns regarding model complexity and predictive power, has motivated a diverse range of statistical formulations. These approaches go beyond marginal statistics and individual-gene analyses with the aim of identifying and modeling the influences of groups of genes on phenotypes of interest. 7.4.1
Gene-set analysis
When genes are biologically related, for example as members of a pathway or regulatory module, it may be the case that none of the genes stand out individually, but the group of genes taken together are salient in some sense. The idea of testing genes for significance not on a gene-by-gene basis, but at the level of sets of genes is called gene set analysis and the sets gene sets. Thus, gene set analysis can be thought of as a systematic approach to grouping together and interpreting gene-level test statistics, automating and extending what is often done by examination of gene lists. Gene sets, comprising biologically related genes, are usually predefined by the user [although methods do exist to automatically discover related groups of genes, e.g. Zou and Hastie (2005)]. These sets may be defined in a broad variety of ways, including, but not limited to: shared genomic location; shared regulatory motifs; shared pathway membership or regulatory network proximity; and co-expression under perturbation. The original gene set analysis was proposed by Mootha et al. (2003) and further developed in Subramanian et al. (2005). This original approach has subsequently been built on by a number of authors [including Efron and Tibshirani (2007), Jiang and Gentleman (2007), Irizarry et al. (2009)], leading to a number of methods which share the underlying gene set formulation. The Bioconductor package limma (Smyth 2004a) incorporates a gene set analysis which permits a number of different choices for enrichment score Sj . Here, following Efron and Tibshirani (2007), we introduce gene set analysis at a general level, and then describe specific methods that arise as special cases. As above, let Z denote a gene expression data matrix with entries Zig indexed by gene g ∈ {1 . . . p} and sample i ∈ {1 . . . n}. The corresponding class labels are Yi ∈ {0, 1}. We denote gene-level test statistics tg . Each predefined gene set j is a subset Aj ⊂ {1 . . . p} of gene indices with Nj = |Aj | members. The gene set methods described here score gene sets using statistics Sj = S({tg , g ∈ Aj }) derived from gene-level test statistics. The score Sj is referred to as the enrichment score and captures the extent to which members of the corresponding gene set have gene-level test statistics that are unusually large or small. Gene sets are ranked under the scores Sj , and sets whose enrichment scores fall above a threshold reported as significant. 7.4.1.1
Gene set analysis using z-scores
A z-score may be used as a simple enrichment score (Irizarry et al. 2009): Sj = Nj ¯t 1 ¯t = tg Nj g∈Aj
where tg denotes a gene-level test statistic.
(7.2)
148
Handbook of Statistical Systems Biology
If the tg are (approximately) standard normal and independent under the null, then the above enrichment score is itself (approximately) standard normal. In that case, gene-set p-values are easily obtained, removing the need for computationally demanding approaches such as permutation tests. The p-values so obtained can then be corrected for multiple comparisons using the procedures described above. Irizarry et al. (2009) show that this gene set analysis performs well on a number of microarray datasets. The simplicity and speed of this approach make it an attractive default choice. 7.4.1.2
Gene set analysis using χ2 statistics
The enrichment statistic (7.2) does not detect changes in scale: that is, it does not permit detection of effects that leave the mean unchanged. For example, if half the genes in a gene set are up-regulated and half downregulated, (7.2) will not detect the effect, since there is no change at the level of the mean. For this scenario, e.g. where gene sets may show interesting behavior that is not reflected at the level of the mean, a χ2 statistic may be used (Irizarry et al. 2009): Sj =
(tg − ¯t )2
(7.3)
g∈Aj 2 from which p-values are easily obtained. This gives Sj ∼ χ|A j |−1
7.4.1.3
Gene set enrichment analysis (GSEA)
The approach originally proposed in Mootha et al. (2003) and Subramanian et al. (2005) is known as GSEA. GSEA uses a Kolmogorov–Smirnov type statistic to compare members and nonmembers of a gene set and thereby give the score Sj . To score a given gene set Aj , GSEA starts with a list of genes ranked under the gene level test statistics, e.g. by association with class labels. The algorithm then walks down the ranked list, increasing a cumu/ Aj , with lative sum when it encounters gene set members g ∈ Aj and decreasing it for nonmembers g ∈ contribution to the enrichment score weighted by absolute value of the corresponding gene-level test statistic. The score is then given by the maximum deviation from zero of the running sum. For full details see Subramanian et al. (2005). In GSEA, p-values are obtained by a permutation analysis. Specifically, the sample labels Yi are randomly permuted B times and the scores Sj computed: the resulting B values are used as an empirical null distribution for computation of p-values [in Subramanian et al. (2005) B = 1000 is used]. The methods for controlling FDR, described above, are then used to account for multiple comparisons. 7.4.1.4
Problems using HTS
As discussed in Section 7.3.3, the p-values and test-statistics from HTS have varying levels of power depending on the level of expression of the gene, particularly with simple count distributions like the Poisson. From the point of view of gene set analysis, different levels of power means that the genes are not equally likely to be declared differentially expressed, even if their differences in expression in the input samples was the same. The presumption of gene set analysis is that one can analyze those genes found significant (and by implication those not found significant) and look for patterns based on the biological function of the genes. For HTS, those biological functions with higher expression will be found over-represented, even if there is little biological difference in the samples. Young et al. (2010) offer a correction to Gene Ontology analysis that specifically corrects for the difference in significance of genes of different length.
Transcriptomic Technologies and Statistical Data Analysis 149
7.4.2
Dimensionality reduction
The high-dimensionality of transcriptomic data can hinder visualization and exploratory analysis and obscure biologically interesting, systems-level regularities. Dimensionality reduction methods are therefore widely used to obtain low dimensional representations that are more amenable to visualization and biological interpretation than the original data. These methods can also be used to pre-process data prior to further statistical analysis. This has the benefit of allowing subsequent model fitting to proceed in a lower dimensional space. However, it is important to recognize the estimation involved in carrying out dimensionality reduction itself and the variance this contributes to the overall analysis. 7.4.2.1
Principal component analysis (PCA)
Principal component analysis gives a linear projection of the gene expression data into a lower dimensional space. A gene expression vector Z ∈ Rp is projected into a d-dimensional space via a d × p projection matrix A to give a reduced dimensional representation Y = AZ, Y ∈ Rd . The target dimension d is usually chosen to be much smaller than the number of genes p and for d ≤ 3 yields projected data Yi which can be readily visualized. The rows of the matrix A are constrained to be mutually orthogonal and of unit length and the matrix A chosen to maximize the variance in the projected space, i.e. to retain as much of the data variation as possible. The variance maximizing projection is obtained by setting the rows of A to be the d eigenvectors of ˆ Z corresponding to the d largest eigenvalues. These eigenvectors are known the sample covariance matrix as the principal components. The projection so defined is also optimal with respect to the least squares reconstruction error associated with approximating the original data points by their projections; for details see Hastie et al. (2009). Each of the p axes in the original gene expression space corresponds to a gene. Since the principal components are themselves vectors in the gene expression space they can be thought of as new axes, each formed as a linear combination of the genes. From this point of view, the principal components can be interpreted as composite genes (often called ‘eigengenes’ or ‘metagenes’) which summarize the overall variation in the data (Bild et al. 2005). 7.4.2.2
Related methods
While PCA provides a reduced dimensional representation of the data it relies on estimation of the data covariance which can itself be challenging in the ‘large p, small n’ gene expression setting. The covariance estimate used to define the projection may itself have large variance and this in turn may render the low-dimensional projections and eigengenes obtained highly sensitive to perturbation of the data. Sparse, regularized formulations (Zou et al. 2006) can therefore be useful in rendering PCA robust for gene expression data. A number of other dimensionality reduction methods have been applied to transcriptomic data. Like PCA, several widely used approaches model data Z as linear combinations of unobserved or latent factors comprising a vector Y, that is, in matrix notation, Z = WY, where both W and Y are unknown. Independent component analysis (ICA) seeks to render the components of vector Y as close to mutually independent as possible. This is only possible under certain constraints: W is identifiable up to scaling and permutation of its rows if at most one of the components of the vector Y is Gaussian (Hastie et al. 2009). PCA can be formally described as a Gaussian latent variable model [for details see Tipping and Bishop (1999)]. The non-Gaussianity of ICA is a key difference between ICA and PCA and necessitates going beyond second moments. Information theory can be used to formulate ICA as an optimization problem. Specifically, the matrix W is sought to minimize the Kullback–Leibler divergence between the joint density p(Y) and the ‘independence’ product density j p(Yj )
150
Handbook of Statistical Systems Biology
(because the Gaussian distribution has the maximum entropy of any density with a given variance, this turns out to be equivalent to maximizing the non-Gaussianity of the latent factors). ICA can be useful in settings in which it is reasonable to assume that independent biological processes contribute to observed gene expression and has been applied to gene expression data in Lee and Batzoglou (2003). For details regarding the method see Chen and Bickel (2006), Hastie et al. (2009) and the references therein. Non-negative matrix factorization (NNMF) is also a linear approach, but W and Y are constrained to have non-negative entries, and thus the input data Z must also be non-negative. For gene expression data, this implies that Z is a matrix of gene expression estimates without a log transformation, which is different than all other methods described in this chapter. NNMF has been applied to gene expression data to uncover ‘metagenes’ and cluster samples (Brunet et al. 2004). The requirement of non-negativity of the matrices is a strong constraint. It is most appropriate when seeking to model gene expression directly, without first log-transforming the data. However, it should be noted that when NNMF is carried out using non-log-transformed gene expression data, highly-expressed, large-variance genes may contribute greatly to the final result. 7.4.3
Clustering
Clustering is a form of unsupervised learning in which samples are partitioned into clusters, such that withincluster similarity is large relative to between-cluster similarity. Clustering can serve to shed light on unrecognized heterogeneity or highlight regularities over groups of genes or samples and is often an important first step in transcriptomic analyses. It is also useful in detecting batch effects or other quality control problems. For gene expression data Z clustering may be used to group genes or samples or both simultaneously. For notational simplicity we focus on the clustering of samples, but note that the methods described can be readily applied to the clustering of genes by simply considering the transpose of data matrix Z. Chapter 2 gives an overview of clustering methods, and we give only a brief description of them here and relevant details about their application to transcriptional studies. All of these methods assume the number of clusters is known; we discuss choosing the number of clusters below. K-means. K-means clustering tries to minimize the average distance of the observations in a cluster to their mean, and finds the optimal cluster by an iterative procedure in which the computation of cluster means alternates with cluster assignment [see Hastie et al. (2009) for details]. The algorithm gives only a local minimum, and should therefore be restarted from many random initializations. The algorithm can be generalized by using other notions of similarity and other definitions of the center of the cluster, in which case it is often called K-centres or K-medioids. However, in this case the cluster characterization step involves searching over all members of each cluster to minimize within-cluster dissimilarity; this requires quadratic time in cluster size. Thus, it can be difficult to obtain good clusters in practice, and particularly to do so robustly and rapidly in applications with a large number of objects to be clustered. A recent message passing-based algorithm called Affinity Propagation (Frey and Dueck 2007) offers an efficient and deterministic alternative to K-centres and has been used for gene expression data by Leone et al. (2007) and Kiddle et al. (2010). Hierarchical clustering. Hierarchical clustering, rather than searching for a fixed number of clusters, progressively groups together samples or groups of samples to create a hierarchical structure of nested clusters: each level of the hierarchy is a clustering of the samples which is itself a product of the merging of two smaller clusters, see Chapter 2 for more details. To obtain the desired K clusters is simply a process of moving down the hierarchy until the level with exactly K clusters is reached [but see Kettenring (2006) for a discussion of the nonintuitive cluster results that can occur from such a ‘cutting’ procedure]. Hierarchical clustering creates a nested sequence of clusters that can be visualized as a binary tree, with each of the samples at the tip of the tree. By clustering both the genes and samples with hierarchical clustering, the rows and columns of the matrix Z can be reordered in the order of the tips of the binary tree and the entries
Transcriptomic Technologies and Statistical Data Analysis 151
of the matrix illustrated by a color scale, with the binary tree shown on the side to also note the relationship between the genes or samples. Such a visualization is known as a heatmap, and it is an ubiquitous choice for visualization of microarray data, see Figure 7.3 (Eisen et al. 1998; Friendly 2009). Mixture models. Mixture model clustering assumes that the density of Z can be written as a K-component mixture model; often the assumption is that the mixture components are Gaussian, but note that mixture models can be formed using any component densities. Gaussian mixture models (GMMs) are used for clustering by identifying density components with clusters: this gives a model-based approach for clustering. The EM algorithm can be used to fit a GMM where two iterative steps are carried out: (i) an ‘E-step’ in which data vectors Zi are assigned to clusters/components k based on the current estimates of density parameters; and (ii) an ‘M-step’ in which the parameters are re-estimated under the assignments from the E-step. The assignment in the E-step is ‘soft’ in the sense that data vectors are not simply assigned exclusively to their current best-match cluster, but rather apportioned among clusters using the current cluster-specific densities. While it can be shown that EM never decreases the likelihood, the algorithm depends on initial conditions and gives only a local maximum of the likelihood. For this reason, like K-means, EM is usually re-initialized many times and the estimates corresponding to the highest likelihood found returned.
7.4.3.1
Estimating the number of clusters
The task of determining the number of clusters K can be regarded as a form of model selection. A widely used approach is based on examining how overall model fit changes with K. For clustering methods, such as GMMs, that are based on explicit probability models, model selection may also be carried out more formally, using information criteria or Bayesian approaches (see Chapter 2). For techniques that are not model based, the selection of the number of clusters is normally done by by carrying out clustering for a number of values of K and storing the final value J(K) of the cost function in each case. The cost function J(K) should decrease with K (because the increase in K results in increasing fit to the data), but a kink in the plot of J(K) against K can suggest a good choice of K. The cost function J(K) is often not the same cost function as is used in the clustering algorithm for determining the clusters. For example, the gap statistic (Tibshirani et al. 2001) defines J(K) as the difference between the pooled within-cluster sums of squares of the observed data to that expected if the data were uniformly distributed. Dudoit and Fridlyand (2002) choose K by randomly dividing the data, assessing the prediction strength of the clustering created on one half of the data on the other half, and repeating for many such partitions; J(K) is then either the average prediction strength over partitions, or the difference in observed average prediction strength from that expected under another model. Another commonly used measure is based on the silhouette statistic (Rousseeuw 1987), in which a single sample is evaluated for how well the sample fits in its assigned cluster compared with all other clusters; J(K) based on the silhouette statistic is often the mean of the silhouette statistic over all samples (the individual silhouette statistics per sample are also useful for finding samples that are not well clustered). See Dudoit and Fridlyand (2002) for a review of common choices of J(K) for choosing the number of clusters.
7.4.3.2
Other approaches to clustering
Resampling approaches may be used to test the reliability of clusters, create more robust clusters, and estimate the number of clusters. Consensus clustering (Monti et al. 2003) and bagging (Dudoit and Fridlyand 2003) exploit resampling approaches to find clusterings that are stable under perturbation of the data and which may therefore be less likely to represent artifactual structure.
152
Handbook of Statistical Systems Biology
Transcriptomic Technologies and Statistical Data Analysis 153
The methods described above apply to clustering either genes or samples, i.e. along one dimension of the data matrix. However, it may be the case that certain samples cluster together, but only when a subset of the genes are considered, or vice versa. Then, biclustering approaches may be applied to simultaneously cluster both genes and samples. This effectively involves finding submatrices, each comprising a subset of samples and a subset of genes, that are internally related in some sense. A review of biclustering approaches for gene expression is presented in Madeira and Oliveira (2004), and the Plaid model (Lazzeroni and Owen 2002) is a popular choice. Time-course data present some special challenges for clustering genes, since accounting for temporal structure is then critical in defining biological meaningful clusters. Methods specifically aimed at time-course data are available, including Heard et al. (2006) and Kiddle et al. (2010).
7.5
Prediction of a biological response
In many settings, we seek to build predictive models of the relationship between gene expression and a biological response of interest. Genes may act in concert to influence biological response. Joint activity of this kind is especially relevant for systems-focused studies, and discovering subsets of genes which work together to drive a response is often an important step in moving towards detailed, mechanistic or networkorientated analyses. However, in the context of genome-wide transcriptional assays, the total number of genes under study (p) is usually large relative to sample size (n), often orders of magnitude larger. The resulting ‘large p, small n’ setting poses special challenges for inference, with issues of over-fitting, model selection and validation playing an important role. 7.5.1
Variable selection
Statistical variable selection offers a natural framework for the discovery of subsets of influential predictors from gene microarray data and for building parsimonious yet predictive models. Some important desiderata of variable selection for systems-level transcriptomic studies include: • •
• • •
Predictive power. Sparsity. It may be desirable to treat the response as a function of only a small number of genes, that is to employ formulations which are sparse in the sense of depending on only a few predictors. Sparse formulations can offer substantial statistical gains, especially in settings in which a few genes are dominant influences on the response. Stability. Small perturbations to the data should not lead to completely different predictor subsets or corresponding model coefficients. Interpretability. Results should be interpretable for the purposes of experimental follow-up or downstream computational analyses, e.g. at the level of networks or mechanistic models. Computational efficiency.
← Figure 7.3 Typical heatmap based on hierarchical clustering of gene expression matrix Z (log ratio of experimental to reference mRNA), where both the samples and genes were clustered (the dendogram for the genes is not shown). Each row represents a separate gene on the microarray and each column a separate mRNA sample. The samples are normal and malignant lymphocytes and are color coded according to their known category (see upper right key). Certain categories of genes that clustered together are highlighted. The ratios were depicted according to the color scale shown at the bottom. Reproduced from Alizadeh et al. (2000).
154
Handbook of Statistical Systems Biology
Clearly, these desiderata are inter-related: for example, sparse models, being parsimonious, may be both stable and biologically interpretable. Here we discuss two broad approaches for variable selection which have enjoyed much development in recent years and are well-suited for systems-level studies: penalized likelihood methods, including the Lasso (Least absolute shrinkage and selection operator) and its variants, and methods based on Bayesian model selection and averaging. Many other methodologies are available in the statistical and machine learning literature. However, we note that many classical approaches, including stepwise selection and ridge regression, can be unsatisfactory with respect to one or more of the desiderata above (Hesterberg et al. 2008), especially in the ‘large p, small n’ regime that is characteristic of transcriptomic studies. 7.5.1.1
Penalized likelihood and Lasso
Consider the classical linear model: Yi = Zi β + i (7.4) T iid where β = β1 , . . . , βp are regression coefficients and i ∼ N 0, σ 2 I . The classical ordinary least squares (OLS) estimator for the coefficients can be obtained by maximum likelihood, or equivalently by minimizing the residual sum of squares, and is given by the βˆ = (ZT Z)−1 ZT Y, where as before Z is the n × p gene expression data matrix and Y = (Y1 , . . . , Yn )T a vector of associated response values. However, this estimator suffers from a number of shortcomings. The estimator β may have low bias, but high variance, which in turn affects predictive power. The fact that all predictors are included in the model hampers biological interpretability and usefulness as a guide for further experimental or computational work. These shortcomings are greatly exacerbated in the ‘large p, small n’ regime (and indeed without some form of regularization or variable selection the conventional estimator is not even directly applicable when p > n). Penalized likelihood methods address these shortcomings by maximizing a modified likelihood function that is subject to a penalty on the parameters. The Lasso is a specific penalized likelihood approach for regression, originally due to Tibshirani (1996), which imposes an 1 penalty on the parameters β. Specifically, the Lasso estimator is given by: ⎞ ⎛ ⎛ ⎞2 p p n ⎟ ⎜ ⎝Yi − βˆ = argmin ⎝ βj Xij ⎠ + λ |βj |⎠ (7.5) β
i=1
j=1
= argmin Y − Zβ22 + λβ1
j=1
(7.6)
β
where · q denotes the q norm of its argument. (The sum-of-squares term in the expression above arises from the log likelihood under the Normal linear model; for other models the penalty is applied to the appropriate log-likelihood.) The penalty term results in simultaneous shrinkage and variable selection. The latter effect is due to certain properties of the 1 penalty which encourage solutions in which certain coefficients βj are exactly zero such that the corresponding terms vanish from the model (7.4). For details see Tibshirani (1996). The penalty parameter λ controls the extent to which sparse solutions are favored, with larger values of λ corresponding to sparser solutions with fewer nonzero coefficients βj . For high-dimensional transcriptomic data, obtaining the Lasso estimate from (7.6) poses a non-trivial optimization problem. Indeed, the original algorithm proposed by Tibshirani (1996) is not applicable in the p > n case. A subsequent convex programming approach (Osborne et al. 2000) deals with the p > n case but
Transcriptomic Technologies and Statistical Data Analysis 155
is not sufficiently fast for routine use in microarray studies (Segal et al. 2003). However, an algorithm known as least angle regression or LARS (Efron et al. 2004) can be used to very rapidly compute Lasso estimates, even in high-dimensional settings. Moreover, LARS provides Lasso estimates for all values of the penalty parameter λ (i.e. the entire solution path), which in turn greatly facilitates tuning of the parameter by empirical approaches such as cross-validation. Lasso for Groups of Genes A number of variants of the Lasso have been proposed which deal with special structure among the predictors and are well suited to specific applications in the analysis of transcriptomic data. Here we outline two approaches which deal with groups of genes and thereby complement the gene-set analyses described above. For a fuller survey of Lasso extensions we direct the interested reader to a review by Hesterberg et al. (2008). Known groups of genes. In some settings, it may be appropriate to select or exclude a group of genes together, for example when the genes are expected to be mutually correlated on account of shared biological function. The ‘group Lasso’ (Yuan and Lin 2006) handles predefined groups of predictors of this kind by modifying the penalty term to encourage sparsity at the level of groups included rather than simply number of genes. For full details we refer the reader to Yuan and Lin (2006). An extension to the logistic regression case is presented in Meier et al. (2008). Unknown groups of genes. The Elastic Net (Zou and Hastie 2005) is a Lasso variant that deals with the possibility of groups of genes, but does not require pre-specification of the groups. Like the Lasso it encourages sparse solutions, but it also encourages grouping of predictors. This is achieved by means of a double penalty: in addition to the 1 penalty of Lasso, a second, 2 penalty (i.e. a ridge-type penalty) is imposed, which leads to a grouping effect [see Zou and Hastie (2005) for details]. A LARS-type algorithm (‘LARS-EN’) is available, which provides an efficient way to obtain full solution paths and thereby facilitates choice of tuning parameters by cross-validation and related methods. 7.5.1.2
Bayesian variable selection
Variable selection can also be carried out within a Bayesian framework (Chapter 3). This is done within a model selection framework, with inference carried out via the posterior distribution over models corresponding to gene subsets. T Let γ = γ1 , . . . , γp , γ ∈ {0, 1}p specify which genes are included in the model; that is, gene j is included in the model if and only if γj = 1. We use γ to denote both the inclusion indicator vector and the model it specifies. We are interested in the posterior distribution over models P(γ | Y, Z). From Bayes’ theorem we have P(γ | Y, Z) ∝ p(Y | γ, Z)P(γ)
(7.7)
where p(Y | γ, Z) is known as the marginal likelihood and P(γ) is a prior over models. The marginal likelihood is obtained by integrating out model parameters (i.e. regression coefficients and variance) and thereby accounts for model complexity, automatically penalizing models with many parameters. Prior specification depends on the precise regression formulation and the goals of analysis. A default, conjugate choice for the regression coefficients and variances in the Normal linear model (7.4) is Normal inverse-gamma; this gives the marginal likelihood in closed form [for details of standard conjugate formulations for Bayesian regression see Denison et al. (2002), in particular Appendix B therein]. The model prior P(γ) can be used to encourage sparsity (Brown et al. 2002; Mukherjee et al. 2009); to incorporate known biology, for example by assigning each variable a prior probability of being included (Ai-Jun and Xin-Yuan 2010); or simply chosen as uniform over the model space.
156
Handbook of Statistical Systems Biology
Following prior specification, the posterior distribution (7.7) may be used to carry out variable selection. Maximizing the posterior over models gives a single, ‘best’ model γ ∗ = argmax P (γ | Y, Z)
(7.8)
γ
The model γ ∗ is known as the maximum a posteriori or MAP model. The selected gene subset is then {j, j ∈ γ ∗ }. However, considering a single model may in some cases be misleading. In small-sample settings, for example, the posterior distribution may be diffuse with many gene subsets having similar posterior probability. An alternative approach is to average over models and thereby calculate posterior inclusion probabilities for each gene (Brown et al. 2002; Denison et al. 2002). This is done by considering the total probability over all models containing a given gene j: P(γ | Y, Z). (7.9) P(γj = 1 | Y, Z) = γ:γj =1
These inclusion probabilities provide an interpretable way to score the importance of each gene in determining the response, but in contrast to marginal statistics they do so via joint models specified at the level of subsets. Direct evaluation of (7.9) entails enumeration of all possible models γ. However, for p genes, the full model space is of size 2p , precluding a direct approach in all but the smallest problems. Instead, Markov chain Monte Carlo may be used to sample from the posterior over models and thereby estimate the inclusion probabilities see Chapter 3 and (George and McCulloch 1993; Nott and Green 2004; Mukherjee et al. 2009). 7.5.2
Estimating the performance of a model
One of the most important aspects of building a predictive model is assessing its generality to the wider population of samples not seen in the specific study under consideration (generalization error or test error). Instead of actually estimating the generalization error, we may want to compare the error of two models so as to pick the best model. Clearly, the performance of the model on the observed n samples is a measure of the performance of the model (training error). However, the training error is a notoriously poor estimate of the expected error of a model on new observations. The main problem is that inclusion of additional variables in the model or an increasingly flexible model results in a better prediction of the observed samples because noisy and non-replicable aspects of the data begin to be predicted. Alternatives to the training error, such as the AIC or BIC, directly try to estimate the error on new data for certain kinds of procedures, and the resulting predictions of error penalize for extra parameters (Hastie et al. 2009). Variable reduction method such as Lasso, described above, generally help improve the generalization error by penalizing for large numbers of parameters, but often require choice of a penalty parameter, such as λ, and therefore also need a method for choosing between different models (i.e. which value of λ to use). In addition, when the number of variables p is large, there are a large number of different possible models to consider simply by choosing different variables. Even if the model is restricted a priori to have a set number of variables, the large number of such choices means that by chance there is likely to be a set of variables that has low training error on the observed data. There are two commonly used methods to evaluate the generalization error in genomic studies (see Chapter 2). The simplest method is the hold-out method, where the observations are divided in two, one portion (the training set) is used to fit the model, and the performance of the model is evaluated on the unused portion (the test set) as a final step. A problem with this approach is that it means that much less data is used to fit the model, and in many settings there may not be enough data to have reasonably sized test and training sets. Another problem with this approach is that the estimates of error on the test set are only valid if it is done
Transcriptomic Technologies and Statistical Data Analysis 157
as a final step. If the test data is used iteratively – for example if the first model has poor performance on the test data, and so a new approach is tried – then error rates for the new approach based on the test data are no longer valid estimates of generalization error because the test data have become part of the model building procedure. For this reason, relying on a test and training set is generally inflexible in the setting of genomic studies. A more robust method is found in the resampling method known as cross-validation (CV), the most common being K-fold CV. In this case, the data are divided into K portions, each subsample is successively withheld and the model fit on the remaining K − 1 portions, and the model is evaluated on the withheld subsample of data. The average error across the K subsamples is the estimate of generalization error. The cross-validation procedure makes clear that the generalization error of a model involves the random process of fitting the model, and thus all of the steps involved in fitting the model must be included. For example, if a dimension reduction step is used to pick a smaller set of genes before fitting a model, the CV procedure must include the variable reduction step as well (so that the model validated on each CV subsample will consist of different sets of variables each time). Failure to include the entire procedure in the CV can result in severe underestimates of the generalization error (Simon et al. 2003; Tibshirani 2005; Hastie et al. 2009). CV is also used to choose between models or regularization parameters by calculating the prediction error on several models or parameters, and choosing the model with the smallest estimate of generalization error. It is important to note that while this is a standard method for picking a model, the generalization error of the best model, as estimated by the CV used to choose between the models, will be itself an underestimate of generalization error of the best model. Instead a double CV should be performed to get accurate estimates of prediction error (Dudoit and Fridlyand 2005; Varma and Simon 2006), for example to determine how well the model can separate diseased from healthy patients. Finally, the CV estimate of generalization error is just an estimate, and as such can be highly variable with small numbers of samples (Braga-Neto and Dougherty 2004; Isaksson et al. 2008).
References Ai-Jun Y and Xin-Yuan S 2010 Bayesian variable selection for disease classification using gene expression data. Bioinformatics 26, 215–222. Alizadeh AA, Eisen MB, Davis RE et al. 2000 Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769), 503–511. Anders S and Huber W 2010 Differential expression analysis for sequence count data. Genome Biology 11, R106. Baldi P and Long AD 2001 A bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics (Oxford, England) 17(6), 509–19. Bar-Joseph Z 2004 Analyzing time series gene expression data. Bioinformatics 20(16), 2493. Benjamini Y and Hochberg Y 1995 Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 57(1), 289–300. Benjamini Y and Yekutieli D 2001 The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 29(4), 1165–1188. Bild A, Yao G, Chang J, Wang Q et al. 2005 Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439(7074), 353–357. Bolstad B, Irizarry R, Astrand M and Speed T 2003 A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185. Bona FD, Ossowski S, Schneeberger K and Ratsch G 2008 Optimal spliced alignments of short sequence reads. Bioinformatics 24(16), i174. Bourgon R, Gentleman R and Huber W 2010a Independent filtering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences of the United States of America 107(21), 9546–9551.
158
Handbook of Statistical Systems Biology
Bourgon R, Gentleman R and Huber W 2010b Reply to Talloen et al.: Independent filtering is a generic approach that needs domain specific adaptation. Proceedings of the National Academy of Sciences of the United States of America 107(46), E175. Braga-Neto U and Dougherty E 2004 Is cross-validation valid for small-sample microarray classification? Bioinformatics 20(3), 374. Brown PJ, Vannucci M and Fearn T 2002 Bayes model averaging with selection of regressors. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64, 519–536. Brunet J, Tamayo P, Golub T and Mesirov J 2004 Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences of the United States of America 101(12), 4164–4169. Bryant D, Shen R, Priest H et al. 2010 Supersplat–spliced RNA-seq alignment. Bioinformatics 26(12), 1500. Bullard JH, Purdom E, Hansen KD and Dudoit S 2010 Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinformatics 11, 94. Chen A and Bickel P 2006 Efficient independent component analysis. The Annals of Statistics 34(6), 2825–2855. Cox DR 1972 Regression models and life tables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 34(2), 187–220. Datta S and Datta S 2005 Empirical bayes screening of many p-values with applications to microarray studies. Bioinformatics 21(9), 1987. Denison D, Holmes C, Mallik B and Smith A 2002 Bayesian Methods for Nonlinear Classification and Regression. John Wiley & Sons, Ltd. Diggle PJ 1990 Time Series: a Biostatistical Introduction. Oxford University Press. Dohm JC, Lottaz C, Borodina T and Himmelbauer H 2008 Substantial biases in ultra-short read data sets from highthroughput DNA sequencing. Nucleic Acids Research 36(16), e105. Dudoit S and Fridlyand J 2002 A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7), RESEARCH0036. Dudoit S and Fridlyand J 2003 Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090– 1099. Dudoit S and Fridlyand J 2005 Classification in microarray experiments. In Statistical Analysis of Gene Expression Microarray Data (ed. Speed TP). Chapman & Hall, pp. 93–158. Dudoit S and van der Laan MJ 2008 Multiple Testing Procedures with Applications to Genomics. Springer-Verlag. Dudoit S, Yang Y, Callow M and Speed T 2002 Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12(1), 111–140. Dudoit S, Shaffer J and Boldrick J 2003 Multiple hypothesis testing in microarray experiments. Statistical Science 18(1), 71–103. Efron B 2008 Microarrays, empirical bayes and the two-groups model. Statistical Science 23(1), 1–22. Efron B and Tibshirani R 2002 Empirical bayes methods and false discovery rates for microarrays. Genetic Epidemiology 23(1), 70–86. Efron B and Tibshirani R 2007 On testing the significance of sets of genes. Annals of Applied Statistics 1(1), 107–129. Efron B, Tibshirani R, Storey J and Tusher V 2001 Empirical bayes analysis of a microarray experiment. Journal of the American Statistical Association 96(456), 1151–1160. Efron B, Hastie T, Johnstone I and Tibshirani R 2004 Least angle regression. Annals of Statistics 32(2), 407–451. Eisen MB, Spellman PT, Brown PO and Botstein D 1998 Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 95(25), 14863–14868. Faulkner GJ, Forrest ARR, Chalk AM et al. 2008 A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by cage. Genomics 91(3), 281–288. Frey B and Dueck D 2007 Clustering by passing messages between data points. Science 315(5814), 972–976. Friendly M 2009 The history of the cluster heat map. The American Statistician 63(2), 179–184. Futcher B, Latter GI, Monardo P et al. 1999 A sampling of the yeast proteome. Molecular and Cell Biology 19(11), 7357–7368. George EI and McCulloch RE 1993 Variable selection via Gibbs sampling. Journal of the American Statistical Association 88, 881–889.
Transcriptomic Technologies and Statistical Data Analysis 159 Goodman S and Greenland S 2007 Why most published research findings are false: problems in the analysis. PLoS Medicine 4(4), e168. Greenbaum D, Colangelo C, Williams K and Gerstein M 2003 Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biology 4(9), 117. Gygi SP, Rochon Y, Franza BR and Aebersold R 1999 Correlation between protein and mRNA abundance in yeast. Molecular and Cell Biology 19(3), 1720–1730. Hansen KD, Brenner SE and Dudoit S 2010 Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research 38(12), e131. Hastie T, Tibshirani R and Friedman J 2009 The Elements of Statistical Learning, 2nd edn. Springer-Verlag. Heard N, Holmes C and Stephens D 2006 A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: An application of bayesian hierarchical clustering of curves. Journal of the American Statistical Association 101(473), 18–29. Heller MJ 2002 DNA microarray technology: devices, systems, and applications. Annual Review of Biomedical Engineering 4, 129–53. Hesterberg T, Choi N, Meier L and Fraley C 2008 Least angle and l1 penalized regression: a review. Statistics Surveys 2(2008), 61–93. Hobman J, Jones A and Constantinidou C 2007 Introduction to microarray technology. In Microarray Technology Through Applications (eds Falciani F and Falciani F). Taylor and Francis, pp. 1–51. Hoen PT, Turk R, Boer J, Sterrenburg E, Menezes RD, Ommen GJV and Dunnen JD 2004 Intensity-based analysis of two-colour microarrays enables efficient and flexible hybridization designs. Nucleic Acids Research 32(4), e41. Ibrahim J, Chen MH and Gray R 2002 Bayesian models for gene expression with dna microarray data. Journal of the American Statistical Association 97(457), 88–99. Ioannidis JPA 2005 Why most published research findings are false. PLoS Medicine 2(8), e124. Irizarry RA, Hobbs B, Collin F et al. 2003 Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2), 249–264. Irizarry RA, Wang C, Zhou Y and Speed TP 2009 Gene set enrichment analysis made simple. Statistical Methods in Medical Research 18(6), 565–575. Isaksson A, Wallman M, Goransson H and Gustafsson M 2008 Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recognition Letters 29, 1960–1965. Jiang H and Wong WH 2009 Statistical inferences for isoform expression in RNA-seq. Bioinformatics 25(8), 1026–1032. Jiang Z and Gentleman R 2007 Extensions to gene set enrichment. Bioinformatics 23(3), 306–313. Kerr MK and Churchill GA 2001 Statistical design and the analysis of gene expression microarray data. Genetical Research 77(02), 123–128. Kettenring J 2006 The practice of cluster analysis. Journal of Classification 23(1), 3–30. Kiddle S, Windram O, McHattie S et al. 2010 Temporal clustering by affinity propagation reveals transcriptional modules in Arabidopsis thaliana. Bioinformatics 26(3), 355–362. Lazzeroni L and Owen A 2002 Plaid models for gene expression data. Statistica Sinica 12(1), 61–86. Lee S and Batzoglou S 2003 Application of independent component analysis to microarrays. Genome Biology 4(11), R76. Leone M, Sumedha and Weigt M 2007 Clustering by soft-constraint affinity propagation: applications to gene-expression data. Bioinformatics 23(20), 2708–2715. Li B, Ruotti V, Stewart RM et al. 2010a RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4), 493. Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proceedings of the National Academy of Science of the United States of America 98(1), 31–36. Li H and Homer N 2010 A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11(5), 473–483. Li H, Ruan J and Durbin R 2008 Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18(11), 1851–1858. Li J, Jiang H and Wong WH 2010b Modeling non-uniformity in short-read rates in RNA-seq data. Genome Biology 11(5), R50. L¨onnstedt I and Speed T 2002 Replicated microarray data. Statistica Sinica 12(1), 31–46.
160
Handbook of Statistical Systems Biology
Lu P, Vogel C, Wang R et al. 2007 Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nature Biotechnology 25(1), 117–124. Luan Y and Li H 2004 Model-based methods for identifying periodically expressed genes based on time course microarray gene expression data. Bioinformatics 20(3), 332. Madeira S and Oliveira A 2004 Biclustering algorithms for biological data analysis: a survey. IEEE Transactions on Computational Biology and Bioinformatics 24–45. McCarthy D and Smyth G 2009 Testing significance relative to a fold-change threshold is a treat. Bioinformatics 25(6), 765. McCullagh P and Nelder J 1989 Generalized Linear Models. Chapman & Hall. Meier L, Van De Geer S and B¨uhlmann P 2008 The group Lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(1), 53–71. Monti S, Tamayo P, Mesirov J and Golub T 2003 Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52(1), 91–118. Mootha V, Lindgren C, Eriksson K et al. 2003 PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 34(3), 267–273. Mortazavi A, Williams BA, McCue K et al. 2008 Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5(7), 621. Mukherjee S, Pelech S, Neve R et al. 2009 Sparse combinatorial inference with an application in cancer biology. Bioinformatics 25(2), 265–271. Newton M and Kendziorski C 2003 Parametric empirical bayes methods for microarrays. In The Analysis of Gene Expression Data Methods and Software (eds Parmigiani G, Garrett E, Irizarry R and Zeger S). Springer, pp. 254–271. Newton MA, Kendziorski CM, Richmond CS et al. 2001 On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology 8(1), 37–52. Nie L, Wu G and Zhang W 2006 Correlation of mRNA expression and protein abundance affected by multiple sequence features related to translational efficiency in desulfovibrio vulgaris: a quantitative analysis. Genetics 174(4), 2229. Nott DJ and Green PJ 2004 Bayesian variable selection and the Swendsen-Wang algorithm. Journal of Computational and Graphical Statistics 13, 141–157. Osborne M, Presnell B and Turlach B 2000 On the Lasso and its dual. Journal of Computational and Graphical Statistics 9(2), 319–337. Patterson TA, Lobenhofer EK, Fulmer-Smentek SB et al. 2006 Performance comparison of one-color and two-color platforms within the microarray quality control (maqc) project. Nature Biotechnology 24(9), 1140. Pawitan Y 2001 In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press. Pepke S, Wold B and Mortazavi A 2009 Computation for chip-seq and RNA-seq studies. Nature Methods 6, S22. Reiner A, Yekutieli D and Benjamini Y 2003 Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19(3), 368. Richard H, Schulz M, Sultan M et al. 2010 Prediction of alternative isoforms from exon expression levels in RNA-seq experiments. Nucleic Acids Research 38(10), e112. Robinson M and Oshlack A 2010 A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11(3), R25–R25. Robinson M and Smyth G 2007 Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23(21), 2881. Rousseeuw P 1987 Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65. Russell S, Meadows L and Russell R 2009 Microarray Technology in Practice. Elsevier. Salzman J, Jiang H and Wong WH 2010 Statistical modeling of RNA-SEQ data. Technical Report BIO-252, Division of Biostatistics, Stanford University. Schena M, Shalon D, Davis RW and Brown PO 1995 Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235), 467–470. Schumacher M, Binder H and Gerds T 2007 Assessment of survival prediction models based on microarray data. Bioinformatics 23(14), 1768.
Transcriptomic Technologies and Statistical Data Analysis 161 Segal M, Dahlquist K and Conklin B 2003 Regression approaches for microarray data analysis. Journal of Computational Biology 10(6), 961–980. Simon R, Radmacher M, Dobbin K and McShane L 2003 Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute 95(1), 14. Smyth G 2004a Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1), 1–25. Smyth GK 2004b Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3, Article3. Smyth GK 2005 Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor (eds Gentleman R, Carey V, Dudoit S and Irizarry RA) Springer, pp. 397–420. Spellman P, Sherlock G, Zhang M et al. 1998 Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9(12), 3273. Steinhoff C and Vingron M 2006 Normalization and quantification of differential expression in gene expression microarrays. Brief Bioinformatics 7(2), 166. Storey J 2002 A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(3), 479–498. Storey J 2003 The positive false discovery rate: a bayesian interpretation and the q-value. The Annals of Statistics 31(6), 2013–2035. Storey JD and Tibshirani R 2003 Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100(16), 9440–9445. Storey JD, Xiao W, Leek JT et al. 2005 Significance analysis of time course microarray experiments. Proceedings of the National Academy of Sciences of the United States of America 102(36), 12837–12842. Subramanian A, Tamayo P, Mootha V et al. 2005 Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102(43), 15545. Tai Y and Speed T 2006 A multivariate empirical bayes statistic for replicated microarray time course data. The Annals of Statistics 34(5), 2387–2412. Tai YC and Speed TP 2005 Statistical analysis of microarray time course data. In DNA Microarrays (ed. Nuber U). Taylor and Francis, pp. 257–280. Talloen W, Hochreiter S, Bijnens L et al. 2010 Filtering data from high-throughput experiments based on measurement reliability. Proceedings of the National Academy of Sciences of the United States of America 107(46), E173–E174. Tibshirani R 1996 Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58(1), 267–288. Tibshirani R 2005 Immune signatures in follicular lymphoma. New England Journal of Medicine 352(14), 1496–1497; author reply 1496–1497. Tibshirani R, Walther G and Hastie T 2001 Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2), 411–423. Tipping M and Bishop C 1999 Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(3), 611–622. Trapnell C, Pachter L and Salzberg S 2009 Tophat: discovering splice junctions with RNA-seq. Bioinformatics 25(9), 1105. Trapnell C, Williams BA, Pertea G et al. 2010 Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28(5), 511. Tukey J 1991 The philosophy of multiple comparisons. Statistical Science 6(1), 100–116. van Iterson M, Boer J and Menezes R 2010 Filtering, fdr and power. BMC Bioinformatics 11(1), 450. Varma S and Simon R 2006 Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91. Yang YH and Speed T 2002 Design issues for cDNA microarray experiments. Nature Reviews Genetics 3(8), 579. Yekutieli D 2008 Hierarchical false discovery rate–controlling methodology. Journal of the American Statistical Association 103, 309–316.
162
Handbook of Statistical Systems Biology
Yekutieli D, Reiner Benaim A, Benjamini Y et al. 2006 Approaches to multiplicity issues in complex research in microarray analysis. Statistica Neerlandica 60(4), 414–437. Young MD, Wakefield MJ, Smyth GK and Oshlack A 2010 Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology 11(2), R14. Yuan M and Lin Y 2006 Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67. Zou H and Hastie T 2005 Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2), 301–320. Zou H, Hastie T and Tibshirani R 2006 Sparse principal component analysis.Journal of Computational and Graphical Statistics 15(2), 265–286.
8 Statistical Data Analysis in Metabolomics Timothy M. D. Ebbels1 and Maria De Iorio2
2
8.1
1 Department of Surgery and Cancer, Imperial College, London, UK Department of Epidemiology and Biostatistics, Imperial College, London, UK
Introduction
Metabolomics, also known as metabonomics or metabolic profiling [1, 2], is the study of the complement of small molecule metabolites present in cells, biofluids and tissues. Metabolites (biomolecules with masses below about 1500 Da) are the key molecules that sustain life, providing the energy and raw materials of the cell. As such, any disregulation or malfunction of the system, due for example to disease or stress, will usually cause changes in metabolite levels, which can be monitored by spectroscopic techniques. Metabolites form one component of the multilevel-systems approach to biology: they interact dynamically with macromolecules in a host of other levels and multiple spatial scales. For example, metabolite levels both regulate and are regulated by genes through the transcriptional and protein levels of biomolecular organisation, and in some cases may be instrumental in cross-species interactions, for example in symbiosis or parasitism. In contrast to proteins and nucleic acids, the metabolic profile is directly sensitive to the environment of the system, for example being strongly influenced by drugs and diet. In metabolomics, the aim is to obtain a global profile of metabolite levels with as little bias as possible, which is the so-called untargeted approach. This poses a severe challenge to analytical chemical procedures that are traditionally optimised for specific target molecules or chemical classes. In order to capture the widest set of metabolites, multiple chemical analytical technologies must be employed. Each technology results in large and complex datasets requiring extensive processing and statistical modelling to draw biological inferences. In this chapter we present some of the most widely used statistical techniques for analysing metabolomic data. We highlight the challenges, including those still to be addressed, and begin by describing the technologies used and how they influence the statistical characteristics of the data. In the first part of the chapter, we discuss the range of analytical tools used to acquire metabolic profiles and the associated characteristics of the data. In the second part, we focus on statistical approaches that can be applied independently of the analytical
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
164
Handbook of Statistical Systems Biology
platform. See chapter 24 for a more in-depth discussion of the problems of metabolite identification arising from mass spectrometry platforms.
8.2 8.2.1
Analytical technologies and data characteristics Analytical technologies
Nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry (MS) are the most commonly used analytical technologies in metabolomics. This is due to their ability to assay the vast diversity of metabolites, while simultaneously yielding characteristic structural information for each molecule which leads to a very high degree of specificity. They are also highly reproducible, sensitive and cost effective on a per-sample basis. Due to the complicating effects of ionisation suppression, MS is usually preceded by a chromatography step which physically separates different metabolites prior to mass analysis. Both liquid and gas chromatography (LC and GC) are widely used. Each technology has its own strengths and weaknesses with regard to the types of molecules which can be detected, so that a comprehensive metabolome analysis requires multiple analytical technologies. The sensitivity of analytical techniques is always limited, meaning that many metabolites will lie below the limit of detection. This contributes to the fact that the total number of detected analytes is unknown and will vary between different biological conditions. NMR exploits the fact that nuclei having a magnetic moment (such as the proton, 1 H) will resonate when subject to a radio frequency pulse. The resonant frequency of each nucleus depends on its chemical environment, i.e. its position in the chemical structure of the metabolite. Thus the pattern of resonances is characteristic and can reveal the identity of each metabolite. The frequency position of a peak in NMR is called its chemical shift and peaks are further split into patterns termed multiplets by interactions between the spins of nearby nuclei. Furthermore, the integrated intensity of each peak or multiplet is directly proportional to the number of nuclei observed, or, assuming a constant volume of sample, the metabolite concentration. Several different magnetic nuclei are commonly used for biological applications including 1 H, 13 C, 31 P, 19 F and 15 N but 1 H is by far the most widely used due to its high NMR sensitivity and its ubiquity amongst biomolecules. Figure 8.1(a) shows a typical 1-D 1 H NMR spectrum of normal rat urine, obtained in a few minutes on a modern spectrometer. It is immediately obvious that there are a large number of resonances, many of them overlapping. One way that peak overlap can be partially alleviated is to conduct 2-D NMR experiments which disperse peaks along two chemical shift axes. 2-D experiments take longer to acquire, are generally less sensitive, may not be as quantitative and require more involved preprocessing than their 1-D counterparts. For these reasons they have not yet been as widely used for high throughput profiling applications. Therefore, in this chapter we concentrate on 1-D spectra. In MS, metabolites are first ionised and then differentiated based on their mass to charge ratio (m/z). Methods typically used in metabolomics, such as electrospray ionisation (ESI), usually result in a single charge for each molecule so that m/z can be effectively read as the mass of the molecule. There are a plethora of mass analysis methods available, but those most common in untargeted metabolomics include time of flight (ToF), Orbitrap and Fourier transform ion cyclotron resonance (FTICR) techniques. ToF-MS accelerates ions to a constant energy and weighs them by measuring the time they take to traverse a known distance; lighter molecules travel faster and have lower flight times than heavier molecules. In contrast, Orbitrap and FTICR instruments use an electric or magnetic field to induce ions into circular motion, the frequency of which is inversely related to the m/z. Both types of instrument have high mass accuracy (a few ppm for ToF, around 1-2 ppm for Orbitrap and sub ppm for FTICR). Since the number of combinations of common elements is limited, and each combination has a different mass, this allows elemental composition of unknown molecules to be narrowed down with high confidence. The mass analyser may be preceded by a quadrupole stage which can be used to induce fragmentation of ions by collisions with an inert gas. The pattern of fragmentation is characteristic of the molecular structure analogous to the pattern of resonances in NMR. Quadrupoles can also be used, typically
Statistical Data Analysis in Metabolomics
165
Figure 8.1 (a) A 1-D 1 H NMR spectrum of rat urine. Various metabolite resonances are indicated and the left half of the spectrum has been magnified in intensity relative to the right half to show detail. NMND, N-methyl nicotinamide; NMNA, N-methyl nicotinic acid; PAG, phenyl acetyl glycine; 2-OG, 2-oxoglutarate. (b) Portion of a 2-D LC-MS spectrum of human urine showing the complex array of peaks
in triple-quadrupole (QQQ) instruments, to perform highly sensitive and accurate quantification for targeted metabolomics applications. Ultimately, mass spectrometers are ion counting devices, and the ion count is in principle proportional to the number of molecules being ionised at the source, and thus the metabolite concentration. As mentioned above, MS is typically preceded by a chromatographic separation step. The idea is to physically separate different metabolites and avoid more than one species competing for ion current at the same time. This effect, called ionisation suppression, can destroy the otherwise linear relationship between ion count and molecular concentration. In chromatography, the analyte mixture is carried by a mobile phase (liquid or gas) through a column packed with small solid particles called the stationary phase. The chemical composition of both phases strongly influences the types of metabolites that are best separated and thus a comprehensive
166
Handbook of Statistical Systems Biology
analysis requires multiple phases and column types. GC-MS requires that metabolites be volatilised prior to analysis which can be achieved through chemical derivatisation. However, for most metabolites, chromatography using a liquid mobile phase requires less sample preparation. From the statistical point of view, both approaches are similar and we will concentrate on LC-MS data in this chapter. Figure 8.1(b) shows part of a typical LC-MS spectrum of human urine plotted as a 2-D surface with chromatographic retention time (RT) and m/z axes. A large number of peaks can be seen, revealing the inherent high sensitivity and resolution of the technique. In comparison with LC- and GC-MS, NMR spectroscopy provides a more uniform global profile of metabolites, has higher reproducibility, provides more direct structural information and requires less sample preparation. However, it is several orders of magnitude less sensitive in standard profiling modes and MS approaches are required to cover the majority of the metabolome. Thus both techniques are often used in parallel to provide a multifaceted view of the metabolic status of the system under study.
8.2.2 8.2.2.1
Preprocessing NMR
In NMR, the raw data are obtained as an exponentially decaying complex sinusoid signal, known as the free induction decay (FID). This is usually multiplied by a weighting function, known as an apodisation function, before being Fourier transformed into a frequency domain complex spectrum. The idea of apodisation is to improve the signal-to-noise ratio of the data, for example by suppressing noisy regions at the end of the FID using an exponential weighting function. However, this improvement in signal-to-noise must be balanced against a consequent reduction in resolution of the resultant spectrum. In the ideal case, all NMR peaks correspond to complex Lorentzians with their real part corresponding to the standard real Lorentzian (or Cauchy) function and are said to be in absorption mode. In practice, small time delays at the start of data acquisition and other effects can lead to differences in phase between resonances across the spectrum, and thus a phase correction must be applied. A linear phase correction is usually sufficient to bring all resonances into absorption mode. Next the chemical shift scale is calibrated by setting the shift of a reference resonance to a known value. In metabolomics a commonly used internal reference is trimethylsilyl propanoic acid (TSP) but any peak with a known chemical shift may be used. In some spectra, the baseline can be distorted due to instrumental effects which are irrelevant to the purpose of the study. Although large distortions are not common with modern spectrometers and acquisition protocols, they do occasionally occur and must be corrected. Numerous algorithms are available, for example, fitting low order polynomials to regions known to be free of resonances. All the above are considered ‘standard’ preprocessing steps which would be applied to any NMR spectrum before it could be analysed for the molecular information contained. For an excellent guide to NMR data processing see Hoch and Stern [3]. There are now several software solutions, both from manufacturers and also freeware, which are able to implement these operations in an automatic way, such that hundreds or thousands of 1-D NMR spectra can be processed in a few minutes. Once the spectra have been preprocessed the analyst is faced with the problem of how to make them amenable to statistical analysis. There are currently two contrasting approaches. In the first, known metabolites are quantified using peak fitting approaches, resulting in a data table of the concentration of each metabolite in each sample. Statistical models may then be built and directly interpreted in terms of changes in concentration of these known metabolites. This approach suffers two major disadvantages. First, it is currently not possible to automatically and accurately quantify metabolites from spectra of diverse samples, and secondly, since only known compounds are fitted, new biomarkers can never be discovered in this way. The difficulty of automatic quantification is a key bottleneck and is due to a number of factors. In 1-D 1 H NMR there is a high degree of overlap between peaks from different metabolites. Thus any quantification algorithm needs to
Statistical Data Analysis in Metabolomics
167
Figure 8.2 Peak shift and overlap in NMR spectra. (a) Multiple spectra overlaid. (b) Peaks from several metabolites that are heavily overlapped in 1-D spectra
solve a complex deconvolution problem. However, in some regions the overlap is so great that many different combinations of peaks can adequately fit the spectral profile and thus the problem is unidentifiable. The challenge is compounded by small but significant shifts in position of individual peaks due to varying pH or ionic concentration between samples, leading to varying degrees of overlap in different samples. Figure 8.2(a) plots an overlay of NMR spectra from different samples, showing how peaks, known to be from the same metabolite by their multiplet patterns, may shift from spectrum to spectrum. This significantly confounds the already difficult problem of peak overlap illustrated in Figure 8.2(b). Further sources of difficulty include the high dynamic range within a single spectrum (e.g. 103 –105 ), differential relaxation leading to different peak area–concentration relationships for each peak, chemical noise and other effects of the complex biological matrix. Nonetheless, several approaches to metabolite quantification in complex spectra have been proposed [4, 5], though none is completely automated or foolproof. An alternative to peak quantification is to apply statistical modelling to the spectra themselves. In this case, each column of the data table corresponds to the spectral intensity at a known chemical shift. If the statistical model identifies features of interest, these must be interpreted by the spectroscopist, with the object of associating them with a known metabolite. This approach has the advantage that all spectral information is used in the statistical modelling process and thus it is possible to discover new biomarkers this way. It also means that statistical modelling may proceed rapidly once the spectra are acquired, without the validation and manual inspection often required for peak fitting. This option does, however, require more specialist spectroscopic knowledge in the interpretation of statistical models and this limits its utility to the wider bioinformatics and biostatistics community. It is also affected negatively by peak shifts which can cause apparent changes in intensity even when there is no change in metabolite concentration. The latter can be mitigated to some degree by binning the spectra into adjacent regions of integrated intensity, although this approach is no longer widely used because of the concomitant loss of resolution. A more sophisticated solution is to use peak alignment algorithms that attempt to align all peaks from the same compound across a batch of samples. There are many such algorithms [6, 7] and although they can be very effective, mismatching peaks and/or alignment artifacts do arise, and thus a degree of manual validation is required. When peak fitting is not used, one must also remove various other artifacts which may occur in the spectrum, for example residuals from an imperfectly suppressed water resonance, or internal standard peaks, which should not be included in statistical modelling.
168
Handbook of Statistical Systems Biology
8.2.2.2
LC-MS
Raw LC-MS data are usually recorded as a list of RT, m/z, ion count triples. The data are thus considered to be 2-D (two independent coordinates) and usually appear in RT order, one mass spectral scan at a time. Depending on the instrument and acquisition parameters, files may be very large (up to several hundred Mb per sample). It is important to note that the instrument software usually performs a degree of preprocessing before the data are made available to the analyst. The most common operation is centroiding, in which peaks are detected, deconvolved, and summarised by a single ion count at the centroid of the m/z peak in the mass dimension. Thus, for centroided data, a single ion chromatographic peak may correspond to several list entries, one for each mass spectral scan. Another feature is that LC-MS peaks are asymmetric, because the mass spectrometer has a much higher resolution than the chromatography. This is illustrated in Figure 8.1(b) where it can be seen that the peaks have a much larger width in the RT dimension than the m/z dimension. Unlike in NMR, LC-MS data are not usually analysed in this ‘raw’ format, owing to their greater complexity and the presence of many confounding effects, partly resulting from the lower reproducibility of the chromatography. Thus the data are further processed including the following steps: peak detection, deconvolution, retention time alignment, and peak integration. In addition to fragment ions mentioned above, a single metabolite may give rise to multiple ions due to the presence of isotopologues (molecules of a single chemical compound containing different numbers of heavy isotopes), adducts (ions formed by the addition of different charged species, e.g. [M+H]+ , [M+Na]+ where M is the metabolite of interest), or dimers (ions formed from two molecules of the metabolite). In GC-MS, multiple peaks may also be observed due to incomplete derivatisation, multiple derivatives of the same molecule or by-product formation. Ideally, these peaks would be grouped to the same metabolite feature by preprocessing software, but this is usually not the case. It is also usually necessary to deconvolve peaks that overlap in the RT dimension (co-eluting metabolites). One of the most difficult challenges in LC-MS data processing is that of RT alignment and peak matching. Peaks from the same metabolite may shift in the RT dimension over the course of, and between, analytical runs, and these must be matched together for the final data table. Many algorithms have been proposed [8–10] and, although there is much room for improvement, the results are usually good enough to be useful for further analysis. Finally, the intensity of each peak must be estimated, usually by a simple integration of the ion count across the RT dimension, possibly with additional noise reduction, smoothing and/or baseline correction steps.
8.2.2.3
Platform independent issues
One major data processing issue which applies to any data type is that of normalisation. The purpose of normalisation is to make profiles comparable with each other by removing variation that is not of interest, typically corresponding to uncontrolled experimental factors. For example, variation in the weight of tissue, or the overall concentration of urine, is hard to control and this will strongly influence the overall signal obtained in both NMR and MS. Other factors such as variable instrument response and calibration may also introduce unwanted variation. Often these effects modify the intensities by a constant factor for each sample and thus may be compensated for by applying a single multiplicative normalisation factor to each sample profile. There are numerous algorithms for calculation of the normalisation factor, including those based on physical parameters (e.g. normalisation to the known concentration of an internal standard compound, or to sample dry weight) and purely statistical methods (e.g. normalisation to total integrated intensity, or median fold change [11]). The most appropriate normalisation technique will depend on the goal of the analysis, for example whether individual relative concentrations are required or if multivariate pattern recognition techniques are to be employed.
Statistical Data Analysis in Metabolomics
169
A final step before proceeding to statistical analysis is to consider scaling or transformation of the data. Metabolomic data often show heteroscedasticity, in that the variance changes with the intensity of the measurement. This applies to both analytical measurement error and also biological variation. For example, NMR peaks typically exhibit a variance that is proportional to their intensity, both in technical and biological replicates [12]. Log transformation leads to variables with constant variance, but can be problematic when variables may have zero or negative values due to finite detection limits and the effects of noise. In such cases, the generalised log transform [13], which does not diverge for zero or negative values, can be used. Other commonly used transforms are unit variance (each variable is divided by its standard deviation) and Pareto (each variable is divided by the square root of its standard deviation).
8.3
Statistical analysis
The aims of statistical analysis in metabolic profiling will depend on the objectives of the study. We are often interested in detecting structure and identifying important features of the data. The main goals are usually to cluster individuals based on their metabolic profiles and understand differences among groups of subjects. Unsupervised statistical methods are usually employed to this end. Another goal is to determine whether there is a significant difference between groups in relation to a phenotype of interest and identify which metabolites discriminate phenotype groups. In this case regression and classification methods are employed and we often work in the small n, large p inference problem [14], that is, each individual datum (metabolic profile) consists of a large vector of inter-related (dependent) observations, yet the number of samples in the study is relatively small. This poses substantial challenge to statistical analyses. Here, we briefly review some of the current techniques used in the statistical analysis of metabolomic data. We will denote with X the n×p matrix of metabolic profiles, where each row is a p-dimensional vector of spectral variables representing one metabolic profile from one biological sample. Usually p ≈ 102 –104 . Each column of X corresponds to a single metabolic variable which plays the role of predictor variable, and we denote the n-dimensional vector of phenotypic responses with y. The methods described in this chapter are exemplified using data from an investigation of the metabolic response to caloric restriction in laboratory rats, taken from the COMET project [15]. Briefly, 18 Sprague– Dawley rats were housed in metabolism cages and their urine sampled daily over the course of 7 days. Half the animals were fed ad libitum (the control group), while half had their diet restricted to 50% of the normal food intake (the restricted group). Half the animals in each group were killed at the end of Day 2 for clinical chemistry and histopathology evaluation, while the other half were killed at the end of the study. Urine samples were immediately frozen and stored at −40 ◦ C. Samples were thawed and prepared according to a standard protocol [16] prior to being analysed by 1 H NMR at an observation frequency of 600 MHz. Initial analysis showed the restricted group to reach a steady metabolic state at Day 2. In this chapter we therefore combine the data corresponding to Days 2–7 resulting in 34 observations per treatment group. The spectra were processed by high resolution binning (bin width 0.001 ppm) with median fold change normalisation to yield a dataset of 8280 integrated spectral bins. Further experimental details may be found in Ebbels et al. [16]. 8.3.1
Unsupervised methods
Unsupervised methods (see Chapter 2) are often used in chemometrics as exploratory tools to identify group structure in the data and spectral features which differentiate between groups. Therefore, determination of variable importance and thus biomarker discovery is a crucial aspect of the analysis. Principal component analysis (PCA) is the most commonly used multivariate statistical technique in metabolic profiling. PCA is a well-known method of dimension reduction: linear combinations of the columns
170
Handbook of Statistical Systems Biology
Figure 8.3 PCA of the caloric restriction dataset. (a) Scores plot for PC1 versus PC2. (b) Loadings on PC1 between 2.3 and 3.1 ppm
of X are built by maximising the variance of each component, while ensuring orthogonality between all components. PCA allows for a more compact representation of the data. The strong correlations between variables in metabolic profile data allow the majority of the variance to be summarised by a few principal components. Visualising data in a lower number of dimensions aids the processes of identifying patterns, highlighting group similarities and differences, outlier detection, and understanding other sources of variation within the dataset (for instance experimental drift). Figure 8.3 shows a PCA of the (unscaled) caloric restriction data. Figure 8.3(a) shows the scores plot – the projection of the data into the space of the first two components. Differences between the two treatment groups are clearly seen with the calorie-restricted group easily differentiated from the controls. One control sample, with approximate coordinates (−10,0), appears to be an outlier and maps with the calorie-restricted group. Figure 8.3(b) shows a portion of the PCA loadings for the first component. The positive loadings indicate that the two metabolites 2-oxoglutarate (2.41 and 3.00 ppm) and citrate (2.55 and 2.68 ppm) are higher in controls than the calorie-restricted group. The citrate resonances also reveal a common pattern of adjacent positive and negative values indicative of peak shift. Cluster analysis identifies subgroups or clusters in multivariate data, in such a way that individuals belonging to the same cluster resemble each other according to some similarity measure, whereas objects in different clusters are dissimilar. Hierarchical clustering is a widely used technique in the analysis of ‘omics’ data; the resulting hierarchy of clusters is more informative than an unstructured set of clusters. See Chapter 2 for a description of the basic hierarchical agglomeration algorithm. A key choice is of the definition of ‘distance’ between two clusters, called the linkage function. For example the Ward linkage is the incremental withingroup sum of squares (i.e. the increase when two clusters are merged), while centroid and single (or nearest neighbour) linkages are other common choices. Divisive hierarchical algorithms use a top-down approach, with all observations initially in one cluster, and splits performed recursively as one moves down the hierarchy. The results of hierarchical clustering are usually presented in a dendrogram. Model-based clustering is based on the assumption that the data are generated by a mixture of underlying probability distributions. The self-organising map (SOM) is another unsupervised clustering and visualisation tool that has been widely used in analysis of omic data. A regular array of nodes in a low (typically two) dimensional space – the map – is built with each node corresponding to a p-dimensional reference or ‘codebook’ vector, which defines the prototype for that position on the map. The map plots the similarities of the data by grouping similar
Statistical Data Analysis in Metabolomics
171
profiles together, with adjacent nodes being more similar than distant nodes; the SOM nodes model the density of data in the p-dimensional input space. New observations can be classified by finding the reference vector to which they are most similar, giving their predicted position on the map. 8.3.2
Supervised methods
Metabolic data collected in case-control or cohort studies of well-defined disease phenotypes can be used to identify metabolites associated with disease status or, in general, of any phenotype under study. Once identified, associated metabolites can be included in predictive models for the estimation of disease risk in individuals for whom the disease status is unknown. In practice, many association studies of phenotypes with metabolic variables involve the analysis of thousands of spectral variables, depending on the analytical platform and data preprocessing. This poses many challenges to the statistical analysis since traditional multiple linear or logistic regression fail when p > n. Moreover, the predictor space shows a high degree of collinearity since the spectral signature of each metabolite comprises many spectral variables, and the levels of metabolites belonging to the same pathway or biochemical process are often correlated. This results in unstable maximum likelihood estimates of regression coefficients. A straightforward approach consists of testing each spectral variable independently using standard univariate tests (such as the t-test or univariate linear/logistic regression). However, this introduces a multiple testing problem requiring very stringent significance thresholds when the number of spectral variables is large, and fails to take into account the combined effect of multiple metabolites. By simultaneously considering the effects of all metabolites, a weak signal for a univariate test may be enhanced because other causal effects are included, leaving less residual variation unexplained. More importantly, false associations can be weakened by including in the model a stronger signal from a true causal metabolite. Multivariate regression methods consider many, or all, spectral variables simultaneously. Partial least squares (PLS) regression is by far the most popular technique in metabolomics for exploring class differences and highlighting explanatory metabolites. PLS regression overcomes some of the limitations of PCA, in which the linear combinations of the original variables chosen describe as much of the variation in X as possible, without reference to the phenotype and so useful predictive information may be discarded as noise. PLS is also a latent variable approach that assumes the data can be well approximated by a low-dimensional subspace of the p-dimensional predictor space. In PLS, the linear combinations of the predictor variables are constructed to have simultaneously high variance and high correlation with the response. The X and y spaces are both modelled by a set of latent variables summarised by scores and loadings. The objective of PLS is to maximise the covariance between the scores in the X and y spaces. As for PCA, components explaining small amounts of variance are discarded from the model. In many applications the response variable y is categorical, representing, for example, class membership (e.g. control/dosed, affected/unaffected individuals, etc.) and PLS Discriminant Analysis (PLS-DA) is routinely used. In this case the goal is to sharpen the separation between groups of observations, by rotating PCA components such that a maximum separation among classes is obtained. The variables that carry the class separating information can then be identified. There are many extensions of the original PLS algorithm, for example to include multiple response variable. Figure 8.4 shows the results of applying PLS-DA to the caloric restriction data. The scores show that, save for the single outlier, one component is sufficient to separate the two groups. This can be contrasted with the PCA of Figure 8.3 where two latent variables were required. The PLS weights can be interpreted similarly to the loadings in PCA and here again indicate the influence of 2-oxoglutarate and citrate on the class separation. O-PLS [17] and O2-PLS [18] are extensions to the PLS algorithm which decompose the variation of the predictor variables into two parts: variation orthogonal (uncorrelated) to the response and variation correlated to the response. Spectra often contain systematic variation unrelated to the response, for example due to
172
Handbook of Statistical Systems Biology
Figure 8.4 PLS modelling of the caloric restriction data. (a) Scores of the one-component model. (b) PLS weights between 2.3 and 3.1 ppm
experimental effects (such as a temperature drift in the spectrometer) or systematic biological variation (e.g. differences in diets between human subjects which are not easily controlled). It is important to be able to distinguish different sources of variation in X as this leads to more robust statistical models and better predictions for new samples. While providing similar predictive accuracy as conventional PLS, O-PLS and O2-PLS provide methods to distinguish the variation in X orthogonal to y which may allow the researcher to explain its presence. This approach can lead to more parsimonious models, which can improve model interpretation. Recent years have seen an increase in popularity of Bayesian methods in metabolomics as they can address many of the statistical challenges posed by this type of data in a coherent probabilistic framework (see Chapter 3). For example, Bayesian models for NMR data have been described in [19–21]. Dou and Hodgson [22] show how Bayesian inference and Gibbs sampling techniques can be applied to spectral analysis and parameter estimation for both single- and multiple-frequency signals while more recently Rubtsov and Griffin [23] propose a Bayesian model to jointly detect and estimate metabolite resonances from NMR spectra of complex biological mixtures. In their approach the time-domain signal is modelled as a sum of exponentially damped sinusoidal components and a Reversible Jump Markov Chain Monte Carlo technique is devised to perform full posterior inference. We believe that there is much scope for the development of Bayesian methods in this area and anticipate that this framework will become more widely used in future metabolomics studies. 8.3.3
Metabolome-wide association studies
Metabolome-wide association studies (MWAS) aim to model variations in the metabolic profiles of individuals across different phenotype groups and to identify biomarkers of disease risk. As the metabolic profile is strongly influenced by lifestyle and environmental variations, which are also major contributors to disease risk, metabolic profiling can capture latent information on the interactions between lifestyle, environmental exposures and genetic variation. It thus provides a powerful approach to understanding underlying biochemical pathways and mechanisms linking such exposures to disease. The MWAS approach involves high-throughput spectroscopic screening of thousands of specimens, capturing comprehensive data on a wide variety of molecular markers, leading to data that are multivariate, noisy and highly collinear. The approach was pioneered by
Statistical Data Analysis in Metabolomics
173
Holmes et al. [24] who demonstrated the potential of metabolic profiling for the discovery of novel metabolites associated with blood pressure. Similarly to genome-wide association studies (GWAS), MWAS involve high throughput technologies and allow the discovery of novel associations and generate new hypotheses. They also share many of the statistical challenges. These include high dimensionality, correlation of predictors (linkage disequilibrium in GWAS), and population structure (some metabolite concentrations might be higher in some populations due for example to diet). These issues have been extensively addressed in the statistical genetics literature [25], but MWAS pose new problems. The profiles consist of several thousand individual measurements at different chemical shifts or masses. In addition to the problems of collinearity and dimensionality, metabolic profiles are highly dynamic and are affected by the entirety of internal and external influences on an organism. Additionally, the spectral signature of a single metabolite comprises many signals that often overlap and may shift between individuals. Finally, the most important challenge is the problem of unidentified signals; while thousands of signals may be detected, typically only a very small fraction will be identified with a known chemical structure. Knowledge of the chemical identity of the metabolite features is necessary for correct interpretation of the results of statistical analysis. A key problem in MWAS is the determination of statistically significant relationships between spectral variables and phenotype, while minimising risk of false positive associations at adequate power. Chadeau-Hyam et al. [26] define the metabolome-wide significance level (MWSL) as the threshold required to control the family-wise error rate (FWER) when performing univariate two-sample t-tests to detect association between each individual spectral variable and phenotype. Using a large database of human metabolic profiles [24], they estimate the MWSL for 1 H NMR spectral data using a permutation approach, under the null hypothesis of no association with a putative binary phenotype. They show that these estimates are stable across populations and robust to intra-individual variability, and find that the MWSL primarily depends on sample size and spectral resolution (i.e. the number of spectral variables in a spectrum). These authors also estimate the effective number of tests, which is the number of independent tests that would generate the same false positive rate, to be equal to 35% of the number of spectral variables regardless of spectral resolution and sample size. This leads, for example, to an estimated MWSL of 2 × 10−5 and 4 × 10−6 for an FWER of 0.05 and 0.01, respectively, at medium spectral resolution (7100 variables). The MWSL provides a practical p-value threshold to guide selection of discriminatory metabolite peaks in spectral data and aids design future of MWAS. Figure 8.5 shows the result of an MWAS of dietary status for the caloric restriction data. A univariate t-test approach was applied to identify spectral variables discriminating between the control and restricted group. Several variables reach the MWSL which, in this example with 8280 spectral variables, is equal to 1.72 × 10−5 , corresponding to an FWER of 0.05. A portion of the mean control spectrum is shown with the spectral variables reaching metabolome-wide significance highlighted. The peaks from 2-OG and citrate are clearly seen as in the PCA and PLS-DA analyses of Figure 8.3 and 8.4. Figure 8.5 also highlights the difficulty inherent in working with data in spectral form, especially when formal statistical significance thresholds are required. There are several significant variables which correspond to low and overlapped peaks, many of which cannot be identified. In addition, not all peaks of a given metabolite may reach the significance threshold (as is the case for the citrate peaks at 2.68 ppm, in this case mainly due to peak shift). 8.3.4
Metabolic correlation networks
An important feature of metabolomic data is that a significant number of metabolite levels are highly interrelated. These correlations do not necessarily occur between metabolites that are neighbours in a metabolic pathway (e.g. substrates and products of the same enzyme), but are the result of both direct enzymatic conversions and indirect cellular regulations of transcriptional and biochemical processes. For example, two or more molecules involved in the same pathway may present high intermolecular correlation and exhibit a similar
174
Handbook of Statistical Systems Biology
Figure 8.5 The metabolome-wide significance level (MWSL). (a) The mean control spectrum (dashed line) with spectral variables reaching the MWSL highlighted in red. (b) Manhattan plot of the data from the caloric restriction study for the same spectral region as earlier figures. Log10 p-values are multiplied by the sign of the effect, so that positive values indicate an increase in the restricted group as compared with controls. The red dashed lines indicate the MWSL.
response to a stimulus. The primary interest in the analysis of metabolite correlations stems from the fact that the observed pattern provides information about the physiological state of a metabolic system and complex metabolite relationships [27]. The working assumption is that a network of metabolic correlations represents a fingerprint of the underlying biochemical network in a given biological state, which then can be related to changes between different genotypes, phenotypes or experimental conditions [28, 29]. Moreover, correlation networks can give insight into the functional and regulatory relations between metabolites, by comparing metabolic correlation networks to known biochemical pathways. Being aware of the complexity of biological systems, the goal is not to infer the physical interaction network itself but to develop new hypotheses of interdependence between biochemical components that can shed light on the observed phenotypic variations. Correlation networks are widely used in transcriptomics and proteomics to explore and visualise high-dimensional data. Some interesting studies involving association networks also exist in metabolomics [30, 31], however, these are not extensive and are usually limited to simple correlation/partial correlation networks, often based on arbitrary thresholds. This type of approach, although easy to implement, might lead to spurious network estimation especially when p>n. In recent studies the use of Gaussian graphical models [32] and Bayesian Networks [33] is becoming more widespread (Chapter 11). In Figure 8.6 we show the full and partial correlation networks for the caloric restriction data. Here, 16 metabolites have been quantified through semi-automated integration of the NMR resonances. The full correlation network is built by calculating Pearson correlation coefficients between each pair of metabolites for each condition and drawing an edge between metabolites for which the corresponding p-value is below a specified cut-off (here 0.01). In general, significance thresholds in network analysis are determined by using Bonferroni correction or false discovery rate methods. We also present the partial correlation network for the
Statistical Data Analysis in Metabolomics
175
Figure 8.6 Correlation networks for the caloric restriction data. (Left) Controls (100% diet); (right) 50% diet. Metabolites whose mean concentration changes significantly between the two groups according to a t-test (p<0.01) are shown in pink. The size of each node is proportional to its number of edges. Metabolites are labelled by their name and the resonance used. DMA, dimethylamine; PAG, phenyl acetyl glycine; TMAO, trimethylamine N-oxide; DMG, dimethylglycine; MA, methylamine; NMNA, N-methyl nicotinic acid; NMND, N-methyl nicotinamide
two dietary regimes. Partial correlations are conditional on all other metabolites and therefore can possibly determine to what extent metabolite correlations are direct and do not originate via intermediate variables. Besides elucidating the origins of metabolite interactions, we expect that metabolite patterns identified by network techniques will reveal a level of information beyond that derived from the analysis of individual
176
Handbook of Statistical Systems Biology
metabolite concentrations. Topological differences in metabolic correlation networks might prove useful to complement findings of subtle differences in variances and averages of metabolite concentration levels in order to discriminate between different groups. At the metabolite level, it is plausible that a transition to a different physiological state may not only involve changes in the mean levels of the metabolite concentrations, but may additionally arise due to changes in their interactions. Metabolites showing no difference in mean concentration between states may still show alteration of pair-wise correlation with other metabolites [28]. Network analysis typically involves independent estimation of the two networks under each condition, mainly based on correlation measures, and subsequent comparison of their topological properties, such as connectivity patterns observed in each state [34] and shortest path differences across states [29], usually based on arbitrary thresholds. In Figure 8.6, metabolites whose mean concentration levels change between states are highlighted in pink. We also notice topological differences between the networks under the two food regimes. The full correlation networks are more densely connected than those built from partial correlations, indicating that many of the edges may come from indirect relationships. This type of approach, although easy to implement, may lead to spurious network estimation because each network contains considerable noise due to limited sample sizes and does not consider the overall complexity of the system. This calls for development of statistical methods specifically aimed at capturing differences in network topology between states. 8.3.5
Simulation of metabolic profile data
In any area of science, simulation of synthetic data sets is a key component in the development of new approaches to data analysis. While there is much publicly available data in other areas of systems biology,
Figure 8.7 Real (a) and simulated (b) 1H NMR spectra of human urine as produced by MetAssimulo [35]
Statistical Data Analysis in Metabolomics
177
Figure 8.8 Intra- and inter-metabolite correlations simulated by MetAssimulo [35]. (a) Detail of the autocorrelation matrix indicating various metabolites. (i) Taurine versus creatinine; (ii) taurine versus taurine; (iii) taurine versus creatinine; (iv) creatinine versus citrate; (v) taurine versuss citrate; (vi) creatinine versus citrate. (b) Full autocorrelation matrix. The box marks the region expanded in (a). The region 4.5–6.0 ppm was excluded due to residual water suppression artifacts
particularly transcriptomics, this is not the case in metabolomics, and this gives data simulation an added importance. Yet, until recently there were few software packages available and none which could simulate the full complexity of spectral data from NMR or MS platforms. Muncey et al. [35] have developed MetAssimulo, a MATLAB package for NMR metabolic profile simulation which simulates biofluid spectra by convolving a library of standard spectra generated from NMR spectroscopy of pure metabolites. The idea behind MetAssimulo is to draw spectral templates of individual metabolites from a local database and to combine them with metabolite concentration information specified by the user, or alternatively automatically downloaded from the Human Metabolome Database (www.hmdb.ca). The software allows the user to specify a wide range of spectral properties including a realistic model for the noise and for peak shifts by incorporating pH information. Moreover, the user can create groups of individuals (for example cases/controls) by specifying different means and standard deviations for the concentration of each metabolite in the two groups. Modelling of realistic intra- and inter-metabolite correlations is also possible. These features allow MetAssimulo to create realistic metabolic profiles containing large numbers of metabolites. Figure 8.7 compares a simulated spectrum of normal human urine with one from a real urine sample, illustrating a high degree of realism in the simulated data. The simulated spectrum emulates the real spectrum with high concentration metabolites such as creatinine, creatine and citrate clearly visible, while the expanded insets show how the similarity extends even to low-level metabolites such as N-methyl nicotinic acid. Figure 8.8 demonstrates the intra- and inter-metabolite correlation structure in a sample of simulated spectra. The following inter-metabolite correlations were designed into the simulation: (citrate versus creatinine, r = −0.7), (citrate versus taurine, r= −0.4) and (creatinine versus taurine, r = 0.8). These can clearly be identified and are labelled on the plot. As expected, strong positive intra-metabolite correlations are seen, for example between the two resonances from taurine at 3.25 and 3.42 ppm, labelled (ii) in the plot. Other strong correlations are also observed in the plot, primarily due to further intra-metabolite relationships. MetAssimulo is the first software package able to simulate NMR metabolic profiles with such complexity and flexibility and provides an invaluable tool for the bioinformatics and biostatistics community to develop new data analysis
178
Handbook of Statistical Systems Biology
techniques and to test hypotheses and experimental designs. MetAssimulo is freely available for academic use at http://cisbic.bioinformatics.ic.ac.uk/metassimulo/.
8.4
Conclusions
The field of metabolomics continues to expand and there is currently a gap between the ability of analytical technology to generate large volumes of complex data and the statistical techniques required to distil information of biological relevance. This is particularly true of methods which can reliably and automatically deconvolve and quantify metabolites from analytical spectra in the presence of peak overlap and shifts. This review has not attempted to be a comprehensive summary of all statistical methods which could be applied to metabolomic data. However, we have highlighted some of the statistical challenges arising from the interplay of chemical, biological and environmental factors influencing metabolic status, and described some common techniques used to address them. Many current methods have shortcomings because they do not capture the full complexity of the underlying data generation process. Yet, we believe that there is considerable scope for biostatisticians and bioinformaticians to develop methods able to fully exploit the wealth of biochemical information latent in metabolic systems.
Acknowledgements The authors would like to acknowledge Beatriz Valcarcel, Harriet Muncey and Paul Benton for help preparing figures for this chapter. The members of the Consortium for Metabonomic Toxicology are thanked for access to the caloric restriction data. This work was in part supported by BBSRC grant BB/E20372/1.
References [1] Nicholson, J.K., J.C. Lindon, and E. Holmes, ‘Metabonomics’: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica 1999, 29(11), 1181. [2] Raamsdonk, L.M., B. Teusink, D. Broadhurst, N. Zhang, A. Hayes, M.C. Walsh, J.A. Berden, K.M. Brindle, D.B. Kell, J.J. Rowland, H.V. Westerhoff, K. van Dam, and S.G. Oliver, A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations. Nature Biotechnology 2001, 19(1), 45–50. [3] Hoch, J.C. and A.S. Stern, NMR Data Processing. New York: Wiley-Liss, 1996. [4] Weljie, A.M., J. Newton, P. Mercier, E. Carlson, and C.M. Slupsky, Targeted profiling: quantitative analysis of 1 H NMR metabolomics data. Analytical Chemistry 2006, 78(13), 4430–42. [5] Crockford, D.J., H.C. Keun, L.M. Smith, E. Holmes, and J.K. Nicholson, Curve-fitting method for direct quantitation of compounds in complex biological mixtures using 1H NMR: application in metabonomic toxicology studies. Analytical Chemistry 2005, 77(14), 4556–62. [6] Forshed, J., R.J. Torgrip, K.M. Aberg, B. Karlberg, J. Lindberg, and S.P. Jacobsson, A comparison of methods for alignment of NMR peaks in the context of cluster analysis. Journal of Pharmaceutical and Biomedical Analysis 2005. 38(5), 824–32. [7] Veselkov, K., J. Lindon, T. Ebbels, V. Volynkin, D. Crockford, E. Holmes, D. Davies, and J. Nicholson, Recursive segment-wise peak alignment of biological 1H NMR spectra for improved metabolic biomarker recovery. Analytical Chemistry 2009, 81(1), 56–66.
Statistical Data Analysis in Metabolomics
179
[8] Smith, C.A., E.J. Want, G. O‘Maille, R. Abagyan, and G. Siuzdak, XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry 2006, 78(3), 779–87. [9] Katajamaa, M. and M. Oresic, Processing methods for differential analysis of LC/MS profile data. BMC Bioinformatics 2005, 6, 179. [10] Aberg, K.M., R.J. Torgrip, J. Kolmert, I. Schuppe-Koistinen, and J. Lindberg, Feature detection and alignment of hyphenated chromatographic-mass spectrometric data. Extraction of pure ion chromatograms using Kalman tracking. Journal of Chromatography, A 2008, 1192(1), 139–46. [11] Dieterle, F., A. Ross, G. Schlotterbeck, and H. Senn, Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in H-1 NMR metabonomics. Analytical Chemistry 2006. 78(13), 4281–90. [12] Ebbels, T.M., J.C. Lindon, and M. Coen, Processing and modeling of nuclear magnetic resonance (NMR) metabolic profiles. Methods in Molecular Biology 2011, 708, 365–88. [13] Durbin, B.P., J.S. Hardin, D.M. Hawkins, and D.M. Rocke, A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 2002, 18(Suppl. 1), S105–10. [14] West, M., Bayesian factor regression models in the ‘Large p, Small n’ paradigm. Bayesian Statistics 2003, 7, 733–42. [15] Lindon, J.C., H.C. Keun, T.M. Ebbels, J.M. Pearce, E. Holmes, and J.K. Nicholson, The Consortium for Metabonomic Toxicology (COMET): aims, activities and achievements. Pharmacogenomics 2005, 6(7), 691–9. [16] Ebbels, T.M.D., H.C. Keun, O. Beckonert, E. Bollard, J.C. Lindon, E. Holmes, and J.K. Nicholson, Prediction and classification of drug toxicity using probabilistic modeling of temporal metabolic data: The Consortium on Metabonomic Toxicology Screening Approach. Journal of Proteome Research 2007, 6(11), 4407–22. [17] Trygg, J. and S. Wold, Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics 2002, 16(3), 119–28. [18] Trygg, J. and S. Wold, O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. Journal of Chemometrics 2003, 17(1), 53–64. [19] Bretthorst, G.L., Bayesian analysis 1. Parameter estimation using quadrature NMR models. Journal of Magnetic Resonance 1990, 88(3), 533–51. [20] Bretthorst, G.L., Bayesian analysis 2. Signal detection and model selection. Journal of Magnetic Resonance 1990, 88(3), 552–70. [21] Bretthorst, G.L., Bayesian analysis 3. Applications to NMR signal detection, model selection, and parameter estimation. Journal of Magnetic Resonance 1990, 88(3), 571–95. [22] Dou, L.X. and R.J.W. Hodgson, Bayesian inference and Gibbs sampling in spectral analysis and parameter estimation. 2. Inverse Problems 1996, 12(2), 121–37. [23] Rubtsov, D.V. and J.L. Griffin, Time-domain Bayesian detection and estimation of noisy damped sinusoidal signals applied to NMR spectroscopy. Journal of Magnetic Resonance 2007, 188, 367–79. [24] Holmes, E., R.L. Loo, J. Stamler, M. Bictash, I.K. Yap, Q. Chan, T. Ebbels, M. De Iorio, I.J. Brown, K.A. Veselkov, M.L. Daviglus, H. Kesteloot, H. Ueshima, L. Zhao, J.K. Nicholson, and P. Elliott, Human metabolic phenotype diversity and its association with diet and blood pressure. Nature 2008, 453(7193), 396–400. [25] Balding, D.J., A tutorial on statistical methods for population association studies. Nature Reviews Genetics 2006, 7(10), 781–91. [26] Chadeau-Hyam, M., T.M. Ebbels, I.J. Brown, Q. Chan, J. Stamler, C.C. Huang, M.L. Daviglus, H. Ueshima, L. Zhao, E. Holmes, J.K. Nicholson, P. Elliott, and M. De Iorio, Metabolic profiling and the metabolome-wide association study: significance level for biomarker identification. Journal of Proteome Research 2010, 9(9), 4620–7. [27] Camacho, D., A. de la Fuente, and P. Mendes, The origin of correlations in metabolomics data. Metabolomics 2005, V1(1), 53. [28] Steuer, R., Review: on the analysis and interpretation of correlations in metabolomic data. Briefings in Bioinformatics 2006, 7(2), 151–8. [29] Muller-Linow, M., W. Weckwerth, and M.T. Hutt, Consistency analysis of metabolic correlation networks. BMC Systems Biology 2007, 1, 44.
180
Handbook of Statistical Systems Biology
[30] Cloarec, O., M.E. Dumas, A. Craig, R.H. Barton, J. Trygg, J. Hudson, C. Blancher, D. Gauguier, J.C. Lindon, E. Holmes, and J. Nicholson, Statistical total correlation spectroscopy: an exploratory approach for latent biomarker identification from metabolic 1 H NMR data sets. Analytical Chemistry 2005, 77(5), 1282. [31] Saude, E.J., I.P. Obiefuna, R.L. Somorjai, F. Ajamian, C. Skappak, T. Ahmad, B.K. Dolenko, B.D. Sykes, R. Moqbel, and D.J. Adamko, Metabolomic biomarkers in a model of asthma exacerbation urine nuclear magnetic resonance. American Journal of Respiratory and Critical Care Medicine 2009, 179(1), 25–34. [32] Krumsiek, J., K. Suhre, T. Illig, J. Adamski, and F.J. Theis, Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data. BMC Systems Biology 2011, 5(1), 21. [33] Green, M.L. and P.D. Karp, A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 2004, 5, 76. [34] Weckwerth, W., M.E. Loureiro, K. Wenzel, and O. Fiehn, Differential metabolic networks unravel the effects of silent plant phenotypes. Proceedings of the National Academy of Sciences of the United States of America 2004, 101(20), 7809–14. [35] Muncey, H., R. Jones, M. De Iorio, and T. Ebbels, MetAssimulo: simulation of realistic NMR metabolic profiles. BMC Bioinformatics 2010, 11(1), 496.
9 Imaging and Single-Cell Measurement Technologies Yu-ichi Ozaki1 and Shinya Kuroda2,3 1
9.1
Laboratory for Cell Signaling Dynamics, RIKEN Quantitative Biology Center, Kobe, Japan 2 Department of Biophysics and Biochemistry, University of Tokyo, Japan 3 CREST, University of Tokyo, Japan
Introduction
This chapter discusses the generation of detailed biological data that form the substrate of contemporary systems biology analysis. Imaging-based single-cell measurement techniques provide a large amount of informative data. In this chapter, we introduce some measurement techniques and biochemical technologies, often used in combination, by focusing primarily on the study of cellular signal transduction of cultured cells. To know how the data are obtained and what the data mean are crucially important in order to elucidate biological messages from the data. This chapter presents and reviews the challenges facing the statistical approaches that are reviewed in other chapters. 9.1.1
Intracellular signal transduction
The purpose of signal transduction studies is to understand the mechanisms of cellular functions as a consequence of interactions among functional proteins. Because signal transduction analysis has evolved from biochemistry, the primary assay techniques used formerly were activity measurements of enzymes extracted from cells and organs. When it became apparent that the enzymes’ activity is regulated by their phosphorylation states as well as a complex formation with specific regulatory proteins, cellular functions became better understood in terms of chains of enzymatic reactions and molecular interactions. A chain of molecular interactions forms a signaling pathway, and divergences and convergences of such pathways form a signaling network that underlies cellular behavior. Therefore, any changes in the total amount, the phosphorylation (or other post-translational modification) state, and the intracellular localization of proteins resulting from Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
182
Handbook of Statistical Systems Biology
molecular interactions can be regarded as a mechanism for transducing a signal through the network. Nowadays, it is becoming increasingly common to elucidate the structure and dynamics of intracellular signaling networks by system-level approaches. Investigating the systemic behavior of the network gives clues for further analysis of as yet unknown functions of individual elements in the network. As mentioned earlier, the measurable targets in an analysis of a signaling network cover a wide variety of cellular signaling events. These include: (1) change in the total amount of proteins, messenger RNA (mRNA) and small molecules known as secondary messengers, such as cyclic adenosine monophosphate (cAMP), ions and phospholipids; and (2) post-translational modifications such as phosphorylation which influence the transcriptional activity of promoters as well as physical properties of the molecule, such as intracellular localization, diffusion coefficient and so on. Virtually all of these signaling events can be investigated through imaging techniques. 9.1.2
Lysate-based assay and single-cell assay
All newly identified proteins have been discovered by the use of some in vitro biochemical techniques. Typically, those proteins were isolated from a cell lysate by a biochemical activity-targeted extraction method; such methods include a wide variety of fractionation methods as well as affinity purification. Afterwards, the protein undergoes both in vivo and in vitro assays to characterize its biochemical activity as well as its role in cellular function. A single-cell assay, primarily an imaging-based assay, is particularly helpful at this stage to characterize spatiotemporal dynamics of the protein in an in vivo environment. A single cell is too small to be studied by most biochemical techniques but measurements obtained by a lysate-based assay represent the average properties of signaling events in a cell population, which is typically composed of more than 105 cells. The averaging obscures heterogeneity in the cell population derived from cell-to-cell variation in response to stimuli, as well as from population distributions caused, for example, by autonomous oscillations. In such a situation, a single-cell assay can measure the properties of distribution of the cell population directly. Another advantage of a single-cell assay is the quantitative information it holds for data analysis. Many imaging-based measurements provide temporally highly resolved data of cellular events at the level of many individual cells. This data richness can offer important advantages for statistical analysis but it also offers considerable challenges. 9.1.3
Live cell and fixed cell
Live cell imaging aims to visualize ongoing cellular signaling events with the aid of modern cell-biology technology. It is ideal to observe continuous changes in the signaling events in living cells and to describe the dynamics underlying cellular functions. The problem with this approach is that it requires functional probes that reflect intracellular signaling events while only interfering minimally. This important point will be discussed in a later section. On the other hand, fixed cells can provide a snapshot of the cellular status in many individual cells at some fixed time after a stimulus has been applied to the cells; they thus provide end-point data of a process; and studying several populations sampled at different times thus allows us to reconstruct the dynamics statistically. The advantage of fixed cell-based imaging is that it is compatible with many more biochemical analyses. For example, immunostaining can be widely applied to monitor signaling events because antibodies that target a wide variety of signaling molecules are currently available. In practice, phosphorylation-specific antibodies are commonly used as an indicator for the activity of signaling pathways. Therefore, live and fixed cell imaging should be mutually complementary.
9.2
Measurement techniques
There are many different techniques for measuring activities of signaling molecules. Each technique has inherent advantages and disadvantages, and should be chosen according to the purpose of an analysis.
Imaging and Single-Cell Measurement Technologies
183
Figure 9.1 Experimental procedure of Western blot analysis
Considerations with regard to the quality of the resulting data include numerical accuracy, reproducibility, spatial and temporal resolutions, and biological credibility. In addition, cost and hours of labor are also considerations. 9.2.1
Western blot analysis
Western blot analysis is the most common technique used for protein quantification (Figure 9.1). Cell lysates are subjected to gel electrophoresis, and then the separated proteins are transferred to and immobilized on a nitrocellulose membrane. The protein of interest is detected with a labeled antibody and then visualized by autoradiography or chemiluminescence detection with an X-ray film. In comparison with other antibodybased assays, Western blot analysis is superior in signal specificity because it identifies a certain protein by its molecular weight in addition to the specificity of the antibody to the protein. Western blot analysis is widely regarded as the most convenient and quantitative assay; however, it is laborious and hard to automate. For quantitative measurements, the concentration of a target protein in a lysate must be calculated from the standard curve that is obtained by applying purified proteins of known concentrations to the same gel. This technique is used because of the nonlinearity of the sensitivity curve of the X-ray film and the densitometry that is commonly used in traditional Western blotting to quantify the amount of antigen in a specific protein band. However, it is uncommon to perform such a rigorous quantification using standard curves because of the limitation of the number of lanes as well as the limited availability of the standard protein. Instead, qualitative differences among samples are often argued by showing the density of protein bands without quantification because the density increases monotonously as the concentration increases. However, modern Western blot analysis utilizes a fluorochrome-labeled antibody and charge-coupled device (CCD) imager to acquire a fluorescence signal directly, thereby achieving improved signal linearity and dynamic range. 9.2.2
Immunocytochemistry
Immunocytochemistry (or narrowly defined immunostaining) is a technique to visualize specific proteins within fixed cells using an antibody–antigen reaction (Figure 9.2). Usually cells are seeded and stained on a
184
Handbook of Statistical Systems Biology
Figure 9.2 Experimental procedure of immunocytochemical analysis
coverslip so that they can be readily observed with a microscope. Intracellular proteins are fixed with crosslinking agents such as formalin or fixed with acid or organic solvents by denaturing and precipitating the proteins. Owing to the fixation process, the intracellular localization and signaling context of proteins are preserved; however, not all antibodies that can be used for Western blotting also work with immunostaining because of interference with accessibility to antigens caused by the cross-linking agent and/or steric hindrance. In addition, because protein detection by immunostaining is attributable to the specificity of antibodies, this technique requires high-quality antibodies with less nonspecific binding to yield a better staining result. For quantitative analysis, the results from immunostaining must be carefully compared with other quantification methods such as Western blot analysis to check biological credibility. Historically, immunostaining has been considered qualitative, but not quantitative, and has rarely been utilized for quantification. This perception is related to problems of reproducibility in staining results such as signal intensity and background level. Moreover, unlike other quantitative assays, such as Western blotting, immunostaining does not allow placement of an internal standard within a sample, making it difficult to make comparisons between samples. However, it has been shown that immunostaining could be as accurate quantitatively as Western blotting by optimizing staining parameters such as the fixation method and the antibody dilution rate, and by controlling reaction times strictly in all staining procedures (Ozaki et al. 2010). It is recommended to use automated sample preparation systems to perform such highly reproducible staining. 9.2.3
Flow cytometry
Flow cytometry analyzes a heterogeneous population of microscopic particles by the light scattering and the fluorescent intensity of each individual particle, capturing physical and chemical features of the particle such as volume, internal complexity and the amount of fluorochrome. This technique is suitable for discriminating components in a crude sample such as a blood sample owing to its multi-parameter measurement capability. It is routinely utilized as a clinical laboratory test, especially for diagnosis of leukemia by staining leukocytes for the cell surface antigens. It can be applicable to both living and fixed cells; however, it cannot obtain the
Imaging and Single-Cell Measurement Technologies
185
Figure 9.3 Principles of flow cytometry
time-lapse data of living cells because it observes each individual cell only once (Figure 9.3). For fixed cells, flow cytometry can quantify an intracellular signaling event by immunostaining with fluorescently labeled antibodies. Because sample preparation in immunostaining requires centrifugal sedimentation and resuspension in each staining and washing step, staining levels between samples tend to vary considerably. Furthermore, this technique is not amenable to automation. However, variation in staining level of each individual cell in the same sample is extremely small because all cells are deemed to be evenly treated throughout the staining steps. Flow cytometry is also utilized for signal transduction studies, especially when the focus of interest is the distribution of a signal in a cell population. For example, cell cycle related studies often utilize flow cytometry to quantify a cell cycle response that is difficult to achieve by lysate-based assays. However, flow cytometry can be used only for suspended cells; an adherent cell culture must be detached into a single-cell suspension prior to analysis (e.g. putting live cells in a proteinase treatment for a few minutes at 37 ◦ C). Detaching cells can cause some harm; therefore, measurements must be interpreted cautiously when they involve quantifying a signaling event that changes rapidly, such as protein phosphorylation. An advantage of flow cytometry is its support for polychromatic analysis. Flow cytometer set-ups involve relatively simple optics, no movable parts and use of single element photodetectors; therefore, it is easy to configure sequential excitation and simultaneous spectroscopic detection for each light source without loss of functionality. Figure 9.4 illustrates the optics of a commercially available flow cytometer that can discriminate up to 18 different fluorochromes within a cell. The main application of the polychromatic analysis is immunophenotyping of a blood sample; however, there are a few epoch-defining studies of intracellular signal networks exploiting the informative measurements with the aid of statistical analysis (see below). 9.2.4
Fluorescent microscope
Given appropriate excitation light, a fluorochrome produces an inherent fluorescence whose spectrum shifts from the excitable wavelength side to the long-wavelength regime, a physical phenomenon that is called Stokes
186
Handbook of Statistical Systems Biology
Figure 9.4 The laser geometry and fluorescence-emission detection system of a 19-parameter flow cytometer. (a) Schematic representation showing that this instrument is equipped with three diode-pumped solid-state lasers. Each laser line can be independently adjusted through a series of steering mirrors. The lasers are set in the following delay timing order: blue, violet, red then green, with the blue laser as the reference laser. (b) Photograph of the octagon assembly for detection of green-laser-excited emissions. (c) Graphical representation of the assembly showing that it is a unique collection system, composed of a series of photomultiplier tubes (PMTs 1-8) and optics in an octagonal arrangement. Photons enter this system and are systematically reflected to the next detector by a dichroic mirror or transmitted through a band-pass filter and counted in that PMT. Dichroic-filter characteristics are shown. Each detector quantifies light that is mainly, but not solely, emitted by a single fluorochrome. Cy5PE, cyanine-5-PE; PE, phycoerythrin; TRPE, Texas Red-PE. Reprinted by permission from Macmillan Publishers Ltd: Nature Reviews Immunology 4, 648–55, copyright (2004)
shift. For observation of only the fluorescence signal, an optical system using a fluorescent microscope was designed to allow fluorescence to pass through while eliminating the scattered excitation light from the object (Figure 9.5). Because fluorescence is far weaker than excitation light, a combination of three optical devices is used to prevent excitation light from contaminating the observation wavelength band. The combination of optical devices is selected according to the excitation/emission spectrum of the fluorochrome to be observed. The fluorescence is often detected by a cooled CCD camera that has high sensitivity and low background noise.
Imaging and Single-Cell Measurement Technologies
187
Figure 9.5 Schematic view of a light path in an epifluorescent microscope
With the use of such an optimally configured optical system, the detected fluorescent intensity is virtually proportional to the amount of the fluorochrome. Confocal microscopes (e.g. laser scanning microscopes) are also commonly used, especially to visualize cellular microstructures, taking advantage of their high-contrast and depth-sectioning capability. However, these microscopes are not yet widely utilized for quantitative measurement, possibly because of the lack of a paradigm for such use and the troublesome handling of three-dimensional stack images. Although polychromatic staining using more than four colors is rarely performed, the laser scanning microscope potentially can be configured to perform polychromatic analysis similar to that performed with a polychromatic flow cytometer. Therefore, it is desirable to generate new applications to exploit such an informative measurement. 9.2.5
Live cell imaging
In general, cells are optically transparent and can hardly be seen without adding some sort of visual emphasis that reveals the cellular structures. The upper panel in Figure 9.6 shows an example of phase-contrast microscopy that optically reveals cellular morphology based on the difference in the refraction factors between the medium and the cellular structures. In most cases, live cell imaging utilizes fluorescent biomarkers to visualize cellular events: namely, fluorescent probes targeted at some cellular events are introduced into the study cells, and then successive changes of the biomarkers, such as changes in their localization, concentration, and sometimes fluorescence spectrum, are observed with a fluorescent microscope. Therefore, the outcomes of live cell imaging rely on the functionality of the probes and are also restricted by the availability of the probes. In addition, there are some concerns about introducing exogenous probes into living cells. Once a movie has been taken, time-lapse data of cellular events are extracted by image analysis. Typically, live cell imaging observes a few to a few dozen cells of interest, and fluorescence intensities in a region of interest are scored manually for each cell.
188
Handbook of Statistical Systems Biology
Figure 9.6 Multimode microscopy of a dividing PC12 cell. Near-simultaneous differential interference contrast (upper) and fluorescence (lower) imaging was used to visualize mitosis
9.2.6 9.2.6.1
Fluorescent probes for live cell imaging Fluorescent protein
Fluorescent proteins, typified by green fluorescent protein (GFP), have been utilized extensively in various studies of intracellular dynamics of signaling molecules. In practice, the DNA sequence that encodes a target protein conjugated with a fluorescent protein is transduced into cells, and then the fusion protein is synthesized by the process used with endogenous proteins. The fluorescence signal is visualized in situ with a fluorescent microscope, thereby identifying the dynamics of cellular distribution of the target protein upon stimulation. Fluorescent proteins are also used to investigate a promoter activity by monitoring expression levels of the fluorescent reporter that is fused downstream to the target promoter. In addition, fluorescent proteins are used to quantify intracellular trafficking and diffusion constants in vivo with the use of a photobleaching technique or photoconvertible fluorescent proteins that allow photochemical switching of fluorescence capability (Habuchi et al. 2005). 9.2.6.2
Fluorescence resonance energy transfer (FRET) probe
Although fluorescent proteins such as GFPs are useful for visualization of proteins, information achieved by these probes is usually limited to the localization and expression level of the proteins. For direct monitoring of signaling activities such as phosphorylation and molecular interactions, the FRET mechanism provides an expedient method for developing functional probes that are responsive to a conformational change in the target proteins caused by those reactions (Figure 9.7). Briefly, the excitation energy of a donor fluorochrome is transferred to an acceptor fluorochrome through the FRET mechanism when the donor and acceptor are located in proximity (typically within 10 nm), leading to fluorescence emission of the acceptor fluorochrome. Because the efficacy of FRET is very sensitive to the distance between the two fluorochromes, a conformational change in the genetically engineered fusion protein that equips two different fluorescent proteins on each terminal, or interaction of a pair of fusion proteins that contain different fluorescent proteins, is detectable as changes in the emission spectrum. The main shortcoming of the FRET probe is that it requires extensive trials and
Imaging and Single-Cell Measurement Technologies
189
Figure 9.7 Scheme showing how FRET from blue fluorescent protein (BFP) to GFP measures calcium ions. Copyright (1997–2011) A. Miyawaki, all rights reserved
selection to develop a functional probe. Although knowledge for the development of FRET probes has been accumulating, it can take more than 6 months to develop and validate a probe. 9.2.6.3
Fluorescent dye
Small-molecule fluorescence compounds are also used to monitor the inner states of cells. For example, Hoechst dye is a cell-permeable DNA intercalator that is used to monitor DNA content as well as chromatin dynamics in the living cell. Some other reporters for monitoring vital cellular activities, such as calcium ion, ATP and cAMP concentrations, and pH balance, have been developed. 9.2.6.4
Bioluminescent probe
Phototoxicity sometimes causes an unexpected issue in fluorescent microscopy during a long-running observation. Therefore, the bioluminescent reporter assay is preferable in situations such as studies of the circadian clock and cell cycle analysis. In practice, firefly luciferase is preferentially used as the luminescent reporter. In addition, compared with the typical fluorescence assay, the bioluminescence assay shows better signal-tonoise ratio, owing to its low background signal, although the photon yield of bioluminescence is much less than that of fluorescence. This high signal-to-noise ratio enables reduced expression levels of the luciferase gene, eliminating the concern of unnecessary hindrance of native cellular functions. 9.2.6.5
Validation of probes
The functionality of the newly developed probes should be examined carefully. For example, it is well known that GFP-tagging sometimes alters the localization specificity of the genuine protein, owing to an increase in the molecular weight (about 27 kDa) or the tendency of GFP to form dimers. In promoter assays, the expression level of the reporter gene may reflect the activity of the target promoter but does not necessarily reflect the expression level of the target gene. This happens because the net protein expression level is also
190
Handbook of Statistical Systems Biology
affected by the stability of the protein and the mRNA, which are regulated by stabilizing/destabilizing motifs in their structures. Similarly, interpretation of the FRET signal should consider both the activation process and the inactivation process to mimic the signaling event of the target protein. Therefore, the measured signals of these probes must be compared with those obtained by other techniques such as Western blotting and immunostaining, at least in some comparable situations, to validate the physiological functionality of the probes. In addition, introducing biomarkers into living cells may cause inevitable disturbances in cellular signaling processes. For example, over-expression of fully functional protein sometimes results in overemphasized signal transmission or constitutive activation of the signaling pathway downstream. By contrast, over-expression of proteins that malfunction can result in impaired signal transmission caused by dominantnegative activity. Therefore, the introduction level of exogenous proteins as well as small-molecule indicator compounds must be as low as possible in order to avoid disturbing intrinsic signaling, while retaining enough signal to be observed. However, sometimes it is not easy to fulfill such stringent conditions, or it may even be difficult to identify these kinds of effects owing to shortcomings of the validation method, failure to consider interpretation of the measurements derived by live cell imaging as quantitative, or failure to consider these side effects. 9.2.6.6
Introduction of probes
There are many different available experimental methods for introducing probes into living cells. Cell-permeable compounds and cell-penetrating peptide-tagged proteins are readily taken up by cells. Cell-impermeable compounds are introduced physically by electroporation or microinjection, or chemically by lipofection, in which compounds are encapsulated in an artificial liposome and then endocytosed into cells by liposome ingestion. If the probe to be introduced is a protein, it is good practice to introduce the nucleic acid that encodes the protein instead of introducing the protein itself, which will ensure a stable expression level regardless of the stability of the protein. This technique is called transfection. Use of virus vectors, which utilize mechanisms of viral infection for DNA transfection, yields very high levels of transfection efficiency; however, the process is rather laborious and it can take several weeks to prepare the vector, compared with other transfection methods. Whatever transfection method is used, it is accompanied by intrinsic variability in transfection efficiency as well as the number of DNA copies introduced into a cell. This means that the expression level of the transfected probe involves artificial variation, so the cells should be treated as a heterogeneous population, potentially confounding statistical analysis. If this type of population is inconvenient for the analysis, one can develop a stable cell line (i.e. a genetically identical cell population). Usually, the expression of transfected genes is transient because the transfected DNA is not integrated into the genomic DNA; hence, the transfected DNA decays through cell proliferation. However, genomic integration occurs in a very small percentage of cells, and the transfected gene is perpetuated in daughter cells. Therefore, by sub-cloning these cells, one can derive a cell line in which the probe is stably expressed. Again, this sub-cloning process takes at least 1 or 2 months. 9.2.7
Image cytometry
In a broad sense, image cytometry is a technique that quantifies cellular information from microscopic images with the aid of image processing. It is applicable to both living and fixed cells as long as some sort of cellular images are supplied with an appropriate visualization technique. However, although recent advances in image processing allow quantification of a variety of cellular information from microscopic images, it is strongly advisable to provide subsidiary staining, which facilitates distinguishing thousands of cells within an image. Therefore, a narrowly defined image cytometry technique constitutes a high-throughput assay system, if used in combination with high-throughput sample preparation and image acquisition followed by fully
Imaging and Single-Cell Measurement Technologies
191
automated image processing, a process called high content screening (HCS) (Figure 9.8). Image cytometry usually utilizes a microtiter plate of 96 or more wells in which an entire experiment is performed. A dozen microscopes dedicated to HCS have been released because it is difficult to manage that much image acquisition by hand. Most of the microscopes do not have an eye piece, and acquired images can be shown only on the screen display. Consequently, modern image cytometry does not require the eyes of an experimenter, which distinguishes image cytometry from ordinary fluorescent microscopy. Image cytometry usually observes a few dozen to several thousand cells in each sample, collecting data on fluorescent intensity as well as morphological features of individual cells. When the process is applied to fixed and then immunostained cells, the resulting data are rather like that of flow cytometry; however, image cytometry can be applicable to adherent cells without detaching them from the bottom surface of the well. In addition, image cytometry is compatible with automated immunostaining because each solution-phase reaction can be performed in place, where cells are attached, with the use of a liquid handling system. Furthermore, with the integration of a carbon dioxide incubator into the liquid handling system, all experimental procedures, including cell culture, stimulation and immunostaining, can be conducted inside the system, allowing configuration of a fully automated assay system for high-throughput and highly reproducible analysis. Such innovative changes in experimental procedures promote qualitative and quantitative expansion of biological data, resulting in the need for a new statistical analysis method to analyze such a high-throughput dataset. 9.2.8
Image processing
The aim of single-cell measurement is not to visualize a certain signaling event but to quantify the event; therefore, some sort of image processing is needed. Image processing of microscopic images has been extensively studied and is still developing. Aside from that, it is generally true that the wide variety of optical configurations and the variety of materials to be observed require specific image processing for each application. Many advanced types of image-processing algorithms, including object recognition, feature extraction and kinetic analysis, have been developed for a host of other image-oriented fields such as astronomy, clinical imaging, remote sensing and so on. However, such algorithms are not directly applicable to microscopic imaging owing to its inherent difficulties (i.e. low signal-to-noise ratio, ambiguous cell boundaries and artificial bias). At the same time, because artifacts in quantitative measurements are largely related to sample preparation rather than errors in object recognition, a very advanced type of image-processing algorithm is not needed at present. 9.2.8.1
Image compensation
Microscopic imaging is normally plagued by systematic errors derived from the optical system; these are, for example, non-flatness of the illumination light, stray light, aberration of the objective lens, light scattering derived from out-of-focus objects and fluorescence photobleaching. Therefore, it requires a set of image compensation processes to obtain a truly useful quantitative measurement. These systematic errors can be reduced by use of compensation algorithms, which normally implement an inverse operation of the error model with the aid of reference samples of a known signal. In addition, some spatial/temporal filtering such as a median filter is also used to reduce random errors to facilitate subsequent object recognition. 9.2.8.2
Segmentation
Image segmentation is a procedure that separate objects from the background. Image segmentation mainly involves two different approaches: histogram-based methods and edge-detection-based methods (Figure 9.9). The former uses the intensity of pixels, while the latter uses the spatial gradient of the intensity. In fluorescent images, cells are recognized as bright objects in the dark background; thus, a pixel-by-pixel histogram typically shows a bimodal distribution. In this case, histogram-based segmentation algorithms are used to find a threshold
192
Handbook of Statistical Systems Biology
Figure 9.8 Image cytometry analysis and equipment for HCS. (a) 96-well microtiter plate. (b) Automated microscope for HCS (Thermo Fisher Scientific Inc.). (c) Example of subsidiary staining (left and middle) and identified cells (right). (d) A configuration of the laboratory automation system for HCS (Beckman Coulter, Inc.)
that maximizes a separation criterion for the two components. However, those algorithms do not always work well owing to low signal-to-noise ratio and non-flatness of background levels. Adaptive thresholding techniques that vary the threshold over different image regions are often used to adapt the spatially changing background level. Edge-detection-based methods, such as the snake algorithm and its variants, are used to
Imaging and Single-Cell Measurement Technologies
193
Figure 9.9 Different methods of image segmentation
extract bordered cells (e.g. in phase-contrast imaging, cells are recognized as dark objects surrounded by a bright halo in a gray background). Briefly, candidates for cell contours are extracted with the use of a differentiation filter, followed by binarization and thinning operations; then the cell contour is reconstituted by using a dynamic programming algorithm to align each fragment around a nucleus. Another commonly used edge-detection-based method is the watershed algorithm, which implicitly finds the boundary by moving a watermark that specifies a part of the object. The watershed algorithm is also used to separate contiguous cells because it can find a valley between two bright objects. 9.2.8.3
Cell tracking
For live cell imaging, cell tracking is indispensable to obtain time series data of individual cells. However, this process is in essence a difficult problem because cells migrate minutely by changing their shape randomly (i.e. they are often in close contact with each other or even partially overlap). Furthermore, cellular events such as cell division and apoptosis, which happen over a longer period, should be considered for precise cell tracking. On the other hand, cell tracking requires careful selection of segmentation algorithms. For example, a snake algorithm generally requires initial values of the cell contours, but when time-lapse imaging data is used, the values of the estimated cell contours at the previous time point can be used as the initial values. Figure 9.10 shows an example of cell tracking by the extraction of cell boundaries and centroids from phasecontrast image data (Li et al. 2008).
194
Handbook of Statistical Systems Biology
Figure 9.10 Tracking mitotic and apoptotic MG-63 cells. (a) Six frames of a sequence with cell boundaries and centroids overlaid. Question marks indicate cells in intermediate stages (either mitotic or apoptotic). For daughter cells, the label of the parent is shown. (b) A spatiotemporal plot of the corresponding cell trajectories. The tick marks on the bounding box indicate the time instants of the six frames shown in (a) and the triangle indicates frame 61. Reprinted from Medical Image Analysis 12(5), Kang Li, Eric D. Miller, Mei Chen, Takeo Kanade, Lee E. Weiss, Phil G. Campbell, Cell population tracking and lineage construction with spatiotemporal context. 546–66, Copyright (2008), with permission from Elsevier
9.2.8.4
Feature extraction
The morphology of a cell and its intracellular structures are presented by a set of pixels, and features such as cell area and signal intensity in individual cells should be extracted and digitalized for statistical analysis. There is no standard set of features for extraction, and features to be extracted depend on the cell lines, cellular functions and molecules. These involve geometric features such as length of contour and curvature of a cell and intracellular structures, area of a specific region and texture. For intracellular localization, features involve the intensity ratio between cytosolic and nuclear intensities, relative distance from the plasma membrane and distribution of signal intensity in individual cells. Also, there may be features that are difficult to characterize, such as irregular morphology. In such a case, a supervised learning algorithm can be used; the supervisory signals will be some typical wild-type cells and irregular cells that were identified visually by skilled experts. When irregular cells can be identified, a machine learning algorithm can be useful; however, it is difficult to find such irregularities by unsupervised learning.
9.3
Analysis of signal cell measurement data
Signal cell measurement data can be used for a time series and distribution analysis. Discussion of some examples follows. 9.3.1
Time series (mean, variation, correlation, localization)
We have developed quantitative image cytometry technology that allows measurement of time series or population data on the phosphorylation and localization of intracellular signaling molecules in fixed cells (Ozaki et al. 2010). Such time series data have often been obtained by Western blot analysis, but this technique cannot obtain precise time series data and simultaneous distributions of two different signaling activities or molecules inside a single cell. We prepared the quadplex-stained fixed cells, identified individual cells and quantified double staining signals. Figure 9.11 shows the time series of mitogen-activated protein kinase kinase (MEK) and extracellular signal-regulated kinase (ERK) phosphorylation in response to nerve growth
Imaging and Single-Cell Measurement Technologies
195
Figure 9.11 Statistics of single-cell measurement data. (a) Distributions of phosphorylated ERK in response to various concentrations of NGF stimulation showing that the distributions are as close as a lognormal distribution. (b–d) Time courses of statistics in response to NGF stimulation: average phosphorylation levels of ERK (blue) and MEK (pink) (b); correlation coefficent between phosphorylated MEK and ERK (c); and standard deviations of phosphorylated MEK and ERK (d)
196
Handbook of Statistical Systems Biology
Figure 9.12 Bayesian network modeling with single-cell data. (A) Schematic of Bayesian network inference using multidimensional flow cytometry data. Nine different perturbation conditions were applied to sets of individual cells. Multiparameter flow cytometer simultaneously recorded levels of 11 phosphoproteins and phospholipids in individual cells in each perturbation dataset. This data conglomerate was subjected to Bayesian network analysis,
Imaging and Single-Cell Measurement Technologies
197
factor stimulation in PC12 cells. Although the detailed statistical modeling analysis remains challenging, the mean, variation and correlation of MEK and ERK distribution suggest the possible existence of an internal negative feedback loop within the MEK and ERK cascade. This highlights the potential for simultaneous single-cell measurements of two different signaling activities to find a novel aspect of cellular signaling. This study also demonstrates that image cytometry using immunostaining can be quantitatively comparable with Western blot analysis. 9.3.2
Bayesian network modeling with single-cell data
Sachs et al. (2005) inferred casual cellular signaling networks derived from multiparameter single-cell data obtained by use of polychromatic flow cytometry (Figure 9.12). They used nine different perturbation conditions applied to a set of human T cell populations, and simultaneously obtained levels of 11 phosphoproteins and phospholipids in individual cells for each perturbation condition. The data were subjected to Bayesian network analysis, which extracts an influence diagram that reflects underlying dependencies and causal relationships in the signaling network. They elucidated most of the traditionally reported signaling relationships and predicted novel interpathway network causalities, which they verified experimentally. 9.3.3
Quantifying sources of cell-to-cell variation
Using genetically identical cells, Colman-Lerner et al. (2005) showed that cell-to-cell variation of the expression level of fluorescent probes is dominated by initial differences between cells, rather than by stochastic noise in transcription and translation. In the plot in Figure 9.13, the major axis of distribution in the observed cell population indicates cell-to-cell variation of two different fluorescent probes under the regulation of the same promoter, and the minor axis indicates cell-to-cell variation of two different fluorescent probes derived from the stochasticity in transcription and translation. This observation clearly indicates that cell-to-cell variation mainly depends on the initial differences between cells. The authors further demonstrated that about half of the cell-to-cell variation is due to pre-existing differences in the cell-cycle position of individual cells at the time of pathway induction. They used live cell imaging analysis to dissect the sources of cell-to-cell variation, but this is an area that continues to develop rapidly and calls out for the development of more technology, as well as offering a rich field for the application of modern statistical tools and methodologies
9.4
Summary
Here we introduced typical techniques for biological image analysis with a focus on single-cell measurements, and have summarized critical points for analyzing and interpreting the obtained data. The fundamental ← Figure 9.12 (Continued) which extracts an influence diagram reflecting dependencies and causal relationships in the underlying signaling network. (B) Bayesian networks for hypothetical proteins X, Y, Z and W. (a) In this model, X influences Y, which, in turn, influences both Z and W. (b) The same network as in (a), except that Y was not measured in the dataset. (C) Simulated data that could reconstruct the influence connections in (B) (this is a simplified demonstration of how Bayesian networks operate). Each dot in the scatter plots represents the amount of two phosphorylated proteins in an individual cell. (a) Scatter plot of simulated measurements of phosphorylated X and Y shows correlation. (b) Interventional data determine directionality of influence. X and Y are correlated under no manipulation (blue dots). Inhibition of X affects Y (yellow dots) and inhibition of Y does not affect X (red dots). Together, this indicates that X is consistent with being an upstream parent node. (c) Simulated measurements of Y and Z. (d) A noisy but distinct correlation is observed between simulated measurements of X and Z. From Science 2005, 308(5721): 523–9. Reprinted with permission from AAAS
198
Handbook of Statistical Systems Biology
Figure 9.13 Quantifying sources of cell-to-cell variation. (A) The mating pheromone response system and analytical framework for decomposition. The diagram shows proteins in the yeast cell membrane, cytoplasm (cyt) and nucleus (nuc). Events in the blue box are classified as the pathway subsystem. The binding of a-factor to the receptor Ste2 causes dissociation of the heterotrimeric G protein a-subunit Gpa1 from the Ste4-Ste18 dimer (bg-subunits). Ste4 recruits the scaffold protein Ste5 to the membrane, and Ste5 binds the MEKK Ste11, the MEK Ste7 and the MAPK Fus3. The PAK kinase Ste20 initiates the MAPK cascade by activating Ste11, which activates Ste7, which in turn activates the MAP kinases Fus3 and Kss1. Phosphorylated Fus3 and Kss1 leave Ste5 and translocate to the nucleus, where they activate the transcription factor Ste12. At a given concentration of a-factor, the amount of activated Ste12 on the promoter is the ’pathway subsystem output’ P. Events in the red box are classified as the expression subsystem, quantified by E. E includes transcription initiation, mRNA elongation and processing, nuclear export and cytoplasmic protein translation. The total system output - the amount of fluorescent reporter protein y produced in any cell i - depends on P, E, a-factor concentrations and the duration of stimulation DT. To measure cell-to-cell variation in the population we used the normalized variance h2, which is decomposed into separate additive terms that represent different sources of cell-to-cell variation. (B) Type I experiment, measuring gene expression noise (g). In strains containing two identical a-factor-responsive promoters driving the YFP and CFP reporter genes, the same pathway (blue box) and expression machinery (red box) controls the production of reporter proteins. The authors stimulated TCY3096 cells with a high concentration (20 nM) of a-factor and collected YFP and CFP images after 3 h. Each cell is represented by a single symbol showing its YFP and CFP signals (in FU or fluorescent units). The uncorrelated variation between YFP and CFP can be seen as the width of the minor axis, which is orthogonal to the 458 diagonal major axis (lines in black); it is caused only by stochastic variation in gene expression (g). The authors used the orthogonal scatter as a measure of h2 (g), here 0.002ˆ0.0001. (C) Type II experiment, measuring variation in pathway subsystem output (P) and expression capacity (E). In strains containing different promoters driving the YFP and CFP reporter genes, different subsystems (blue boxes) regulate the activity of the DNA-bound transcription factors, but the subsystem enabling expression of the reporter genes (red box) is the same. The authors stimulated TCY3154 cells as in the type I experiment in (B).
Imaging and Single-Cell Measurement Technologies
199
technology for single-cell measurements is rapidly developing, and the quality and quantity of such data are increasing. We did not, however, introduce all available technologies in this chapter. A combination of different complementary technologies will allow us to obtain data that are ideally suitable for systems biology analysis. However, to obtain such data requires more complex experimental procedures, and those who analyze such large-scale data may also need knowledge of such experimental backgrounds. Thus far, analyses using the distribution of a cell population are limited, and combining advanced technologies still presents technical difficulties. One reason for these difficulties is that the importance of biological knowledge, which is obtained only by the use of such advanced technology, is not yet widely recognized. Improving these methods and bringing them into mainstream biomedical research requires simultaneous advances in both experimental technology and data analysis. Having knowledge in both fields is a first step, and we hope that this chapter will help those who make such a step.
Acknowledgements We wish to acknowledge funding by the CREST of the Japan Science and Technology Agency (JST) and the Strategic International Cooperative Program (Research Exchange Type) (JST).
References Colman-Lerner, A., Gordon, A., Serra, E., et al. (2005). Regulated cell-to-cell variation in a cell-fate decision system. Nature 437 (7059): 699–706. Habuchi, S., Ando, R., Dedecker, P., et al. (2005). Reversible single-molecule photoswitching in the GFP-like fluorescent protein Dronpa. Proc Natl Acad Sci USA 102(27): 9511–6. Li, K., Miller, E.D., Chen, M., et al. (2008). Cell population tracking and lineage construction with spatiotemporal context. Med Image Anal 12(5): 546–66. Ozaki, Y., Uda, S., Saito, T.H., et al. (2010). A quantitative image cytometry technique for time series or population analyses of signaling networks. PLoS One 5(4): e9955. Sachs, K., Perez, O., Pe’er, D., et al. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721): 523–9.
← Figure 9.13 (Continued) Variation in expression capacity affected only the correlated variation (the dispersion of points along the major axis, or the 458 diagonal). The uncorrelated variation (the dispersion of points along the minor axis) is due to the gene expression noise measured from type I experiments and to cell-to-cell variations in the pathway subsystems for each promoter. Reprinted by permission from Macmillan Publishers Ltd: Nature 437(7059): 699–706, copyright (2005)
10 Protein Interaction Networks and Their Statistical Analysis Waqar Ali, Charlotte Deane and Gesine Reinert Department of Statistics, University of Oxford, UK
10.1
Introduction
A major aim of post-genomic biology is to provide a complete systems-level snapshot of the cells in an organism. This requires not only a detailed description of the different components of a cellular system, but also a deep understanding of how those components interact with each other. During the last few years large scale genome sequencing projects and advances in protein analysis technologies have gathered a huge dataset of the components of living systems. The focus now is on developing successful models of the interactions of these components that can explain real living systems. Protein–protein interactions (PPI) are the corner-stone of most biological processes taking place in cells. Recent advances in high throughput interaction detection techniques have led to the elucidation of substantial parts of the entire protein interaction set for several species. These large datasets of interactions can be conveniently represented in the form of networks, where the nodes represent proteins and edges represent interactions. Given these data, some pertinent research questions are: 1. How do these networks work? How can a network be manipulated in order to prevent, say, tumour growth? 2. How did these biological networks evolve? Could mutation affect whole parts of the network at once? 3. How similar are these networks? How much can we infer from the PPI network of one organism for those of other organisms? 4. How are these networks linked with other networks, such as gene interaction networks? 5. What are the building principles of these networks? How are resilience and flexibility achieved?
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
Protein Interaction Networks 201
In order to attack these questions we first focus on 1. 2. 3. 4. 5. 6.
How to best describe networks? How to compare networks from related organisms? How to model network evolution? How to find relevant sub-structures of a network? How to predict functions from networks? How to infer and validate edges?
In the following sections, we first introduce proteins and their interactions (Section 10.2), experimental techniques to detect them and the high rate of false-positives and false-negatives. Section 10.3 covers network descriptions and model fitting. Approaches for the comparison of networks across species are discussed in Section 10.4, and Section 10.6 discusses evolution in networks. Community detection, and the identification of relevant sub-structures in a network, is found in Section 10.7, and Section 10.8 contains prediction using network structure as well as validation and inferring of edges. We conclude with a brief overview of current and future research directions.
10.2
Proteins and their interactions
Proteins are the most versatile macromolecules in living systems and serve crucial functions in most biological processes. For example, they function as catalysts, transport and store other molecules such as oxygen, provide mechanical support and immune protection and control growth differentiation. Proteins are linear polymers built of monomer units called amino acids. They fold up into three-dimensional structures that are thought to be determined by the sequence of amino acids in the protein polymer (Berg et al. 2006). 10.2.1
Protein structure and function
Due to interactions between the chemical groups on amino acids, a few characteristic patterns occur frequently within folded proteins. These recurring shapes are called the secondary structure, and they occur repeatedly as they are particularly stable (Br¨and´en and Tooze 1991). The two most commonly occurring secondary structures are the alpha helix and the beta strand. These are both highly regular local substructures (Figure 10.1). The term tertiary structure is used to refer to the three-dimensional structure of a single protein molecule. This
Figure 10.1 Protein structure: (a) secondary structure elements; and (b) tertiary structure
202
Handbook of Statistical Systems Biology
final shape is determined by a variety of bonding interactions between the amino acids. The tertiary structure of a protein is thought to determine its functionality. Some proteins also possess quaternary structure which involves the association of two or more polypeptide chains into a multi-subunit or oligomeric protein. The polypeptide chains of an oligomeric protein may be identical or different. From the biological perspective, the function of a protein is the most important characteristic, which in turn is determined to a large extent by its structure. Although proteins can often be classified into functional groups, many proteins can carry out multiple functions dependent on the cellular context. Some major classifications include enzymes, antibodies, transport proteins, hormones, signalling proteins and structural proteins (Berg et al. 2006). Proteins can interact with each other and with other macromolecules to form complex assemblies. The proteins within these assemblies often act synergistically to generate capabilities not afforded by the individual component proteins. These assemblies include macromolecular machines that carry out the accurate replication of DNA, the transmission of signals within cells and many other essential processes. 10.2.2
Protein–protein interactions
Most proteins function through interaction with other molecules, and often these are other proteins. There is an important distinction between transient and obligate protein interactions. Many proteins exist as parts of permanent obligate complexes such as multi-subunit enzymes, which may often fold and bind simultaneously. Other interactions are fleeting encounters between single proteins or larger complexes. These include enzyme– inhibitor, hormone–receptor, and signaling–effector types of interactions. This distinction is not always well understood, and the classification is sometimes difficult (Mintseris and Weng 2005). The interactions between proteins are important for many biological functions and operate at almost every level of cell function including in the structure of subcellular organelles, the transport machinery across the various biological membranes, packaging of chromatin, the network of submembrane filaments, muscle contraction, signal transduction and regulation of gene expression (Huthmacher et al. 2008). Thus, the elucidation of protein interactions is a central problem in biology today. Unless we understand the complex interaction patterns of the tens of thousands of proteins that constitute our proteome, we cannot hope to even attempt to efficiently combat some of the most important diseases, let alone gain an integrated understanding of the living cell. 10.2.3
Experimental techniques for interaction detection
Given their importance, there has been a surge in studies of protein interactions during the last decade. Some of the initial experiments focused on small and specific sets of interactions of interest to a particular research group, and were characterised by repeated observations. However the sheer scale of the number of possible interactions that proteins in a cell may undergo soon made researchers worldwide realise that there are probably more different possible interactions than there are researchers in the field. Thus, high throughput approaches for the elucidation of protein–protein interactions have rapidly gained appreciation. A few of the most popular and widely used experimental techniques are summarised below. These approaches differ widely in the quality and quantity of interaction data reported. Moreover, large scale studies using these methods show little overlap with each other. 10.2.3.1
Yeast two-hybrid system
The two-hybrid system is a genetic method that uses transcriptional activity as a measure of protein–protein interaction (Chien et al. 1991). Two hybrid proteins are created: one is a bait protein of interest fused to a DNA-binding domain and the other is a prey protein fused to a transcription activation domain. These two hybrids are then expressed in a cell containing one or more reporter genes. If the bait and prey proteins interact,
Protein Interaction Networks 203
this can be detected by expression of the reporter genes. While the assay has been generally performed in yeast cells, it works similarly in mammalian cells. If all proteins in a genome are treated as prey and bait in pairwise tests, all possible interactions can be probed. The main criticism applied to the yeast two-hybrid screen of protein–protein interactions is the possibility of a high number of false positive (and false negative) identifications. The exact rate of false positive results is not known, but estimates range between 35 and 70% (Hart et al. 2006). 10.2.3.2
Tandem affinity purification
The two-hybrid system uses binary combinations to explore the interaction space of a set of proteins. A different strategy to solve this problem is to purify all protein complexes from a living cell, subsequently characterising their constituent parts. This is the strategy that lies at the heart of tandem affinity purification (TAP)-tagging approaches. First the nucleotide sequence encoding the TAP tag is inserted at the end of the open reading frame to be investigated. A column with immunoglobulin beads would retain the TAP-tagged protein and associated complexed proteins. The complex is then purified and separated to its constituent protein parts and analysed on a mass spectrometer (Puig et al. 2001). With the help of software, peptide sequences and protein identities are obtained from mass spectrometry. Compared with the yeast twohybrid system, TAP is thought to have lower false negative rates (15%), and a false positive rate of 35% (Hart et al. 2006). 10.2.3.3
Co-immunoprecipitation
One of the most common and rigorous demonstrations of protein–protein interaction is the coimmunoprecipitation (Co-IP) of suspected complexes from cell extracts. Co-IP confirms interactions utilising a whole cell extract where proteins are present in their native conformation in a complex mixture of cellular components that may be required for successful interactions. An antibody specific to the bait protein is used to extract the complex of interest. This complex is purified and then evaluated using SDS-PAGE followed by Western blotting with specific antibodies (Phizicky and Fields 1995). Although very accurate, Co-IP can only determine the interaction between one pair of proteins at a time. 10.2.4
Computationally predicted data-sets
Parallel to experimental efforts, a number of computational methods have been developed for the prediction of protein interactions. Complete genome sequencing projects provide the vast amount of information needed for these analyses. The methods utilise the genomic and biological context of genes in complete genomes to predict functional linkages between proteins. Given that experimental techniques remain expensive, time-consuming, and labour-intensive, these methods represent an important advance in proteomics. One of the first methods for predicting protein–protein interactions from the genomic context of genes utilises the idea of co-localisation, or gene neighbourhood. Such methods exploit the notion that genes which physically interact (or are functionally associated) will be kept in close physical proximity to each other on the genome (Tamames et al. 1997; Overbeek et al. 1999; Bowers et al. 2004). This method has been successfully used to identify new members of metabolic pathways (Dandekar et al. 1998). Another method exploits the co-occurrence of homologous pairs of genes across multiple genomes. The fact that a pair of genes remains together across many disparate species represents a concerted evolutionary effort that suggests that these genes are functionally associated or physically interacting. The analysis of phylogenetic context in this fashion has been termed phylogenetic profiling (Pellegrini et al. 1999). This method has been used not only to infer physical interaction, but also to predict the cellular localisation of gene products (Marcotte et al. 2000; Bowers et al. 2005).
204
Handbook of Statistical Systems Biology
Methods using the analysis of gene fusion across complete genomes have also been proposed (Enright et al. 1999; Marcotte et al. 1999). A gene fusion event represents the physical fusion of two separate parent genes into a single multifunctional gene. This is the ultimate form of gene co-localisation as interacting genes are not just kept in close proximity on the genome, but are also physically joined into a single entity (Skrabanek et al. 2008). These events are detected by cross-species sequence comparison and provide a way to computationally detect functional and physical interactions between proteins. Although the method is not generally applicable to all genes, it has been shown to have an accuracy as high as 90% and has been successfully applied to a large number of genomes, including eukaryotes (Enright and Ouzounis 2001). It must be noted that all of these methods use some experimental data sources and as a result, they all suffer from the limitations of experimental approaches and incompleteness of observed data. Moreover, many of these techniques detect functional associations between proteins that do not necessarily indicate physical interactions. 10.2.5
Protein interaction databases
As a consequence of the experimental and computational approaches providing data about interacting proteins on a genome- and proteome-wide scale, several research groups have made an important effort in designing and setting up databases. The interaction data in these databases usually results from the integration of diverse datasets. Public databases of protein interactions include: • • • • • •
Biomolecular Interaction Network Database–BIND (Bader et al. 2001); Database of Interacting Proteins–DIP (Xenarios et al. 2002); General Repository for Interaction Datasets–GRID (Breitkreutz et al. 2003); Molecular Interactions Database–MINT (Zanzoni et al. 2002); Search Tool for the Retrieval of Interacting Genes/Proteins–STRING (von Mering et al. 2003); Human Protein Reference Database–HPRD (Keshava Prasad et al. 2009).
The structure and type of data that these databases contain is similar, but not identical. Most of these databases contain protein–protein interaction data only, though MINT and BIND also feature interactions involving non-protein entities such as promoter regions and mRNA transcripts. DIP is probably the most highly curated database of protein interactions. Curation in DIP is carried out manually by experts and also automatically using computational approaches. The sheer volume of interaction data available in these databases poses many challenges along with opportunities. On the one hand, such large scale data can enable one to infer large scale properties of cellular systems. On the other hand, the data have to be presented and analysed in a manageable framework. 10.2.6
Error in PPI data
Recent estimates suggest that the full yeast protein–protein interaction network contains 37 800–75 500 interactions and the human network 154 000–369 000 (Hart et al. 2006), but owing to a high false negative rate, current experimental datasets are roughly only 10–50% complete. Analysis of yeast, worm, and fly data indicates that 25–45% of the reported interactions are likely false positives (Huang et al. 2007). Membrane proteins have higher false discovery rates on average, and signal transduction proteins have lower rates. The overall false negative rate ranges from 75% for worm to 90% for fly, which arises from a roughly 50% false negative rate due to statistical under-sampling and a 55–85% false negative rate due to proteins that appear to be systematically lost from the assays (Huang et al. 2007).
Protein Interaction Networks 205
Error rates for large-scale PPI datasets can be estimated computationally using methods like the expression profile reliability (EPR) index and paralogous verification method (PVM) (Deane et al. 2002). The EPR index estimates the biologically relevant fraction of protein interactions detected in a high throughput screen. It does so by comparing the RNA expression profiles for the proteins whose interactions are found in the screen with expression profiles for known interacting and noninteracting pairs of proteins. PVM judges an interaction likely if the putatively interacting pair has paralogs that also interact. More recent methods such as IG1, IG2 (Saito et al. 2002, 2003) and IRAP (Chen et al. 2005) use network structure in assessing individual interaction reliabilities. Current PPI networks are, therefore, a sample of the complete network. Biases in sampling could lead to even more drastic differences between the complete network and the subsample that we observe. Even data derived from high throughput studies are not an unbiased sample of the complete network; rather, they are biased toward proteins from particular cellular environments, toward more ancient, conserved proteins and toward highly expressed proteins (von Mering et al. 2002). Current interaction maps represent the first steps on the way to accurate networks, and should continue to improve in accuracy and sensitivity with time. 10.2.7
The interactome concept and protein interaction networks
The compendium of all molecular interactions present in cells is called the interactome. When used in terms of proteomics, interactome refers to the entire set of protein–protein interactions for a species. Due to limitations of current knowledge, the experimentally and computationally determined set of protein interactions available in databases is a subset of the real interactome. Still, the sheer number of known protein interactions makes even the simplest analysis a difficult task. It has therefore become routine to represent these data in the form of protein interaction networks. A protein interaction network can summarise large amounts of interaction data in the form of graphs, with proteins as nodes and interactions as edges. The networks are undirected, and may be weighted. The weights of the edges could represent the confidence level for the interaction (typically based on the experimental or computational method used to detect that particular interaction). A distinct advantage of such a representation is the visual and computational ease in detecting higher level structures in interaction data. For instance, many biological processes are a result of more than two proteins acting in sequential pathways or simultaneously forming multiprotein complexes, which can be identified relatively easily in a network.
10.3
Network analysis
Computational analysis of PPI networks makes extensive use of graph theoretical techniques and concepts from literature on random networks. In the following sections, we introduce some basic terminology and present several methods as well as models for network analysis. 10.3.1
Graphs
A graph consists of nodes (also called vertices) and edges (also called links). Nodes may possess characteristics which are of interest (such as protein structure or function). Edges may possess different weights, depending on for example, the strength of the interaction or its reliability. Mathematically, we abbreviate a graph as G = (V, E), where V is the set of nodes and E is the set of edges. We use the notation |S| to denote the number of elements in the set S. Then |V | is the number of nodes, and |E| is the number of edges in the graph G. If u and v are two nodes and there is an edge from u to v, then we write that (u, v) ∈ E, and we say that v is a neighbour of u; and u and v are adjacent. If both endpoints of an edge are the same, then the edge is a loop.
206
Handbook of Statistical Systems Biology
Figure 10.2 Frequency of node degrees (k) in the yeast DIP network (accessed July 2010)
In general in PPI networks, we exclude self-loops as well as multiple edges between two nodes. Edges may be directed or undirected; here we shall mainly deal with undirected edges. While some authors assume that, in contrast to graphs, networks are connected, here we make no such assumption and use the terms graph and network interchangeably. 10.3.2
Network summary statistics
Although large networks are typically high dimensional and complex objects, many of their important properties can be captured by calculating relatively simple summary statistics. One of the most is the average degree. The degree deg(v) of a single node v is the number of edges which are adjacent to v. The average degree of a graph is then the average of its node degrees. In protein interaction networks it has been found that the vast majority of nodes have low degrees, whereas a few nodes are highly connected (Figure 10.2). This apparent similarity to the power law distribution prompted the popular classification of PPI networks as scale-free (Jeong et al. 2001; Barab´asi and Oltvai 2004), although subsequent studies have challenged this view (Tanaka et al. 2005; de Silva et al. 2006; Lima-Mendez and van Helden 2009). The clustering coefficient for a graph measures the tendency of the formation of tightly connected groups of nodes. Two versions of the clustering coefficient are in use: the global clustering coefficient is defined as the number of closed triplets of nodes in the network divided by the total number of triplets. The local clustering coefficient is defined for single nodes and is defined as the number of links existing between the neighbours of the node divided by the total number of possible links; for node i with ki neighbours in its set Ni of neighbours, the clustering coefficient Ci is Ci =
2|{(vj , vk ) ∈ E : vj , vk ∈ Ni }| . ki (ki − 1)
The shortest path length and average shortest path length of a graph are also commonly used summary statistics, where path length is defined as the number of edges traversed to reach a target node from a source node. Other popular summaries include the betweenness of an edge (or node) which counts the proportion of all the shortest paths in the network which pass through this edge (or node). For a comprehensive review of graph summary statistics, see for example Luciano et al. (2007). A comparison of yeast and human interaction networks indicates very similar clustering and path-length statistics, despite the difference in size (Table 10.1). To judge whether these difference are significant, network models are needed (see Section 10.3.4).
Protein Interaction Networks 207 Table 10.1 Summary statistics for yeast DIP and human HPRD protein interaction networks. Data downloaded from DIP and HPRD on 15 July 2010 Summary statistic Nodes Edges Avg. degree Avg. clustering coefficient Avg. shortest path
10.3.3
Yeast DIP
Human HPRD
4823 17471 6.10 0.1283 4.14
12937 43496 6.72 0.1419 4.40
Network motifs
In addition to considering general graph summary statistics, it has proven fruitful to describe the smaller-scale structure of networks in terms of subgraphs and motifs. Given a graph G = (V, E), a subgraph GS = (VS , ES ) consists of a subset of nodes VS ⊆ V and a subset of edges ES ⊆ E connecting the nodes of VS in the original graph (see Figure 10.3). The subgraph induced by VS is the subgraph GS that includes all the edges of G which connect the vertices of VS . A motif is commonly defined as a subgraph with a fixed number of nodes and a given topology that appears more often in a graph than expected by chance (Figure 10.3). The overrepresentation of a subgraph is established on the basis of its frequency compared with the average frequency of the same subgraph in a set of random networks (either based on a suitable model or generated by shuffling the the edges of the original network while keeping the same degree distribution). A motif of size k, i.e. containing k nodes, is called a k-motif. As the number of possible k-motifs grows very fast with k, only small size k-motifs have been studied in PPI networks. The two most commonly studied motifs in the context of PPI networks are cliques, i.e. complete subgraphs, and k-cores, i.e. graphs where every node has the degree of at least k. The enumeration of cliques and k-cores in particular has been used as a method of detecting protein complexes and functionally related proteins in protein interaction networks. Apart from PPI networks, motifs have also been found to be present in gene regulatory, metabolic and transcription networks (Alon 2006). There is a large body of literature on network models in different fields from physics to sociology to the internet and biology (Wasserman and Faust 1995; Alm and Arkin 2003; Dorogovtsev and Mendes 2003). In the next section, we present some random network models used to model protein interaction networks. Unfortunately, for some research questions none of these models provide a good fit for the data (Rito et al. 2010). However, there may be research areas or new data which make these models relevant. 10.3.4
Models of random networks
In order to judge whether a network summary is ‘unusual’ or whether a motif is ‘frequent’, there is an underlying assumption of randomness in the network. To understand mechanisms which could explain the formation of networks, mathematical models have been suggested. The main models, discussed elsewhere in
Figure 10.3 Some examples of motifs: (a) line; (b) triangle; (c) square; and (d) 5-node clique
208
Handbook of Statistical Systems Biology
this volume, are firstly Bernoulli or Erd˝os–R´enyi (ER) random graphs (Erd˝os and R´enyi 1960), with a finite node set and independent identically distributed edges; a variant is the random graph model G(n, m) with n nodes and m edges chosen uniformly at random from all n2 = 12 n(n − 1) possible edges. Barab´asi–Albert (BA) models (Barab´asi and Albert 1999) start with a small complete graph; new nodes attach to existing nodes with probability proportional to (a power of) the degree of the existing node, resulting in an asymptotic powerlaw degree distribution. ER mixture graphs, also known as latent block models in social science (Nowicki and Snijders 2001) assume that nodes are of different types, edges are independent, and the probability for an edge varies depending on the type of the nodes at its endpoints. Another set of models are exponential random graph (p∗ ) models where all edges of the network are modelled simultaneously, making it easy to incorporate dependence. A variation of the ER model is an ER graph with fixed degree distribution, abbreviated to ER-DD. For a given real graph as input, an ER-DD graph is constructed to have not only the same number of nodes and edges as the input graph, but also the same degree distribution. Finally, geometric random graph models (GEOdD) have also been proposed (Penrose 2003), which are constructed by dropping n nodes randomly uniformly into the unit square (or more generally according to some arbitrary specified density function on d-dimensional Euclidean space) and adding edges to connect any two nodes distant at most r from each other. The above models were initially proposed in nonbiological contexts. While studies suggest that they are able to reproduce some coarse properties of biological networks, it is difficult to relate their growth mechanisms to real biological systems. This has led to the proposal of models specifically aimed at protein interaction networks. For instance, although the Barab´asi and Albert class of models proposes a preferential attachment rule resulting in a power-law degree distribution observed in protein interaction networks, the underlying reason for preferential attachment is unclear. A more biologically plausible mechanism is gene duplication and divergence (DD) (Ispolatov et al. 2005a), where nodes are randomly selected and copied along with their links (Figure 10.4). In the DD model, the degree of a node increases mainly by having duplicate genes as its neighbours. Therefore, the preferential attachment rule is achieved implicitly, with highly connected nodes having more chance to have duplicate genes as their neighbours. DD models have been shown to closely model the degree distribution observed in real protein networks (Evlampiev and Isambert 2007). The DD model is also shown to generate hierarchically modular networks under certain conditions. If self-interactions (homo-oligomers) are taken into consideration, the DD model gives rise to networks with patterns of clustering and abundance of cliques similar to those found in natural networks (Ispolatov et al. 2005b). While the basic DD model remains by far the most widely accepted one in protein interaction network literature, some recent studies have proposed enhancements such as mixture models combining DD and preferential attachment (Ratmann et al. 2007). Alternatives to the DD model have also been investigated,
Figure 10.4 Duplication divergence model. The duplicate node (blue) loses some of the original links and creates new links
Protein Interaction Networks 209
including a crystal growth model that captures the age-dependency of interaction density in the yeast interaction network along with hierarchical modularity (Kim and Marcotte 2008). 10.3.5
Parameter estimation for network models
In most of these models it is necessary to estimate parameters. In ER graphs, where the unknown parameter is the edge probability, this probability can be estimated using standard maximum likelihood, yielding the graph density as an estimate. The graph density is the number of edges that are present in the network, divided by the total possible number of edges in the network. In the G(n, m) version, once the number of nodes and the number of edges are observed, no parameters are to be estimated. In BA models, the parameters include the power exponent for the node degree, as occurring in the probability for an incoming node to connect to some node already in the network, and the size of the initial complete graph. Estimation depends on the precise model formulation – the general BA model does not specify the joint distribution of edges. In exponential random graphs, unless the network is very small, maximum likelihood estimation quickly becomes numerically unfeasible. Instead, Markov chain Monte Carlo (MCMC) estimation is employed. Unfortunately in exponential random graph models it is known that in some small parameter regions the stationary distribution of the Markov chain is not unique. To assess the model fit, the distribution of a network statistic under the model of choice can be used to see whether the observed value of the network statistic is unusual. This could be carried out by establishing the (asymptotic) distribution of the network statistic of choice and finding the p-value of the observed statistic, or by using Monte Carlo tests. For example, Lin and Reinert (2011) showed that in ER graphs and generalisations which include ER mixture graphs, the number of triangles and the number of squares are asymptotically Poisson distributed if the edge probabilities are small, and asymptotically normally distributed when the edge probabilities are moderate, with nontrivial covariance matrix. As full-likelihood based parameter estimation is often computationally intractable for even moderate-sized networks and relatively simple evolutionary models, studies of protein network growth models have mostly been restricted to comparing the observed degree distribution to a probability model for the degree distribution. Ratmann et al. (2007) developed a novel, model-based approach for Bayesian inference on biological network data that centres on Approximate Bayesian Computation (ABC; see Section 10.3.6). Instead of computing the intractable likelihood of the protein network topology, their method summarises key features of the network and then uses a MCMC algorithm to approximate the posterior distribution of the model parameters. This was used to fit a mixture model that captures network evolution by combining preferential attachement and duplication divergence with attachment to data from Helicobacter pylori and Plasmodium falciparum. Fitting this above model using ABC indicated that gene duplication has played a larger part in the PPI network evolution of the eukaryote than in the prokaryote. 10.3.6
Approximate Bayesian Computation
In standard Bayesian inference the posterior distribution for a parameter θ is given by P(θ|D) ∝ P(D|θ)π(θ). Here π is the prior distribution for θ, D are the data, and P(D|θ) is the likelihood of the data D given the parameter θ. When simulating from sufficiently complex models and large datasets, the probability of happening upon a simulation run that yields precisely the same dataset as the one observed will be very small, often unacceptably so. This is especially true in the case of network data, where it is nearly impossible to simulate a network with exactly the same topology as the dataset. The explicit evaluation of the likelihood P(D|θ) is avoided in ABC approaches by considering distances between observations and data simulated
210
Handbook of Statistical Systems Biology
from a model with parameter θ. Rather than considering the data itself, we consider a summary statistic of the data, S(D), and use a distance (S(D), S(X)) between the summary statistics of real and simulated data, D and X, respectively. The generic ABC approach to infer the posterior probability of a parameter θ is as follows: 1. Sample a candidate parameter vector θ ∗ from some proposal distribution. 2. Simulate a dataset X from the model with parameter θ ∗ . 3. If (S(D), S(X)) < ε then accept θ ∗ as a sample from the posterior. For ε sufficiently small, the ABC procedure should deliver a good approximation to the true posterior, in particular if the summary statistic S is a sufficient statistic of the probability model. If sufficient statistics do not exist or are hard to obtain, setting up a satisfying and efficient ABC approach can be challenging. The generic procedure outlined above can be computationally inefficient but ABC procedures can be combined with standard computational approaches used in Bayesian inference such as MCMC and sequential Monte Carlo. In these frameworks ABC can be used to tackle otherwise computationally intractable problems. For a review of ABC see for example Beaumont (2010). 10.3.7
Threshold behaviour in graphs
It is widely believed that there is a correspondence between topological motifs or subgraphs in PPI networks and biologically relevant functional modules (Bachman and Liu 2009; Spirin and Mirny 2003; Hartwell et al. 1999). Thus rigorous theoretical studies of the conditions under which certain subgraphs might arise are of great interest. For a graph G(n, m) with n nodes and m edges, many theoretical properties change dramatically in a narrow range of m, which lead to the concept of threshold functions (Erd˝os and R´enyi 1960). Let Gn,f (n) be a family of random graphs induced by n number of nodes and f (n) a function that gives edges according to the specific model. If Q is a graph property, P(Q) denotes the probability that Gn,f (n) has or belongs to Q. We say that almost every graph in Gn,f (n) has the property Q if P(Q) → 1 as n → ∞. For a given monotone increasing property Q (such as the appearance of a certain subgraph), we define a threshold function t(n) for f (n) f (n) Q as any function which satisfies P(Q) → 0 if → 0 and P(Q) → 1 if → ∞. t(n) t(n) Threshold functions for the ER model are not unique although they are so within certain factors (Bollobas 2001). For the ER model G(n, M(n)), with f (n) = M(n) it is possible to show that the threshold function for the property of containing a fixed, nonempty graph F is n2−2/m , where m = m(F ) is the maximum average degree of F (Bollobas 2001). For the ER model G(n, M(n)) it is possible and more informative to calculate the graph density such that the expected number of copies of a given subgraph F is approximately 1. For a subgraph on v vertices with e edges, M(n) the approximate expected count for the subgraph under the ER model G(n, M(n)) with ρ = ρ(n) = n is v n e E(number of occurrences) = λ := ρ (1 − ρ)(2)−e ∼ nv ρe /v!, v
2
for small ρ. When the number of occurrences is well approximated by a Poisson process, as in the case for balanced graphs, P(no occurrence of subgraph) ∼ 1 − e−λ ∼ λ when λ is small and hence the threshold function and the expectation formula coincide. Threshold functions for other models are not so well understood, but Goel et al. (2005) have shown that every monotone graph property has a threshold in geometric random graphs, generalising a similar result by Friedgut and Kalai (1996) for ER graphs. One can, nonetheless, calculate approximate threshold values for
Protein Interaction Networks 211
the appearance of induced graphlets with k vertices. This is based on the fact that for a random geometric graph placed in Rd with n vertices and a radius r, the k-vertices subgraph count satisfies a Poisson limit when the product nk r d(k−1) tends to a finite constant (Penrose 2003). Choosing r such that nk r d(k−1) = 1 then gives an approximate threshold value. To translate this value r into a graph density, we use that the radius r can be related to the expected average degree α by using the gamma function (x) (Dall and Christensen 2002), 1 α d + 2 1/d . r= √ π n 2
(10.1)
k d(k−1) = 1 and solving for α in (10.1) approximates the threshold graph density ρ as Using r such that n r ρ = αn/2 n2 . While threshold behaviour has been almost exclusively studied in theoretical models so far, in a recent paper Rito et al. (2010) show that PPI networks are situated in a region of graph density close to the threshold behaviour in ER and GEO3D models. Yeast has about 6600 protein-coding genes and is predicted to have about 25 000–35 000 interactions (Stumpf et al. 2008); such a network would have a graph density between 0.0011 and 0.0016. For humans, estimates of about 25 000 genes (Human Genome Project) and 650 000 interactions (Stumpf et al. 2008) would also lead to graph densities around 0.002. Both these networks would be placed in the threshold region for the appearance of specific types of motifs under the ER model. As the authors subsequently show, GDDA (see Section 10.4.1) which is a widely used measure of fit between interaction datasets and theoretical models, is unstable in this very region. It is conjectured that this instability in model fitting could be a consequence of the threshold for specific motifs appearing in the networks.
10.4
Comparison of protein interaction networks
So far we have discussed analysis techniques for single networks. The availability of interaction data for multiple species also opens up the opportunity for comparative techniques. Current research in the comparison of networks follows two generally separate streams: (1) comparing experimental networks with theoretical models in order to assess the fit; and (2) comparison of experimental networks across multiple species to identify conservation at systems level. Here we concentrate on using subgraph counts for the first problem, and network alignment for the second problem. Indeed the second problem itself is also referred to as network alignment as it is essentially a graph-matching problem. While there has been much work done and many algorithms proposed recently for network alignment, some of which we discuss later in this section, not many measures exist to measure the agreement between experimental data and theoretical network models. 10.4.1
Network comparison based on subgraph counts
One possible way to compare empirical and model generated networks is by quantifying their similarity in terms of abundance of specific classes of subgraphs. GraphCrunch (Milenkovic et al. 2008) is an open source software tool that compares large real-world networks with random graph models. These are automatically generated to have the same number of nodes and edges (to within 1%) as those of the real-world network being compared. (This is approximate; with a 12-star as input, GraphCrunch generates ER-DD graphs with 10, 11 and 12 edges.) As well as many global standard properties, the software supports the local statistics RGF-distance (Przulj et al. 2004) and GDDA (Przulj 2007). RGF-distance compares the frequencies of the appearance of all 3- to 5-node subgraphs in two networks. The networks being compared by GraphCrunch always have the same number of nodes as well as edges, and thus the frequencies of occurrence of the only 1-node subgraph, a node, and the only 2-node subgraph, an edge, are also taken into account by this measure. GDDA uses orbit degree distributions, which are based on the automorphism orbits of the
212
Handbook of Statistical Systems Biology
Figure 10.5 Some subgraphs and their automorphism orbits (Rito et al. 2010)
29 subgraphs on 2–5 vertices, as follows. Automorphisms are edge-preserving bijections from a graph to itself, and together they form a permutation group. An automorphism orbit is a node that represents this group (Figure 10.5). Within the 29 subgraphs, 73 different orbits can be found and each one will have an associated orbit degree distribution. An orbit i from subgraph Gj has orbit degree k in the graph G if there are k copies of Gj in G which involve j orbit i. Let dG (k) be the sample distribution of the node counts for a given orbit degree k in a graph G and for a particular automorphism orbit j. This sample distribution is then scaled by 1/k in order that large degrees do not dominate the score, and normalised to give a total sum of 1, j
d (k)/k j NG (k) = ∞G j . l=1 dG (l)/ l The comparison Dj (G, H) of two graphs G and H with respect to orbit√j is simply the Euclidean distance between the two scaled and normalised vectors N, which is scaled by 1/ 2 to be between 0 and 1, as pointed out in Przulj (2010); the resulting expression is 1 D (G, H) = √ 2 j
∞
k=1
1/2 j j [NG (k) − NH (k)]2
.
Protein Interaction Networks 213
This is then turned into an agreement by subtracting from 1, and the agreements are combined into a single value by taking the arithmetic mean over all j, yielding the GDDA, 1
(1 − Dj (G, H)). 73 72
GDDA =
j=0
According to Przulj (2007), a perfect score can be achieved when comparing networks of the same random model type; for example generate a network from an ER model with given number of edges and nodes, and compare it to other randomly generated ER models with the same parameters. Przulj (2007) found the mean GDDA of comparing ER versus ER, ER-DD versus ER-DD or GEO3D versus GEO3D to be 0.84 ± 0.07, where 0.07 denotes one standard error. This was updated in Przulj (2010) where the highest score for two GEO3D networks was found to be 0.95 ± 0.002. However, Rito et al. (2010) found that GDDA values have not only striking differences amongst different model types but also a pronounced dependency on the number of vertices of the network. For a specific graph, drawn from one model type and with a fixed number of vertices, they also observed a strong dependency of the GDDA score with graph density when comparing with graphs of the same type and with the same number of vertices (Figure 10.6). Furthermore, these dependencies are not monotone. They propose a new protocol where several same model versus model comparisons with roughly the same number of vertices and edges should be carried out in order to assess the best obtainable score for this specific case. GDDA should then be calculated between the query network and graphs from the model network. Model fit can be evaluated by gauging the differences between the distributions of agreement scores resulting from query network versus model and model versus model comparisons, for example by using a Monte Carlo test. 10.4.2
Network alignment
Research in cross-species network comparison, or network alignment has been spurred on by the introduction of the interolog concept. An interolog is a conserved interaction between a pair of proteins which have interacting orthologs in another organism, where orthologs are proteins descended from a common ancestor. The evidence for the existence of such protein interactions that are conserved across species is increasing. Proteins in the same pathway have been found to be present or absent in a genome as a group (Pellegrini et al. 1999; Kelley et al. 2003), and many protein interactions in the yeast network have also been identified for the corresponding protein orthologs in Caenorhabditis elegans (worm), see Matthews et al. (2001). These discoveries have led to research directed at identifying conserved complexes and functional modules through network alignment, analogous to traditional sequence alignment (Dandekar et al. 1999; Ogata et al. 2000; Kelley et al. 2004). Given two or more networks the aim of network alignment algorithms is to identify sets of interactions that are conserved across the networks. This alignment is achieved by first identifying a mapping between the nodes of two or more networks based on some biological information, usually sequence similarity. This step is followed by the actual alignment process, incorporating concepts from graph matching where the goal is to maximise the overlap in the interaction patterns of mappable nodes (Figure 10.7). The premise is that patterns of interactions which are conserved across species have biological significance and hence are more likely to correspond to real protein complexes or functional modules. The large and ever-increasing size of interaction datasets (typically >5000 nodes and >25 000 edges) combined with the fact that graph matching is an NP-hard problem, makes network alignment a computationally challenging problem. One of the earliest network alignment algorithms, NetworkBlast (Sharan and Ideker 2006) carries out alignment by first defining an alignment graph where each node represents a set of orthologous proteins based on sequence similarity. The edges in the alignment graph represent conserved interactions. A search is then usually carried out over this alignment graph for high scoring subgraphs. NetworkBlast has been used to
214
Handbook of Statistical Systems Biology
Figure 10.6 Dependency of GDDA for model versus model comparisons on the number of vertices and edges of a network. GDDA of ER versus ER (a) and GEO3D versus GEO3D (b) graphs with 500, 1000 and 2000 vertices are plotted against graph density (Rito et al. 2010). The arrows indicate thresholds for the appearance of the subgraphs G1 up to G29 for ER, and for the appearance of triangles in GEO3D
perform three-way comparisons of yeast, worm and fly which yielded conserved modules displaying good overlap with MIPS (Mewes et al. 2004) complexes. Graemlin (Flannick et al. 2006) uses progressive pairwise alignments to compare multiple networks. Graemlin’s probabilistic formulation of the topology-matching problem eliminates restrictions on the possible architecture of conserved modules such as those imposed by NetworkBlast. However it requires parameter learning through a training set of known alignments. The sensitivity of the method was assessed by counting the number of KEGG (Kyoto Encyclopaedia of Genes and Genomes; Kanehisa and Goto 2000) pathways in
Protein Interaction Networks 215
Figure 10.7 Pairwise network alignment of two graphs G and G . Dotted red lines indicate homology
the alignments. The KEGG coverage of the alignment results was between 21 and 39%. In terms of speed, it far outperforms NetworkBlast with a running time approximately linear to the number of networks. Other alignment algorithms have tried to take into account the evolutionary forces shaping the interaction networks. For example, MaWISH (Koyuturk et al. 2006) implements a duplication divergence model to carry out pairwise network alignment. More recently an evolutionary based multiple network alignment algorithm CAPPI (Dutkowski and Tiuryn 2007) was developed which tries to reconstruct the ancestral network for the input species and maps it back onto the extant networks to identify common modules. Graemlin 2.0 (Flannick et al. 2008) is also a multiple network aligner, with a scoring function that can use evolutionary events. Some network alignment methods proposed recently include IsoRank (Singh et al. 2008) and IsoRankN (Liao et al. 2009), GNA and PATH (Zaslavskiy et al. 2009) and DOMAIN (Guo and Hartemink 2009). DOMAIN is the first algorithm to introduce protein domains into the network alignment problem and uses a novel direct-edge-alignment paradigm to directly detect equivalent interaction pairs across species. It should be noted that global network alignment methods such as IsoRank and GNA do not directly address the conserved module detection problem, and focus on finding the best node-to-node match across the entire networks. 10.4.3
Using functional annotation for network alignment
A common theme of the studies described above for protein interaction network alignment is the use of protein sequence similarity to map orthologous proteins across different species. However, this does not necessarily provide a complete picture of orthologous relationships in the context of interaction networks. When aligning networks from species that are very distant in evolutionary terms, the proteins may not display enough sequence similarity to achieve a reasonable degree of mapping. This would result in a severely restricted alignment graph that may miss biologically conserved regions in the networks. Ali and Deane (2009) explored the possibility of using a different measure of protein similarity. Since the goal of alignment is to extract modules that correspond to specific biological processes, functional similarity of proteins across networks was employed to aid alignment. To be useful for network alignment, a subjective concept like functional similarity must be expressed in a quantitative form that reflects the closeness in the biological functions of the proteins being compared. Functional annotation of proteins is an ongoing scientific activity and one of the most widely used resources is Gene Ontology or GO (Ashburner et al. 2000). GO offers substantial coverage of major protein databases and provides a species-specific, structured set of terms describing gene products. A simple measure of functional
216
Handbook of Statistical Systems Biology
similarity was used by Ali and Deane (2009) which is based on the most specific and hence most informative GO annotation of each protein. For simplicity they focused only on the Biological Process category of GO, the method being identical for the other top categories of Molecular Function and Cellular Component. Let there be a total of N proteins in the dataset under consideration and the GO functional annotation of each protein be defined as a set of terms SA . Define a multiset of size n as a pair (S, σ) where σ : S → N, with the conditions: S=
SA
,
A∈N
σ(y) = n.
y∈S
Here, σ is a function that maps a GO term to the number of times it occurs in the dataset. Terms having fewer proteins annotated to them occur less frequently in the dataset and are thus classified as more specific. For any two proteins A and B with annotation sets SA and SB , the functional similarity score (funsim) was then calculated as follows: σ(t) funsim(A, B) = maxt∈SA ∩SB 1 − . n The above scoring scheme assigns higher functional similarity to protein pairs that share more specific GO annotations. It should be noted that other, more sophisticated scoring schemes for functional similarity based on GO are possible. Several measures of functional similarity have been proposed in recent years making use of the information content of GO terms as well as the semantics (is a, part of) of the GO relationships (Resnik 1995; Schlicker et al. 2006). In the study by Ali and Deane (2009), function-based alignment using the above similarity score was successful in uncovering a larger number of proteins participating in conserved interactions than sequencebased alignment. The human network used for the study contained 9305 proteins and 35 458 interactions while the yeast network contained 4941 proteins and 17 387 interactions. As shown in Table 10.2, the number of conserved interactions discovered in the human network (aligned to yeast) increased from 612 (between 457 unique proteins) to 1034 (between 727 unique proteins). Moreover, the two sets share only 58 proteins (<15%), indicating that the interactions targeted by the two methods are nearly disjoint. Sets of conserved interactions detected using function based alignment also displayed higher overlap with experimentally detected complexes in the MIPS database. As can be seen, despite the development of increasingly sophisticated network alignment methods, conserved interactions detected between model organisms constitute only a tiny fraction of existing datasets. Taking into account that some of the detected conserved interactions will be due purely to chance, such low evolutionary conservation at the interactome level is quite surprising and warrants more attention. Table 10.2 Comparison of sequence and function based alignment of the yeast and human PPI networks Alignment method
No. interactions
No. proteins
MIPS coverage
MIPS accuracy
Functional coherence
Seq. based Func. based MaWISH
612 1034 596
457 727 543
96 126 83
0.18 0.24 0.1
0.36 0.51 0.32
Results are also compared with sequence based alignment using the MaWISH algorithm.
Protein Interaction Networks 217
10.5
Evolution and the protein interaction network
Related intimately to the problem of cross-species network alignment is the study of likely factors underlying network evolution. Many models have been proposed for the growth of the protein interaction network, some of which (namely DD and geometric evolution) are discussed in Section 10.3.4. In terms of underlying mechanisms, evolution at the network level is thought to be a consequence of protein evolution. Errors in replication can result in a change in copy number of proteins, from individual genes being duplicated or lost (Zhang 2003), to the whole genome being duplicated (Kasahara 2007; Scannell et al. 2007). After a gene duplication event, divergence of function is possible. There are two main competing models for such divergence: subfunctionalisation (partitioning of ancestral function between gene duplicates) and neofunctionalisation (the de novo acquisition of function by one duplicate). Whichever model is chosen, this functional divergence at protein level can manifest itself in the form of diverging interaction patterns at the network level. While evolution in PPI networks opens up several research questions including the co-evolution of interacting proteins and constraints imposed by interactions on protein evolution (Lewis et al. 2010b), here we focus on one issue: whether network divergence after speciation can explain the low conservation observed in species-level network alignments. 10.5.1
How evolutionary models affect network alignment
Alignments of protein interaction networks have found little conservation among several species (Ali and Deane 2009). This is in sharp contrast to the genome level sequence alignments, where even evolutionarily distant organisms share significant portions of their gene repertoire. While this could be a consequence of the incompleteness of interaction datasets and presence of error, an intriguing prospect is that the process of network evolution is sufficient to erase any evidence of conservation. Ali and Deane (2010) tested this hypothesis using models of network evolution and also investigated the role of error in the results of network alignment. Using the DD and geometric evolution models, pairs of networks were grown from a single ancestor to the size of current interaction datasets for model species. Pairwise network alignment was then carried out using several different network alignment algorithms, and a distance metric based on summary statistics was used to assess the fit between experimental and simulated network alignments (Figure 10.8). Results indicated that network evolution alone is unlikely to account for the poor quality alignments given by real data. Alignments of simulated networks undergoing evolution are considerably (four to five times) larger than real alignments. The authors also compared several error models in their ability to explain this discrepancy. For a given error model with a single rate parameter θ, they estimate the posterior density for θ using the following ABC algorithm: 1. 2. 3. 4. 5.
Draw (θ 1 , θ 2 ) ∼ Uniform[0, 1]2 . Simulate error in networks (NW1 , NW2 ) using error models M(θ 1 ), M(θ 2 ). Align (NW1,NW2) and compute summary vector S from alignment. Calculate the distance d(S, D) where D is the summary vector for the real alignment. Accept θ if d(S, D) ≤ δ.
In the algorithm d is a scaled Euclidean distance between two summary vectors defined as
N (xi − yi )2 d( x, y ) = . σi2 i=1
218
Handbook of Statistical Systems Biology
Figure 10.8 Measuring the effect of error and evolution on network alignment (Ali and Deane 2010)
Using this set-up, posterior estimates of false negative rates for the yeast and human protein networks vary from 20 to 60% dependent on whether current incomplete proteome sampling is taken into account or not (Figure 10.9). It was also found that false positives appear to affect network alignments little compared with false negatives indicating that incompleteness, not spurious links, is the major challenge for interactomelevel comparisons.
10.6
Community detection in PPI networks
In parallel with network comparisons and model fitting, another area with obvious potential benefits is the automatic detection of functional modules and complexes in PPI data. Discovering these structures cannot only help us better understand the hierarchical organisation of cellular systems but may also assist in better targeting/design of drugs. Thus a very active area of research in the computational analysis of PPI networks is related to graph clustering approaches. Within the broader networks literature, a great many algorithms have been proposed that locate dense regions in a network, called communities or clusters [reviewed in Porter et al. (2009) and Fortunato (2010)]. A community is loosely defined as a group of nodes that are more closely associated with themselves than with the rest of the network. In the context of biological networks, such communities are potentially good candidates for functional modules, and many studies report running one of the myriad algorithms for detecting community
Protein Interaction Networks 219
Figure 10.9 Posterior density for error parameters in Homo sapiens (HN) and Saccharomycesce revisiae (YN) PPI networks (Ali and Deane 2010)
structure on PPI networks (Bu et al. 2003; Pereira-Leal et al. 2004; Dunn et al. 2005; Luo et al. 2007; Li et al. 2008). Having located communities, such studies then attempt to assess their functional homogeneity by searching for terms in a structured vocabulary – usually GO or MIPS – that are significantly over-represented within communities. If such terms exist, the identified communities are said to be ‘enriched’ for biological function. In many studies such enriched communities are found, and hence are plausible candidates for biological modules. 10.6.1
Community detection methods
A much used approach is the algorithm by Newman and Girvan (2004). It involves simply calculating the betweenness of all edges in the network and removing the one with highest betweenness, and repeating this process until no edges remain. If two or more edges tie for highest betweenness then one can either choose one at random to remove, or simultaneously remove all of them. As a guide to how many communities a network should be split into, they use the modularity. For a division with g groups, define a g × g matrix e whose component eij is the fraction of edges in the original network that connect nodes in group i to those in group j. Then the modularity is defined to be Q=
i
ei,i −
ei,j ek,i ,
i,j,k
that is, the fraction of all edges that lie within communities minus the expected value of the same quantity in a graph where the nodes have the same degrees but edges are placed at random. A value of Q = 0 indicates that the community is no stronger than would be expected by random shuffling. The Potts method (Reichardt and Bornholdt 2006) partitions the proteins into communities at many different values of a resolution parameter, thus finding communities at different scales within the network. The method
220
Handbook of Statistical Systems Biology
seeks a partition of nodes into communities that minimises a quality function (‘energy’): H =−
Jij (λ)δ(si , sj ),
ij
where si is the community of node i, δ is the Kronecker delta, λ is the resolution parameter, and the interaction matrix Jij (λ) gives an indication of how much more connected two nodes are than one would expect at random (i.e. in comparison with some null hypothesis). The energy H is thus given by a sum of elements of J for which the two nodes are in the same community. Other methods include the Markov clustering algorithm or MCL (van Dongen 2000) which simulates random walks on networks to isolate dense regions and MCODE (Bader and Hogue 2003) which uses high density k-cores to search for potential complexes. Here we have only mentioned community detection techniques based purely upon network structure, though there are several studies in the literature where network data are supplemented by additional biological information such as gene expression (Segal et al. 2003) and phenotypic sensitivity (Tanay et al. 2004) to achieve potentially more meaningful graph partitions. 10.6.2
Evaluation of results
Community detection algorithms are generally evaluated in terms of some quality function defined over the output clusters. A commonly used measure of ‘goodness’ is functional homogeneity, measuring how similar the proteins in a cluster are in terms of their biological function. Given a measure of functional similarity between pairs of proteins, one way to express the homogeneity of a cluster is H(C) =
i,j∈C
Similarity(i, j)
|C|(|C| − 1)
;
where, H(C) represents the homogeneity of a cluster C by the average pairwise protein similarity within C. Modules can also be statistically evaluated using the p-value from the hypergeometric distribution, which is defined as p=1−
k−1
i=0
|X||N|−|X| i
|N|n−i
,
n
where |N| is the total number of proteins, |X| is the number of proteins in a reference function, n is the number of proteins in an identified module, and k is the number of proteins in common between the function and the module. This is the probability that at least k proteins in a module of size n are included in a reference function of size |X| assuming that all proteins are investigated independently and have the same probability to be included in |X|. A low p-value indicates evidence for the hypothesis that the module corresponds to the function. Lewis et al. (2010a) used the Potts method to study the biological relevance of, and the relationship between, communities detected in the yeast protein network at different resolutions. They found that the large communities present at small values of the resolution parameter λ are not judged to be functionally homogeneous. As λ is increased, larger numbers of proteins occur in functionally homogeneous communities, peaking in the range 1.5 < log(λ) < 2. At log(λ) = 1.5, the mean community size is 73 proteins, and the majority of proteins (3071 out of 4980) are in functionally homogeneous communities (Figure 10.10).
Protein Interaction Networks 221
Figure 10.10 Communities identified in the yeast protein interaction network (Lewis et al. 2010b). When the resolution parameter is very small, all nodes are assigned to the same community. As is increased (viewing the network at progressively closer distances), more structure is revealed. The figures on the left-hand side show visualisations of the networks’ partition into communities at four different values of . Each circle represents a community, with size proportional to the number of proteins in that community, positioned at the mean position of its constituent nodes. The shade of the connecting lines is proportional to the number of links between two communities
10.7
Predicting function using PPI networks
Given the relative sparsity of protein functional annotations for even well-studied organisms, predicting protein function using PPI networks has received a lot of attention in recent years. Protein functions may be predicted on the basis of module detection algorithms. If an unknown protein is included in a functional module, it is expected to contribute toward the function that the module represents. Several topology-based approaches that predict protein function on the basis of PPI networks have also been introduced. The simplest method (majority vote) assigns the function of an unknown protein as the function of the majority of its neighbours (Schwikowski et al. 2000). The assumption that the function of a protein can be assumed to be independent of all other proteins, given the functions of its immediate neighbours, has led to several methods using Markov random fields for function prediction (Deng et al. 2003; Letovsky and Kasif 2003). The number of common neighbours of the known protein and of the unknown protein has also been taken as the basis for the prediction of function (Lin et al. 2006). Here we describe the method of Chen et al. (2007) to consider how such prediction of characteristics can be achieved using network structure. As characteristics they considered structure (7 categories) and function (24 categories). From the PPI network, they built an upcast set of category–category interactions. A category–category interaction is constructed by two characteristic categories from two interacting proteins. Denote the set of all categories the characteristic can assume by S. For a protein x, S(x) is the set of categories that protein x is classified into. If two proteins x and y interact, the category–category interactions it generates are the edges between any two characteristic categories, a and b (a ∈ S(x), b ∈ S(y)), from each of the two proteins (denoted by a ∼ b). The upcast set of category–category interactions is a collection of all category–category interactions extracted from the protein–protein interaction network (Figure 10.11).
222
Handbook of Statistical Systems Biology
Figure 10.11 Upcast sets of characteristic pairs. In this example, we consider only a single characteristic (e.g. protein function), so that the characteristic vector for a protein is a 1-vector. There are three single-category proteins and one two-category protein in the protein interaction network (left), which result in an upcast set of six characteristic pairs (Chen et al. 2007)
Let f ( a ∼ b) denote the relative frequency of category–category interaction { a ∼ b} among all category– category interactions. The score F (a, x) for the query protein x with annotated neighbours B(x), to be in a specific category a is proportional to the product C(a, x) of the relative frequencies f of observing category a for all category–category interactions of x’s neighbours in the prior database; f (a ∼ b), C(a, x) = b∈S(n) n∈B(x)
and is defined by C(a, x) . k∈S C(k, x)
F (a, x) :=
The protein x is then predicted to possess the characteristic category, or categories, with the highest score. This frequency method can be extended to include two or more protein characteristics in the prediction of a specific protein characteristic; it is then called the enhanced frequency method (see Table 10.3 for results). Table 10.3 The accuracies of function prediction using Majority Vote (MV), Chen’s method (F) and Chen’s enhanced method (EF). The enhanced method combines structure and function into a category vector for category–category interactions. The accuracy is calculated as the ratio between the number of correctly predicted proteins and all predicted proteins Organism (DIP)
Predicted proteins
MV
F
EF
D. melanogaster C. elegans S. cerevisiae E. coli M. musculus H. sapiens
1275 85 1618 154 32 274
0.53 0.38 0.67 0.69 0.59 0.79
0.67 0.55 0.61 0.69 0.88 0.90
0.69 0.71 0.67 0.70 0.81 0.89
D. melanogaster, Drosophila melanogaster; E. coli, Escherichia coli; M. musculus, Mus musculus. A function prediction is counted as correct if one of the best three predicted categories is correct. Values in italic show where the result outperforms MV with statistical significance, based on a Z-test.
Protein Interaction Networks 223
Figure 10.12 Structure prediction using different priors (Chen et al. 2007)
One can use the above model to see how well one network would predict characteristics in another network. The model was implemented to use pattern frequencies from pooled interactions in other organisms as well. Protein interactions were grouped into prokaryotes including E.coli and H.pylori and eukaryotes including C.elegans, S.cerevisiae, D.melanogaster, M.musculus, H.sapiens, and a final global pooled dataset including all interactions. As shown in Figure 10.12, in the prediction of structure, higher accuracy was gained when predicting eukaryotes using a prior database from eukaryotes only. On the other hand, the use of pooled interactions does not significantly improve predictions. While predicting function, the predictions significantly deteriorated when using prokaryotes as prior data. The prediction results thus clearly differentiated between using prior databases of eukaryotes and prokaryotes. These results indicate that the type of prior data may be more important than the data quantity and that there might be network similarity within kingdoms not observed across kingdoms. We conclude that while there are network similarities within kingdoms, interaction networks in prokaryotes and eukaryotes may be different.
10.8
Predicting interactions using PPI networks
As discussed earlier, current PPI networks are highly incomplete even for model organisms. Existing PPI networks from experimental datasets can be useful resources on which to base the prediction of new interactions or the identification of reliable interactions. Deng et al. (2002) used evolutionarily conserved domains defined in the Pfam database (Finn et al. 2010) and applied a maximum likelihood estimation method to infer interacting domains that are consistent with observed protein interactions. They estimated the probabilities of interactions between every pair of domains and measured the accuracies of their predictions at the protein level. Using the inferred domain–domain interactions, they predict interactions between proteins. Liu et al. (2005) extended this approach by integrating large-scale PPI data from three organisms to estimate the probabilities of domain– domain interactions. They found that the integrated analysis provides more reliable inference of protein interactions than the analysis from a single organism. Jonsson et al. (2006) predicted interactions by integrating experimental PPI data from many species and translating it into the reference frame of the rat. The putative rat
224
Handbook of Statistical Systems Biology
protein interactions were given confidence scores based on their homology to proteins that were experimentally observed to interact. In continuation to the previous section, we describe a method by Chen et al. (2008) who used ideas from logistic regression to develop a score to predict and to validate protein interactions, based on the protein characteristics and the PPI network. 10.8.1
Tendency to form triangles
The score is based on the following observation from exploratory data analysis. Let a, b, c ∈ S be three vectors, and let N be the set of proteins in the protein interaction network. Assume that all of a, b and c are indeed observed in the proteins, so that using the notation from Section 10.8,
1(a ∈ S(x), b ∈ S(y), z ∈ S(z)) > 0,
x,y,z∈N
and
1(a ∈ S(x), b ∈ S(y)) > 0.
x,y∈N
Here 1(A) denotes the indicator function; it takes on the value 1 if A is satisfied, and is 0 otherwise. For each type of category–category pair {a, c} with a fixed category b, the ratio of probabilities rabc = P(a∼c|a∼b∼c) is estimated by P(a∼c) rˆabc =
ˆ ∼ c|a ∼ b ∼ c) ˆ ∼ c, a ∼ b ∼ c) P(a P(a = , ˆ ∼ c) ˆ ∼ b ∼ c ∼ a) + P(a ˆ ∼b∼c∼ P(a P(a / a)
ˆ ∼ c) is the proportion of pairs of proteins x, y in N, with characteristics such that a ∈ S(x), b ∈ S(y), where P(a ˆ ∼ b ∼ c ∼ a) is the which interact, relative to all pairs of proteins with such characteristics. Similarly, P(a ˆ ∼b∼c∼ proportion of protein triplets, with given characteristics, which form a triangle, and P(a / a) is the proportion of protein triplets, with given characteristics, which form a line (but not a triangle). For each organism (protein interaction network), r¯ is the average of rˆabc for all a, b, c ∈ S,
r¯ =
a,b,c∈S rˆabc . 1 2 2 |S| (|S| + 1)
If r¯ 1, the existence of the interacting partner tends to decrease the chance of interaction. If r¯ 1, the interaction is more likely if two proteins have a common interacting partner. The average ratios of conditional probabilities from different organisms are estimated in Table 10.4. From Table 10.4 we conclude that there is some evidence, albeit weak, for a tendency to form triangles in the upcast networks. This tendency can be exploited for prediction, as shown next. 10.8.2
Using triangles for predicting interactions
Within the triplet interactions, the odds are assessed to observe triangles versus lines around the query protein pair. The triangle rate score, tri(x, y) for the protein pair {x, y} is defined as the odds of observing triangles versus lines among triangles and lines in its neighbourhood.
Protein Interaction Networks 225 Table 10.4 Estimates of r¯ from triplets: function Organisms D. melanogaster S. cerevisiae E. coli H. sapiens a b c ∗
No. obs. pairsa
No. obs. triplesb
r¯
SE
r¯ > 1c
110 214 76 60
534 1850 494 350
48.3 55.9 16.1 76.8
67.6 91.36 9.91 125.25
∗
Number of different pairs {a, c} forming triples {a ∼ b ∼ c}. Total number of different triples {a ∼ b ∼ c}. 5% level of significance based on Z-tests. Organism showing tendency of formation of triangles.
This scoring method was compared with the Deng, Liu and Jonsson scores using receiver operating characteristic (ROC) curves (Figure 10.13) and, using a subsampling scheme, the areas under the ROC curves were tested for significant difference. The results show that the triangle rate score outperforms both the domainbased and homology-based scores. The success of this method provides a good argument for representing PPI data as networks as the triangle information is crucial. In the above two sections, we have presented in detail two related methods for predicting protein characteristics and protein interactions based on network data. It should be noted that there is a vast number of similar as well as very different computational prediction techniques reported in the literature [see Rentzsch and Orengo (2009) and Browne et al. (2010), for recent reviews].
Figure 10.13 ROC curves, 1− specificity versus sensitivity, for predicting yeast protein interactions using domainbased approaches (Deng’s score and Liu’s score), a homology-based approach (Jonsson’s score plus paralogs) and Chen’s network-based approach (the triangle rate score)
226
Handbook of Statistical Systems Biology
10.9
Current trends and future directions
We conclude this chapter with a brief overview of current research trends in the analysis of protein networks and some areas which we believe will generate significant activity in the near future. 10.9.1
Dynamics
Studies of large-scale biological networks are gradually shifting from the analysis of their organisational principles and guilt-by-association predictions of the function of individual network components towards examining cell dynamics. In such studies, experimentally determined static networks are often used as scaffolds for modelling of dynamical changes in the system. Information about dynamics can be provided, for example, by measurements of gene expression at different time points or in different conditions. Han et al. (2004) examined the extent to which hubs in the yeast interactome are co-expressed with their interaction partners. They defined hubs as proteins with degree of at least 5. Based on the averaged Pearson correlation coefficient (avPCC) of expression over all partners, they concluded that hubs fall into two distinct classes: those with a low avPCC (which they called date hubs) and those with a high avPCC (so-called party hubs). They inferred that these two types of hubs play different roles in the modular organisation of the network: party hubs are thought to coordinate single functions performed by a group of proteins that are all expressed at the same time, whereas date hubs are described as higher-level connectors between groups that perform varying functions and are active at different times or under different conditions. The validity of the date/party hub distinction has since been debated in some recent papers (Batada et al. 2006, 2007; Wilkins and Kummerfeld 2008). Agarwal et al. (2010) used an interaction data set from the Online Predicted Human Interaction Database or OPHID (Brown and Jurisica 2005) and found that the form of the distribution of hub avPCC does not support bimodality and is not robust to methodological changes (Figure 10.14). Expression data can also be utilised to infer causality as well as information flow within cellular networks. A particularly illuminating source of dynamic data comes from knock-out experiments, where a gene is perturbed or removed from a genetic background and the expression levels of all other genes are measured.
Figure 10.14 Probability density plots of the distribution of hub avPCC values for human interaction data from OPHID. Gene expression data from GeneAtlas (Su et al. 2004), normalised using (a) MAS5 (Hubbell et al. 2002) and (b) GCRMA (Wu et al. 2004). From Agarwal et al. (2010).
Protein Interaction Networks 227
Yeang et al. (2004) developed a probabilistic approach for explaining observed gene expression changes due to a knock-out by inferring molecular cascades of flow through the interaction network. These molecular cascades correspond to paths beginning from the knock-out gene and ending at the gene whose expression has changed. RNA interference (RNAi) screens are a powerful technique for obtaining perturbation data in higher organisms. Here, a known gene from a pathway of interest is chosen as a reporter gene, other genes in the genome are systematically knocked-down using RNAi, and the effect on the reporter is measured. As RNA-based loss of function screens are increasingly being applied with automated image analysis to detect effects on specific processes or phenotypes (Moffat et al. 2006; Jones et al. 2009), these types of network analyses are likely to be relevant to study a broad range of interesting biological questions. 10.9.2
Integration with other networks
Until recently, the large-scale modelling of network dynamics has been focused on individual network types. However, within a cell, all network types are interrelated and dynamics of any individual network has an impact on the behaviour of other networks. Several recent studies have begun to address the challenge of coupling large-scale dynamical models for different network types to obtain one consistent dynamical network. Such methods have been spearheaded by approaches to combine metabolic and regulatory networks. For example, to obtain a combined model of metabolic and regulatory networks, Covert et al. (2001) used flux-balance analysis to model the metabolic network component while the transcriptional regulatory network was modelled as a Boolean network. The genes in the transcriptional network were assigned Boolean (binary) values indicating whether or not a given gene is being expressed. An interactive procedure was applied to ensure that the combined model satisfies both the metabolic and the regulatory constraints. A subsequent study used mixed integer linear programming (a general optimisation framework for capturing problems with both discrete and continuous variables) to couple such metabolic and regulatory models (Shlomi et al. 2007). Wang and Chen (2010) propose an approach for integrating transcription regulation and protein–protein interactions using dynamic gene-expression data. They start with candidate gene regulatory and signalling networks obtained from genome-scale data. These candidate networks are then pruned and combined, utilising gene-expression data at multiple time points, to obtain an integrated and focused network under a specific condition of interest. This area is in its infancy; integrated networks may shed new lights on all the research questions tackled above. 10.9.3
Limitations of models, prediction and alignment methods
While there is an expanding literature on methods for predicting protein characteristics and interactions using network information, there are several shortcomings which could hamper the practical application of such techniques. Many methods for protein function prediction using network neighbours fail to cope with the relatively sparse level of annotation currently available (Freschi 2009). Even given sufficient annotations, the network structure itself can be quite heterogeneous and too few links in some regions can make prediction either impossible or highly unreliable. For instance, performance of methods (Chen et al. 2008) which use a triangle score for interaction prediction, can be negatively affected in networks or network regions with low density. Similarly, network alignment methods are now sophisticated enough to cope with the computationally hard problem of matching multiple large networks, but such algorithms tend to be purely graph-theory based. Unlike sequence alignment, where rigorous models based on evolutionary theory are present to explain the results, network alignment is very often based on heuristic arguments. This is perhaps partly a consequence of the fact that evolution at the network level is poorly understood at the moment and we have no universally applicable
228
Handbook of Statistical Systems Biology
models for PPI networks. Future research should be able to shed more light on the mechanisms shaping interaction networks through evolutionary timescales and make it possible to incorporate these principles in comparative and predictive studies. A related research question could be the classification of existing as well as more refined network models according to the type of problem they best tackle. 10.9.4
Biases, error and weighting
Apart from techniques employing diverse biological information, some serious issues related to the nature of interaction data itself also need to be resolved. Current experimental techniques, such as yeast two-hybrid and coaffinity purification, sample subsets of the interaction data space (Stumpf et al. 2005). As mentioned earlier, these subsets show very limited overlap. Moreover, interaction data are nonbinary by nature for any multicomponent complex; their conversion to binary pair interactions is nontrivial and relies on processing protocols that may introduce further biases in the final screening output (Wodak et al. 2009; Yu et al. 2009). It is vital that we understand to what extent observed discrepancies between different networks reflect sampling biases of their experimental methods, as opposed to topological features due to biological functionality. While some studies have focused on the possible effect of biases, uncertainty and incompleteness on network inference and comparison (Stumpf and Wiuf 2005; Fernandes et al. 2010), we feel there is great need for directly modelling in these factors as part of predictive studies. 10.9.5
New experimental sources of PPI data
Finally, it is expected that new sources of PPI data will help relieve some of the issues discussed above. Sanderson (2009) pointed out that 20–30% of human genes encode membrane proteins and currently very little PPI interaction data exist for these proteins. This is true for both intracellular and extracellular membrane protein interactions. Nonetheless, exciting new experimental techniques are now being developed to probe these so-called dark regions of the interactome. These include the membrane yeast two-hybrid (MYTH) assays (Stagljar et al. 1998), the AViditiybase EXtracellular Interaction Screen (AVEXIS) system (Bushell et al. 2008) and the yeast adapted version of the DyHydroFolate-Reductase (DHFR) Protein fragment Complementation A ssays (PCA) (Tarassov et al. 2008). It will be interesting to see how network topology differs amongst the different techniques and whether these new datasets, when combined with existing data, have an influence on the overall network topology. It is important to bear in mind that most of these assays still rely on protein over-expression and, since PPIs are dependent on both relative affinity and protein concentration, validating PPI data is still a major challenge. Diverent techniques might however help to build a system of increasing confidence, by assigning higher scores to interactions verified by several independent techniques.
References Agarwal S, Deane CM, Porter MA and Jones NS 2010 Revisiting date and party hubs: novel approaches to role assignment in protein interaction networks. PLoS Computational Biology 6(6), e1000817. Ali W and Deane CM 2009 Functionally guided alignment of protein interaction networks for module detection. Bioinformatics 25(23), 3166–3173. Ali W and Deane CM 2010 Evolutionary analysis reveals low coverage as the major challenge for protein interaction network alignment. Molecular BioSystems 6, 2296–2304. Alm E and Arkin AP 2003 Biological networks. Current Opinion in Structural Biology 13(2), 193–202. Alon U 2006 An Introduction to Systems Biology: Design Principles of Biological Circuits, 1st edn. Chapman and Hall/CRC.
Protein Interaction Networks 229 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM and Sherlock G 2000 Gene Ontology: tool for the unification of biology. Nature Genetics 25(1), 25–29. Bachman P and Liu Y 2009 Structure discovery in PPI networks using pattern-based network decomposition. Bioinformatics 25(14), 1814–1821. Bader G and Hogue C 2003 An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4(1), 2. Bader GD, Donaldson I, Wolting C, Ouellette BFF, Pawson T and Hogue CWV 2001 BIND:The Biomolecular Interaction Network Database. Nucleic Acids Research 29(1), 242–245. Barab´asi AL and Albert R 1999 Emergence of scaling in random networks. Science 286(5439), 509–512. Barab´asi AL and Oltvai ZN 2004 Network biology: understanding the cell’s functional organization. Nature Reviews Genetics 5(2), 101–113. Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD and Tyers M 2006 Stratus not altocumulus: a new view of the yeast protein interaction network. PLoS Biology 4(10), e317. Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD and Tyers M 2007 Still stratus not altocumulus: further evidence against the date/party hub distinction. PLoS Biology 5(6), e154. Beaumont M 2010 Approximate Bayesian Computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics 41, 379–406 Berg JM, Tymoczko JL and Stryer L 2006 Biochemistry, 6th edn. W. H. Freeman. Bollobas B 2001 Random Graphs. Cambridge University Press. Bowers P, Pellegrini M, Thompson M, Fierro J, Yeates T and Eisenberg D 2004 Prolinks: a database of protein functional linkages derived from coevolution. Genome Biology 5(5), R35. Bowers PM, O’Connor BD, Cokus SJ, Sprinzak E, Yeates TO and Eisenberg D 2005 Utilizing logical relationships in genomic data to decipher cellular processes. FEBS Journal 272(20), 5110–5118. Br¨and´en C and Tooze J 1991 Introduction to Protein Structure. Garland Publishing. Breitkreutz BJ, Stark C and Tyers M 2003 The grid: the general repository for interaction datasets. Genome Biology 4(3), R23. Brown KR and Jurisica I 2005 Online Predicted Human Interaction Database. Bioinformatics 21(9), 2076–2082. Browne F, Zheng H, Wang H and Azuaje F 2010 From experimental approaches to computational techniques: a review on the prediction of protein–protein interactions. Advances in Artificial Intelligence 2010, Article ID 924529, 15 pages, 2010. doi:10.1155/2010/924529 Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, Li G and Chen R 2003 Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Research 31(9), 2443–2450. Bushell KM, S¨ollner C, Schuster-Boeckler B, Bateman A and Wright GJ 2008 Large-scale screening for novel low-affinity extracellular protein interactions. Genome Research 18(4), 622–630. Chen J, Hsu W, Lee ML and Ng SK 2005 Discovering reliable protein interactions from high-throughput experimental data using network topology. Artificial Intelligence in Medicine 35(1–2), 37–47. Chen PY, Deane CM and Reinert G 2008 Predicting and validating protein interactions using network structure. PLoS Computational Biology 4(7), e1000118. Chen PY, Deane CM and Reinert G 2007 A statistical approach using network structure in the prediction of protein characteristics. Bioinformatics 23(17), 2314–2321. Chien CT, Bartel PL, Sternglanz R and Fields S 1991 The two-hybrid system: a method to identify and clone genes for proteins that interact with a protein of interest. Proceedings of the National Academy of Sciences of the United States of America 88(21), 9578–9582. Covert MW, Schilling CH and Palsson B 2001 Regulation of gene expression in flux balance models of metabolism. Journal of Theoretical Biology 213(1), 73–88. Dall J and Christensen M 2002 Random geometric graphs. Physical Review E 66(1), 016121. Dandekar T, Snel B, Huynen M and Bork P 1998 Conservation of gene order: a fingerprint of proteins that physically interact. Trends in Biochemical Sciences 23(9), 324–328. Dandekar T, Schuster S, Snel B, Huynen M and Bork P 1999 Pathway alignment: application to the comparative analysis of glycolytic enzymes. Biochemistry Journal 343(1), 115–124.
230
Handbook of Statistical Systems Biology
Deane CM, Salwi´nski, Xenarios I and Eisenberg D 2002 Protein interactions. Molecular & Cellular Proteomics 1(5), 349–356. Deng M, Mehta S, Sun F and Chen T 2002 Inferring domain–domain interactions from protein–protein interactions. Genome Research 12(10), 1540–1548. Deng M, Zhang K, Mehta S, Chen T and Sun F 2003 Prediction of protein function using protein–protein interaction data. Journal of Computational Biology 10(6), 947–960. de Silva E, Thorne T, Ingram P, Agrafioti I, Swire J, Wiuf C and Stumpf M 2006 The effects of incomplete protein interaction data on structural and evolutionary inferences. BMC Biology 4(1), 39. Dorogovtsev SN and Mendes JFF 2003 Evolution of Networks: from Biological Nets to the Internet and WWW. Oxford University Press. Dunn R, Dudbridge F and Sanderson C 2005 The use of edge-betweenness clustering to investigate biological function in protein interaction networks. BMC Bioinformatics 6(1), 39. Dutkowski J and Tiuryn J 2007 Identification of functional modules from conserved ancestral protein–protein interactions. Bioinformatics 23(13), i149–158. Enright A and Ouzounis C 2001 Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions. Genome Biology 2(9), research0034.1–research0034.7. Enright AJ, Iliopoulos I, Kyrpides NC and Ouzounis CA 1999 Protein interaction maps for complete genomes based on gene fusion events. Nature 402(6757), 86–90. Erd˝os P and R´enyi A 1960 On the evolution of random graphs. Publication of the Mathematical Institute of the Hungarian Academy of Sciences 5, 17–61. Evlampiev K and Isambert H 2007 Modeling protein network evolution under genome duplication and domain shuffling. BMC Systems Biology 1(1), 49. Fernandes LP, Annibale A, Kleinjung J, Coolen ACC and Fraternali F 2010 Protein networks reveal detection bias and species consistency when analysed by information-theoretic methods. PLoS ONE 5(8), e12083. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer ELL, Eddy SR and Bateman A 2010 The Pfam protein families database. Nucleic Acids Research 38(Suppl. 1), D211–D222. Flannick J, Novak A, Srinivasan BS, McAdams HH and Batzoglou S 2006 Grmlin: general and robust alignment of multiple large interaction networks. Genome Research 16(9), 1169–1181. Flannick J, Novak AF, Do CB, Srinivasan BS and Batzoglou S 2008 Automatic parameter learning for multiple network alignment. RECOMB 214–231. Fortunato S 2010 Community detection in graphs. Physics Reports 486(3–5), 75–174. Freschi V 2009 A graph-based semi-supervised algorithm for protein function prediction from interaction maps. In Learning and Intelligent Optimization (ed. Sttzle T), vol. 5851 of Lecture Notes in Computer Science. Springer, pp. 249–258. Friedgut E and Kalai G 1996 Every monotone graph property has a sharp threshold. Proceedings of the American Mathematical Society 124(10), 2993–3002. Goel A, Rai S and Krishnamachari B 2005 Monotone properties of random geometric graphs have sharp thresholds. Annals of Applied Probability 15, 2535–2552. Guo X and Hartemink AJ 2009 Domain-oriented edge-based alignment of protein interaction networks. Bioinformatics 25(12), i240–1246. Han JDD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJ, Cusick ME, Roth FP and Vidal M 2004 Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature 430(6995), 88–93. Hart GT, Ramani A and Marcotte E 2006 How complete are current yeast and human protein-interaction networks? Genome Biology 7(11), 120. Hartwell LH, Hopfield JJ, Leibler S and Murray AW 1999 From molecular to modular cell biology. Nature 402(6761 Suppl.), C47–C52. Huang H, Jedynak BM and Bader JS 2007 Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps. PLoS Computational Biology 3(11), e214. Hubbell E, Liu WM and Mei R 2002 Robust estimators for expression analysis. Bioinformatics 18(12), 1585–1592.
Protein Interaction Networks 231 Huthmacher C, Gille C and Holzhtter HG 2008 A computational analysis of protein interactions in metabolic networks reveals novel enzyme pairs potentially involved in metabolic channeling. Journal of Theoretical Biology 252(3), 456–464. Ispolatov I, Krapivsky PL and Yuryev A 2005a Duplication–divergence model of protein interaction network. Physical Review E 71(6), 061911. Ispolatov I, Krapivsky PL, Mazo I and Yuryev A 2005b Cliques and duplication–divergence network growth. New Journal of Physics 7(1), 145. Jeong H, Mason SP, Barab´asi AL and Oltvai ZN 2001 Lethality and centrality in protein networks. Nature 411(6833), 41–42. Jones TR, Carpenter AE, Lamprecht MR, Moffat J, Silver SJ, Grenier JK, Castoreno AB, Eggert US, Root DE, Golland P and Sabatini DM 2009 Scoring diverse cellular morphologies in image-based screens with iterative feedback and machine learning. Proceedings of the National Academy of Sciences of the United States of America 106(6), 1826–1831. Jonsson P, Cavanna T, Zicha D and Bates P 2006 Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinformatics 7(1), 2. Kanehisa M and Goto S 2000 KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28(1), 27–30. Kasahara M 2007 The 2r hypothesis: an update. Current Opinion in Immunology 19(5), 547–552. Kelley BP, Sharan R, Karp RM, Sittler T, Root DE, Stockwell BR and Ideker T 2003 Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proceedings of the National Academy of Sciences of the United States of America 100(20), 11394–11399. Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR and Ideker T 2004 PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Research 32(Suppl. 2), W83–W88. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R and Pandey A 2009 Human Protein Reference Databas–2009 update. Nucleic Acids Research 37(Suppl. 1), D767–D772. Kim WK and Marcotte EM 2008 Age-dependent evolution of the yeast protein interaction network suggests a limited role of gene duplication and divergence. PLoS Computational Biology 4(11), e1000232. Koyuturk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W and Grama A 2006 Pairwise alignment of protein interaction networks. Journal of Computational Biology 13(2), 182–199. Letovsky S and Kasif S 2003 Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19(Suppl. 1), i197–i204. Lewis A, Jones N, Porter M and Deane CM 2010a The function of communities in protein interaction networks at multiple scales. BMC Systems Biology 4(1), 100. Lewis ACF, Saeed R and Deane CM 2010b Predicting protein–protein interactions in the context of protein evolution. Molecular BioSystems 6, 55–64. Li M, Wang J and Chen J 2008 A graph-theoretic method for mining overlapping functional modules in protein interaction networks. In Bioinformatics Research and Applications (eds. Mandoiu I, Sunderraman R and Zelikovsky A), vol. 4983 of Lecture Notes in Computer Science. Springer, pp. 208–219. Liao CS, Lu K, Baym M, Singh R and Berger B 2009 IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 25(12), i253–258. Lima-Mendez G and van Helden J 2009 The powerful law of the power law and other myths in network biology. Molecular BioSystems 5, 1482–1493. Lin C, Jiang D and Zhang A 2006 Prediction of protein function using common-neighbors in protein-protein interaction networks. IEEE 6th Symposium on BioInformatics and BioEngineering . IEEE Computer Society, pp. 251–260. Lin K and Reinert G 2011 Joint vertex degrees in an inhomogeneous random graph model. ArXiv e-prints. Liu Y, Liu N and Zhao H 2005 Inferring protein–protein interactions through high-throughput interaction data from diverse organisms. Bioinformatics 21(15), 3279–3285. Luciano, Rodrigues FA, Travieso G and Boas VPR 2007 Characterization of complex networks: a survey of measurements. Advances in Physics 56(1), 167–242.
232
Handbook of Statistical Systems Biology
Luo F, Yang Y, Chen CF, Chang R, Zhou J and Scheuermann RH 2007 Modular organization of protein interaction networks. Bioinformatics 23(2), 207–214. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO and Eisenberg D 1999 Detecting protein function and protein– protein interactions from genome sequences. Science 285(5428), 751–753. Marcotte EM, Xenarios I, van der Bliek AM and Eisenberg D 2000 Localizing proteins in the cell from their phylogenetic profiles. Proceedings of the National Academy of Sciences of the United States of America 97(22), 12115–12120. Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S and Vidal M 2001 Identification of Potential Interaction Networks Using Sequence-Based Searches for Conserved Protein–Protein Interactions or ‘Interologs’. Genome Research 11(12), 2120–2126. Mewes HW, Amid C and Arnold R 2004 MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Research 32, D41–44. Milenkovic T, Lai J and Przulj N 2008 Graphcrunch: a tool for large network analyses. BMC Bioinformatics 9(1), 70. Mintseris J and Weng Z 2005 Structure, function, and evolution of transient and obligate protein–protein interactions. Proceedings of the National Academy of Sciences of the United States of America 102(31), 10930–10935. Moffat J, Grueneberg DA, Yang X, Kim SY, Kloepfer AM, Hinkle G, Piqani B, Eisenhaure TM, Luo B, Grenier JK, Carpenter AE, Foo SY, Stewart SA, Stockwell BR, Hacohen N, Hahn WC, Lander ES, Sabatini DM and Root DE 2006 A lentiviral rnai library for human and mouse genes applied to an arrayed viral high-content screen. Cell 124(6), 1283–1298. Newman MEJ and Girvan M 2004 Finding and evaluating community structure in networks. Physics Review E 69(2), 026113. Nowicki K and Snijders TAB 2001 Estimation and prediction for stochastic block structures. Journal of the American Statistical Association 96(455), 1077–1087. Ogata H, Fujibuchi W, Goto S and Kanehisa M 2000 A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Research 28(20), 4021–4028. Overbeek R, Fonstein M, D’Souza M, Pusch GD and Maltsev N 1999 The use of gene clusters to infer functional coupling. Proceedings of the National Academy of Sciences of the United States of America 96(6), 2896–2901. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D and Yeates TO 1999 Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences of the United States of America 96(8), 4285–4288. Penrose M 2003 Random Geometric Graphs. Oxford University Press. Pereira-Leal JB, Enright AJ and Ouzounis CA 2004 Detection of functional modules from protein interaction networks. Proteins: Structure, Function, and Bioinformatics 54(1), 49–57. Phizicky E and Fields S 1995 Protein–protein interactions: methods for detection and analysis. Microbiological Reviews 59(1), 94–123. Porter MA, Onnela JP and Mucha PJ 2009 Communities in networks. Notices of the American Mathematical Society 56(9). Przulj N 2007 Biological network comparison using graphlet degree distribution. Bioinformatics 23(2), e177–e183. Przulj N 2010 Biological network comparison using graphlet degree distribution. Bioinformatics 26(6), 853–854. Przulj N, Corneil DG and Jurisica I 2004 Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508–3515. Puig O, Caspary F, Rigaut G, Rutz B, Bouveret E, Bragado-Nilsson E, Wilm M and Sraphin B 2001 The tandem affinity purification (tap) method: a general procedure of protein complex purification. Methods 24(3), 218–229. Ratmann O, Jrgensen O, Hinkley T, Stumpf M, Richardson S and Wiuf C 2007 Using likelihood-free inference to compare evolutionary dynamics of the protein networks of H. pylori and P. falciparum PLoS Computational Biology 3(11), e230. Reichardt J and Bornholdt S 2006 Statistical mechanics of community detection. Physical Review E 74(1), 016110. Rentzsch R and Orengo CA 2009 Protein function prediction – the power of multiplicity. Trends in Biotechnology 27(4), 210–219. Resnik P 1995 Using information content to evaluate semantic similarity in a taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence 448–453.
Protein Interaction Networks 233 Rito T, Wang Z, Deane CM and Reinert G 2010 How threshold behaviour affects the use of subgraphs for network comparison. Bioinformatics 26(18), i611–i617. Saito R, Suzuki H and Hayashizaki Y 2002 Interaction generality, a measurement to assess the reliability of a protein– protein interaction. Nucleic Acids Research 30(5), 1163–1168. Saito R, Suzuki H and Hayashizaki Y 2003 Construction of reliable protein–protein interaction networks with a new interaction generality measure. Bioinformatics 19(6), 756–763. Sanderson CM 2009 The Cartographers toolbox: building bigger and better human protein interaction networks. Briefings in Functional Genomics & Proteomics 8(1), 1–11. Scannell DR, Butler G and Wolfe KH 2007 Yeast genome evolution: the origin of the species. Yeast 24, 929–942. Schlicker A, Domingues F, Rahnenfuhrer J and Lengauer T 2006 A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics 7(1), 302. Schwikowski B, Uetz P and Fields S 2000 A network of protein–protein interactions in yeast. Nature Biotechnology 18(12), 1257–1261. Segal E, Wang H and Koller D 2003 Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19(Suppl. 1), i264–i272. Sharan R and Ideker T 2006 Modeling cellular machinery through biological network comparison. Nature Biotechnology 24, 427–433. Shlomi T, Eisenberg Y, Sharan R and Ruppin E 2007 A genome-scale computational study of the interplay between transcriptional regulation and metabolism. Molecular Systems Biology 3, Article no. 101. Singh R, Xu J and Berger B 2008 Global alignment of multiple protein interaction networks with application to functional orthology detection. Proceedings of the National Academy of Sciences of the United States of America 105(35), 12763– 12768. Skrabanek L, Saini H, Bader G and Enright A 2008 Computational prediction of protein–protein interactions. Molecular Biotechnology 38, 1–17. Spirin V and Mirny LA 2003 Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences of the United States of America 100(21), 12123–12128. Stagljar I, Korostensky C, Johnsson N and te Heesen S 1998 A genetic system based on split-ubiquitin for the analysis of interactions between membrane proteins in vivo. Proceedings of the National Academy of Sciences of the United States of America 95(9), 5187–5192. Stumpf MPH and Wiuf C 2005 Sampling properties of random graphs: the degree distribution. Physical Review E 72(3), 036118. Stumpf MPH, Wiuf C and May RM 2005 Subnets of scale-free networks are not scale-free: sampling properties of networks. Proceedings of the National Academy of Sciences of the United States of America 102(12), 4221–4224. Stumpf MPH, Thorne T, de Silva E, Stewart R, An HJ, Lappe M and Wiuf C 2008 Estimating the size of the human interactome. Proceedings of the National Academy of Sciences 105(19), 6959–6964. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR and Hogenesch JB 2004 A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America 101(16), 6062–6067. Tamames J, Casari G, Ouzounis C and Valencia A 1997 Conserved clusters of functionally related genes in two bacterial genomes. Journal of Molecular Evolution 44(1), 66–73. Tanaka R, Yi TM and Doyle J 2005 Some protein interaction data do not exhibit power law statistics. FEBS Letters 579(23), 5140–5144. Tanay A, Sharan R, Kupiec M and Shamir R 2004 Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proceedings of the National Academy of Sciences of the United States of America 101(9), 2981–2986. Tarassov K, Messier V, Landry CR, Radinovic S, Molina MMS, Shames I, Malitskaya Y, Vogel J, Bussey H and Michnick SW 2008 An in vivo map of the yeast protein interactome. Science 320(5882), 1465–1470. van Dongen SM 2000 Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S and Bork P 2002 Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417(6887), 399–403.
234
Handbook of Statistical Systems Biology
von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P and Snel B 2003 STRING: a database of predicted functional associations between proteins. Nucleic Acids Research 31(1), 258–261. Wang YC and Chen BS 2010 Integrated cellular network of transcription regulations and protein–protein interactions. BMC Systems Biology 4(1), 20. Wasserman S and Faust K 1995 Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences). Cambridge University Press. Wilkins MR and Kummerfeld SK 2008 Sticking together? Falling apart? Exploring the dynamics of the interactome. Trends in Biochemical Sciences 33(5), 195–200. Wodak SJ, Pu S, Vlasblom J and Sraphin B 2009 Challenges and rewards of interaction proteomics. Molecular & Cellular Proteomics 8(1), 3–18. Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F and Spencer F 2004 A Model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association 99(468), 909–917. Xenarios I, Salwnski L, Duan XJ, Higney P, Kim SM and Eisenberg D 2002 DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research 30(1), 303–305. Yeang CH, Ideker T and Jaakkola T 2004 Physical network models. Journal of Computational Biology 11(2–3), 243–262. Yu X, Ivanic J, Wallqvist A and Reifman J 2009 A novel scoring approach for protein co-purification data reveals high interaction specificity. PLoS Computational Biology 5(9), e1000515. Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M and Cesareni G 2002 Mint: a molecular interaction database. FEBS Letters 513(1), 135–140. Zaslavskiy M, Bach F and Vert JP 2009 Global alignment of protein-protein interaction networks by graph matching methods. Bioinformatics 25(12), i259–1267. Zhang J 2003 Evolution by gene duplication: an update. Trends in Ecology and Evolution 18, 292–298.
Part C NETWORKS AND GRAPHICAL MODELS
11 Introduction to Graphical Modelling Marco Scutari 1 and Korbinian Strimmer 2
2
1 UCL Genetics Institute (UGI), University College London, UK Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, Germany
The aim of this chapter is twofold. In the first part (Sections 11.1, 11.2 and 11.3) we will provide a brief overview of the mathematical and statistical foundations of graphical models, along with their fundamental properties, estimation and basic inference procedures. In particular we will develop Markov networks (also known as Markov random fields) and Bayesian networks, which are the subjects of most past and current literature on graphical models. In the second part (Section 11.4) we will review some applications of graphical models in systems biology.
11.1
Graphical structures and random variables
Graphical models are a class of statistical models which combine the rigour of a probabilistic approach with the intuitive representation of relationships given by graphs. They are composed of two parts: 1. a set X = {X1 , X2 , . . . , Xp } of random variables describing the quantities of interest. The statistical distribution of X is called the global distribution of the data, while the components it factorises into are called local distributions. 2. a graph G = (V, E) in which each vertex v ∈ V, also called a node, is associated with one of the random variables in X (they are usually referred to interchangeably). Edges e ∈ E, also called links, are used to express the dependence structure of the data (the set of dependence relationships among the variables in X) with different semantics for undirected graphs (Diestel 2005) and directed acyclic graphs (Bang-Jensen and Gutin 2009).
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
238
Handbook of Statistical Systems Biology
The scope of this class of models and the versatility of its definition are well expressed by Pearl (1988) in his seminal work ‘Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference’: Graph representations meet our earlier requirements of explicitness, saliency, and stability. The links in the graph permit us to express directly and quantitatively the dependence relationships, and the graph topology displays these relationships explicitly and preserves them, under any assignment of numerical parameters.
The nature of the link outlined above between the dependence structure of the data and its graphical representation is given again by Pearl (1988) in terms of conditional independence (denoted with ⊥ ⊥P ) and graphical separation (denoted with ⊥ ⊥G ). Definition 11.1.1 A graph G is a dependency map (or D-map) of the probabilistic dependence structure P of X if there is a one-to-one correspondence between the random variables in X and the nodes V of G, such that for all disjoint subsets A, B, C of X we have ⊥G B | C. A⊥ ⊥P B | C =⇒ A ⊥
(11.1)
Similarly, G is an independency map (or I-map) of P if ⊥G B | C. A⊥ ⊥P B | C ⇐= A ⊥
(11.2)
G is said to be a perfect map of P if it is both a D-map and an I-map, that is A⊥ ⊥P B | C ⇐⇒ A ⊥ ⊥G B | C,
(11.3)
and in this case P is said to be isomorphic to G. Note that this definition does not depend on a particular characterisation of graphical separation, and therefore on the type of graph used in the graphical model. In fact both Markov networks (Whittaker 1990) and Bayesian networks (Pearl 1988), which are by far the two most common classes of graphical models treated in the literature, are defined as minimal I-maps even though the former use undirected graphs and the latter use directed acyclic graphs. Minimality requires that, if the dependence structure P of X can be expressed by multiple graphs, we must use the one with the minimum number of edges; if any further edge is removed then the graph is no longer an I-map of P. Being an I-map guarantees that two disjoint sets of nodes A and B found to be separated by C in the graph (according to the characterisation of separation for that type of graph) correspond to independent sets of variables. However, this does not mean that every conditional independence relationship present in P is reflected in the graph; this is true only if the graph is also assumed to be a dependency map, making it a perfect map of P. In Markov networks graphical separation [which is called undirected separation or u-separation in Castillo et al. (1997)] is easily defined due to the lack of direction of the links. Definition 11.1.2 If A, B and C are three disjoint subsets of nodes in an undirected graph G, then C is said to separate A from B, denoted A ⊥ ⊥G B | C, if every path between a node in A and a node in B contains at least one node in C. In Bayesian networks separation takes the name of directed separation (or d-separation) and is defined as follows (Korb and Nicholson 2009).
Introduction to Graphical Modelling 239 separation (undirected graphs) A
B C d-separation (directed acyclic graphs)
A
B C
A
B C
A
B C
Figure 11.1 Graphical separation, conditional independence and probability factorisation for some simple undirected and directed acyclic graphs. The undirected graph is a simple 3 node chain, while the directed acyclic graphs are the converging, serial and diverging connections (collectively known as fundamental connections in the theory of Bayesian networks)
Definition 11.1.3 If A, B and C are three disjoint subsets of nodes in a directed acyclic graph G, then C is said to d-separate A from B, denoted A ⊥ ⊥G B | C, if along every path between a node in A and a node in B there is a node v satisfying one of the following two conditions: 1. v has converging edges (i.e. there are two edges pointing to v from the adjacent nodes in the path) and none of v or its descendants (i.e. the nodes that can be reached from v) are in C. 2. v is in C and does not have converging edges. A simple application of these definitions is illustrated in Figure 11.1. We can see that in the undirected graph on the top A and B are separated by C, because there is no edge between A and B and the path that connects them contains C; so we can conclude that A is independent from B given C according to Definition 11.1.2. As for three directed acyclic graphs, which are called the converging, serial and diverging connections, we can see that only the last two satisfy the conditions stated in Definition 11.1.3. In the converging connection C has two incoming edges (which violates the second condition) and is included in the set of nodes we are conditioning on (which violates the first condition). Therefore we can conclude that C does not d-separate A and B and that according to the definition of I-map we cannot say that A is independent of B given C. A fundamental result descending from the definitions of separation and d-separation is the Markov property (or Markov condition), which defines the decomposition of the global distribution of the data into a set of local distributions. For Bayesian networks it is related to the chain rule of probability (Korb and Nicholson 2009); it takes the form P(X) = f (X) =
p i=1 p i=1
P(Xi | Xi )
for discrete data and
(11.4)
f (Xi | Xi )
for continuous data,
(11.5)
240
Handbook of Statistical Systems Biology
so that each local distribution is associated with a single node Xi and depends only on the joint distribution of its parents Xi . This decomposition holds for any Bayesian network, regardless of its graph structure. In Markov networks on the other hand local distributions are associated with the cliques (maximal subsets of nodes in which each element is adjacent to all the others) C1 , C2 , . . ., Ck present in the graph; so P(X) = f (X) =
k i=1 k
ψi (Ci )
for discrete data and
(11.6)
ψi (Ci )
for continuous data
(11.7)
i=1
The functions ψ1 , ψ2 , . . . , ψk are called Gibbs’ potentials (Pearl 1988), factor potentials (Castillo et al. 1997) or simply potentials, and are non-negative functions representing the relative mass of probability of each clique. They are proper probability or density functions only when the graph is decomposable or triangulated, that is when it contains no induced cycles other than triangles. With any other type of graph inference becomes very hard, if possible at all, because ψ1 , ψ2 , . . . , ψk have no direct statistical interpretation. Decomposable graphs are also called chordal (Diestel 2005) because any cycle of length at least four has a chord (a link between two nodes in a cycle that is not contained in the cycle itself). In this case the global distribution factorises again according to the chain rule and can be written as k P(Ci ) P(X) = i=1 for discrete data and (11.8) k i=1 P(Si ) k f (Ci ) for continuous data, (11.9) f (X) = i=1 k i=1 f (Si ) where Si are the nodes of Ci which are also part of any other clique up to Ci−1 (Pearl 1988). A trivial application of these factorisations is illustrated again in Figure 11.1. The Markov network is composed by two cliques, C1 = {A, C} and C2 = {B, C}, separated by R1 = {C}. Therefore according to Equation (11.6) we have P (X) =
P (A, C) P (B, C) = P (A | C) P (B | C) P (C) . P (C)
(11.10)
In the Bayesian networks we can see that the decomposition of the global distribution results in three local distributions, one for each node. Each local distribution is conditional on the set of parents of that particular node. For example, in the converging connection we have that A = {∅}, B = {∅} and C = {A, B}, so according to Equation (11.4) the correct factorisation is P (X) = P (A) P (B) P (C | A, B) .
(11.11)
On the other hand, in the serial connection we have that A = {∅}, B = {C} and C = {A}, so P (X) = P (A) P (C | A) P (B | C) .
(11.12)
The diverging connection can be shown to result in the same factorisation, even though the nodes have different sets of parents than in the serial connection. Another fundamental result descending from the link between graphical separation and probabilistic independence is the definition of the Markov blanket (Pearl 1988) of a node Xi , the set that completely separates Xi from the rest of the graph. Generally speaking it is the set of nodes that includes all the knowledge needed to
Introduction to Graphical Modelling 241 Bayesian network G
C
E
Markov network
L
G
C
B
D
K
F
(a)
B
D
K
H
H
Parents
L
A
A F
E
Children
Children's other parents
Markov blanket
Neighbours
(b)
Figure 11.2 The Markov blanket of the node A in a Bayesian network (a) and in the corresponding Markov network given by its moral graph (b). The two graphs express the same dependence structure, so the Markov blanket of A is the same
do inference on Xi , from estimation to hypothesis testing to prediction, because all the other nodes are conditionally independent of Xi given its Markov blanket. In Markov networks the Markov blanket coincides with the neighbours of Xi (all the nodes that are connected to Xi by an edge); in Bayesian networks it is the union of the children of Xi , its parents, and its children’s other parents (see Figure 11.2). In both classes of models the usefulness of Markov blankets is limited by the sparseness of the network. If edges are few compared with the number of nodes the interpretation of each Markov blanket becomes a useful tool in understanding and predicting the behaviour of the data. The two characterisations of graphical separation and of the Markov properties presented above do not appear to be closely related, to the point that these two classes of graphical models seem to be very different in construction and interpretation. There are indeed dependency models that have an undirected perfect map but not a directed acyclic one, and vice versa [see Pearl (1988), pages 126–127 for a simple example of a dependency structure that cannot be represented as a Bayesian network]. However, it can be shown (Castillo et al. 1997; Pearl 1988) that every dependency structure that can be expressed by a decomposable graph can be modelled both by a Markov network and a Bayesian network. This is clearly the case for the small networks shown in Figure 11.2, as the undirected graph obtained from the Bayesian network by moralisation (connecting parents which share a common child) is decomposable. It can also be shown that every dependency model expressible by an undirected graph is also expressible by a directed acyclic graph, with the addition of some auxiliary nodes. These two results indicate that there is a significant overlap between Markov and Bayesian networks, and that in many cases both can be used to the same effect.
11.2
Learning graphical models
Fitting graphical models is called learning, a term borrowed from expert systems and artificial intelligence theory, and in general requires a two-step process. The first step consists in finding the graph structure that encodes the conditional independencies present in the data. Ideally it should coincide with the minimal I-map of the global distribution, or it should at least
242
Handbook of Statistical Systems Biology
identify a distribution as close as possible to the correct one in the probability space. This step is called network structure or simply structure learning (Korb and Nicholson 2009; Koller and Friedman 2009), and is similar in approaches and terminology to model selection procedures for classical statistical models. The second step is called parameter learning and, as the name suggests, deals with the estimation of the parameters of the global distribution. This task can be easily reduced to the estimation of the parameters of the local distributions because the network structure is known from the previous step. Both structure and parameter learning are often performed using a combination of numerical algorithms and prior knowledge on the data. Even though significant progress has been made on performance and scalability of learning algorithms, an effective use of prior knowledge and relevant theoretical results can still speed up the learning process severalfold and improve the accuracy of the resulting model. Such a boost has been used in the past to overcome the limitations on computational power, leading to the development of the so-called expert systems [for real-world examples see the MUNIN (Andreassen et al. 1989), ALARM (Beinlich et al. 1989) and Hailfinder (Abramson et al. 1996) networks]; it can still be used today to tackle larger and larger problems and obtain reliable results.
11.2.1
Structure learning
Structure learning algorithms have seen a steady development over the past two decades thanks to the increased availability of computational power and the application of many results from probability, information and optimisation theory. Despite the (sometimes confusing) variety of theoretical backgrounds and terminology they can all be traced to only three approaches: constraint-based, score-based and hybrid. Constraint-based algorithms use statistical tests to learn conditional independence relationships (called constraints in this setting) from the data and assume that the graph underlying the probability distribution is a perfect map to determine the correct network structure. They have been developed originally for Bayesian networks, but have been recently applied to Markov networks as well [see for example the Grow-Shrink algorithm (Margaritis 2003; Bromberg et al. 2009), which works with minor modifications in both cases]. Their main limitations are the lack of control of either the family-wise error rate (Dudoit and van der Laan 2007) or the false discovery rate (Efron 2008) and the additional assumptions needed by the tests themselves, which are often asymptotic and with problematic regularity conditions. Score-based algorithms are closer to model selection techniques developed in classical statistics and information theory. Each candidate network is assigned a score reflecting its goodness of fit, which is then taken as an objective function to maximise. Since the number of both undirected graphs and directed acyclic graphs grows more than exponentially in the number of nodes (Harary and Palmer 1973) an exhaustive search is not feasible in all but the most trivial cases. This has led to an extensive use of heuristic optimisation algorithms, from local search (starting from an initial network and changing one edge at a time) to genetic algorithms (Russell and Norvig 2009). Convergence to a global maximum however is not guaranteed, as they can get stuck into a local maximum because of the noise present in the data or a poor choice in the tuning parameters of the score function. Hybrid algorithms use both statistical tests and score functions, combining the previous two families of algorithms. The general approach is described for Bayesian networks in Friedman et al. (1999b), and has proved one of the top performers to date in Tsamardinos et al. (2006). Conditional independence tests are used to learn at least part of the conditional independence relationships from the data, thus restricting the search space for a subsequent score-based search. The latter determines which edges are actually present in the graph and, in the case of Bayesian networks, their direction. All these structure learning algorithms operate under a set of common assumptions, which are similar for Bayesian and Markov networks:
Introduction to Graphical Modelling 243 •
•
• •
There must be a one-to-one correspondence between the nodes of the graph and the random variables included in the model; this means in particular that there must not be multiple nodes which are functions of a single variable. There must be no unobserved (also called latent or hidden) variables that are parents of an observed node in a Bayesian network; otherwise only part of the dependency structure can be observed, and the model is likely to include spurious edges. Specific algorithms have been developed for this particular case, typically based on Bayesian posterior distributions or the EM algorithm (Dempster et al. 1977); see for example Binder et al. (1997), Friedman (1997) and Elidan and Friedman (2005). All the relationships between the variables in the network must be conditional independencies, because they are by definition the only ones that can be expressed by graphical models. Every combination of the possible values of the variables in X must represent a valid, observable (even if really unlikely) event. This assumption implies a strictly positive global distribution, which is needed to have uniquely determined Markov blankets and, therefore, a uniquely identifiable model. Constraint-based algorithms work even when this is not true, because the existence of a perfect map is also a sufficient condition for the uniqueness of the Markov blankets (Pearl 1988).
Some additional assumptions are needed to properly define the global distribution of X and to have a tractable, closed-form decomposition: •
•
•
•
Observations must be stochastically independent. If some form of temporal or spatial dependence is present it must be specifically accounted for in the definition of the network, as in dynamic Bayesian networks (Koller and Friedman 2009). They will be covered in Section 11.4.4 and Chapter 12. If all the random variables in X are discrete or categorical both the global and the local distributions are assumed to be multinomial. This is by far the most common assumption in the literature, at least for Bayesian networks, because of its strong ties with the analysis of contingency tables (Agresti 2002; Bishop et al. 2007) and because it allows an easy representation of local distributions as conditional probability tables (see Figure 11.3). If on the other hand all the variables in X are continuous the global distribution is usually assumed to follow a multivariate Gaussian distribution, and the local distributions are either univariate or multivariate Gaussian distributions. This assumption defines a subclass of graphical models called graphical Gaussian models (GGMs), which overlaps both Markov (Whittaker 1990) and Bayesian networks (Neapolitan 2004). A classical example from Edwards (2000) is illustrated in Figure 11.4. If both continuous and categorical variables are present in the data there are three possible choices: assuming a mixture or conditional Gaussian distribution (Edwards 2000; Bøttcher 2004), discretising continuous attributes (Friedman and Goldszmidt 1996) or using a nonparametric approach (Bach and Jordan 2003).
The form of the probability or density function chosen for the local distributions determines which score functions (for score-based algorithms) or conditional independence tests (for constraint-based algorithms) can be used by structure learning algorithms. Common choices for conditional independence tests are: •
•
Discrete data: Pearson’s χ2 and the G2 tests (Edwards 2000; Agresti 2002), either as asymptotic or permutation tests. The G2 test is actually a log-likelihood ratio test (Lehmann and Romano 2005) and is equivalent to mutual information tests (Cover and Thomas 2006) up to a constant. Continuous data: Student’s t, Fisher’s Z and the log-likelihood ratio tests based on partial correlation coefficients (Legendre 2000; Neapolitan 2004), again either as asymptotic or permutation tests. The loglikelihood ratio test is equivalent to the corresponding mutual information test as before.
244
Handbook of Statistical Systems Biology
smoking?
visit to Asia?
lung cancer?
tuberculosis?
bronchitis?
either tuberculosis or lung cancer?
dyspnoea? positive X-ray?
visit to Asia?
smoking?
visit to Asia?
tuberculosis?
tuberculosis?
smoking?
smoking?
lung cancer?
lung cancer?
either tuberculosis or lung cancer?
bronchitis?
either tuberculosis or lung cancer?
bronchitis?
dyspnoea?
either tuberculosis or lung cancer?
positive X-ray?
Figure 11.3 Factorisation of the ASIA Bayesian network from Lauritzen and Spiegelhalter (1988) into local distributions, each with his own conditional probability table. Each row contains the probabilities conditional on a particular configuration of parents
Introduction to Graphical Modelling 245
mechanics
analysis
algebra
vectors
statistics
mechanics
analysis
algebra
vectors
algebra
statistics
Figure 11.4 The MARKS graphical Gaussian network from Edwards (2000) and its decomposition into cliques, the latter characterised by their partial correlation matrices
Score functions commonly used in both cases are penalised likelihood scores such as the Akaike and Bayesian information criteria [AIC and BIC, see Akaike (1974) and Schwarz (1978), respectively], posterior densities such as the Bayesian Dirichlet and Gaussian equivalent scores [BDe and BGe, see Heckerman et al. (1995) and Geiger and Heckerman (1994), respectively] and entropy-based measures such as the Minimum Description Length (MDL) by Rissanen (2007). The last important property of structure learning algorithms, one that sometimes is not explicitly stated, is their inability to discriminate between score equivalent Bayesian networks (Chickering 1995). Such models have the same skeleton (the undirected graph resulting from ignoring the direction of every edge) and the same v-structures (another name for the converging connection illustrated in Figure 11.1), and therefore they encode the same conditional independence relationships because every d-separation statement that is true for one of them also holds for all the others. This characterisation implies a partitioning of the space of the possible networks into a set of equivalence classes whose elements are all I-maps of the same probability distribution. The elements of each of those equivalence classes are indistinguishable from each other without additional information, such as a nonuniform prior distribution; their factorisations into local distributions are equivalent. Statistical tests and almost all score functions (which are in turn called score equivalent functions), including those detailed above, are likewise unable to choose one model over an equivalent one. This means that learning algorithms, which base their decisions on these very tests and scores, are only able to learn which equivalence class the minimal I-map of the dependence structure belongs to. They are usually not able to uniquely determine the direction of all the edges present in the network, which is then represented as a partially directed graph (see Figure 11.5).
246
Handbook of Statistical Systems Biology B
A
B
A
B
A
D
C
D
C
D
C
E
F
E
F
E
F
Figure 11.5 Two score equivalent Bayesian networks (on the left and in the middle) and the partially directed graph representing the equivalence class they belong to (on the right). Note that the direction of the edge D → E is set because its reverse E → D would introduce two additional v-structures in the graph; for this reason it is called a compelled edge (Pearl 1988)
11.2.2
Parameter learning
Once the structure of the network has been learned from the data the task of estimating and updating the parameters of the global distribution is greatly simplified by the application of the Markov property. Local distributions in practise involve only a small number of variables; furthermore their dimension usually does not scale with the size of X (and is often assumed to be bounded by a constant when computing the computational complexity of algorithms), thus avoiding the so-called curse of dimensionality. This means that each local distribution has a comparatively small number of parameters to estimate from the sample, and that estimates are more accurate due to the better ratio between the size of parameter space and the sample size. The number of parameters needed to uniquely identify the global distribution, which is the sum of the number of parameters of the local ones, is also reduced because the conditional independence relationships encoded in the network structure fix large parts of the parameter space. For example in GGMs partial correlation coefficients involving (conditionally) independent variables are equal to zero by definition, and joint frequencies factorise into marginal ones in multinomial distributions. However, parameter estimation is still problematic in many situations. For example it is increasingly common to have sample size much smaller than the number of variables included in the model; this is typical of microarray data, which have a few tens or hundreds of observations and thousands of genes. Such a situation, which is called ‘small n, large p’, leads to estimates with high variability unless particular care is taken both in structure and parameter learning (Sch¨afer and Strimmer 2005a; Castelo and Roverato 2006; Hastie et al. 2009). Dense networks, which have a high number of edges compared with their nodes, represent another significant challenge. Exact inference quickly becomes unfeasible as the number of nodes increases, and even approximate procedures based on Monte Carlo simulations and bootstrap resampling require large computational resources (Korb and Nicholson 2009; Koller and Friedman 2009). Numerical problems stemming from floating point approximations (Goldberg 1991) and approximate numerical algorithms (such as the ones used in matrix inversion and eigenvalue computation) should also be taken into account.
11.3
Inference on graphical models
Inference procedures for graphical models focus mainly on evidence propagation and model validation, even though other aspects such as robustness (Cozman 1997) and sensitivity analysis (G´omez-Villegas et al. 2008) have been studied for specific settings. Evidence propagation (another term borrowed from expert systems literature) studies the impact of new evidence and beliefs on the parameters of the model. For this reason it is also referred to as belief propagation or belief updating (Pearl 1988; Castillo et al. 1997), and has a clear Bayesian interpretation in terms of
Introduction to Graphical Modelling 247
posterior and conditional probabilities. The structure of the network is usually considered fixed, thus allowing a scalable and efficient updating of the model through its decomposition into local distributions. In practise new evidence is introduced by either altering the relevant parameters (soft evidence) or setting one or more variables to a fixed value (hard evidence). The former can be thought of as a model revision or parameter tuning process, while the latter is carried out by conditioning the behaviour of the network on the values of some nodes. The process of computing such conditional probabilities is also known as conditional probability query on a set of query nodes (Koller and Friedman 2009), and can be performed with a wide selection of exact and approximate inference algorithms. Two classical examples of exact algorithms are variable elimination (optionally applied to the clique tree form of the network) and Kim and Pearl’s Message Passing algorithm. Approximate algorithms on the other hand rely on various forms of Monte Carlo sampling such as forward sampling (also called logic sampling for Bayesian networks), likelihood-weighted sampling and importance sampling. Markov chain Monte Carlo methods such as Gibbs sampling are also widely used (Korb and Nicholson 2009). Model validation on the other hand deals with the assessment of the performance of a graphical model when dealing with new or existing data. Common measures are the goodness-of-fit scores cited in the Section 11.2.1 or any appropriate loss measure such as misclassification error (for discrete data) and the residual sum of squares (for continuous data). Their estimation is usually carried out using either a separate testing dataset or cross validation (Pe˜na et al. 2005; Koller and Friedman 2009) to avoid negatively biased results. Another nontrivial problem is to determine the confidence level for particular structural features. In Friedman et al. (1999a) this is accomplished by learning a large number of Bayesian networks from bootstrap samples drawn from the original dataset and estimating the empirical frequency of the features of interest. Scutari (2011) has recently extended this approach to obtain some univariate measures of variability and perform some basic hypothesis testing. Both techniques can be applied to Markov networks with little to no change. Tian and He (2009) on the other hand used a nonuniform prior distribution on the space of the possible structures to compute the exact marginal posterior distribution of the features of interest.
11.4
Application of graphical models in systems biology
In systems biology graphical models are employed to describe and to identify interdependencies among genes and gene products, with the eventual aim to better understand the molecular mechanisms of the cell. In medical systems biology the specific focus lies on disease mechanisms mediated by changes in the network structure. For example, a general assumption in cancer genomics is that there are different active pathways in healthy compared with affected tissues. A common problem in the practical application of graphical models in systems biology is the high dimensionality of the data compared with the small sample size. In other words, there are a large number p of variables to be considered whereas the number of observations n is small due to ethical reasons and cost factors. Typically, the number of parameters in a graphical model grows with some power of the number of variables. Hence, if the number of genes is large, the parameters describing the graphical model (e.g. edge probabilities) quickly outnumber the data points. For this reason graphical modelling in systems biology almost always requires some form of regularised inference, such as Bayesian inference, penalised maximum likelihood or other shrinkage procedures. 11.4.1
Correlation networks
The simplest graphical models used in systems biology are relevance networks (Butte et al. 2000), which are also known in statistics as correlation graphs. Relevance networks are constructed by first estimating the
248
Handbook of Statistical Systems Biology
correlation matrix for all p(p − 1)/2 pairs of genes. Subsequently, the correlation matrix is thresholded at some prespecified level, say at |rij | < 0.8, so that weak correlations are set to zero. Finally, a graph is drawn in order to depict the remaining strong correlations. Technically, correlation graphs visualise the marginal (in)dependence structure of the data. Assuming the latter are normally distributed, a missing edge between two genes in a relevance network is indicative of marginal stochastic independence. Because of their simplicity, both in terms of interpretation as well as computation, correlation graphs are enormously popular, not only for analysing gene expression profiles but also many other kinds of omics data (Steuer 2006).
11.4.2
Covariance selection networks
The simplest graphical model that considers conditional rather than marginal dependencies is the covariance selection model (Dempster 1972), also known as the concentration graph or GGM (Whittaker 1990). In a GGM the graph structure is constructed in the same way as in a relevance network; the only difference is that the presence of an edge is determined by the value of the corresponding partial correlation (the correlation between any two genes once the linear effect of all other p − 2 genes has been removed) instead of the marginal correlation used above. Partial correlations may be computed in a number of ways, but the most direct approach is by inversion and subsequent standardisation of the correlation matrix (the inverse of the covariance matrix is often called a concentration matrix). Specifically, it can be shown that if an entry in the inverse correlation matrix is close to zero then the partial correlation between the two corresponding genes also vanishes. Thus, under the normal data assumption a missing edge in a GGM implies conditional independence. Partial correlation graphs derived from genomic data are often called gene association networks, to distinguish them from correlation-based relevance networks. Despite their mathematical simplicity, it is not trivial to learn GGMs from high-dimensional ‘small n, large p’ genomic data (Sch¨afer and Strimmer 2005c). There are two key problems. First, inferring a large-scale correlation (or covariance) matrix from relatively few data is an ill-posed problem that requires some sort of regularisation. Otherwise the correlation matrix is singular and therefore cannot be inverted to compute partial correlations. Secondly, an effective variable selection procedure is needed to determine which estimated partial correlations are not significant and which represent actual linear dependencies. Typically, GGM model selection involves assumptions concerning the sparsity of the actual biological network. The first applications of covariance selection models to genome data were either restricted to a small number of variables (Waddell and Kishino 2000), used as a preprocessing step in cluster analysis to reduce the effective dimension of the model (Toh and Horimoto 2002), or employed low order partial correlations as an approximation to fully conditioned partial correlations (de la Fuente et al. 2004). However, newer inference procedures for GGMs are directly applicable to high-dimensional data. A Bayesian regression-based approach to learn large-scale GGMs is given in Dobra et al. (2004). Sch¨afer and Strimmer (2005b) introduced a largescale model selection procedure for GGMs using false discovery rate multiple testing with an empirically estimated null model. Sch¨afer and Strimmer (2005a) also proposed a James–Stein-type shrinkage correlation estimator that is both computationally and statistically efficient even in larger dimensions, specifically for use in network inference. An example of a GGM reconstructed with this algorithm from Escherichia coli data is shown in Figure 11.6. Methods for estimating large-scale inverse correlation matrices using different variants of penalised maximum likelihood are discussed by Li and Gui (2006), Banerjee et al. (2008) and Friedman et al. (2008). Most recently, Andrei and Kendziorski (2009) considered a modified GGM that allows the specification of interactions (i.e. multiplicative dependencies) among genes, and Kr¨amer et al. (2009) conducted an extensive comparison of regularised estimation techniques for GGMs.
Introduction to Graphical Modelling 249 ftsJ
gatC
grpE
mopB
ibpB dnaK yrfH
b1963
yceP
dnaG
aceA
ycgX cchB
b1583
folK atpD yheI
yhdM
eutG lacZ
sucD aceB
hupB
yhgI yfaD
sodA gltA
fixC
yaeM cspA
sucA
atpH
asnA
ygcE
lacY
yfiA
lacA
b1191 dnaJ yhfV
tnaA yedE
icdA
cspG
flgD gatD
lpdA
atpG ygbD yjbO
pspB
nuoM
manX pspC nuoA pspA
yecO
nmpC
ompT
gatZ
¨ Figure 11.6 Partial correlation graph inferred from E. coli data using the algorithm described in Schafer and Strimmer (2005a,b). Dotted edges indicate negative partial correlation
250
Handbook of Statistical Systems Biology
11.4.3
Bayesian networks
Both gene relevance and gene association networks are undirected graphs. In order to learn about directed conditional dependencies Bayesian network inference procedures have been developed for static (and later also for time course) microarray data. The application of Bayesian networks to learn large-scale directed graphs from microarray data was pioneered by Friedman et al. (2000), and has also been reviewed more recently in Friedman (2004). The high dimensionality of the model means that inference procedures are usually unable to identify a single best Bayesian network, settling instead on a set of equally well behaved models. In addition, as discussed in Section 11.2.1, all Bayesian networks belonging to the same equivalence class have the same score and therefore cannot be distinguished on the basis of the probability distribution of the data. For this reason it is often important to incorporate prior biological knowledge into the inference process of a Bayesian network. A Bayesian approach based on the use of informative prior distributions is described in Mukherjee and Speed (2008). The efficiency of Bayesian networks, GGMs and relevance networks in recovering biological regulatory networks have been studied in an extensive and realistic set-up in Werhli et al. (2006). Not surprisingly, the amount of information contained in gene expression and other high-dimensional data is often too low to allow for accurate reconstruction of all the details of a biological network. Nonetheless, both GGMs and Bayesian networks are able to elucidate some of the underlying structure. 11.4.4
Dynamic Bayesian networks
The extension of Bayesian networks to the analysis of time course data is provided by dynamic Bayesian networks, which explicitly account for time dependencies in their definition. The incorporation of temporal aspects is important for systems biology, as it allows to draw conclusions about causal relations. Dynamic Bayesian networks are often restricted to linear systems, with two special (yet still very general) models: the vector-autoregressive (VAR) model and state-space models. The main difference between the two is that the latter includes hidden variables that are useful for implicit dimensionality reduction. The VAR model was first applied to genomic data by Fujita et al. (2007) and Opgen-Rhein and Strimmer (2007b). A key problem of this kind of model is that it is very parameter-rich, and therefore it is hard to estimate efficiently and reliably. Opgen-Rhein and Strimmer (2007b) proposed a shrinkage approach whereas Fujita et al. (2007) employed Lasso regression for sparse VAR modelling. A refinement of the latter approach based on the elastic net penalised regression is described in Shimamura et al. (2009). In all VAR models the estimated coefficients can be interpreted in terms of Granger causality (Opgen-Rhein and Strimmer 2007b). State-space models are an extension of the VAR model, and include lower-dimensional latent variables to facilitate inference. The dimension of the latent variables is usually chosen in the order of the rank of the data matrix. Husmeier (2003), Perrin et al. (2003) and Rangel et al. (2004) were the first to study genomic data with dynamic Bayesian networks and to propose inference procedures suitable for use with microarray data. Bayesian learning procedures are discussed in L¨ahdesm¨aki and Shmulevich (2008). A general state-space framework that allows to model nonstationary time course data is given in Grzegorczyk and Husmeier (2009). Rau et al. (2010) present an empirical Bayes approach to learning dynamical Bayesian networks and apply it to gene expression data. 11.4.5
Other graphical models
Bayesian networks are graphical models where all edges are directed, whereas GGMs represent undirected conditional dependencies in multivariate data. On the other hand, chain graphs can include directed as well as undirected dependencies in the same graph. One heuristic approach to infer an approximating chain graph from high-dimensional genomic data is described in Opgen-Rhein and Strimmer (2007a).
Introduction to Graphical Modelling 251
For reasons of simplicity, and to further reduce the number of parameters to be estimated, many graphical models used in systems biology only describe linear dependencies (GGM, VAR, state-space models). Attempts to relax such linearity assumptions include entropy networks (Meyer et al. 2007; Hausser and Strimmer 2009) and copula-based approaches (Kim et al. 2008). Finally, sometimes time-discrete models such as dynamic Bayesian networks are not appropriate to study the dynamics of molecular processes. In these cases stochastic differential equations (Wilkinson 2009) often represent a viable alternative. It is also important to keep in mind that, given the small sample size of omics data, the most complex graphical model is not necessarily the best choice for an analyst (Werhli et al. 2006).
References Abramson B, Brown J, Edwards W, Murphy A and Winkler RL 1996 Hailfinder: a Bayesian System for Forecasting Severe Weather. International Journal of Forecasting 12(1), 57–71. Agresti A 2002 Categorical Data Analysis, 2nd edn. John WIley & Sons, Ltd. Akaike H 1974 A New Look at the Statistical Identification Model. IEEE Transactions on Automatic Control 19, 716–723. Andreassen S, Jensen F, Andersen S, Falck B, Kjærulff U, Woldbye M, Sørensen A, Rosenfalck A and Jensen F 1989 MUNIN – An Expert EMG Assistant. In Computer-Aided Electromyography and Expert Systems (ed. Desmedt JE), pp. 255–277. Elsevier. Andrei A and Kendziorski C 2009 An Efficient Method for Identifying Statistical Interactors in Gene Association Networks. Biostatistics 10, 706–718. Bach FR and Jordan MI 2003 Learning Graphical Models with Mercer Kernels. In Advances in Neural Information Processing Systems (eds Becker S, Thrun S and Obermayer K), vol. 15, pp. 1009–1016. MIT Press. Banerjee O, El Ghaoui L and d’Aspremont A 2008 Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. Journal of Machine Learning Resesearch 9, 485–516. Bang-Jensen J and Gutin G 2009 Digraphs: Theory, Algorithms and Applications, 2nd edn. Springer. Beinlich I, Suermondt HJ, Chavez RM and Cooper GF 1989 The ALARM Monitoring System: a Case Study with Two Probabilistic Inference Techniques for Belief Networks. Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine, pp. 247–256. Springer-Verlag. Binder J, Koller D, Russell S and Kanazawa K 1997 Adaptive Probabilistic Networks with Hidden Variables. Machine Learning 29(2–3), 213–244. Bishop YMM, Fienberg SE and Holland PW 2007 Discrete Multivariate Analysis: Theory and Practice. Springer. Bøttcher SG 2004 Learning Bayesian Networks with Mixed Variables. PhD thesis Department of Mathematical Sciences, Aalborg University. Bromberg F, Margaritis D and Honavar V 2009 Efficient Markov Network Structure Discovery Using Independence Tests. Journal of Artificial Intelligence Research 35, 449–485. Butte AJ, Tamayo P, Slonim D, Golub TR and Kohane IS 2000 Discovering Functional Relationships Between RNA Expression and Chemotherapeutic Susceptibility Using Relevance Networks. Proceedings of the National Academy of Sciences of the United States of America 97, 12182–12186. Castelo R and Roverato A 2006 A Robust Procedure for Gaussian Graphical Model Search from Microarray Data with p Larger than. Journal of Machine Learning Research 7, 2621–2650. Castillo E, Guti´errez JM and Hadi AS 1997 Expert Systems and Probabilistic Network Models. Springer. Chickering DM 1995 A Transformational Characterization of Equivalent Bayesian Network Structures. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence 87–98. Cover TM and Thomas JA 2006 Elements of Information Theory, 2nd edn. John WIley & Sons, Ltd. Cozman F 1997 Robustness Analysis of Bayesian Networks with Local Convex Sets of Distributions. Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence 108–11. de la Fuente A, Bing N, Hoeschele I and Mendes P 2004 Discovery of Meaningful Associations in Genomic Data Using Partial Correlation Coefficients. Bioinformatics 20, 3565–3574. Dempster AP 1972 Covariance Selection. Biometrics 28, 157–175.
252
Handbook of Statistical Systems Biology
Dempster AP, Laird NM and Rubin DB 1977 Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39, 1–39. Diestel R 2005 Graph Theory, 3rd edn. Springer. Dobra A, Hans C, Jones B, Nevins JR, Yao G and West M 2004 Sparse Graphical Models for Exploring Gene Expression Data. Journal of Multivariate Analysis 90, 196–212. Dudoit S and van der Laan MJ 2007 Multiple Testing Procedures with Applications to Genomics. Springer. Edwards DI 2000 Introduction to Graphical Modelling, 2nd edn. Springer. Efron B 2008 Microarrays, Empirical Bayes and the Two-Groups Model. Statistical Science 23(1), 1–47. Elidan G and Friedman N 2005 Learning Hidden Variable Networks: the Information Bottleneck Approach. Journal of Machine Learning Research 6, 81–127. Friedman J, Hastie T and Tibshirani R 2008 Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics 9, 432–441. Friedman N 1997 Learning belief networks in the presence of missing values and hidden variables. Proceedings of the 14th International Conference on Machine Learning (ICML97) 125–133. Friedman N 2004 Inferring Cellular Networks Using Probabilistic Graphical Models. Science 303, 799–805. Friedman N and Goldszmidt M 1996 Discretizing Continuous Attributes While Learning Bayesian Networks. Proceedings of the 13th International Conference on Machine Learning (ICML96) 157–165. Friedman N, Goldszmidt M and Wyner A 1999a Data Analysis with Bayesian Networks: a Bootstrap Approach. Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence 206–215. Friedman N, Nachman I and Pe´er D 1999b Learning Bayesian Network Structure from Massive Datasets: The ‘Sparse Candidate’ Algorithm. Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence 206–21. Friedman N, Linial M, Nachman I and Pe’er D 2000 Using Bayesian Networks to Analyze Gene Expression Data. Journal of Computational Biology 7, 601–620. Fujita A, Sato JR, Garay-Malpartida HM, Yamaguchi R, Miyano S, Sogayar MC and Ferreira CE 2007 Modeling Gene Expression Regulatory Networks with the Sparse Vector Autoregressive Model. BMC Systems Biology 1, 39. Geiger D and Heckerman D 1994 Learning Gaussian Networks. Technical Report MSR-TR-94-10. Microsoft Research. Goldberg D 1991 What Every Computer Scientist Should Know About Floating Point Arithmetic. ACM Computing Surveys 23(1), 5–48. G´omez-Villegas MA, Ma´ın P and Susi R 2008 Extreme inaccuracies in Gaussian Bayesian networks. Journal of Multivariate Analysis 99(9), 1929–1940. Grzegorczyk M and Husmeier D 2009 Non-Stationary Continuous Dynamic Bayesian Networks. Advances in Neural Information Processing Systems (NIPS) 22, 682–690. Harary F and Palmer EM 1973 Graphical Enumeration. Academic Press. Hastie T, Tibshirani R and Friedman J 2009 The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer. Hausser J and Strimmer K 2009 Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks. Journal of Machine Learning Resesearch 10, 1469–1484. Heckerman D, Geiger D and Chickering DM 1995 Learning Bayesian Networks: the Combination of Knowledge and Statistical Data. Machine Learning 20(3), 197–243. Husmeier D 2003 Sensitivity and Specificity of Inferring Genetic Regulatory Interactions from Microarray Experiments with Dynamic Bayesian Networks. Bioinformatics 19, 2271–2282. Kim JM, Jung YS, Sungur EA, Han KH, Park C and Sohn I 2008 A Copula Method for Modeling Directional Dependence of Genes. BMC Bioinformatics 9, 225. Koller D and Friedman N 2009 Probabilistic Graphical Models: Principles and Techniques. MIT Press. Korb K and Nicholson A 2009 Bayesian Artificial Intelligence, 2nd edn. Chapman and Hall. Kr¨amer N, Sch¨afer J and Boulesteix AL 2009 Regularized Estimation of Large-Scale Gene Association Networks using Graphical Gaussian Models. BMC Bioinformatics 10, 384. L¨ahdesm¨aki H and Shmulevich I 2008 Learning the Structure of Dynamic Bayesian Networks from Time Series and Steady State Measurements. Machine Learning 71, 185–217.
Introduction to Graphical Modelling 253 Lauritzen SL and Spiegelhalter D 1988 Local Computation with Probabilities on Graphical Structures and their Application to Expert Systems (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 50(2), 157–224. Legendre P 2000 Comparison of Permutation Methods for the Partial Correlation and Partial Mantel Tests. Journal of Statistical Computation and Simulation 67, 37–73. Lehmann EL and Romano JP 2005 Testing Statistical Hypotheses, 3rd edn. Springer. Li H and Gui J 2006 Gradient Directed Regularization for Sparse Gaussian Concentration Graphs, with Applications to Inference of Genetic Networks. Biostatistics 7, 302–317. Margaritis D 2003 Learning Bayesian Network Model Structure from Data. PhD thesis, School of Computer Science, Carnegie-Mellon University. Available as Technical Report CMU-CS-03-153. Meyer PE, Kontos K, Lafitte F and Bontempi G 2007 Information-Theoretic Inference of Large Transcriptional Regulatory Networks. EURASIP Journal on Bioinformatics and Systems Biology 2007, Article ID 79879. Mukherjee S and Speed TP 2008 Network Inference using Informative Priors. Proceedings of the National Academy of Sciences of the United States of America 105, 14313–14318. Neapolitan RE 2004 Learning Bayesian Networks. Prentice Hall. Opgen-Rhein R and Strimmer K 2007a From Correlation to Causation Networks: a Simple Approximate Learning Algorithm and its Application to High-Dimensional Plant Gene Expression Data. BMC Systems Biology 1, 37. Opgen-Rhein R and Strimmer K 2007b Learning Causal Networks from Systems Biology Time Course Data: an Effective Model Selection Procedure for the Vector Autoregressive Process. BMC Bioinformatics 8 (Suppl. 2), S3. Pearl J 1988 Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Pe˜na JM, B¨orkegren J and Tegn´er J 2005 Learning Dynamic Bayesian Network Models via Cross-Validation. Pattern Recognition Letters 26(14), 2295–2308. Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J and d’Alch´e Buc F 2003 Gene Networks Inference Using Dynamic Bayesian Networks. Bioinformatics 19(Suppl. 2), ii138–ii148. Rangel C, Angus J, Ghahramani Z, Lioumi M, Sotheran E, Gaiba A, Wild DL and Falciani F 2004 Modeling T-Cell Activation Using Gene Expression Profiling and State Space Modeling. Bioinformatics 20, 1361–1372. Rau A, Jaffr´ezic F, Foulley JL and Doerge RW 2010 An Empirical Bayesian Method for Estimating Biological Networks from Temporal Microarray Data. Statistical Applications in Genetics and Molecular Biology 9(1), Article 9. Rissanen J 2007 Information and Complexity in Statistical Modeling. Springer. Russell SJ and Norvig P 2009 Artificial Intelligence: a Modern Approach, 3rd edn. Prentice Hall. Sch¨afer J and Strimmer K 2005a A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics. Statistical Applications in Genetics and Molecular Biology 4, 32. Sch¨afer J and Strimmer K 2005b An Empirical Bayes Approach to Inferring Large-Scale Gene Association Networks. Bioinformatics 21, 754–764. Sch¨afer J and Strimmer K 2005c Learning Large-Scale Graphical Gaussian Models from Genomic Data. In Science of Complex Networks: from Biology to the Internet and WWW (eds Mendes JFF, Dorogovtsev SN, Povolotsky A, Abreu FV and Oliveira JG), vol. 776, AIP Conference Proceedings, pp. 263–276. American Institute of Physics. Schwarz G 1978 Estimating the Dimension of a Model. The Annals of Statistics 6(2), 461–464. Scutari M 2011 Measures of Variability for Graphical Models. PhD thesis, School of Statistical Sciences, University of Padova. Shimamura T, Imoto S, Yamaguchi R, Fujita A, Nagasaki M and Miyano S 2009 Recursive Regularization for Inferring Gene Networks from Time-Course Gene Expression Profiles. BMC Systems Biology 3, 41. Steuer R 2006 On the Analysis and Interpretation of Correlations in Metabolomic Data. Briefings in Bioinformatics 151, 151–158. Tian J and He R 2009 Computing Posterior Probabilities of Structural Features in Bayesian Networks. Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence 538–547. Toh H and Horimoto K 2002 Inference of a Genetic Network by a Combined Approach of Cluster Analysis and Graphical Gaussian Modeling. Bioinformatics 18, 287–297. Tsamardinos I, Brown LE and Aliferis CF 2006 The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm. Machine Learning 65(1), 31–78.
254
Handbook of Statistical Systems Biology
Waddell PJ and Kishino H 2000 Cluster Inference Methods and Graphical Models Evaluated on NCI60 Microarray Gene Expression Data. Genome Informatics 11, 129–140. Werhli AV, Grzegorczyk M and Husmeier D 2006 Comparative Evaluation of Reverse Engineering Gene Regulatory Networks with Relevance Networks, Graphical Gaussian Models and Bayesian Networks. Bioinformatics 22, 2523–2531. Whittaker J 1990 Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, Ltd. Wilkinson DJ 2009 Stochastic Modelling for Quantitative Description of Heterogeneous Biological Systems. Nature Reviews Genetics 10(2), 122–133.
12 Recovering Genetic Network from Continuous Data with Dynamic Bayesian Networks Ga¨elle Lelandais1 and Sophie L`ebre2 1
12.1 12.1.1
DSIMB, INSERM, University of Paris Diderot and INTS, Paris, France 2 LSIIT, University of Strasbourg, France
Introduction Regulatory networks in biology
A central goal of molecular biology is to understand the regulation of mRNA and protein synthesis and their reactions to external or internal cellular signals. Cell functioning requires the tight linking between multiple regulatory systems that include mechanisms for controlling initiation of gene transcription, RNA splicing, mRNA transport, translation initiation, post-translational protein modifications, or the degradation of mRNA/protein. All together, these regulatory controls between individual cellular components constitute large regulatory networks made of interlocking positive and negative feedback loops. These networks reliably and robustly coordinate the molecular and biochemical processes inside a cell, while remaining flexible in order to respond to physiological and environmental changes. Today, the elucidation of the dynamic behaviour of regulatory networks represents one of the most significant challenges in systems biology. Functional genomics has yielded experimental techniques that allow interactions between cellular components to be elucidated in a large-scale manner. An example is the use of DNA microarrays to monitor gene expression (transcriptome experiments) or to identify DNA sequences that interact with a particular protein (ChIP-chip experiments). More recently, information coming from deep-sequencing technology allows to precisely identify and quantify DNA or RNA fragments present in a cell population. Although a large amount of experimental data exists, a comprehensive understanding of gene regulations involved in cellular processes is not yet achieved. One of the
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
256
Handbook of Statistical Systems Biology
main obstacles is the difficulty of choosing the appropriate experimental data, together with the appropriate computational approaches. 12.1.2
Objectives and challenges
The elucidation of regulatory networks is clearly a challenging problem. Biological processes are controlled by multiple interactions over time, between hundreds of cellular components (genes, mRNA, proteins, etc.). Ultimately, the goal for methodology to infer regulatory networks consists in modelling and recovering regulatory interactions such as ‘protein i activates (or inhibits) the transcription of gene j’. It is also attractive to be able to decipher more complex scenarios such as auto-regulations, feed-forward or multicomponent regulatory loops. Regulatory networks are usually described mathematically in a way where each gene (or gene product) is represented by a node and the regulatory associations are represented by edges between nodes. There exist many approaches to reconstruct regulatory networks. One of the main challenges is the difficulty to strike a balance between the model complexity and the available experimental data. The ideal model must be sufficiently complex to accurately describe the system, dealing with a number of cellular components, which is extremely large compared with the number of experimental results that are generally available. Initially, regulatory networks were described by using static modelling approaches such as correlation networks, graphical Gaussian models or bayesian networks. Whilst each of these approaches has its merits, none of them is sufficient in and of itself. For instance each of these methodologies are, on the one hand, of great interest when analysing expression dependency between genes. But on the other hand, they implicitly assume that the topology of the network, i.e. the sets of nodes and edges, stays constant over time, whereas the changing nature of regulatory controls between cellular components is beyond doubt. The development of methologies to infer temporal changes in biological netwoks represents a supplementary step toward a complete description of regulatory networks. In that respect, reverse engineering of gene regulatory networks using Dynamic Bayesian Network (DBN) models has received particular attention. In this chapter, procedures for DBN modelling from continuous data are first described. Then methodologies to infer changes in the structure of networks are introduced and finally, results obtained on simulated data are provided.
12.2
Reverse engineering time-homogeneous DBNs
Up to now, many dynamic approaches have been proposed to model genetic networks such as Boolean networks (Kauffman 1969; Akutsu et al. 1999), differential equations (Chen et al. 1999) or neural networks (Weaver et al. 1999). Among others, DBNs have received great interest in the field of systems biology. DBNs were first introduced for the analysis of genetic time series by Friedman et al. (1998) and Murphy and Mian (1999). In this section, Bayesian networks and DBNs will be first introduced. Then a simple approach for modelling DBNs from continuous time series using a linear model together with various reverse engineering procedures will be detailed. A significant advantage of using continuous data is that the whole information brought by the data is directly used. In particular, the tortuous choice of the threshold for discretizing the data is not needed. For DBN inference approaches that are specific to discrete data, see Chapter 13. 12.2.1 12.2.1.1
Genetic network modelling with DBNs Some theory about Bayesian networks
Let us first quickly recall some theory about Bayesian networks. Bayesian networks model directed relationships between genes (Friedman et al. 2000; De Jong 2002) and are one of the available approaches for
Recovering Genetic Network from Continuous Data
257
graphical modelling (see Chapter 11). Based on a probabilistic measure, a Bayesian network is a model representation defined by a directed acyclic graph (DAG), i.e. a graph G that does not contain cycles. Let us call the ‘parents’ of a node Xi in graph G, denoted by pa(Xi , G), the set of variables having an edge pointing towards the node Xi . A Bayesian network is entirely defined by a DAG G and the set of conditional probability distributions of each variable given its parents in G. To summarize, a stochastic process X admits a Bayesian network representation according to a DAG G, whenever its probability distribution factorizes as a product of the conditional probability distribution of each variable given their parents in G, i.e. P(X) =
p
P(Xi |pa(Xi , G)).
(12.1)
i=1
One of the major advantages of Bayesian networks is that conditional independence between variables can be derived from the ‘moral graph’. The moral graph Gm is obtained from G by first marrying the parents, i.e. drawing an undirected edge between each parent pair of each variable Xi and then deleting directions of the original edges in G. Then conditional independencies are derived from Gm as follows: whenever all succession of edges (paths) starting from node X1 to node X3 intersect node X2 in the moral graph Gm , the variables X1 and X3 are conditionally independent given variable X2 [for more details see the directed global Markov property in Lauritzen (1996)]. Moreover, it should be noted that interpretation of the edges and their directions must be done carefully. Some differing DAG structures can be equivalent in terms of dependence between the represented variables [see Figure 12.1(b)]. The interpretation of the edges in a DAG G defining a Bayesian network is not intuitive but still, this modelling is very useful to derive conditional independencies in static modelling. However, the acyclicity constraint in static Bayesian networks is a serious restriction given the expected structure of genetic networks. Dynamic modelling methodologies, such as DBNs presented in the next section, have the major advantage to allow the modelling of cyclic motifs and also to take into account time dependencies between gene measurements. Moreover, as presented in the next subsection, the interpretation of the edges of a DBN is straightforward. 12.2.1.2
DBN modelling
In DBNs, each gene is not represented by a single node, but by several nodes respectively associated with each time point measurement. Also, the interactions are assumed to be time-delayed. Therefore, a dynamic network is obtained by unfolding in time an initial cyclic motif [Figure 12.1(a) and (c)]. Note that the direction according to time guarantees the acyclicity of this dynamic network and the resulting DAG allows the definition of a Bayesian network (L`ebre 2009). In the DAG defining a DBN, an edge is drawn between two successive variables, for example from X1 (t) to X2 (t) in Figure 12.1(c), whenever these two variables are conditionally dependent given the remaining past variables. This property is derived from the theory of graphical models for DAGs (see Lauritzen (1996) and Chapter 11]. The nature (activation or inhibition) of the regulation in the biological motif does not appear in the DAG, nonetheless it can be derived from the sign (positive or negative) of the model parameter estimates. Up to now, various DBN representations based on different probabilistic models have been proposed in the literature: discrete (Ong et al. 2002; Zou and Conzen 2005), multivariate autoregressive (AR) process (Opgen-Rhein and Strimmer 2007), State Space or Hidden Markov (Perrin et al. 2003; Rangel et al. 2004; Wu et al. 2004; Beal et al. 2005), and non-parametric additive regression (Imoto et al. 2002, 2003; Kim et al. 2004; Sugimoto and Iba 2004). See also Kim et al. (2003) for a review of such models. To summarize, sufficient conditions for a model to admit a DBN representation are presented below and a concrete interpretation in terms of dependencies between variables is given by using the theory of graphical mode for DAGs. The following results are detailed in L`ebre (2009).
258
Handbook of Statistical Systems Biology
Figure 12.1 Graphical representation for Bayesian networks, DBNs and time-varying DBNs. (a) Regulatory motif among three genes to model. Crucially, regulatory interactions do not persist over the whole time course considered here, but are turned ‘on’ and ‘off’ at different time points. The labels on the edges indicate at what times an edge points to or influences the expression of the target gene. (b) Because Bayesian networks are constrained to have a DAG structure, they cannot contain loops or cycles. Therefore the motif in (a) can only be imperfectly represented using a conventional Bayesian network formalism which does not take temporal ordering into account; if X3 is statistically independent of X1 provided X2 is known, two alternative Bayesian network representations can be given, P(X1 , X2 , X3 ) = P(X3 |X2 ).P(X2 |X1 ).P(X1 ) and P(X1 , X2 , X3 ) = P(X3 |X2 ).P(X1 |X2 ).P(X2 ). (c) For the sake of clarity, variable Xi (t) is denoted by Xit here. If time-course expression measurements are available, it is possible to unravel the feedback cycles and loops over time. Such time-homogenous DBNs represent the interactions by assuming that at each given time, all the parental nodes come from the previous time point. (d) Example of time-varying DBNs
Let us consider a DBN based on a DAG G [e.g. like the DAG of Figure 12.1(c)] which describes exactly the conditional dependencies between variables observed at successive time points (t − 1 and t) given all other variables observed at the earliest time point (t − 1). Assumption 12.2.1 The stochastic process X(t) is first-order Markovian. Assumption 12.2.2 For all t > 0, the random variables X(t) = (X1 (t), . . . , Xi (t), . . . , Xp (t)) observed at time t are conditionally independent given the random variables X(t − 1) observed at the previous time t − 1. Assumption 12.2.3 The set of measurements (Xi (1), . . . , Xi (n))i=1..p of the genes under study form a set of linearly independent vectors. First, the observed process X is assumed to be first-order Markovian (Assumption 12.2.1). That is, given the history of the gene measurements (X(1), ..., X(t − 1)), the measurement for a gene at some time t only depends
Recovering Genetic Network from Continuous Data
259
on the gene measurements observed at the previous time (t − 1). Then the variables observed simultaneously are assumed to be conditionally independent given the past of the process (Assumption 12.2.2). In other words, time measurements are assumed to be close enough so that a gene measurement Xi (t) at time t is better explained by the previous time measurements X(t − 1) than by some measurements (Xj (t))j =/ i at time t. Briefly, Assumptions 12.2.1 and 12.2.2 allow the existence of a DBN representation according to a DAG G that only contains edges pointing out from a variable observed at some time (t − 1) towards a variable observed at the next time t (no edges between simultaneously observed variables). All in all, in order to restrict the dimension, this DBN model assumes a constant time delay for all interactions (defined by the time points sampling). It is possible to add simultaneous interactions, or a longer time delay by allowing the existence of edges between variables observed either at the same time t or with a longer time delay (i.e. from t − 2 to t). However, the dimension of the model increases exponentially with the number of authorized time delays, which can hardly be afforded given the amount of time points. Finally, DAG G is unique whenever the set of measurements for the p genes are linearly independent (Assumption 12.2.3), that is whenever none of the profiles can be written as a linear combination of the others. This assumption is reasonable when the set of genes under study does not contain duplicates. Whenever these three assumptions are satisfied, the probability distribution of the process allows a DBN representation as shown in the following theorem. Theorem 12.2.1 Whenever Assumptions 12.2.1, 12.2.2 and 12.2.3 are satisfied, the probability distribution of the process allows a DBN representation according to DAG G whose edges describe exactly the conditional dependencies between successive variables (Xj (t − 1), Xi (t))i,j=1...p given the past variables (Xk (t − 1))k=1...p . Note however that dynamic modelling is very dependent on the time point measurements sampling. An interaction that actually occurs at a time-scale that is shorter than the time sampling may be not detectable from the data or could be misinterpreted. The choice of the time delay must be made very carefully. The choice of the time delay must be done very carefully. In particular, if the time delay between two successive time points is too large, considering a static Bayesian network might be relevant. A large majority of genetic time series contain no or very few repeated measurement(s) for each gene at a given time point. Hence, to carry out an estimation, it is often assumed that the process is homogeneous across time (Assumption 12.2.4). Assumption 12.2.4 The process is homogeneous across time: any edge is present during the whole process. This consists in considering that the system is governed by the same rules during the whole experiment. Then (n − 1) repeated measurements are observed for each gene at two successive time points. Note that this is a strong assumption, which is not always satisfied but often used for the estimation procedure when the number of measurements is too small compared with the number of genes. The next section will introduce recent approaches for nonhomogeneous DBN inference which allows the interactions to change with time. 12.2.2 12.2.2.1
DBN for linear interactions and inference procedures Multivariate AR model
A common and simple particular case of the DBN modelling assumes linearity of the dependency relationships, i.e. a multivariate AR model. Let p be the number of observed genes and n the number of time measurements for each gene. In this study, the discrete-time stochastic process X = {Xi (t); 1 ≤ i ≤ p, 1 ≤ t ≤ n} is considered, taking real values and describing the measurements of the p genes at n time points. The gene measurements at time t are then assumed to satisfy: ∀t ≥ 2, X(t) = AX(t − 1) + B + ε(t)
with
ε(t) ∼ N(0, ),
(12.2)
260
Handbook of Statistical Systems Biology
where X(t) = (Xi (t))1≤i≤p and N(0, ) is the multivariate normal distribution centered at 0 with diagonal covariance matrix . Note that diagonality of ensures that the process describing the temporal evolution of gene measurements – here a first-order AR process – can be represented by a DAG as in Figure 12.1(c), i.e. no edges between nodes at the same time, and where the edges from time t − 1 to time t are defined by the set of non-zero coefficients in matrix A. Furthermore the error in measurement of gene i does not affect the measurements of the other genes and off-diagonal elements in can be set to 0. Based on Assumption 12.2.4, the coefficient matrix A = (aij )1≤i,j≤p — which is the adjacency matrix of the genetic network — and the column vector B = (bi )1≤i≤p — which is the baseline gene measurement that does not depend on the parent genes — are assumed constant in time. Moreover, conditional on the past of the process, the random vector only depends on the random vector observed at time (t − 1), then Assumption 12.2.1 is satisfied. Assumption 12.2.2 is satisfied whenever the error covariance matrix is diagonal (L`ebre 2009). Considering non-correlated measurement errors between distinct genes is a strong assumption, especially since microarray data contain several sources of noise. Nevertheless, assuming to be diagonal is still reasonable after a normalization procedure. For an illustration, any AR process whose error covariance matrix is diagonal and where matrix has the following form (where a notations refer to non-zero coefficients), ⎛
a11 ⎜ A = ⎝ a21 0
a12 0 a32
0
⎞
⎟ 0⎠ 0
(12.3)
admits a DBN representation according to the dynamic network of Figure 12.1(a) (p = 3). Thus the non-zero coefficient corresponds to the edges of the network. For instance, according to the AR model defined by matrix A, the non-zero coefficient a12 stands for the edge from gene 2 toward gene 1 in the motif of Figure 12.1(a). 12.2.2.2
Inference procedures
This section is dedicated to the inference of DBN representation for an AR model, which is directly applicable to continuous data. Recovering a DBN defining an AR model defined by Equation 12.2 amounts to pointing out the non-zero coefficient of the AR matrix A. Basing on time homogeneity assumption (Assumption 12.2.4), the repeated time measurements can be used to perform linear regression. for each gene i according to the linear regression model, Xi (t) = bi + aij Xj (t − 1) + εi (t) where εi (t) ∼ N(0, σi (t)) (12.4) However, the standard theory to estimate the regression coefficients can be exploited only when n p (which ensures that the sample covariance matrix is positive definite) (Lauritzen 1996). Then, the use of regularized estimators is absolutely essential. Dimension reduction approaches discussed here improve estimation efficiency and allows the ‘curse of dimension’ inherent to genetic data (n p) to be handled. Murphy (2001) proposes several Bayesian structure learning procedures for Bayesian networks (static or dynamic) in the open-source Matlab package BNT for Bayes Net Toolbox. (see Chapter 3 for an introduction to Bayesian inference.) A standard procedure first proposed by Meinshausen et al. (2006) for network inference is the Lasso (Least absolute shrinkage and selection operator). This constrained estimation procedure tends to produce some coefficients that are exactly zero. Variable selection is then straightforward: only non-zero coefficients define significant dependence relationships. Lasso estimation can be carried out with the LARS Software, developed by Efron et al. (2004) for R and Splus programming languages. The R package SIMoNe (Statistical Inference
Recovering Genetic Network from Continuous Data
261
for MOdular NEtworks) by Chiquet et al. (2009) implements various DBN inference methods based on the Lasso regression with an additional grouping effect for multiple data (including three variants: intertwined, group-Lasso and cooperative-Lasso). An efficient estimator of the covariance matrix can be obtained by ‘shrinking’ the empirical correlations between gene measurements towards 0 and the empirical variances against their median. The shrinkage approach improves the global estimation precision of the regression coefficients in comparison with standard methods. Then, by ordering the edges according to decreasing coefficients, edge selection can be carried out. Multiple testing correction can be performed with the local False Discovery Rate (FDR) approach introduced by Sch¨affer and Strimmer (2005). ‘Shrinkage’ estimates of regression coefficients can be derived via the R package GeneNet. Another powerful approach to infer DBNs proposed in L`ebre (2009) is based on the consideration of first-order conditional dependencies. The R package G1DBN allows the inference of DBNs through a twostep procedure: first, recover the first-order partial dependence DAG G(1) ; and secondly, infer the full order dependence DAG G (representing the DBN) which is included in G(1) by classic linear regression.
12.3
Go forward: how to recover the structure changes with time
The time homogeneity assumption (Assumption 12.2.4) is useful for the inference of DBNs when the number of measurements is much smaller compared with the number genes. However, it is very unlikely that a biological process remains homogeneous with time. In particular when observing the response to some external stress (drug addition, glucose starving, . . . ), the experimentalist is more looking for a change in time. In that case, inferring a time homogeneous DBN allows to point out the main interactions (which can still be recovered when considering all time measurements) but it would be more accurate to recover a time-varying network. This has been made possible with several recent developments, especially when the number of time measurements is larger than the number of genes. Indeed, due to technical developments, longer time series measurements have become available. The idea is to build a DBN which is divided into time segments: the network is homogeneous (see previous section) along segments [see Figure 12.1(d)]. Then the definition of such a time varying DBN only requires the introduction of a set of time points (referred to as changepoints) delimiting the segments. The segment decomposition can be variable specific (each variable has its own segmentation, a new segment starts when the incoming edges and/or coefficient change for the variable under consideration). In both cases, the dimension is much larger than for the time homogeneous network, as described in the previous section. Indeed, the number of changepoints delimiting the segments, the number of edges within segments remain unknown. Serious attempts to reconstruct dynamic networks whose topology changes with time started with Yoshida et al. (2005) and Talih and Hengartner (2005) in the financial field where there is often large time series data that are available (daily measurements over months or years). Although promising, these approaches assume that there is a fixed (user-specified) number of distinct networks or segments, and the switching between segments is modelled via a stochastic transition matrix that requires an estimation of many parameters. More recently, methods in which the number of distinct segments is determined a posteriori have been proposed but there are still some limitations. Fujita et al. (2007) developed a multivariate AR model to estimate time-varying genetic networks. Notably, only the values of the network parameters change over time, meaning that the global topology of the network remains constant. Xuan and Murphy (2007) introduced an iterative procedure based on a similar modelling, which switches between a convex optimization approach for determining a suitable candidate graph structure and a dynamic programming algorithm for calculating the segmentation of the time into distinct segments, i.e. the sets of successive time points for which the graph structure remains unchanged.
262
Handbook of Statistical Systems Biology
This time, the number of segments is explicitly determined by the inference procedure, but a mathematical assumption is needed: the graph structure must be decomposable, which cannot be verified in practice. Robinson and Hartemink (2008) introduced a Markov chain Monte Carlo (MCMC) sampler (see Chapter 3) for the inference of nonstationary DBNs. The latter offers the attractive feature that the network structure within a temporal segment depends on the structure of the contiguous segments. However, a preliminary discretization of the data is required, which incurs an inevitable loss of information. Moreover, as for all the approaches cited the time segmentation is common to all genes under study, meaning that all the genes of the network change their inputs simultaneously. In reality however, it would rather be expected that each gene (or at most a subset of genes) has its own and characteristic pattern. To that end, Ahmed and Xing (2009) introduced a machine learning algorithm (called TESLA) to infer time evolving networks (that are gene specific), by solving a set of temporally smoothed l1 -regularized logistic regression problems via convex optimization techniques. However, the optimization procedures depends on two penalty coefficients (in order to preserve network sparsity and smoothness of changes). Then many runs are needed to compare the best results obtained with each possible pair of penalty coefficients. This section introduces the ARTIVA (AutoRegressive TIme VArying) procedure by L`ebre et al. (2010) which is designed to address the issues raised above: • • •
the changepoints vector is gene specific; continuous data can be treated directly (discretization is not mandatory); no need for prior information (the number of changepoints is unknown and penalty coefficients are not necessary).
A combination of efficient and robust methods is used: DBNs to model directed interactions between genes and reversible-jump Markov chain Monte Carlo (RJMCMC) for inferring simultaneously the times when the network changes and the resulting network topologies. 12.3.1
ARTIVA network model
In order to strike a balance between model refinement and the amount of information available to infer the model parameters, the ARTIVA model delimits temporal segments for each gene where the influence factors and their weights can be assumed homogeneous. 12.3.1.1
Multiple changepoints
Based on the time homogeneous AR model defined in Section 12.2.1, the ARTIVA network model is defined by adding a segmentation vector for each gene. For each gene i, an unknown number ki of changepoints define ki + 1 non-overlapping segments. Segment h = 1, .., ki + 1 starts at changepoint ξih−1 and stops before ξih , where ξi = (ξi0 , ..., ξih , ..., ξiki +1 ) with ξih−1 < ξih . To delimit the bounds, ξi0 = 2 and ξiki +1 = n + 1. The set of changepoints is denoted by ξ = {ξi }1≤i≤p where each vector ξi has length |ξi | = ki + 2. This changepoint process induces a partition of the time series, with different structures Mhi associated with the different segments h ∈ {1, . . . , ki + 1}. Identifiability is satisfied by ordering the changepoints based on their position in the time series. 12.3.1.2
Regression model
For all genes i, the random variable Xi (t) refers to the measurement of gene i at time t. Within any segment h, the measurement of gene i depends on the p gene measurements at the previous time point through a regression model defined by (i) a set of sih parents denoted by Mhi ⊆ {1, . . . , p}, |Mhi | = sih , and (ii) a set of
Recovering Genetic Network from Continuous Data
263
parameters ((aijh )j∈0..p , σih ); aijh ∈ R, σih > 0. For all j =/ 0, aijh = 0 if j ∈ / Mhi . For all genes i, for all time points t in h−1 h segment h (ξi ≤ t < ξi ), random variable Xi (t) depends on the p variables {Xj (t − 1)}1≤j≤p according to h + ah Xj (t − 1) + εi (t) (12.5) Xi (t) = ai0 h ij j∈Mi
where the noise εi (t) is assumed to be Gaussian with mean 0 and variance define aih = (aijh )j∈0..p . 12.3.2
(σih )2 , εi (t)
∼ N(0, (σih )2 ). Let us
ARTIVA inference procedure and performance evaluation
When recovering a time-varying network, the dimension remains unknown. Indeed, not only the number of edges within each segment is unknown but also the number of changepoints. Under such circumstances, a natural approach is to use a RJMCMC procedure introduced by Green (1995) (see Chapter 3). This section introduces the ARTIVA network model inference procedure proposed in L`ebre et al. (2010), which combines the RJMCMC procedure for regression model selection by Andrieu and Doucet (1999) with multiple changepoint processes sampling by Green (1995). 12.3.2.1
Priors
To that end, prior distribution must be set for the various parameters. The ki + 1 segments are delimited by ki changepoints. In order to encourage sparsity, the number of changepoints ki is distributed a priori as a truncated Poisson random variable with mean λ and maximum k = n − 2: P(ki |λ) ∝
λki 1l . ki ! {ki ≤k}
(12.6)
Conditional on ki changepoints, the changepoint positions vector ξ i = (ξi0 , ξi1 , ..., ξiki +1 ) takes nonoverlapping integer values, which are taken uniformly distributed a priori. There are (n − 2) possible positions for the ki changepoints, thus the prior density for vector ξ i is a uniform distribution over all parent sets with cardinality n − 2. For all genes i and all segments h, the number sih of parents for node i follows a truncated Poisson distribution with mean and maximum s = 5. Following Andrieu and Doucet (1999), conditional on sih , the prior for the parent set Mhi is a uniform distribution over all parent sets with cardinality sih . The overall prior on the network structures is given by marginalization: P(Mhi |) =
s sih =1
P(Mhi |sih )P(sih |)
(12.7)
Conditional on the parent set Mhi of size sih , the sih + 1 regression coefficients, denoted by aMh = i
h , (ah ) h 2 (ai0 ij j∈Mh ), are assumed zero-mean multivariate Gaussian with covariance matrix (σi ) Mh . Finally, i
i
the conjugate prior for the variance (σih )2 is the inverse gamma distribution, P((σih )2 ) = IG(υ0 , γ0 ). When no prior knowledge is available, the hyper-hyperparameters for shape, υ0 = 0.5, and scale, γ0 = 0.05, can be set to fixed values that give a vague distribution. The terms λ and can be interpreted as the expected number of changepoints and parents and are sampled from a gamma distribution. 12.3.2.2
RJMCMC inference
Starting from time-course genetic data, ARTIVA performs a gene-by-gene analysis and infers simultaneously (i) the topology of the genetic network and (ii) how it changes over time.
264
Handbook of Statistical Systems Biology
From the prior, the random variable Xih for gene i in segment h follows a Gaussian distribution. Let us denote the observed time series by x. From Bayes theorem, the posterior distribution P(k, ξ, M, a, σ, λ, |x) is given by the following equation, where all prior distributions have been defined above: P(k, ξ, M, a, σ, λ, |x) ∝ P(λ)P()
p ki P(ki |λ)P(ξi |ki ) P(Mhi |)
(12.8)
i=1 h=1 h 2 h h h 2 P([σi ] )P(ai |Mi , [σi ] )P(xih |ξih−1 , ξih , Mhi , aih , [σih ]2 )
An attractive feature of the chosen model is that the marginalization over the parameters a and σ in the posterior distribution of (12.8) is analytically tractable,
P(k,ξ,M,λ,|x) = P(k,ξ,M,a,σ,λ,|x)dadσ (12.9) = P(λ)P() = P(λ)P()
p
i=1 p
P(ki , ξi , Mi , ai , σi |λ, , x)dai dσi
P(ki , ξi , Mi |λ, , x).
(12.10)
(12.11)
i=1
Indeed, for each gene i, the joint posterior distribution for parameters (ki , ξi , Mi , ai , σi ) conditional on hyperparameters (λ, ) and data x, is integrated out over parameter ai (normal distribution) and σi (inverse gamma distribution) to obtain an expression of the posterior density for parameters describing the network structure (ki , ξi , Mi ) conditional on hyperparameters (λ, ) and data x [see L`ebre et al. (2010) for computational detail]. The number of changepoints and their location, k, ξ, the network structure M and the hyperparameters λ, can be sampled from the posterior distribution P(k, ξ, M, λ, |x) with an RJMCMC scheme which is outlined in Figure 12.2. The ARTIVA procedure is based on four moves: changepoint birth, changepoint death, changepoint shift and network structure change within segments. The latter move is adapted from Andrieu and Doucet (1999) and is divided into three possible moves: edge birth, edge death and regression parameter update. Each move is accepted with a certain probability [see L`ebre et al. (2010) for a complete description], leading to an estimation of the posterior distribution of the ARTIVA model. Iterations must be pursued until convergence is obtained. Note that the generation of the regression model parameters (aih , σih ) is optional and only used when an estimation of their posterior distribution is wished for. Indeed, a changepoint birth or death acceptance is performed without generating the regression model parameters for the modified segment. Thus the acceptance probability of the move does not depend on the regression model parameters but only on the network topology in the segments delimited by the changepoint involved in the move. 12.3.2.3
Performance evaluation
In order to evaluate the performance of the ARTIVA procedure, simulations are run in order to assess the impact of three major factors on the algorithm performances: noise in the data, minimal length of segments, and number of proposed parent genes. The simulation procedure for a given target gene involves three main steps. First, the structure of the dynamic network is defined. This consists of randomly setting the number and the localization of changepoints delimiting the segments. Then the parent genes and the corresponding coefficients are chosen for each segment. Once the network is defined, synthetic data can be generated from this network model. The values of the parent
Recovering Genetic Network from Continuous Data
265
Figure 12.2 Schematic illustration of the ARTIVA procedure for recovering ARTIVA network. With probabilities b, d and v, the birth, death or shift of a changepoint (CP), respectively, is proposed; with probability w the update of the regression model describing interactions for a gene within a segment is considered. Varying the number of changepoints or the number of edges (network topology) corresponds to a change in the dimension of the state-space and is dealt with by using Green’s RJ-MCMC formalism (Green 1995). Proposed shifts in changepoint positions are accepted according to a standard Metropolis–Hastings step. Because of conservation of probability, the probabilities of choosing each move satisfy b + d + v + w = 1 and + + = 1
genes are first generated randomly (uniformly drawn from [−2, −0.1] ∪ [0.1, 2]) and subsequently used to calculate the target gene value according to the AR model presented in Section 12.3.1. Noise is finally added (to represent experimental variability). Because the simulations are based on the assumption of the ARTIVA model (AR model), correct results are expected under ideal conditions (like absence of noise). Therefore, this simulation protocol evaluates the ARTIVA performance and studies the influence of the following parameters: • • •
The amount of noise in the data: for all segments h of gene i, the noise εi (t) is drawn from a Gaussian distribution N(0, (σih )2 ) with standard deviation σih (σih = 0.2 to 1.8). The size of the temporal segments: segment size vary from 10 to 60 time points. The number of possible parent genes.
The ability of the ARTIVA procedure to recover changepoints was evaluated via the Positive Predictive Value (PPV) and the Sensitivity, PPV =
TP (TP + FP)
Sensitivity =
TP (TP + FN)
(12.12)
where TP is true positives, FP is false positives and FN is false negatives. The edges PPV and Sensitivity were computed for the segments whose changepoints were correctly inferred. Sensitivity and PPV calculated for the detection of changepoints and of models, i.e. the topology of the network within the segments, are plotted in Figure 12.3. For each parameter value, 200 gene time-series of
80
100
40 20
0 0.5
1.0
1.5
Noise standard deviation
Changepoint PPV Changepoint Sensitivity Model PPV Model Sensitivity
0
40 20
60
60
80
80 60 40 20 0
Sensitivity / PPV (%)
100
Handbook of Statistical Systems Biology 100
266
10
20
30
40
Segment size (measurements #)
5
15
25
35
# of possible parents
Figure 12.3 Performance of ARTIVA on synthetic data. The default values of the parameters used for the simulation study are: number of time points n = 100; number of changepoints k ∼ U({1, .., 3}); maximal number of edges = 5; parent to target coefficient ∼ U([−2, −0.1] ∪ [0.1,2]); segment size ∼ U({10, .., n}); noise standard deviation ∼ N(0,0.5); total number of simulations = 200 for each condition. The value of noise intensity, segment size and number of parent genes change according to the parameter under study (all other parameters were set to default). In each condition, the ability of ARTIVA to detect all true phase changepoints and model edges (Sensitivity) and to detect only true positives (PPV) was calculated. Overall, this simulation study allows us to gain confidence in ARTIVA results ( 80% of PPV and Sensitivity) for a given set of parameters (noise standard error is on the order of the mean value of the regression coefficients, number of measurements in a segment is more than 10 and less than 20 parent genes)
length = 100 time points were randomly generated. All other parameters were set to their default values (see Figure 12.3). The changepoint sensitivity is greater than 80% when the noise standard error reaches σi = 1. As noise increases further, the ability of the algorithm to recover changepoints decreases in terms of sensitivity, but still, the changepoint sensitivity remains greater than 70% when the noise standard deviation reaches σi = 1.2 (a value that is larger than the mean value of the regression coefficients, uniformly sampled from [−2; −0.1] ∪ [0.1; 2]). Nevertheless, the changepoint sensitivity is still greater than 80% even when noise reaches σi = 0.8. The number of measurements for each segment also plays an important role for the changepoint detection sensitivity. Indeed, during a segment reduced to a single time point, there are only r repeated measurements to estimate the AR models. Interestingly, the ARTIVA algorithm here succeeds in finding the correct segmentation with a sensitivity value of 79% for a segment of size 10 time points (default noise standard error σi = 0.5). With segments of size 15, the changepoint sensitivity is greater than 90%. For all noise levels considered here the changepoint PPV is greater than 95%; furthermore, changepoint PPV appears to be stable and not to be affected by the segment size either. The edge detection was evaluated when the correct changepoint segmentation was recovered. Once the correct changepoints are recovered, neither noise nor short segments appear to strongly affect the detection
Recovering Genetic Network from Continuous Data
267
of edges. The edge sensitivity deteriorates for extreme situations only. Indeed, the edge sensitivity is about 50% when segment length is 25 or when the number of proposed parents is 20. In all other cases, the edge sensitivity is greater than 75% and the edge PPV is greater than 95%. Simulation studies such as the one performed here do, of course, only provide a partial insight into an algorithm’s performance and robustness. They are nevertheless essential to gain confidence in the performance of novel algorithms and to develop understanding of their likely limitations. Together these results serve to illustrate of the robustness of the ARTIVA inference procedure. In particular, ARTIVA can deal with some of the generic problems encountered in real experimental data. It still performs well when noise standard error is on the order of the mean value of the regression coefficients, when the number of measurements per segment is reduced to 10 or when the number of possible parents reaches 20. At some point, the ARTIVA algorithm misses some changepoints, but the PPV is still very large, meaning that great confidence can be given to the changepoints with a high posterior probability. Also, in a recent study (Husmeier et al. 2010), the ARTIVA algorithm outperforms up-to-date approaches for time-varying DBN inference (Robinson and Hartemink 2008; Ahmed and Xing 2009) at changepoints detection in real data for the muscle development of Drosophila melanogaster.
12.4
Discussion and Conclusion
The ARTIVA approach allows the description and reverse engineering of the dynamic aspects of molecular networks. Such time-varying networks provide a middle ground between networks homogeneous in time and explicit dynamical models. Inferring such systems is a considerable statistical challenge. In contrast to classical network reverse engineering approaches such as time homogeneous DBNs, ARTIVA also allows the construction of more complex hypotheses where interactions may depend on time. As no particular constraint is imposed to the changepoint positions or to the succession in network topologies within segments, the ARTIVA model appears to be highly flexible. The results are not a priori directed toward any particular interactions between genes. This flexibility can be extremely valuable, especially when no information regarding the studied biological process is available. The rapid accumulation of data obtained with different experimental approaches provides the opportunity to acquire a more comprehensive picture of all the interactions between cellular components. To understand the biology of the studied systems better, the trend is clearly towards the aggregation of multiple sources of information. A natural extension of the ARTIVA procedure is to incorporate data originating from different sources in the model. In particular, protein/DNA interaction data (ChIP-chip or ChIP-seq experiments) could be effective by replacing the uniform prior for the edges with a prior favouring edges that correspond to the experimentally identified interactions (Werhli and Husmeier 2007; Mukherjee and Speed 2008). Also, ARTIVA assumes independent network topologies within successive segments and can identify very different interactions between two segments, even if the time delay between the segments is very short. This assumption is appropriate in the case of biological models in which transcriptional regulations are highly dynamic. However, when considering systems that evolve more smoothly or in case of datasets with a small number of time points, incorporating a regularization scheme into ARTIVA in order to favour slight changes from one segment to the next one can improve the algorithm performance. Such an approach has been initiated by Robinson and Hartemink (2008) for discretized data and in Grzegorczyk and Husmeier (2009) with a regularization scheme between model coefficients within segments (based on a common network structure for all segments). For an introduction to the use of prior knowledge in a Bayesian scheme, see Chapter 13. More recently, inter-time information sharing between segments was added to the ARTIVA procedure in Husmeier et al. (2010). Various segment information sharing schemes are currently being studied and improvement is still on going.
268
Handbook of Statistical Systems Biology
References Ahmed A and Xing EP 2009 Recovering time-varying networks of dependencies in social and biological studies. Proceedings of the National Academy of Sciences of the United States of America, 106(29), 11878–11883. Akutsu T, Miyano S and Kuhara S 1999 Identification of genetic networks from a small number of gene expression patterns under the boolean network model. Pacific Symposium on Biocomputing, pp. 17–28. Andrieu C and Doucet A 1999 Joint bayesian model selection and estimation of noisy sinusoids via reversible jump mcmc. IEEE Transactions on Signal Processing 47(10), 2667–2676. Beal M, Falciani F, Ghahramani Z, Rangel C and Wild D 2005 A bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics 21, 349–356. Chen T, He H and Church G 1999 Modelling gene expression with differential equations. Pacific Symposium on Biocomputing, pp. 29–40. Chiquet J, Smith A, Grasseau G, Matias C and Ambroise C 2009 SIMoNe: Statistical Inference for MOdular NEtworks. Bioinformatics 25(3), 417–418. De Jong H 2002 Modeling and simulation of genetic regulatory systems: a literature review. Journal of Computational Biology 9, 67–103. Efron B, Hastie T, Johnstone I and Tibshirani R 2004 Least angle regression. Annals of Statistics 32(2), 407–499. Friedman N, Linial M, Nachman I and Pe’er D 2000 Using bayesian networks to analyse expression data. Journal of Computational Biology 7(3–4), 601–620. Friedman N, Murphy K and Russell S 1998 Learning the structure of dynamic probabilistic networks. Proceedings of the 14th Conference on the Uncertainty in Artificial Intelligence pp. 139–147. Fujita A, Sato J, Garay-Malpartida H, Morettin P, Sogayar M and Ferreira C 2007 Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method. Bioinformatics 23(13), 1623– 1630. Green PJ 1995 Reversible jump markov chain Monte Carlo computation and bayesian model determination. Biometrika 82, 711–732. Grzegorczyk M and Husmeier D 2009 Non-stationary continuous dynamic Bayesian networks. In Advances in Neural Information Processing Systems (NIPS), vol. 22 (eds Bengio Y, Schuurmans D, Lafferty J, Williams CKI and Culotta A), Nips Foundation, pp. 682–690. Husmeier D, Dondelinger F and L`ebre S 2010 Inter-time segment information sharing for non-homogeneous dynamic bayesian networks. In Advances in Neural Information Processing Systems, vol 23 (Lafferty J, Williams CKI, ShaweTaylor J, Zemel RS and Culotta A), Nips Foundation, pp. 901–909. Imoto S, Goto T and Miyano S 2002 Estimation of genetic networks and functional structures between genes by using bayesian networks and nonparametric regression. Pacific Symposium on Biocomputing 7, 175–186. Imoto S, Kim S, Goto T, Aburatani S, Tashiro K, Kuhara S and Miyano S 2003 Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Journal of Bioinformatics Computational Biology 2, 231–252. Kauffman S 1969 Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology 22, 437–446. Kim S, Imoto S and Miyano S 2003 Inferring gene networks from time series microarray data using dynamic bayesian networks.. Briefings in Bioinformatics 4(3), 228. Kim S, Imoto S and Miyano S 2004 Dynamic bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data. Biosystems 75(1–3), 57–65. Lauritzen SL 1996 Graphical Models. Oxford University Press. L`ebre S 2009 Inferring dynamic genetic networks with low order independencies. Statistical Applications in Genetics and Molecular Biology 8(1), 1–38. L`ebre S, Becq J, Devaux F, Lelandais G and Stumpf M 2010 Statistical inference of the time-varying structure of generegulation networks. BMC Systems Biology 4(130), 1–16. Meinshausen N, Buhlmann P and Zurich E 2006 High dimensional graphs and variable selection with the lasso. Annals of Statistics 34, 1436–1462.
Recovering Genetic Network from Continuous Data
269
Mukherjee S and Speed TP 2008 Network inference using informative priors. Proceedings of the National Academy of Sciences of the United States of America 105(38), 14313–14318. Murphy K 2001 The Bayes Net Toolbox for MATLAB. Computing Science and Statistics 33, 331–350. Murphy K and Mian S 1999 Modelling gene expression data using dynamic bayesian networks. Technical Report, MIT Artificial Intelligence Laboratory. Ong IM, Glasner JD and Page D 2002 Modelling regulatory pathways in E. coli from time series expression profiles. Bioinformatics 18(Suppl. 1), S241–S248. Opgen-Rhein R and Strimmer K 2007 Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process. BMC Bioinformatics 8(Suppl. 2), S3. Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J and d’Alch´e Buc F 2003 Gene networks inference using dynamic bayesian networks. Bioinformatics 19(Suppl 2), S138–S148. Rangel C, Angus J, Ghahramani Z, Lioumi M, Sotheran E, Gaiba A, Wild DL and Falciani F 2004 Modeling t-cell activation using gene expression profiling and state-space models.. Bioinformatics 20(9), 1361–1372. Robinson J and Hartemink A 2008 Non-stationary dynamic Bayesian networks. In Advances in Neural Information Processing Systems, vol. 21 (eds Koller D, Schuurmans D, Bengio Y and Bottou L), Nips Foundation, pp. 1369–1376. Sch¨affer J and Strimmer K 2005 A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4(1), Article 32. Sugimoto N and Iba H 2004 Inference of gene regulatory networks by means of dynamic differential bayesian networks and nonparametric regression. Genome Informatics 15(2), 121–130. Talih M and Hengartner N 2005 Structural learning with time-varying components: tracking the cross-section of financial time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3), 321–341. Weaver D, Workman CT and Stormo GD 1999 Modelling regulatory networks with weight matrices. Proceedings Pacific Symposium on Biocomputing, 112–123. Werhli A and Husmeier D 2007 Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge. Statistical Applications in Genetics and Molecular Biology, vol. 6, Article 15. Wu FX, Zhang WJ and Kusalik AJ 2004 Modeling gene expression from microarray expression data with state-space equations. Pacific Symposium on Biocomputing, pp. 581–592. Xuan X and Murphy KP 2007 Modeling changing dependency structure in multivariate time series. Proceedings of the ACM International Conference Series 227, 1055–1062. Yoshida R, Imoto S and Higuchi T 2005 Estimating time-dependent gene networks from time series microarray data by dynamic linear models with markov switching. Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference 289–298. Zou M and Conzen SD 2005 A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21(1), 71–79.
13 Advanced Applications of Bayesian Networks in Systems Biology Dirk Husmeier1 , Adriano V. Werhli 2 and Marco Grzegorczyk 3
2
13.1
1 Biomathematics and Statistics Scotland, Edinburgh, UK Centro de Ciˆencias Computacionais, Universidade Federal do Rio Grande, Rio Grande, Brazil 3 Department of Statistics, TU Dortmund University, Dortmund, Germany
Introduction
Bayesian networks have received increasing attention from the computational biology community as models of gene regulatory networks, following up on pioneering work by (8) and (17). Several tutorials on Bayesian networks have been published (27;19;21). We therefore only qualitatively recapitulate some aspects that are of relevance to the present chapter, and refer the reader to the above tutorials for a thorough and more rigorous introduction. 13.1.1
Basic concepts
The structure M of a Bayesian network is defined by a directed acyclic graph (DAG) indicating how different variables of interest, X1 , . . . , XN , represented by nodes,‘interact’. The word ‘interact’ has a causal connotation, which is ultimately of interest to the biologist, but has to be taken with caution in this context, as explained shortly. The edges of a Bayesian network are associated with conditional probabilities, defined by a functional family and their parameters. The interacting quantities are associated with random variables, which represent some measured quantities of interest, like relative gene expression levels or protein concentrations. We denote the set of all the measurements of all the random variables as the data, represented by the letter D. As a consequence of the acyclicity of the network structure, the joint probability of all the random variables can be factorized into a product of lower-complexity conditional probabilities according to conditional
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
Advanced Applications of Bayesian Networks 271
independence relations defined by the graph structure M. Under certain regularity conditions, the parameters associated with these conditional probabilities can be integrated out analytically. This allows us to compute the marginal likelihood or evidence P(D|M), which captures how well the network structure M explains the data D. More formally, denote by M a DAG (network structure) composed of the interacting variables (nodes) X1 , . . . , XN . An edge pointing from Xi to Xj indicates that the realization of Xj is conditionally dependent on the realization of Xi (and vice versa). The parent node set of node Xn in M, πn = πn (M), is the set of all nodes from which an edge points to node Xn in M. The conditional dependence of Xn on πn is described by the conditional probability (density) P(Xn |πn , θ n,πn ), which lies in a chosen function family, and depends on the parameters θ n,πn . Consider a set of experimental conditions t, 1 ≤ t ≤ T . Let Xn (t) represent the realization of Xn in experimental condition t. We are given a dataset D, where Dn,t and D(πn ,t) are the tth realizations Xn (t) and πn (t) of Xn and πn , respectively. The DAG structure of a Bayesian network defines the following unique and unambiguous expansion of the joint probability: P(D|M, θ) =
N m P Xn (t) = Dn,t |πn (t) = D(πn ,t) , θ n,πn
(13.1)
n=1 t=1
where θ is the total parameter vector, composed of node-specific subvectors θ n,πn , which specify the local conditional distributions in the factorization. Let P(θ) denote the prior distribution on the parameters. We assume a priori parameter independence: P(θ|M) = n P(θ n,πn |M). We also assume parameter modularity: P(θ n,πn |M) = P(θ n,πn |M ) := P(θ n,πn ) for two network structures M and M , in which Xn has the same parents πn , but which otherwise may be different. From Equation(13.1) and under the assumption of parameter independence and parameter modularity, the marginal likelihood is given by P(D|M) =
P(D|M, θ)P(θ|M)dθ =
N
(Dπn n )
(13.2)
n=1
(Dπn n )
m = P Xn (t) = Dn,t |πn (t) = D(πn ,t) , θ n,πn P(θ n,πn )dθ n,πn
(13.3)
t=1
where Dπn n := {(Dn,t , Dπn ,t ) : 1 ≤ t ≤ m} is the subset of data pertaining to node Xn and parent set πn = πn (M). For certain choices of the conditional distribution P(Xn |πn , θ n,πn ) and the prior distribution P(θn,πn ), the integral in Equation (13.3) has a closed form solution. One choice is the multinomial distribution for P(Xn |πn , θ n,πn ), and a conjugate Dirichlet prior for P(θn,πn ). The detailed derivations can be found in (19), and the resulting expression is called the BDe score. The other option is to choose P(Xn |πn , θ n,πn ) to be a linear Gaussian distribution, and P(θ n,πn ) to be the corresponding conjugate prior, i.e. a normal Wishart distribution. Details of the derivations are available in (9), and the resulting expression is called the BGe score. Note that the BDe score requires a discretization of the data, which will inevitably result in some information loss. The BGe score, on the other hand, is based on a linear model, which will not be appropriate for modelling nonlinear regulatory interactions. We will discuss a relaxation of this restriction in Section 13.3. We are ultimately interested in learning a network of causal relations between interacting nodes. While such a causal network forms a valid Bayesian network, the inverse relation does not hold: when we have learned a Bayesian network from the data, the resulting graph does not necessarily represent the correct causal graph. One reason for this discrepancy is the existence of unobserved nodes. When we find a probabilistic dependence between two nodes, we cannot necessarily conclude that there exists a causal interaction between
272
Handbook of Statistical Systems Biology
them, as this dependence could have been brought about by a common yet unobserved regulator. However, even under the assumption of complete observation the inference of causal interaction networks is impeded by symmetries within so-called equivalence classes, which consist of networks that define the same conditional dependence structure and should therefore have the same evidence scores P(D|M). For a simple example, consider two conditionally dependent nodes, say A and B, where the two networks related to the two possible directions of the edge, A → B and A ← B, are equivalent. There are three ways to break the symmetries of the equivalence classes. One approach is to use active interventions, like gene knockouts and over-expressions. When knocking out gene A affects gene B, while knocking out gene B does not affect gene A, then A → B will tend to have a higher evidence than A ← B. For more details, see (37) and (43). An alternative way to break the symmetries, discussed in Section 13.2, is to use prior information. When genes A and B are conditionally dependent, and we have prior knowledge that A is a transcription factor that regulates genes in the functional category that B belongs to, then we will presumably favour A → B over A ← B. To formalize this notion, we score networks by the posterior probability P(M|D) ∝ P(D|M)P(M)
(13.4)
where P(D|M) is the evidence, and P(M) is the prior distribution over network structures; the latter distribution captures the biological knowledge that we have prior to measuring the data D. While different graphs might have identical scores in light of the data, P(D|M), symmetries can be broken by the inclusion of prior knowledge, P(M), and these two sources of information are systematically integrated into the posterior distribution P(M|D). Finally, when we have temporal rather than static data, the symmetry is broken by the direction of time. A relation C(t) → E(t + 1) indicates that a cause at time t, C(t), leads to a future effect at time t + 1, E(t + 1). Reverting the edge would imply that a modification of the future effect E(t + 1) allows the cause in the past, C(t), to be changed – in obvious contradiction to the temporal direction of causality. This idea leads to the concept of dynamic Bayesian networks (DBNs), as discussed in the next subsection. 13.1.2
Dynamic Bayesian networks
A principled feature of Bayesian networks is the DAG structure of the graph and the acyclicity constraint, which rules out directed feedback loops. Since feedback is an essential control mechanism common in many biological systems, this poses a serious limitation of the model. In the present section, we describe the concept of DBNs, which allows us to overcome this limitation for temporal data. Consider Figure 13.1(a), which shows a simple network consisting of two genes. One of the genes is involved in a feedback loop, ruling out the application of DAGs. However, interactions between genes are usually such that the first gene is transcribed and translated into protein, which then has some influence on the transcription
X(1)
X(2)
X(3)
Y(2)
Y(3)
X
Y (a)
(b)
Figure 13.1 State space graph and corresponding dynamic Bayesian network. (a) Recurrent state space graph containing two nodes. Node X has a recurrent feedback loop and acts as a regulator of node Y. (b) The same graph unfolded in time
Advanced Applications of Bayesian Networks 273
of the second gene. This implies that the interaction is not instantaneous, but that its effect happens with a time delay after its cause. The same applies to the feedback loops of genes acting back on themselves. We can therefore unfold the recurrent network of Figure 13.1(a), in time to obtain the directed, acyclic network of Figure 13.1(b). The latter is again a proper DAG and corresponds to a DBN. The equations of the previous section have to be modified as follows. Given a node Xn (t), parents can only be taken from the previous time point, πn (t − 1). Hence, P Xn (t) = Dn,t |πn (t) = D(πn ,t) , θ n,πn in Equations (13.1) and (13.3), where to node Xn and parent set πn , has to be Dπn n := {(Dn,t, Dπn ,t ) : 1 ≤ t ≤ m} is the subset of data pertaining replaced by P Xn (t) = Dn,t |πn (t − 1) = D(πn ,t−1) , θ n,πn where Dπn n := {(Dn,t , Dπn ,t−1 ) : 2 ≤ t ≤ m}. For further details, see (7), (35) and (20). To avoid an explosion of the model complexity, parameters are tied such that the transition probabilities between time slices t − 1 and t are the same for all t. The true dynamic process is thus approximated by a homogeneous Markov model. We will discuss a relaxation of this assumption in Section 13.3. 13.1.3
Bayesian learning of Bayesian networks
Our objective is to find the network structure M that maximizes P(M|D). Unfortunately, the number of structures increases super-exponentially with the number of nodes. Also, in systems biology, where we aim to learn complex interaction patterns involving many components, the amount of information from the data and the prior is usually not sufficient to render the distribution P(M|D) sharply peaked at a single graph. Instead, the distribution is usually diffusely spread over a large set of networks. Summarizing this distribution by a single network would not be appropriate. Instead, we aim to sample network structures from the posterior distribution P(M|D) so as to obtain a typical collection of high-scoring networks and, thereby, capture intrinsic inference uncertainty. Direct sampling from this distribution is usually intractable, though. Hence, we resort to a Markov chain Monte Carlo (MCMC) scheme (31), which under fairly general regularity conditions is theoretically guaranteed to converge to the posterior distribution of Equation (13.4). Given a network structure Mold , a new network structure Mnew is proposed from a proposal distribution Q(Mnew |Mold ), which is then rejected or accepted according to the standard Metropolis–Hastings scheme (18) with the following acceptance probability: A = min
Q(Mold |Mnew ) P(D|Mnew )P(Mnew ) × ,1 P(D|Mold )P(Mold ) Q(Mnew |Mold )
(13.5)
The procedure is iterated until some convergence criterion (2) is satisfied. The functional form of the proposal distribution Q(Mnew |Mold ) depends on the chosen type of proposal moves. A simple and straightforward approach, proposed by (31), is based on three edge-based proposal operations: creating, deleting, or inverting an edge. The computation of the Hastings factor Q(Mold |Mnew )/Q(Mnew |Mold ) is, for instance, discussed in (21). More sophisticated approaches, with improved mixing and convergence of the Markov chains, are based on node orders (6) or more complex edge reversal operations (12).
13.2
Inclusion of biological prior knowledge
The number of experimental conditions is usually limited for practical reasons. For instance, a biologist will hardly be able to procure the funding for, say, a thousand microarray experiments, and in the practical data
274
Handbook of Statistical Systems Biology
analysis, we have to make do with a few dozen conditions. This limited dataset size will usually lead to a diffuse posterior distribution in network structure space, if the posterior distribution is dominated by the likelihood. A way to counteract this tendency is to include informative prior knowledge, as described in Section 13.1. For instance, a biologist aiming to infer regulatory networks in a crop plant, say barley, might want to include knowledge about related regulatory networks in a model plant like Arabidopsis. The bulk of the analysis may be based on gene expression profiles from microarray experiments, but we may have information about transcription factor binding motifs in the upstream sequences of the promoters of some genes, which will give us prior cues about potential gene regulation patterns. It is certainly desirable to include this prior knowledge in the analysis. However, it is also important to automatically assess how useful this prior knowledge is. How confident are we that the over-representation of certain sequence motifs in the promoter is indicative of genuine transcription factor binding? To what extent is a signalling pathway in a model plant similar to corresponding pathways in crops, and is there any corresponding pathway in the first place? We would therefore like to have a mechanism in place that automatically trades off different sources of prior knowledge against each other and against the data, so as to assess their relative relevance. This is achieved with the hierarchical Bayesian model discussed below. The method was proposed in (41;42), based on work by (22). A similar approach was developed in (34). 13.2.1
The ‘energy’ of a network
A network M is represented by a binary adjacency matrix, where each entry Mij can be either 0 or 1. A zero entry, Mij = 0, indicates the absence of an edge between node Xi and node Xj . Conversely if Mij = 1 there is a directed edge from node Xi to node Xj . We define the biological prior knowledge matrix B to be a matrix in which the entries Bij ∈ [0, 1] represent our knowledge about interactions among nodes as follows. If entry Bij = 0.5, we do not have any prior knowledge about the presence or absence of the edge between node Xi and node Xj . If 0 ≤ Bij < 0.5, we have prior evidence that there is no edge between node Xi and node Xj . The evidence is stronger as Bij is closer to 0. If 0.5 < Bij ≤ 1, we have prior evidence that there is a directed edge pointing from node Xi and node Xj . The evidence is stronger as Bij is closer to 1. Note that despite their restriction to the unit interval, the Bij are not probabilities in a stochastic sense. To obtain a proper probability distribution over networks, we have to introduce an explicit normalization procedure, as will be discussed shortly. Having defined how to represent a network M and the biological prior knowledge B, we can now define a regularization term to quantify the deviation between a given network M and the biological prior knowledge that we have at our disposal: E(M) =
N
Bij − Mij
(13.6)
i,j=1
where N is the total number of nodes in the studied domain. The regularization term E is zero for a perfect match between the prior knowledge B and the actual network structure M, while increasing values of E indicate an increasing mismatch between B and M. Following (22), we call this measure the ‘energy’ E, borrowing the name from the statistical physics literature. If we have two different sources of prior knowledge, represented by separate prior knowledge matrices B(k) , k ∈ {1, 2}, e.g. one extracted from pathway databases, like KEGG (Kyoto Encyclopedia of Genes and Genomes), and the other derived from binding motifs in the upstream sequences of the genes, then we get two regularization (‘energy’) functions: E1 (M) =
N
(1)
Bij − Mij , i,j=1
E2 (M) =
N
(2)
Bij − Mij
i,j=1
(13.7)
Advanced Applications of Bayesian Networks 275
The scheme can be generalized to an arbitrary number of disparate sources of prior knowledge. However, for clarity of exposition, we will limit the subsequent discussion to the case of two sources of prior knowledge, noting that an extension to more sources is straightforward. 13.2.2
Prior distribution over network structures
To integrate the prior knowledge expressed by Equation (13.7) into the inference procedure, we follow (22) and define the prior distribution over network structures M to take the form of a Gibbs distribution: P(M|β1 , β2 ) =
e−{β1 E1 (M)+β2 E2 (M)} Z(β1 , β2 )
(13.8)
where the ‘energy’ E(M) was defined in Equation (13.7), β1 and β2 are hyperparameters, and the denominator is a normalizing constant that is usually referred to as the partition function: Z(β1 , β2 ) =
e−{β1 E1 (M)+β2 E2 (M)}
(13.9)
M∈G
Note that the summation extends over the set of all possible network structures, G. The hyperparameters β1 and β2 can be interpreted as factors that indicate the strength of the influence of the respective source of biological prior knowledge relative to the data. For β1 , β2 → 0, the prior distribution defined in Equation (13.8) becomes flat and uninformative about the network structure. Conversely, for βk → ∞, the prior distribution becomes sharply peaked at the network structure with the lowest ‘energy’ E(M), k ∈ {1, 2}. For DBNs, we can exploit the modularity of Bayesian networks and compute the sum in Equation (13.9) efficiently. Note that E(M) in Equation (13.7) can be rewritten as follows: E1 (M) =
N
E1 (n, πn [M]) ;
E2 (M) =
n=1
N
E2 (n, πn [M])
(13.10)
n=1
where πn [M] is the set of parents of node Xn in the graph M, and we have defined: E1 (n, πn ) =
1 1 Bin 1 − Bin +
i∈πn
E2 (n, πn ) =
2 2 1 − Bin + Bin
i∈πn
(13.11)
i∈π / n
(13.12)
i∈π / n
Inserting Equation (13.10) into Equation (13.9), we obtain: Z=
e−{β1 E1 (M)+β2 E2 (M)}
M∈G
=
...
π1
=
n
πn
e−{β1 [E1 (1,π1 )+...+E1 (N,πN )]+β2 [E2 (1,π1 )+...+E2 (N,πN )]}
πN
e−{β1 E1 (n,πn )+β2 E2 (n,πn )}
(13.13)
276
Handbook of Statistical Systems Biology
Here, the summation in the last equation extends over all parent configurations πn of node Xn , which in the case of a fan-in restriction1 is subject to constraints on their cardinality. Note that the essence of Equation (13.13) is a dramatic reduction in the computational complexity. Rather than summing over the whole space of network structures, whose cardinality increases super-exponentially with the number of nodes N, we only need to sum over all parent configurations of each node; the complexity of this operation is polynomial in N, with the degree of the polynomial determined by the maximum fan-in restriction. The reason for this simplification is the fact that any modification of the parent configuration of a node in a DBN leads to a new valid DBN by construction. This convenient feature does not apply to static Bayesian networks, though, where modifications of a parent configuration πn may lead to directed cyclic structures, which are invalid and hence have to be excluded from the summation in Equation (13.13). The detection of directed cycles is a global operation. This destroys the modularity inherent in Equation (13.13), and leads to a considerable explosion of the computational complexity. Note, however, that Equation (13.13) still provides an upper bound on the true partition function. When densely connected graphs are ruled out by a fan-in restriction, as commonly done – see e.g. (8), (6) and (20) – then the number of cyclic terms that need to be excluded from Equation (13.13) can be assumed to be relatively small. We can then expect the bound to be rather tight, and use it to approximate the true partition function. 13.2.3
MCMC sampling scheme
Having defined the prior probability distribution over network structures, our next objective is to extend the MCMC scheme of Equation (13.5) to sample both the network structure and the hyperparameters from the posterior distribution. Starting from the prior distributions on the hyperparameters, P(β1 ) and P(β2 ), our objective is to sample network structures and hyperparameters from the posterior distribution P(M, β1 , β2 |D). We sample a new network structure Mnew from the proposal distribution Q(Mnew |Mold ), and new hyperparameters from the proposal distributions R(β1new |β1old ) and R(β2new |β2old ). We then accept this move according to the standard Metropolis–Hastings update rule (18) with the following acceptance probability: A = min
P(D, Mnew , β1new , β2new )Q(Mold |Mnew )R(β1old |β1new )R(β2old |β2new ) ,1 P(D, Mold , β1old , β2old )Q(Mnew |Mold )R(β1new |β1old )R(β2new |β2old )
(13.14)
From the conditional independence relations intrinsic to the model, this expression can be expanded as follows: A = min
P(D|Mnew )P(Mnew |β1new , β2new )P1 (β1new )P2 (β2new ) P(D|Mold )P(Mold |β1old , β2old )P1 (β1old )P2 (β2old ) Q(Mold |Mnew )R1 (β1old |β1new )R2 (β2old |β2new ) × ,1 Q(Mnew |Mold )R1 (β1new |β1old )R2 (β2new |β2old )
(13.15)
To increase the acceptance probability and, hence, mixing and convergence of the Markov chain, it is advisable to break the move up into three submoves. Submove 1: sample a new network structure Mnew from the proposal distribution Q(Mnew |Mold ) for fixed hyperparameters β1 and β2 . Submove 2: sample a new hyperparameter β1new from the proposal distribution R(β1new |β1old ) for fixed hyperparameter β2 and fixed network structure 1A
fan-in restriction defines the maximum number of parents each node can have. In the simulations discussed later in this chapter, a fan-in restriction of three was assumed, as commonly applied by other authors; see e.g. (8), (6) and (20).
Advanced Applications of Bayesian Networks 277
M. Submove 3: sample a new hyperparameter β2new from the proposal distribution R(β2new |β2old ) for fixed hyperparameter β1 and fixed network structure M. Assuming uniform prior distributions P(β1 ) and P(β2 ) as well as symmetric proposal distributions R(β1new |β1old ) and R(β2new |β2old ), the corresponding acceptance probabilities are given by the following expressions: P(D|Mnew )P(Mnew |β1 , β2 )Q(Mold |Mnew ) A(Mnew |Mold ) = min ,1 (13.16) P(D|Mold )P(Mold |β1 , β2 )Q(Mnew |Mold ) A(β1new |β1old ) = min A(β2new |β2old ) = min
P(M|β1new , β2 ) ,1 P(M|β1old , β2 )
(13.17)
P(M|β1 , β2new ) ,1 P(M|β1 , β2old )
(13.18)
These submoves are iterated until some convergence criterion (2) is satisfied. 13.2.4
Practical implementation
(41) chose the prior distribution of the hyperparameters P(β) to be the uniform distribution over the interval [0, MAX]. The proposal probability for the hyperparameters R(βnew |βold ) was chosen to be a uniform distribution over a moving interval of length 2l MAX, centred on the current value of the hyperparameter. Consider a hyperparameter βnew to be sampled in an MCMC move given that we have the current value βold . The proposal distribution is uniform over the interval [βold − l, βold + l] with the constraint that βnew ∈ [0, MAX]. If the sampled value βnew happens to lie outside the allowed interval, the value is reflected back into the interval. The respective proposal probabilities can be shown to be symmetric and therefore to cancel out in the acceptance probability ratio. In our simulations, we set the upper limit of the prior distribution to be MAX = 30, and the length of the sampling interval to be l = 3. Convergence and mixing can be further improved by adapting l during the burn-in phase. To test for convergence of the MCMC simulations, various methods have been developed; see (2) for a review. The results reported below are taken from (41), who applied the simple scheme used in (6): each MCMC run was repeated from independent initializations, and consistency in the marginal posterior probabilities of the edges was taken as indication of sufficient convergence. For the applications reported in Section 13.2.5, this led to the decision to run the MCMC simulations for a total number of 5 × 105 steps, of which the first half were discarded as the burn-in phase. 13.2.5
Empirical evaluation on the Raf signalling pathway
The present section describes an empirical evaluation of the sampling scheme, which we carried out for the study reported in (41). 13.2.5.1
Data
(39) applied intracellular multicolour flow cytometry experiments to quantitatively measure protein concentrations. Data were collected after a series of stimulatory cues and inhibitory interventions targeting specific proteins in the Raf pathway. Raf is a critical signalling protein involved in regulating cellular proliferation in human immune system cells. The dysregulation of the Raf pathway is implicated in carcinogenesis, and this pathway has therefore been extensively studied in the literature (3;39); see Figure 13.2 for a representation of the gold-standard network from (39). Note that the abundance of cytometry data substantially exceeds that of
278
Handbook of Statistical Systems Biology
raf mek p38
erk akt
pip3 plcg
pkc pip2
pka
jnk
Figure 13.2 Raf signalling pathway. The graph shows the Raf signalling network, taken from Sachs et al. (2005). Nodes represent proteins, edges represent interactions and arrows indicate the direction of signal transduction
currently available gene expression data from microarrays. We therefore pursued the approach taken in (43) and downsampled the data to a sample size representative of current microarray experiments (100 exemplars). In our experiments we used five datasets with 100 measurements each, obtained by randomly sampling subsets from the original observational data of (39). Details about the standardization of the data can be found in (43). 13.2.5.2
Prior knowledge
We extracted biological prior knowledge from the KEGG pathways database (23;24;25). KEGG pathways represent current knowledge of the molecular interaction and reaction networks related to metabolism, other cellular processes, and human diseases. As KEGG contains different pathways for different diseases, molecular interactions and types of metabolism, it is possible to find the same pair of genes2 in more than one pathway. We therefore extracted all pathways from KEGG that contained at least one pair of the 11 proteins/phospholipids included in the Raf pathway. We found 20 pathways that satisfied this condition. From these pathways, we computed the prior knowledge matrix, introduced in Section 13.2.1, as follows. Define by Uij the total number of times a pair of genes i and j appears in a pathway, and by uij the number of times the genes are connected by a (directed) edge in the KEGG pathway. The elements Bij of the prior knowledge matrix are then defined by Bij = uij /Uij . If a pair of genes is not found in any of the KEGG pathways, we set the respective prior association to Bij = 0.5, implying that we have no information about this relationship. 13.2.5.3
Comparison with other models
The objective of the empirical evaluation is to assess the viability of the Bayesian inference scheme and to estimate by how much the network reconstruction results improve as a consequence of combining the (downsampled) cytometry data with prior knowledge from the KEGG pathway database. To this end, we compared the results obtained with the Bayesian hierarchical model described in the present chapter with the results reported in (43), where we had evaluated the performance of Bayesian networks and graphical Gaussian models (GGMs) without the inclusion of prior knowledge. We applied GGMs as described in (40). For comparability with (43), we used Bayesian networks with the family of linear Gaussian distributions, for which the marginal likelihood P(D|M) of Equation (13.2) is given by the BGe score, as discussed in Section 13.1.1. Note that 2 We
use the term ‘gene’ generically for all interacting nodes in the network. This may include proteins encoded by the respective genes.
Advanced Applications of Bayesian Networks 279
the cytometry data of (39) are not taken from a time course; hence, Bayesian networks were treated as static rather than dynamic models. 13.2.5.4
Discriminating between different priors
We wanted to test whether the proposed Bayesian inference method can discriminate between different sources of prior knowledge and automatically assess their relative merits. To this end, we complemented the prior from the KEGG pathway database with a second prior, for which the entries in the prior knowledge matrix B were chosen completely at random. Hence, this second source of prior knowledge is vacuous and does not include any useful information for reconstructing the regulatory network. Figure 13.3(b) shows the estimated posterior distributions of two hyperparameters: β1 , the hyperparameter associated with the prior knowledge extracted from the KEGG pathways; and β2 , the hyperparameter associated with the vacuous prior. It is seen that the hyperparameter associated with the KEGG prior, β1 , takes on substantially larger values than the hyperparameter associated with the vacuous prior, β2 . This suggests that the proposed method successfully
PDF estimate Real data
UGE
1
0.9
AUC β 1 MCMC trajectory
0.95
β KEGG
0.8
0.9
β Random
0.7
0.85
0.6
AUC
0.8 0.5 0.75 0.4
0.7
0.3
0.65
0.2
0.6
0.1
0.55 0.5
(a)
0
5
10
15 β1
20
25
0
30
(b)
0
1
2
3
4
5
β
Figure 13.3 Inferring hyperparameters from the cytometry data of the Raf pathway. (a) The horizontal axis represents the value of ˇ1 , the hyperparameter associated with the prior knowledge from KEGG. The vertical axis represents the area under the ROC curve (AUC) for the undirected graph evaluation (UGE). The results for the directed graph evaluation (DGE) were similar. The graph plotted against the horizontal axis shows the mean AUC score for fixed values of ˇ1 , obtained by sampling network structures from the posterior distribution with MCMC. The results were averaged over five datasets of 100 protein concentrations each, independently sampled from the observational cytometry data of Sachs et al. (2005). The error bars show the respective standard deviations. The graph plotted against the vertical axis shows trace plots of ˇ1 obtained with the MCMC scheme described in Section 13.2.3. (b) shows the corresponding posterior probability densities, estimated from the MCMC trajectories with a Parzen estimator, using a Gaussian kernel whose standard deviation was set automatically by the MATLAB function ksdensity.m. The solid line refers to the hyperparameter associated with the prior knowledge extracted from the KEGG pathways (ˇ1 ). The dashed line refers to completely random and hence vacuous prior knowledge (ˇ2 ). The data, on which the inference was based, consisted of 100 concentrations of the 11 proteins in the Raf pathway, subsampled from the observational cytometry data of Sachs et al. (2005)
280
Handbook of Statistical Systems Biology
discriminates between the two sources of prior information and effectively suppresses the influence of the vacuous one. 13.2.5.5
Learning the hyperparameters
While the study described in the previous subsection suggests that the proposed Bayesian inference scheme succeeds in suppressing irrelevant prior knowledge, we were curious to see whether the hyperparameter associated with the relevant prior (from KEGG) was optimally inferred. To this end, we chose a large set of fixed values for β1 , while keeping the hyperparameter associated with the vacuous prior fixed at zero: β2 = 0. For each fixed value of β1 , we sampled Bayesian networks from the posterior distribution with MCMC, and evaluated the network reconstruction accuracy using the evaluation criteria described in Section 13.2.5.6. We compared these results with the proposed Bayesian inference scheme, where both hyperparameters and networks are simultaneously sampled from the posterior distribution with the MCMC scheme discussed in Section 13.2.3. The results are shown in Figure 13.3(a). They suggest that the inferred values of β1 are close to those that achieve the best network reconstruction accuracy, and that the approximation of the partition function of Equation (13.9) by Equation (13.13) leads only to a small bias. 13.2.5.6
Reconstructing the regulatory network
While the true network is a directed graph, our reconstruction methods may lead to undirected, directed, or partially directed graphs.3 To assess the performance of these methods, we applied two different criteria. The first approach, referred to as the undirected graph evaluation (UGE), discards the information about the edge directions altogether. To this end, the original and learned networks are replaced by their skeletons, where the skeleton is defined as the network in which two nodes are connected by an undirected edge whenever they are connected by any type of edge. The second approach, referred to as the directed graph evaluation (DGE), compares the predicted network with the original directed graph. A predicted undirected edge is interpreted as the superposition of two directed edges, pointing in opposite directions. The application of any of the machine learning methods considered in our study leads to a matrix of scores associated with the edges in a network. For Bayesian networks sampled from the posterior distribution with MCMC, these scores are the marginal posterior probabilities of the edges. For GGMs, these are partial correlation coefficients. Both scores define a ranking of the edges. This ranking defines a receiver operating characteristic (ROC) curve, where the relative number of true positive (TP) edges is plotted against the relative number of false positive (FP) edges. We proceeded with two different evaluation procedures. The first approach is based on integrating the ROC curve so as to obtain the area under the curve (AUC), with larger areas indicating, overall, a better performance. The second approach is based on the selection of an arbitrary threshold on the edge scores, from which a specific network prediction is obtained. Following our earlier study (43), we set the threshold such that it led to a fixed count of 5 FPs. From the predicted network, we determined the number of correctly predicted (TP) edges, and took this score as our second figure of merit. The results are shown in Figure 13.4. The proposed Bayesian inference scheme clearly outperforms the methods that do not include the prior knowledge from the KEGG database (Bayesian networks and GGMs). It also clearly outperforms the prediction that is solely based on the KEGG pathways alone without taking account of the cytometry data. The improvement is significant for all four evaluation criteria: AUC and TP scores for both directed (DGE) and undirected (UGE) graph evaluations. This suggests that the network 3 GGMs
are undirected graphs. While Bayesian networks are, in principle, directed graphs, partially directed graphs may result as a consequence of equivalence classes, which were briefly mentioned in Section 13.1. For more details, see e.g. chapter 2 in (21).
Advanced Applications of Bayesian Networks 281
Figure 13.4 Reconstruction of the Raf signalling pathway with different machine learning methods. The figure evaluates the accuracy of inferring the Raf signalling pathway from cytometry data and prior information from KEGG. Two evaluation criteria were used. Panel (a) shows the results in terms of the area under the ROC curve (AUC scores), while (b) shows the number of predicted true positive (TP) edges for a fixed number of 5 spurious edges. Each evaluation was carried out twice: with and without taking the edge direction into consideration (UGE, undirected graph evaluation; DGE, directed graph evaluation). Four machine learning methods were compared: Bayesian networks without prior knowledge (BNs), graphical Gaussian models without prior knowledge (GGMs), Bayesian networks with prior knowledge from KEGG (BN-Prior), and prior knowledge from KEGG only (Only Prior). In the latter case, the elements of the prior knowledge matrix (introduced in Section 13.2.1) were computed as described in Section 13.2.5.2. The histogram bars represent the mean values obtained by averaging the results over five datasets of 100 protein concentrations each, independently sampled from the observational cytometry data of (39). The error bars show the respective standard deviations
reconstruction accuracy can be substantially improved by systematically integrating expression data with prior knowledge about pathways, as extracted from the literature or databases like KEGG.
13.3
Heterogeneous DBNs
As discussed in Section 13.1.2, classical DBNs are based on the homogeneous Markov assumption and cannot deal with heterogeneity in temporal processes. The present chapter describes a relaxation of this assumption based on node-specific multiple changepoint processes; this is a summary of the work presented in (13;14). Related ideas were independently developed in (28), (1), (26) and (38). 13.3.1
Motivation: Inferring spurious feedback loops with DBNs
Feedback loops and recurrent structures are essential to the regulation and stable control of complex biological systems. As discussed in Section 13.1.2, the application of dynamic as opposed to static Bayesian networks is promising in that, in principle, these feedback loops can be learned. However, we show that the widely applied BGe score (discussed in Section 13.1.1) is susceptible to incurring spurious feedback loops, which are a consequence of nonlinear regulation and autocorrelation in the data. We propose a nonlinear generalization
282
Handbook of Statistical Systems Biology
of the BGe model, based on a mixture model, and demonstrate that this approach successfully represses spurious feedback loops. When the objective is to infer regulatory networks from time series, as is typically the case in systems biology, the restriction of the model to linear processes can result in the prediction of spurious feedback loops. Consider the simple example shown in Figure 13.1. The graph shows two interacting nodes. Node X is a regulator of node Y , and it also has a regulatory feedback loop acting back on itself. Node Y is regulated by node X, but does not contain a feedback loop. Figure 13.1 shows both the state space representation, i.e. the recurrent graph, and the corresponding DBN. Note that the latter is a valid DAG obtained by the standard procedure of unfolding the state space graph in time. First assume that the data generation processes are linear and hence consistent with the BGe model assumption, e.g. X(t + 1) = X(t) + c + σx · φX (t) and Y (t + 1) = w · X(t) + m + σy · φY (t) where w, m, c, σx , σy are constants, and φ. (.) are independent and identically distributed (i.i.d.) normally distributed random variables. Under fairly general regularity conditions, the marginal likelihood and, hence, the BGe score is a consistent estimator. This implies that the correct model structure will be learned as m → ∞, where m is the dataset size. Next, consider the scenario of a nonlinear regulatory influence that X exerts on Y : X(t + 1) = X(t) + c + σx · φX (t),
Y (t + 1) = f (X(t)) + σy · φY (t)
(13.19)
for some nonlinear function f (.). This nonlinear function cannot be modelled with a linear Bayesian network based on the BGe model. Consequently, the prediction of Y (t + 1) from X(t) will tend to be poor. Note that for sufficiently small noise levels, the Y (t)’s will exhibit a strong autocorrelation, by virtue of the autocorrelation of the X(t)’s, and the regulatory influence of X(t) on Y (t + 1). If the latter regulatory influence cannot be learned owing to the linear restriction of our model, the next best explanation is a direct modelling of the autocorrelation between the Y (t)’s themselves. This autocorrelation corresponds to a feedback loop of Y acting back on itself in the state-space graph, or, equivalently, an edge from Y (t) to Y (t + 1) in the DBN. We would therefore conjecture that the linear restriction of the Bayesian network model may result in the prediction of spurious feedback loops and, hence, to the reconstruction of wrong network structures. Ruling out feedback loops altogether will not provide a sufficient remedy for this problem, as some nodes – X in the example above – will exhibit regulatory feedback loops (e.g. in molecular biology: transcription factors regulating their own transcription), and it is generally not known in advance where these nodes are. 13.3.2
A nonlinear/nonhomogeneous DBN
To obtain a nonlinear DBN, we generalize an adapted version of Equation (13.2) (see Section 13.1.2) with a mixture model:
P(D|M, V, K, θ) =
Kn m N δVn (t),k P Xn (t) = Dn,t |πn (t − 1) = D(πn ,t−1) , θ kn,πn
(13.20)
n=1 t=2 k=1
where δVn (t),k is the Kronecker delta, V is a matrix of latent variables Vn (t), Vn (t) = k indicates that the realization of node Xn at time t, Xn (t), has been generated by the kth component of a mixture with Kn components, and K = (K1 , . . . , Kn ). Note that the matrix V divides the data into several disjoined subsets, each of which can be regarded as pertaining to a separate BGe model with parameters θ kn,πn . The probability model defined in Equation (13.20) is a straightforward generalization of the mixture model proposed in (16), with local probability distributions P(Xn |πn , θ kn,πn ), and allocation vectors Vn that are node-specific,
Advanced Applications of Bayesian Networks 283
i.e. different nodes can have different assignments. This free allocation of the latent variables can, in principle, approximate any probability distribution arbitrarily closely. However, the price of this flexibility is that the computational complexity increases exponentially with the number of time points m. In the present chapter, we therefore summarize the work of (14), where the assignment of data points to mixture components was changed from a free allocation to a changepoint process. This modification reduces the complexity of the latent variable space from exponential to polynomial complexity, and incorporates our prior belief that, in a time series, adjacent time points are likely to be assigned to the same component. From Equation (13.20), the marginal likelihood conditional on the latent variables V is given by P(D|M, V, K) =
P(D|M, V, K, θ)P(θ)dθ =
Kn N
(Dπn n [k, Vn ])
(13.21)
n=1 k=1
(Dπn n [k, Vn ]) =
m δVn (t),k P Xn (t) = Dn,t |πn (t − 1) = D(πn ,t−1) , θ kn,πn
t=2 k P(θ n,πn )dθ kn,πn
(13.22)
Equation (13.22) is similar to Equation (13.3), appropriately adapted to DBNs (see Section 13.1.2), except that it is restricted to the subset Dπn n [k, Vn ] := {(Dn,t , Dπn ,t−1 ) : Vn (t) = k, 2 ≤ t ≤ m}. Hence when the regularity conditions defined in (9) are satisfied, then the expression in equation (13.22) has a closed-form solution: it is given by Equation (24) in (9) restricted to the subset of the data that has been assigned to the kth mixture component (or kth segment). The joint probability distribution of the proposed heterogeneous changepoint DBN model is given by: P(M, V, K, D) = P(D|M, V, K) · P(M) · P(V|K) · P(K) Kn N πn (Dn [k, Vn ]) P(Vn |Kn ) · P(Kn ) · = P(M) · n=1
(13.23)
k=1
In the absence of genuine prior knowledge about the regulatory network structure, we assume for P(M) a uniform distribution on graphs, subject to a fan-in restriction of |πn | ≤ 3. As prior probability distributions on the node-specific numbers of mixture components Kn , P(Kn ), we take i.i.d. truncated Poisson distributions with shape parameter λ = 1, restricted to 1 ≤ Kn ≤ KMAX(we set KMAX = 10 in our simulations). The prior distribution on the latent variable vectors, P(V|K) = N n=1 P(Vn |Kn ), is implicitly defined via the changepoint process as follows. We identify Kn with Kn − 1 changepoints bn = {bn,1 , . . . , bn,Kn −1 } on the continuous interval [2, m]. For notational convenience we introduce the pseudo changepoints bn,0 = 2 and bn,Kn = m. For node Xn the observation at time point t is assigned to the kth component, symbolically Vn (t) = k, if bn,k−1 ≤ t < bn,k . Following (11) we assume that the changepoints are distributed as the evennumbered order statistics of L := 2(Kn − 1) + 1 points u1 , . . . , uL uniformly and independently distributed on the interval [2, m]. The motivation for this prior, instead of taking Kn uniformly distributed points, is to encourage a priori an equal spacing between the changepoints, i.e. to discourage mixture components (i.e. segments) that contain only a few observations. The even-numbered order statistics prior on the changepoint locations bn induces a prior distribution on the node-specific allocation vectors Vn . Deriving a closed-form expression is involved. However, the MCMC scheme we discuss in the next section does not sample Vn directly, but is based on local modifications of Vn based on changepoint birth, death and reallocation moves. All that is required for the acceptance probabilities of these moves are P(Vn |Kn ) ratios, which are straightforward to compute.
284
Handbook of Statistical Systems Biology
13.3.3
MCMC inference
We now describe an MCMC algorithm to obtain a sample {Mi , Vi , Ki }i=1,...,I from the posterior distribution P(M, V, K|D) ∝ P(M, V, K, D) of Equation (13.23). We combine the structure MCMC algorithm of (10) and (31) with the changepoint model used in (11), and draw on the fact that conditional on the allocation vectors V, the model parameters can be integrated out to obtain the marginal likelihood terms (Dπn n [k, Vn ]) in closed form, as shown in the previous section. Note that this approach is equivalent to the idea underlying the allocation sampler proposed in (36). The resulting algorithm is effectively an RJMCMC (reversible jump MCMC) scheme (11) in the discrete space of network structures and latent allocation vectors, where the Jacobian in the acceptance criterion is always 1 and can be omitted. With probability pG = 0.5 we perform a structure MCMC move on the current graph Mi and leave the latent variable matrix and the numbers of mixture components unchanged, symbolically: Vi+1 = Vi and Ki+1 = Ki . A new candidate graph Mi+1 is randomly drawn out of the set of graphs N(Mi ) that can be reached from the current graph Mi by deletion or addition of a single edge. The proposed graph Mi+1 is accepted with probability: P(D|Mi+1 , Vi , Ki ) P(Mi+1 ) |N(Mi )| (13.24) A(Mi+1 |Mi ) = min 1, P(D|Mi , Vi , Ki ) P(Mi ) |N(Mi+1 )| where |.| is the cardinality, and the marginal likelihood terms have been specified in Equation (13.21). The graph is left unchanged, symbolically Mi+1 := Mi , if the move is not accepted. With the complementary probability 1 − pG we leave the graph Mi unchanged and perform a move on (Vi , Ki ), where Vni is the latent variable vector of Xn in Vi , and Ki = (Ki1 , . . . , KiN ). We randomly select a node Xn and change its current number of components Kin via a changepoint birth or death move, or its latent variable vector Vni by a changepoint re-allocation move. The changepoint birth (death) move increases (decreases) Kin by 1 and may also have an effect on Vni . The changepoint reallocation move leaves Kin unchanged and may have an effect on Vni . Under fairly mild regularity conditions (ergodicity), the MCMC sampling scheme converges to the desired posterior distribution if the acceptance probabilities for the three i+1 changepoint moves (Kin , Vni ) → (Ki+1 n , Vn ) are chosen of the form min(1, R), see (11), with Ki+1 n k=1 R= Kni
(Dπn n [k, Vni+1 ], M)
πn i k=1 (Dn [k, Vn ], M)
×A×B
(13.25)
i+1 i i i where A = P(Vni+1 |Ki+1 n )P(Kn )/P(Vn |Kn )P(Kn ) is the prior probability ratio, and B is the inverse proposal probability ratio. The exact form of these factors depends on the move type; see the supplementary material of (14) for details. We note that an MCMC algorithm based on Equation (10) in (6) in combination with the dynamic programming scheme proposed in (5) has the prospect of improving the convergence and mixing of the Markov chain, and we have investigated this scheme in Grzegorczyk and Husmeier (2011).
13.3.4
Simulation results
We briefly summarize the results of our study in (13), were we evaluated the heterogeneous DBN on synthetic data sets generated from the simple network of Figure 13.1. The dynamics of the system are given by the nonlinear state-space equations [see Equation (13.19)], with the nonlinear function f (.) = sin(.). We generated 40 observations by applying Equation (13.19), setting the drift term c = 2π/41 to ensure that the complete period [0, 2π] of the sinusoid is involved. Figure 13.5 shows the marginal posterior probabilities of the four possible edges in the two-node network of Figure 13.1 and the nonlinear state space process of Equation (13.19). The results in Figure 13.5(a) were obtained with the linear BGe model and show a clear propensity for inferring the spurious self-loop Y → Y ,
Advanced Applications of Bayesian Networks 285 σY=0.25
σY=0.5
σY=1
σY=0.25
σY=0.5
σY=1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0 Conventional DBN
0
0
0
0
0
0
0
1
1
1
1
0
0
0
1
1
0
σX=0.5
σX=0.25
1
σX=0.1
1
1
σX=1
σX=1
(a)
σY=0.1
1
1
σX=0.5
σX=0.25
σX=0.1
σY=0.1
1
(b)
0 Heterogeneous DBN
0
Figure 13.5 Marginal edge posterior probabilities for the synthetic network. Both panels are laid out as matrices, whose cells correspond to standard deviations X and Y of the noise in X and Y. All histograms show averages (means/standard deviations) from 20 independent data instantiations. The histograms show the posterior probabilities of the edges in the simple network of Figure 13.1, as obtained with a standard DBN using the BGe score (a) and the proposed heterogeneous DBN (b). Each histogram contains four bars, which represent the average posterior probabilities of the four possible edges: (left) self-loop X → X (true); (centre left) X → Y (true); (centre right) self-loop Y → Y (false); and (right) Y → X (false). It is seen that BGe has a high propensity for learning the spurious feedback loop Y → Y, while the heterogeneous DBN tends to learn an increased probability of the correct edge X → Y (centre left bars)
in confirmation of our earlier conjecture (see Section 13.3.1). Compare this with the results for the proposed heterogeneous DBN, shown in Figure 13.5(b). Here, the spurious self-loop Y → Y is suppressed in favour of the correct edge X → Y . There are two noise regimes in which the spurious self-loop Y → Y has a marginal posterior probability that is higher than or equal to that of the correct edge X → Y . One noise regime is where both noise levels in X and Y are low [top left corner in Figure 13.5(a) and (b)]. Here, the autocorrelation of Y is so high that the spurious self-loop Y → Y is still favoured over the true edge X → Y ; this is a consequence of the fact that the functional dependence of Y (t + 1) on X(t) is only learned approximately (namely approximated by a mixture/changepoint model). The second regime is where both noise levels are high [bottom right corners in Figure 13.5(a) and (b)]. High noise in Y blurs the functional dependence of Y (t + 1) on X(t), while high noise in X leads to a high misclassification of latent variables and, consequently, a deterioration of the model accuracy; this is a consequence of the fact that latent variables are not allocated individually, as in (16), but according to a changepoint process. However, in the majority of noise scenarios, the marginal posterior probability of the correct edge X → Y is significantly higher than that of the self-loop Y → Y . This suggests that the proposed heterogeneous DBN is successful at suppressing spurious feedback loops. 13.3.5
Results on Arabidopsis gene expression time series
In (14), we applied our method to microarray gene expression time series related to the study of circadian regulation in plants. Arabidopsis thaliana seedlings, grown under artificially controlled ϑ-hour-light/ϑ-hourdark cycles, were transferred to constant light and harvested at 13 time points in τ-hour intervals. From these
286
Handbook of Statistical Systems Biology
PRR9
0.6
0.6
0.3
0.3
LHY
GI
CCA1
TOC1
PRR3 PRR5 ELF4 ELF3
0
0
(a)
10
20
30
40
0
0
10
20
30
40
(b)
(c)
Figure 13.6 Results on the Arabidopsis gene expression time series. (a, b) Average posterior probability of a changepoint (vertical axis) at a specific transition time plotted against the transition time (horizontal axis) for two selected circadian genes [(a) LHY; (b) TOC1] The vertical dotted lines indicate the boundaries of the time series segments, which are related to different entrainment conditions and time intervals. (c) Predicted regulatory network of nine circadian genes in Arabidopsis thaliana. Empty circles represent morning genes. Shaded circles represent evening genes. Edges indicate predicted interactions with a marginal posterior probability greater than 0.5
seedlings, RNA was extracted and assayed on Affymetrix GeneChip oligonucleotide arrays. The data were background-corrected and normalized according to standard procedures4 using the GeneSpring© software (Agilent Technologies). We combined four time series, which differed with respect to the pre-experiment entrainment condition and the time intervals: ϑ ∈ {10h, 12h, 14h}, and τ ∈ {2h, 4h}. The data, with detailed information about the experimental protocols, can be obtained from (4), (33) and (16) . We focused our analysis on nine circadian genes5 (i.e. genes involved in circadian regulation). We combined all four time series into a single set. The objective was to test whether the proposed heterogeneous DBN would detect the different experimental phases. Since the gene expression values at the first time point of a time series segment have no relation with the expression values at the last time point of the preceding segment, the corresponding boundary time points were appropriately removed from the data.6 This ensures that for all pairs of consecutive time points a proper conditional dependence relation determined by the nature of the regulatory cellular processes is given. Figure 13.6 shows the marginal posterior probability of a changepoint for two selected genes (LHY and TOC1). It is seen that the three concatenation points are clearly detected. There is a slight difference between the heights of the posterior probability peaks for LHY and TOC1. This deviation indicates that the two genes are affected by the changing experimental conditions (entrainment, time interval) in different ways and thus provides a useful tool for further exploratory analysis. Figure 13.6(c) shows the gene interaction network that is predicted when keeping all edges with marginal posterior probability greater than 0.5. There are two groups of genes. Empty circles in Figure 13.6(c) represent morning genes (i.e. genes whose expression peaks in the morning), shaded circles represent evening genes (i.e. genes whose expression peaks in the evening). There are several directed edges pointing from the group of morning genes to the evening genes, mostly originating from gene CCA1. This result is consistent with the findings in (32), where the morning genes were found to activate the evening genes, with CCA1 being a central regulator. Our reconstructed network also contains edges pointing into the opposite direction, from the evening genes back to the morning genes. This finding is also consistent with (32), where the evening genes were found to inhibit the morning genes via a negative 4 RMA
rather than GCRMA was used for reasons discussed in (29). TOC1, CCA1, ELF4, ELF3, GI, PRR9, PRR5 and PRR3. 6 A proper mathematical treatment is given in the supplementary material of (14). 5 LHY,
Advanced Applications of Bayesian Networks 287
feedback loop. In the reconstructed network, the connectivity within the group of evening genes is sparser than within the group of morning genes. This finding is consistent with the fact that following the light–dark cycle entrainment, the experiments were carried out in constant-light condition, resulting in a higher activity of the morning genes overall. Within the group of evening genes, the reconstructed network contains an edge between GI and TOC1. This interaction has been confirmed in (30). Hence while a proper evaluation of the reconstruction accuracy is currently unfeasible – like (38) and many related studies, we lack a gold-standard owing to the unknown nature of the true interaction network – our study suggests that the essential features of the reconstructed network are biologically plausible and consistent with the literature.
13.4
Discussion
We have discussed two important problems related to the application of Bayesian networks in computational systems biology: the systematic integration of biological prior knowledge, discussed in Section 13.2, and the relaxation of the homogeneity assumption for DBNs, discussed in Section 13.3. We have evaluated the integration of biological prior knowledge on the problem of reconstructing the Raf protein signalling network. To this end, we have combined protein concentrations from cytometry experiments with prior knowledge from the KEGG pathway database. The findings of our study clearly demonstrate that the proposed Bayesian inference scheme outperforms various alternative methods that either take only the cytometry data or only the prior knowledge from KEGG into account (Figure 13.4). We have inspected the values of the sampled hyperparameters. Encouragingly, we have found that their range was close to the optimal value that maximizes the network reconstruction accuracy (Figure 13.3). A small systematic deviation would be expected owing to the approximation we have made for computing the partition function of the prior; see Equations (13.9) and (13.13). The modelling of heterogeneity is closely related to the modelling of nonlinear regulatory interactions. As discussed in Section 13.1.1, the standard scoring schemes either incur an information loss as a consequence of data discretization (BDe), or they are restricted to a linear model (BGe). The model proposed in Section 13.3 has the aspects of a mixture model and a piecewise linear model. This provides a natural nonlinear generalization of the BGe model while avoiding the data discretization intrinsic to the BDe score. A restriction is that the assignment of data points to mixture components is based on a multiple changepoint process. This corresponds to a piecewise linear process in the time domain, rather than the domain of explanatory variables. In principle it is straightforward to relax this restriction; we just have to replace the changepoint process by a free allocation of the latent variables, as in (16). The disadvantage is a substantial increase in the computational complexity. Moreover, the results presented in Section 13.3.4 are encouraging and demonstrate that the heterogeneous changepoint-process DBN successfully avoids artifacts that are incurred with the linear BGe model. This suggests that the approximation of a nonlinear regulation process by a piecewise linear process in the time domain is efficient if the temporal processes are sufficiently smooth. The proposed heterogeneous DBN assumes that the network structure remains constant, and that only the network parameters are allowed to change. This assumption is appropriate for most cellular processes on short timescales, where it is the strength of the regulatory interactions rather than their structure that change with time. However, in situations where we expect major structural changes to occur at the organizational level of an organism, e.g. during embryogenesis or morphogenesis, the assumption of a rigid structure might be too restrictive. A more flexible model, in which the network structure in addition to the parameters is allowed to change, is discussed in Chapter 12.
288
Handbook of Statistical Systems Biology
Acknowledgements Dirk Husmeier is supported by the Scottish Government Rural and Environment Research and Analysis Directorate (RERAD) and under the EU FP7 project ‘Timet’. Marco Grzegorczyk is supported by the Graduate School ‘Statistische Modellbildung’ of the Department of Statistics, TU Dortmund University.
References Ahmed, A. and Xing, E. P. (2009) Recovering time-varying networks of dependencies in social and biological studies. Proceedings of the National Academy of Sciences, 106, 11878–11883. Cowles, M. K. and Carlin, B. P. (1996) Markov chain Monte Carlo convergence diagnostics: a comparative review. Journal of the American Statistical Association, 91, 883–904. Dougherty, M. K., Muller, J., Ritt, D. A., Zhou, M., Zhou, X. Z., Copeland, T. D., Conrads, T. P., Veenstra, T. D., Lu, K. P. and Morrison, D. K. (2005) Regulation of Raf-1 by direct feedback phosphorylation. Molecular Cell, 17, 215–224. Edwards, K. D., Anderson, P. E., Hall, A., Salathia, N. S., Locke, J. C., Lynn, J. R., Straume, M., Smith, J. Q. and Millar, A. J. (2006) Flowering locus C mediates natural variation in the high-temperature response of the Arabidopsis circadian clock. The Plant Cell, 18, 639–650. Fearnhead, P. (2006) Exact and efficient Bayesian inference for multiple changepoint problems. Statistics and Computing, 16, 203–213. Friedman, N. and Koller, D. (2003) Being Bayesian about network structure. Machine Learning, 50, 95–126. Friedman, N., Murphy, K. and Russell, S. (1998) Learning the structure of dynamic probabilistic networks. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 139–147. Friedman, N., Linial, M., Nachman, I. and Pe’er, D. (2000) Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7, 601–620. Geiger, D. and Heckerman, D. (1994) Learning Gaussian networks. Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, 235–243. Giudici, P. and Castelo, R. (2003) Improving Markov chain Monte Carlo model search for data mining. Machine Learning, 50, 127–158. Green, P. (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732. Grzegorczyk, M. and Husmeier, D. (2008) Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move. Machine Learning, 71, 265–305. Grzegorczyk, M. and Husmeier, D. (2009) Avoiding spurious feedback loops in the reconstruction of gene regulatory networks with dynamic Bayesian networks. In Kadirkamanathan, V., Sanguinetti, G., Girolami, M., Niranjan, M. and Noirel, J. (eds), Pattern Recognition in Bioinformatics, pp. 113–124. Springer, Berlin. Grzegorczyk, M. and Husmeier, D. (2009) Non-stationary continuous dynamic Bayesian networks. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I. and Culotta, A. (eds), Advances in Neural Information Processing Systems (NIPS), vol. 22, pp. 682–690. Grzegorczyk M. and Husmeier D. (2011) Non-homogeneous dynamic Bayesian networks for continuous data. Machine Learning, 83(3), 355–419. Grzegorczyk, M., Husmeier, D., Edwards, K., Ghazal, P. and Millar, A. (2008) Modelling non-stationary gene regulatory processes with a non-homogeneous Bayesian network and the allocation sampler. Bioinformatics, 24, 2071–2078. Hartemink, A. J., Gifford, D. K., Jaakkola, T. S. and Young, R. A. (2001) Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pacific Symposium on Biocomputing, 6, 422–433. Hastings, W. K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Heckerman, D. (1999) A tutorial on learning with Bayesian networks. In Jordan, M. I. (ed.), Learning in Graphical Models, Adaptive Computation and Machine Learning, pp. 301–354. MIT Press, Cambridge, MA. Husmeier, D. (2003) Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics, 19, 2271–2282.
Advanced Applications of Bayesian Networks 289 Husmeier, D., Dybowski, R. and Roberts, S. (2005) Probabilistic Modeling in Bioinformatics and Medical Informatics. Advanced Information and Knowledge Processing. Springer, New York. Imoto, S., Higuchi, T., Goto, T., Kuhara, S. and Miyano, S. (2003) Combining microarrays and biological knowledge for estimating gene networks via Bayesian networks. Proceedings of the IEEE Computer Society Bioinformatics Conference, CSB’03, 104–113. Kanehisa, M. (1997) A database for post-genome analysis. Trends in Genetics, 13, 375–376. Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28, 27–30. Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K., Itoh, M., Kawashima, S., Katayama, T., Araki, M. and Hirakawa, M. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Research, 34, D354–357. Kolar, M., Song, L. and Xing, E. (2009) Sparsistent learning of varying-coefficient models with structural changes. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I. and Culotta, A. (eds), Advances in Neural Information Processing Systems (NIPS), vol. 22, pp. 1006–1014. Krause, P. J. (1998) Learning probabilistic networks. Knowledge Engineering Review, 13, 321–351. L`ebre, S. (2007) Stochastic process analysis for Genomics and Dynamic Bayesian Networks inference. PhD thesis, Universit´e d‘Evry-Val-d‘Essonne, France. Lim, W., Wang, K., Lefebvre, C. and Califano, A. (2007) Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks. Bioinformatics, 23, i282–i288. Locke, J., Southern, M., Kozma-Bognar, L., Hibberd, V., Brown, P., Turner, M. and Millar, A. (2005) Extension of a genetic network model by iterative experimentation and mathematical analysis. Molecular Systems Biology, 1 (online). Madigan, D. and York, J. (1995) Bayesian graphical models for discrete data. International Statistical Review, 63, 215–232. McClung, C. R. (2006) Plant circadian rhythms. Plant Cell, 18, 792–803. Mockler, T., Michael, T., Priest, H., Shen, R., Sullivan, C., Givan, S., McEntee, C., Kay, S. and Chory, J. (2007) The diurnal project: diurnal and circadian expression profiling, model-based pattern matching and promoter analysis. Cold Spring Harbor Symposia on Quantitative Biology, 72, 353–363. Mukherjee, S. and Speed, T. P. (2008) Network inference using informative priors. Proceedings of the National Academy of Sciences, 105, 14313–14318. Murphy, K. P. and Milan, S. (1999) Modelling gene expression data using dynamic Bayesian networks. Technical report, MIT Artificial Intelligence Laboratory. http://www.ai.mit.edu/∼murphyk/Papers/ismb99.ps.gz. Nobile, A. and Fearnside, A. (2007) Bayesian finite mixtures with an unknown number of components: the allocation sampler. Statistics and Computing, 17, 147–162. Pournara, I. and Wernisch, L. (2004) Reconstruction of gene networks using Bayesian learning and manipulation experiments. Bioinformatics, 20, 2934–2942. Robinson, J. W. and Hartemink, A. J. (2009) Non-stationary dynamic Bayesian networks. In Koller, D., Schuurmans, D., Bengio, Y. and Bottou, L. (eds), Advances in Neural Information Processing Systems (NIPS), vol. 21, pp. 1369–1376. Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A. and Nolan, G. P. (2005) Protein-signaling networks derived from multiparameter single-cell data. Science, 308, 523–529. Sch¨afer, J. and Strimmer, K. (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4, Article 32. Werhli, A. and Husmeier, D. (2007) Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge. Statistical Applications in Genetics and Molecular Biology, 6. Werhli, A. V. and Husmeier, D. (2008) Gene regulatory network reconstruction by Bayesian integration of prior knowledge and/or different experimental conditions. Journal of Bioinformatics and Computational Biology, 6, 543–572. Werhli, A. V., Grzegorczyk, M. and Husmeier, D. (2006) Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics, 22, 2523–2531.
14 Random Graph Models and Their Application to Protein–Protein Interaction Networks Desmond J. Higham1 and Nataˇsa Prˇzulj2 1
14.1
Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK 2 Department of Computing, Imperial College London, UK
Background and motivation
Research in the analysis of large-scale biological datasets is a modern, evolving topic, driven by recent technological advances in experimental techniques [1–3]. It presents many fascinating challenges and offers computational scientists the possibility of contributing directly to biological understanding and therapeutics. This chapter focuses on a class of data that is particularly straightforward when viewed from a mathematical perspective. However, when studying such data it is necessary to be aware of the many simplifications and compromises that were involved in reaching such a streamlined summary of biological reality. A protein–protein interaction (PPI) network takes the form of a large, sparse and undirected graph. Nodes represent proteins and edges represent observed physical interaction between protein pairs. Figure 14.1 gives a simple schematic picture of how a set of physical interactions might arise between eight types of protein, labelled A to H. Figure 14.2 then shows the resulting (unweighted, undirected) PPI network. More realistically, for a particular organism a PPI network will involve several thousand proteins, and tens of thousands of edges. We can, of course, represent a PPI network in terms of its adjacency matrix W ∈ RN×N , where wij = wji = 1 if proteins i and j have been observed to interact and wij = wji = 0 otherwise. Because a protein molecule may interact with another instance of the same type of protein, it is perfectly natural to have self-loops, so wii = 1 is allowed. In Figures 14.3 and 14.4 we show the adjacency matrices for two of the earliest publicly available datasets. Here a dot represents a nonzero element in the matrix. A large and rapidly expanding collection of PPI datasets is now available in the public domain; although some experience is needed in order to manipulate the databases and extract appropriate information. High-throughput techniques such as yeast two-hybrid (Y2H), tandem affinity purification (TAP), and mass spectrometric protein
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
Random Graph Models and PPI Networks
A
291
B
D
C
F
E
H
G
Figure 14.1 Schematic representation of the physical basis for protein–protein interactions
complex identification (HMS-PCI) produce a rich source of experimental PPI data for many organisms [5–13]. However, these data are known to admit very high levels of uncertainty [14–18]. Y2H and TAP data have false negative rates estimated to be in the ranges of 43–71% and 15–50%, respectively [19]. False positive rates for Y2H could be as high as 64% and for TAP experiments they could be as high as 77% [19]. These errors are likely to be heavily biased, so that certain parts of the network may be much more accurately recorded. In addition to technical false positives arising from experimental limitations, depending on what the data are used for it may also be relevant to consider biological false positives–interactions that can be observed experimentally but do not occur in vivo since, for example, • •
The two types of protein are expressed at different times in the cell’s life cycle, and hence never exist simultaneously. The two types of protein remain in distinct subcellular compartments, or tissues, and hence never become physically close.
We also note that the ability of two proteins to interact may depend strongly on the specific environment present in the cell, and hence a PPI network is best regarded as an aggregation of one of more snapshots of the interactions taking place over a range of circumstances. A
B
C
F
D
G
E
H
Figure 14.2 Network arising from the interactions suggested by Figure 14.1
292
Handbook of Statistical Systems Biology 0
500
1000
1500
2000
2500
3000 0
500
1000
1500
2000
2500
3000
Figure 14.3 Nonzero structure in the adjacency matrix for the yeast PPI network from [4]
0 100 200 300 400 500 600 700 800 900 1000 0
100
200 300
400 500
600 700 800
900 1000
Figure 14.4 Nonzero structure in the adjacency matrix for the yeast PPI network from [5]
Random Graph Models and PPI Networks
14.2
293
What do we want from a PPI network?
PPI network data have the potential to address two types of question. Fine detail issues involve particular proteins or small groups of proteins, such as • • • • •
Are there any other proteins like protein A? What is the biological function of protein B? Which proteins act together to perform biological function C? What happens if protein D is removed? Is this link a false positive?
Big picture issues involve global aspects of the network, such as • • • • • •
What is a good null model if we wish to test for the statistical significance of a pattern that has been discovered in a PPI network? How can we conveniently summarize a PPI network in terms of a small number of parameters? How can we quantify the difference between the PPI networks of two different organisms? What are the most likely candidates for the many false positives and false negatives? How did the PPI network evolve into its current state? How can we generate realistic synthetic network datasets on which to test computational algorithms? Properties that capture useful information about a network include:
Pathlength. The pathlength between i and j is the shortest number of edges needed to traverse a path from i to j. Pathlength information may be summarized via the maximum or average pathlength over all pairs i, j. w . It is common to plot the degree distribution Degree. The degree of the ith node is N ik k=1 P(k) :=
number of nodes with degree k . N
The accuracy and usefulness of early reports that PPI networks exhibit a so-called scale-free degree distribution, where P(k) ∼ k −γ with the exponent γ often being between 2 and 3, have more recently been questioned [20–22]; indeed the issue of how to test for a power law distribution is itself an interesting challenge [23]. Clustering coefficient. Suppose node k has ν neighbours. The clustering coefficient (or curvature) of node k is found by counting how many pairs of these neighbours are themselves connected, and dividing this number by the maximum possible number of connections, ν(ν − 1)/2. Typically the average clustering coefficient over all nodes is reported. Using these types of measure, it is possible to categorize and compare networks, but of course, each such meaure can only capture a single broad feature. A more systematic approach based on graphlet frequencies, monitoring the relative abundance of various subgraphs, has been adopted in [24–26]. Much of the graph theory oriented research on PPI networks has evolved out of the more general field of Network Science, driven by physicists, mathematicians and computer scientists [1,27]. Our aim in this chapter is to sample from this large field, describing some aspects of our recent PPI work, and related studies by others, that we feel are likely to be of interest to the intended audience. We focus on the use of network models–procedures for generating the links in a network–that (a) have some biological motivation (so that the phrase ‘model’ is relevant in the sense of applied mathematics) and (b) allow us to address some or all
294
Handbook of Statistical Systems Biology
of the fine detail and big picture questions mentioned above, even in the presence of uncertainty. As well as random models, where links are generated probabilistically, we will also consider some related deterministic counterparts that can be calibrated to a given network. We also note at this stage that many other sources of information are available about proteins, and their underlying genes, ranging from sequence data to Gene Ontology (GO) representations of the cellular component, molecular function and biological process aspects of gene products. The basic approach that we consider here is to extract as much information as possible from the PPI network alone, and then, when it is feasible, use extra sources of information as a means to validate the findings. However, combining and simultaneously computing with heterogeneous sources of protein data remains a key long term goal in this area.
14.3 14.3.1
PPI network models Lock and key
Figure 14.1 is intended to reflect the fact that proteins are physical objects that bind when appropriate regions come together. The existence of such complementary binding domains, is consistent with the underlying biochemistry, although only a relatively small percentage have been discovered so far. Some examples are described as ‘bumps’ and ‘holes’ in [28]. Motivated by this concept, Thomas et al. [29] suggested the idea of creating a network by assigning binding domains at random and then forming a link if and only if two proteins share a complementary pair. Their aim was to show that this class of models can reproduce the type of degree distribution found in real PPI networks. As we can see from Figure 14.1, if a network is formed entirely from the pairwise attraction of complementary binding domains then it may be broken down into complete bipartite1 subgraphs. In the case of Figure 14.1, the relevant pairs of node sets are {A, B} and {C, D}, {C, D} and {F, G, H}, and {D} and {E}. Morrison et al. [30] used a coloured lock/key analogy to represent complementary biding domains. For example, in Figure 14.1, we may imagine that • • •
A and B have a red key, and C and D have the matching red lock. C and D have a blue key, and F, G and H have the matching blue lock. D has a green key and E has the matching green lock.
(Because the links are undirected, the roles of locks and keys are interchangeable.) It is then possible to consider the problem of reverse engineering this bipartite lock/key structure in a real PPI network, which corresponds in biological terms to the task of identifying proteins that have common or complementary binding domains. Under the simplifying assumption that there exists a lock/key pair in the network such that any protein with this lock/key • • •
does not have the matching key/lock, will only interact with a protein having the matching key/lock, and only has a fixed proportion 0 ≤ θ ≤ 1 of its lock/key matches recorded as interactions,
it was shown in [30] that the adjacency matrix, W, has a pair of eigenvalues λ = ±θ locksum × keysum 1 We recall that a bipartite graph consists of two sets of nodes, {X
are absent and all links of the form Xi ↔ Yj are present.
1 , X2 , . . .} and {Y1 , Y2 , . . .} where links of the form Xi
↔ Xj or Yi ↔ Yj
Random Graph Models and PPI Networks
with eigenvectors
keysum ind[lock] ±
295
√ locksum ind[key] .
Here locksum denotes the number of locks of this type present in the network and ind[lock] ∈ RN is an indicator vector that has 1 in its ith position if protein i has the lock and 0 otherwise, with keysum and ind[key] defined analogously. It is important to note that this lock/key assumption applies only to those proteins involved in that bipartite subnetwork–the rest of the PPI network could have any structure. This is attractive from the point of view that certain parts of a PPI network may be known to high accuracy, even when the whole network has high false positive and false negative rates. Furthermore, it is possible to appeal to the least-squares type variational properties of spectral methods in order to argue that this type of information is likely to be robust under perturbations [31]; so ‘approximately bipartite’ subgraphs may also be revealed through the spectrum. As a simple test, we note that for the original PPI network of [5], asking MATLAB [32] to compute the four most positive and most negative eigenvalues produces the following: >> [U,D] = eigs(W,8,’BE’); >> diag(D) ans = -6.4614 -5.1460 -4.1557 -4.1270 4.3778 5.4309 5.8397 7.4096
For the extreme eigenvalues, −6.4614 and 7.4096, Figure 14.5 plots the sum and the difference of each component in the corresponding eigenvectors. It follows that indices with significant positive/negative entries are candidates for common keys/locks in a bipartite subnetwork. This approach can be summarized in the following pseudocode: • • •
Calculate eigenvalues/vectors of W. Group into ≈ ±λ pairs. For each pair with eigvectors ua and ub : – choose a threshold, K; – |ua + ub |i ≥ K means protein i has lock; – |ua − ub |i ≥ K means protein i has key.
Using the spectral information displayed in Figure 14.5, an approximate bipartite subnetwork was identified in [30] involving the proteins listed in Table 14.1. In this case 54 out of a possible 85 interactions were present between the two groups and only two of the possible 146 within-group interactions. Using extra information, it was then found that the group of five key proteins all contain a known binding domain called SH3, which is involved in linking cytoskeletal dynamics and trafficking of vesicles. The proteins of the ‘lock’ group, on the other hand, were found to form part of the actin cortical patch assembly mechanism of vesicle endocytosis [33]. These proteins were also identified as part of a larger group by a clustering method in a separate study [34], which did not mention the highly relevant interaction with SH3 domain proteins. Overall, there is a
296
Handbook of Statistical Systems Biology Sum and Difference 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6
200
400
600
800
1000
Figure 14.5 Sum and difference of components of extreme eigenvectors for the PPI network of [5]
clear biological explanation for the bipartite substructure, which was discovered solely from the interaction topology. In fact, in this example, the ‘lock’ and ‘key’ analogy might be more usefully reinterpreted as ‘bus’ and ‘ticket’–the five SH3 proteins are available for transporting the 17 proteins in the other group. Of course, the main payoff from this type of algorithm occurs in the discovery of new information–when only certain proteins in a group have known properties, information about others in the group may be imputed Table 14.1 Key and lock proteins arising from the spectral computation in Figure 14.5 Key
Lock
BZ11 YSC84 SLA1 YFR024C RVS167
KTR3 YPL246C YPR171W YKR030W SNA3 YOR284W YGR268C APP1 SYS1 BOS1 VPS73 YMR253C YLR064W YJR083C YMR192W ACF2 MMS4
Random Graph Models and PPI Networks
297
0 5 10 15 20 25 30 35 40 45 50 0
5
10
15
20
25
30
35
40
45
50
Figure 14.6 Approximate bipartite subnetwork discovered in Drosophila melanogaster
using ‘guilt-by-association’. Many other examples of approximate bipartite subnetworks were discovered in [30] and given putative functional annotations. For example, Figure 14.6 shows an interaction structure that, based on annotations for some of the proteins involved, is suggested as a module participating in cell cycle transcriptional regulation in Drosophila melanogaster. To conclude this subsection we mention two further points: 1. The type of approximate bipartite structure shown in Figure 14.6 could not be found by direct application of traditional clustering techniques. Indeed the two groups of proteins have very few inter-group interactions– what defines the groups is the high level of commonality among their neighbours. 2. In theory, any network can be broken down into overlapping bipartite subnetworks simply by considering each edge as a separate structure. For this reason, any pattern that is discovered should be tested for statistical significance, both topologically (is this size and density of bipartivity likely to arise by chance?) and biologically (is this group of proteins enriched in terms of one or more GO annotations?). 14.3.2
Geometric networks
We may construct a geometric random graph by first placing N nodes independently and uniformly at random in Rd –more precisely, the ith node is given coordinates x[i] , where each component of x[i] is independent and identically distributed with uniform (0,1) distribution. Next, for some specified radius, , nodes i and j are connected if and only if x[i] − x[j] ≤ , where · is some vector norm, which we will always take to be the Euclidean norm. Figure 14.7(a) illustrates the node placement procedure in the d = 2 dimensional unit square. Here we have N = 100 nodes. Using
298
Handbook of Statistical Systems Biology Original Nodes
Adjacency Matrix
1
0
0.8
20
0.6
40
0.4 60 0.2 80 (a)
0
0
0.5
1 (b)
Relocated Nodes
100 0
50
100
1 2
|| W − Wε ||2 / || W ||2
0.8 0.6 0.4 0.2 0 (c)
0
0.5
1.5
1
0.5
0
1 (d)
0
0.1
0.2
ε
0.3
0.4
0.5
Figure 14.7 (a) Geometric random graph with N = 100 nodes and radius = 0.25. (b ) The corresponding adjacency matrix. (c) Nodes projected into the unit square from an MDS-based algorithm. (d) Relative error in the recovered adjacency matrix as a function of
a radius of = 0.25 produces a graph whose adjacency matrix is shown in Figure 14.7(b). Of course, the resulting graph is simply the usual list of nodes and edges–information about the underlying physical locations of the nodes is not part of the final mathematical object. There is a large amount of literature devoted to the theoretical study of geometric random graphs; see [35] for a comprehensive treatment. But only relatively recently has it emerged that they can be useful practical models in biology. Pr˘zulj and co-workers [26,36] have shown that suitably calibrated two- and three-dimensional versions give surprisingly accurate reproductions of many features of real biological networks. Geometric graph models are biologically intuitive in the following sense. Biological entities, such as genes and proteins, exist in some multidimensional biochemical space. It is believed that genomes evolve through a series of gene (and sometimes entire genome) duplications and mutations. In geometric terms, when a gene gets duplicated, its child begins at the same position in space as the parent and then natural selection acts either to eliminate one, or to move the pair apart in space so that they still share some interacting partners (and thus biological functionality) due to their proximity, but also have some different interacting partners. The results of such evolutionary processes are naturally modelled by geometric graphs [36]. Currently, it is difficult even to hypothesize about the nature or dimensionality of this space. It is possible that the dimensionality stands
Random Graph Models and PPI Networks
299
not only for the complementarity of binding domains and biochemical surroundings of proteins in the cell, but that it also has to do with unicellularity and multicellularity of eucaryotic organisms. This is because we notice that the fit of geometric graphs to PPI networks does not improve after dimension 4 for unicellular organisms and after dimension 9 for multicellular ones. Just as the assumptions behind a lock-and-key model can be directly tested by reverse engineering the locks and keys in a real PPI network, the possible geometric structure can also be probed. If we were given the actual pairwise Euclidean distances between all pairs of nodes in a geometric graph, then multidimensional scaling (MDS) [37] could be used to recover their locations (up to the obvious invariants of reflection and rotation). In the case of a PPI network, we only have edge information. The idea of using pathlength as a proxy for Euclidean distance was considered in [38]. MDS may then be applied in order to project the nodes into the low-dimensional space. The idea is illustrated in Figure 14.7. Figure 14.7(c) shows the location of the 100 nodes after MDS has been applied to the pathlength-based distances. We may then reconstruct a network by choosing a radius, . Figure 14.7(d) shows how the norm of the relative error W − W W behaves as a function of . Here, W is the original adjacency matrix and W is the adjacency matrix that arises when we construct a geometric graph from the relocated nodes with a radius of . We see that the error is smallest around the ‘correct’ value of = 0.25 that was used to generate the data. Using = 0.25, Figure 14.8 shows the position and number of correct negatives (blank), correct positives (solid circle), false negatives (inverted triangle) and false positives (open circle) in the recovered adjacency matrix.
Correct Neg: 4091 Correct Pos: 610 False Neg: 102 False Pos: 147
0 10 20 30 40 50 60 70 80 90 100 0
10
20
30
40
50
60
70
80
90
100
Figure 14.8 Correct negatives (blank), correct positives (solid circle), false negatives (inverted triangle) and false positives (open circle) in the recovered adjacency matrix, using = 0.25 on the example in Figure 14.7
300
Handbook of Statistical Systems Biology 1
Sensitivity: TP/(TP + FN)
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 0.8 1−Specificity: 1−[TN/(TN+FP)]
0.9
1
Figure 14.9 ROC curve for the MDS-based algorithm on the example from Figure 14.7
Perhaps a more realistic measure of the success of this algorithm is given via a receiver operating characteristic (ROC) curve. Here, as varies, we compute Sensitivity :=
TP TP + FN
and
Specificity :=
TN , TN + FP
where TP, FN, TN and FP stand for the number of true positives, false negatives, true negatives and false positives, respectively. In our context, increasing will increase sensitivity at the expense of lowering specificity. The ideal is to have both sensitivity and specificity values equal to one. We can judge the outcome by plotting 1 − specificity against sensitivity, and checking whether the curve has values close to (0, 1) in the x-y plane. It is also common to compute the area under such a ROC curve. Figure 14.9 shows the results for the example from Figure 14.7. The value of the curve at the ‘correct’ radius of = 0.25 is marked with an asterisk. In this case computing the area under the ROC curve gave 0.965. For comparison, Figure 14.10 shows a ROC curve arising when edges between nodes are predicted by a coin flip–for a fixed value of p we go through the network inserting each possible edge with independent probability p. As p increases from zero we trace out a ROC curve, which in this instance gives an area of 0.48. By contrast, Figure 14.11 shows the corresponding results that arise when we start with a nongeometric network. Here we applied the algorithm to a classical Erd˜os–R´enyi style random graph, where edges were created independently. In this case there is no geometric structure to uncover, and the area under the ROC curve is 0.67. Tests with the algorithm on 19 real PPI networks data in [38] found strong evidence for two-dimensional geometric structure; for example a high quality yeast network had an area under the ROC curve of 0.89. Overall, current PPI networks are essentially indistinguishable from noisy geometric graphs. The geometric model was then used in [39] as the basis for an algorithm to predict false negatives and thus guide future biological experiments. The main tenet is that proteins placed sufficiently close together in the geometric embedding are strong candidates for interacting pairs; so a ‘missing’ edge between two such proteins can be flagged as a potential error. The resulting algorithm predicted 251 new interactions in the human PPI network, 70.2% of which correspond to protein pairs sharing at least 1 GO annotation term
Random Graph Models and PPI Networks
301
1 0.9 Sensitivity: TP/(TP + FN)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 0.8 1−Specificity: 1−[TN/(TN+FP)]
0.9
1
Figure 14.10 ROC curve when we try to predict edges using coin flips in the example from Figure 14.7
Moreover, 13 of those predicted interactions were then validated by considering upadated and alternative PPI databases.
14.4
Range-dependent graphs
In a geometric random graph model of PPI interaction, the physical location of the proteins is specified probabilistically and then the interactions follow deterministically. By contrast, Grindrod [40] proposed a 1
Sensitivity: TP/(TP + FN)
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 0.8 1−Specificity: 1−[TN/(TN+FP)]
0.9
1
Figure 14.11 ROC curve when the MDS-based algorithm is applied to a random graph with no underlying geometric structure
302
Handbook of Statistical Systems Biology Prob. f(|i−j|)
0
1
2
i
j
Figure 14.12 Illustration of range-dependent interaction probabilities
random graph model for PPI networks where the proteins are located deterministically but the interactions then follow probabilistically. More precisely, Grindrod’s range-dependent random graph (RDRG) model begins with nodes on a one-dimensional integer lattice and assigns edges between pairs of nodes with a probability that decays with their lattice distance, as illustrated in Figure 14.12. We may express this more formally as follows: Definition 14.4.1 For a given decay function, f , that maps from {1, 2, . . . , N − 1} to [0, 1], the RDRG model generates an edge between nodes i and j with independent probability f (|j − i|). Grindrod focused on the case of geometric decay, where f (k) = αλk−1 for constants α, λ ∈ [0, 1] and showed that a generating function method could then be used to calculate the clustering coefficient and other key network properties. In ths section we will simplify to f (k) = λk . For λ = 0.85 and N = 100, an instance of such a RDRG is shown in Figure 14.13(a). The motivation for the RDRG model is that it should be possible to place proteins in order according their similarity, so that proteins close together in this ordering are more likely to interact. In practice, of course, even if such an arrangement existed, the proteins would not be supplied in such a convenient order. In Figure 14.13(b) we show the same network with the nodes arbitrarily shuffled, so as the hide the rangedependent structure. This rouses the related inverse problem of computing an appropriate protein reordering given the interaction network. To tackle this problem, Grindrod used the likelihood function Llin (p) :=
λ|pi −pj |
edge pi ↔pj
1 − λ|pi −pj | ,
(14.1)
no edge pi ↔pj
corresponding to a node ordering where i → pi , and looked at maximizing over all orderings via genetic search. Higham [41] showed that spectral reordering algorithms offer a quicker and more effective solution. 1 − 12 , where D To describe such an algorithm, we introduce the normalized Laplacian, L := D− 2 (D − W)D w . Denoting is the N × N diagonal matrix, diag(d1 , . . . , dN ), containing the vertex degrees di = N ij j=1 [i] the eigenvalue/eigenvector pairs by {λi , v }, it is known that for a connected network the eigenvalues may be ordered 0 = λ1 < λ2 ≤ · · · ≤ λN ≤ 2. A linear spectral reordering algorithm may then be summarized as follows: 1. Compute a subdominant eigenvector x := v[2] . 2. Construct a permutation vector p according to
pi ≤ pj
⇐⇒
xi ≤ xj .
Figure 14.13(c) scatter plots the components of v[2] and v[3] , and Figure 14.13(d) shows the network reordered by this spectral algorithm. (We note that spectral methods are invariant to node permutation, so the same answer arises when the algorithm is applied to the original or the shuffled network.)
Random Graph Models and PPI Networks W
Shuffled
0
0
50
50
100
100 0
(a)
303
50
100
0
(b)
50
100
0.4
v[3]
0.2 0 −0.2 −0.4 −0.2 (c)
0
0.2
v[2]
Linear Reordering
Periodic Reordering
0
0
50
50
100 (d)
0
50
100
(e)
100
0
50
100
Figure 14.13 (a) RDRG with = 0.85 and N = 100. (b) Shuffled version. (c) Components of the eigenvectors v [2] and v [3] . (d) Linear spectral reordering. (e) Periodic spectral reordering
In [42], the authors define a periodic version of a range-dependent random graph, abbreviated to pRDRG, by using periodic lattice distance. In this case, for example, nodes 1 and N are a unit distance apart; in the linear RDRG their separation distance would be N − 1. This may be defined formally as follows: Definition 14.4.2 For a given decay function, f , that maps from {1, 2, . . . , N − 1} to [0, 1], the pRDRG model generates an edge between nodes i and j with independent probability f (min{|j − i|, N − |j − i|}). Figure 14.14(a) illustrates an instance of a pRDRG in the case where N = 100 and f (k) = λk with λ = 0.85, with a shuffled version shown in Figure 14.14(b). Analogously to (14.1), under the pRDRG model, for a given network and an ordering i → pi we have a likelihood of 1 − λmin(|pi −pj |, N−|pi −pj |) . (14.2) Lper (p) := λmin(|pi −pj |, N−|pi −pj |) edge pi ↔pj
no edge pi ↔pj
304
Handbook of Statistical Systems Biology W
(a)
Shuffled
0
0
50
50
100 0
100 50
100
(b)
0
50
100
0.2
v[3]
0.1 0 −0.1 −0.2 −0.2 (c)
0
0.2
v[2]
Linear Reordering
Periodic Reordering
0
0
50
50
100 (d)
0
50
(e)
100
100
0
50
100
Figure 14.14 (a) pRDRG with = 0.85 and N = 100. (b) Shuffled version. (c) Components of the eigenvectors v [2] and v [3] . (d) Linear spectral reordering. (e) Periodic spectral reordering
The following periodic spectral reordering algorithm for a pRDRG was then derived in [42]. [2] [3] 1. Compute an eigenvector pair v and v . [2] [3] −1 vi /v )i . 2. Let θi = tan 3. Construct a permutation vector p according to
pi ≤ pj
⇐⇒
θi ≤ θj .
Figures 14.13(e) and 14.14(e) show the networks reorderd according to this periodic-oriented algorithm. We see from the two figures that both algorithms are able to reveal the relevant structure when it is present in the network–the linear algorithm recovers the RDRG structure and the periodic algorithm recovers the pRDRG version. Now that we have two related models, it is reasonable to ask which one best describes real PPI data. The approach taken in [42] was first to calibrate the two models by choosing the decay parameter, λ, so that the number of edges in the PPI network matches the expected number of edges in the model. Then, after computing the likelihoods (14.1) and (14.2) given by the corresponding spectral reordering algorithms, we
Random Graph Models and PPI Networks
305
Table 14.2 Results for linear versus period range-dependent reordering Network
Number of proteins
Number of interactions
L
2137 573 417 473 970 314 1076 1411 1686 1818 1446 1218 2060
10816 2097 511 543 1229 727 2116 2594 3359 2725 2896 1902 18000
−1.39 × 10−2 −1.25 × 10−2 −2.83 × 10−2 −1.23 × 10−2 −9.03 × 10−3 −3.69 × 10−2 −5.62 × 10−4 1.31 × 10−3 1.50 × 10−2 −9.93 × 10−3 −1.23 × 10−2 −1.80 × 10−2 −1.21 × 10−2
Y11000 Y2455 YItoCore YUetz YItoCoreUetz hStelzlH hStelzlHM hStelzlHML hRual hBIND hMINT WCore WZhSt
may evaluate the normalized log-likelihood ratio L=
2 log N(N − 1)
Llin (plin ) Lper (pper )
.
(14.3)
Table 14.2 summarizes some results from [42] where the following 13 PPI networks were analysed. •
•
•
Two yeast PPI networks described in [43]: a network defined by the top 11 000 interactions (denoted Y11000 in Table 14.2) and its high confidence subnetwork (Y2455). Three further yeast PPI networks are the ‘core’ from [7], the network from [5] and the union of both, denoted YItoCore, YUetz and YItoCoreUetz, respectively. Human PPI networks used included three networks of different confidence level: high (hStelzlH), high and medium (hStelzlHM) and high, medium and low (hStelzlHML) from [11] and a network from [12] (hRual). Two further human networks were downloaded from databases BIND and MINT [44,45] (hBIND and hMINT). The worm ‘core’ network (Wcore) from [46] and the worm PPI network (WZhSt) from [47].
In each case, the largest connected component of the network was used. Table 14.2 gives the number of proteins, the number of interactions and the log-likelihood ratio L from (14.3). The results show that 11 out of the 13 PPI networks, including all the high and medium confidence versions, are categorized as being more periodic than linear (by virtue of a negative log-likelihood ratio). Some illustrative linear versus periodic reordered adjacency matrix plots of PPI networks can also be found in [42].
14.5
Summary
This chapter has focused on the development of random graph models, and their deterministic counterparts, which lead to algorithms that can be directly used and tested on real PPI networks. They have the potential to reveal new biological insights at a local level, for example:
306 • • •
Handbook of Statistical Systems Biology
Are these two proteins similar? Is this a ‘long range’ interaction? Which other proteins share this binding domain?
Also, at a global level, for example: • • •
Can all proteins be usefully embedded in a low-dimensional space? Is the network inherently periodic? Can these two networks be summarized with a similar set of model parameters?
As the experimental data become more comprehensive and less noisy, this type of model-based, data-driven network analysis will have an increasingly significant role to play. We have shown by example that new insights can be gleaned directly from the network topology alone, but a key future challenge is to develop computational tools that can seamlessly combine interaction network data and other quantitative information, from the fully discrete sequence and pathway level to real-valued gene expression measurements.
References [1] E. de Silva and M. Stumpf, Complex networks and simple models in biology, Journal of the Royal Society Interface, 2 (2005), pp. 419–430. [2] E. Estrada, M. Fox, D. J. Higham, and G.-L. Oppo, eds, Network Science: Complexity in Nature and Technology, Springer, Berlin, 2010. [3] M. E. J. Newman, Networks: an Introduction, Oxford University Press, Oxford, 2010. [4] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, A comprehensive two-hybrid analysis to explore the yeast protein interactome, Proceedings of the National Academy of Sciences of the United States of America, 98 (2001), pp. 4569–4574. [5] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, E. Lockshon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin, D. Conover, T. Kalbfleish, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, and J. M. Rothberg, A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae, Nature, 403 (2000), pp. 623–627. [6] S. Collins, P. Kemmeren, X. Zhao, J. Greenblatt, F. Spencer, F. Holstege, J. Weissman, and N. Krogan, Toward a comprehensive atlas of the physical interactome of saccharomyces cerevisiae, Molecular and Cellular Proteomics, 6:3 (2007), pp. 439–450. [7] T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto, S. Kuhara, and Y. Sakaki, Toward a protein-protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins, Proceedings of the National Academy of Sciences of the United States of America, 97 (2000), pp. 1143–1147. [8] A.-C. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. Jensen, S. Bastuck, B. Dimpelfeld, A. Edelmann, M.-A. Heurtier, V. Hoffman, C. Hoefert, K. Klein, M. Hudak, A.-M. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper, A. Bauer, T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. Rick, B. Kuster, P. Bork, R. Russell, and G. Superti-Furga, Proteome survey reveals modularity of the yeast cell machinery, Nature, 440 (2006), pp. 631–636. [9] L. Giot, J. Bader, C. Brouwer, A. Chaudhuri, B. Kuang, Y. Li, Y. Hao, C. Ooi, B. Godwin, E. Vitols, G. Vijayadamodar, P. Pochart, H. Machineni, M. Welsh, Y. Kong, B. Zerhusen, R. Malcolm, Z. Varrone, A. Collis, M. Minto, S. Burgess, L. McDaniel, E. Stimpson, F. Spriggs, J. Williams, K. Neurath, N. Ioime, M. Agee, E. Voss, K. Furtak, R. Renzulli, N. Aanensen, S. Carrolla, E. Bickelhaupt, Y. Lazovatsky, A. DaSilva, J. Zhong, C. Stanyon, R. J. Finley, K. White, M. Braverman, T. Jarvie, S. Gold, M. Leach, J. Knight, R. Shimkets, M. McKenna, J. Chant, and J. Rothberg, A protein interaction map of Drosophila melanogaster, Science, 302 (2003), pp. 1727–1736.
Random Graph Models and PPI Networks
307
[10] S. Li, C. Armstrong, N. Bertin, H. Ge, S. Milstein, M. Boxem, P.-O. Vidalain, J.-D. Han, A. Chesneau, T. Hao, N. Goldberg, DS Li, M. Martinez, J.-F. Rual, P. Lamesch, L. Xu, M. Tewari, S. Wong, L. Zhang, G. Berriz, L. Jacotot, P. Vaglio, J. Reboul, T. Hirozane-Kishikawa, Q. Li, H. Gabel, A. Elewa, B. Baumgartner, D. Rose, H. Yu, S. Bosak, R. Sequerra, A. Fraser, S. Mango, W. Saxton, S. Strome, S. van den Heuvel, F. Piano, J. Vandenhaute, C. Sardet, M. Gerstein, L. Doucette-Stamm, K. Gunsalus, J. Harper, M. Cusick, F. Roth, D. Hill, and M. Vidal, A map of the interactome network of the metazoan C. elegans, Science, 303 (2004), pp. 540–543. [11] U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. Brembeck, H. Goehler, M. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff, C. Abraham, N. Bock, S. Kietzmann, A. Goedde, E. Toksoz, A. Droege, S. Krobitsch, B. Korn, W. Birchmeier, H. Lehrach, and E. Wanker, A human protein-protein interaction network: a resource for annotating the proteome, Cell, 122 (2005), pp. 957–968. [12] J.-F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F. Berriz, F. D. Gibbons, M. Dreze, N. Ayivi-Guedehoussou, N. Klitgord, C. Simon, M. Boxem, S. Milstein, J. Rosenberg, D. S. Goldberg, L. V. Zhang, S. L. Wong, G. Franklin, S. Li, J. S. Albala, J. Lim, C. Fraughton, E. Llamosas, S. Cevik, C. Bex, P. Lamesch, R. S. Sikorski, J. Vandenhaute, H. Y. Zoghbi, A. Smolyar, S. Bosak, R. Sequerra, L. Doucette-Stamm, M. E. Cusick, D. E. Hill, F. P. Roth, and M. Vidal, Towards a proteome-scale map of the human protein-protein interaction network, Nature, 437 (2005), pp. 1173–1178. [13] N. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta, A. Tikuisis, T. Punna, J. Peregrin-Alvarez, M. Shales, X. Zhang, M. Davey, M. Robinson, A. Paccanaro, J. Bray, A. Sheung, B. Beattie, D. Richards, V. Canadien, A. Lalev, F. Mena, P. Wong, A. Starostine, M. Canete, J. Vlasblom, S. Wu, C. Orsi, S. Collins, S. Chandran, R. Haw, J. Rilstone, K. Gandi, N. Thompson, G. Musso, P. St Onge, S. Ghanny, M. Lam, G. Butland, A. Altaf-Ul, S. Kanaya, A. Shilatifard, E. O’Shea, J. Weissman, C. Ingles, T. Hughes, J. Parkinson, M. Gerstein, S. Wodak, A. Emili, and J. Greenblatt, Global landscape of protein complexes in the yeast Saccharomyces cerevisiae, Nature, 440 (2006), pp. 637–643. [14] S. Collins, P. Kemmeren, X.-C. Zhao, J. Greenblatt, F. Spencer, F. Holstege, J. Weissman, and N. Krogan, Toward a comprehensive atlas of the phyisical interactome of saccharomyces cerevisiae, Molecular and Cellular Proteomics, 6 (2007), pp. 439–450. [15] E. de Silva, T. Thorne, P. Ingram, I. Agrafioti, J. Swire, C. Wiuf, and M. Stumpf, The effects of incomplete protein interaction data on structural and evolutionary inferences, BMC Biology, 4 (2006), pp. 1–13. [16] L. Hakes, J. Pinney, D. L. Robertson, and S. C. Lovell, Protein-protein interaction networks and biology–what’s the connection? Nature Biotechnology, 26 (2008), pp. 69–72. [17] J. D. H. Han, D. Dupuy, N. Bertin, M. E. Cusick, and V. M., Effect of sampling on topology predictions of proteinprotein interaction networks, Nature Biotechnology, 23 (2005), pp. 839–844. [18] S. Wodak, S. Pu, J. Vlasblom, and B. Seraphin, Challenges and rewards of interaction proteomics, Molecular and Cellular Proteomics, 8 (2009), pp. 3–18. [19] A. M. Edwards, B. Kus, R. Jansen, D. Greenbaum, J. Greenblatt, and M. Gerstein, Bridging structural biology and genomics: assessing protein interaction data with known complexes, Trends in Genetics, 18 (2002), pp. 529–536. [20] E. F. Keller, Revisiting ‘scale-free’ networks, BioEssays, 27 (2005), pp. 11060–11068. [21] R. Khanin and E. Wit, How scale-free are gene networks? Journal of Computational Biology, 13 (2006), pp. 810–818. [22] M. P. H. Stumpf, C. Wiuf, and R. M. May, Subnets of scale-free networks are not scale-free: sampling properties of networks, Proceedings of the National Academy of Sciences of the United States of America, 102 (2005), pp. 4221– 4224. [23] A. Clauset, C. R. Shalizi, and M. E. J. Newman, Power-law distributions in empirical data, SIAM Review, 51 (2009), pp. 661–703. [24] T. Milenkovi´c, J. Lai, and N. Prˇzulj, Graphcrunch: a tool for large network analyses, BMC Bioinformatics, 9:70 (2008). [25] N. Prˇzulj, Biological network comparison using graphlet degree distribution, Bioinformatics, 23 (2007), pp. e177–e183. [26] N. Prˇzulj, D. G. Corneil, and I. Jurisica, Modeling interactome: scale-free or geometric? Bioinformatics, 20 (2004), pp. 3508–3515. [27] M. E. J. Newman, The structure and function of complex networks, SIAM Review, 45 (2003), pp. 167–256. [28] D. Evanko, Using ‘bumps’ and ‘holes’ to control protein activity, Nature Methods, 3 (2006), pp. 154–155.
308
Handbook of Statistical Systems Biology
[29] A. Thomas, R. Cannings, N. A. M. Monk, and C. Cannings, On the structure of protein-protein interaction networks, Biochemical Society Transactions, 31 (2003), pp. 1491–1496. [30] J. L. Morrison, R. Breitling, D. J. Higham, and D. R. Gilbert, A lock-and-key model for protein-protein interactions, Bioinformatics, 22 (2006), pp. 2012–2019. [31] D. J. Higham, G. Kalna, and M. Kibble, Spectral clustering and its use in bioinformatics, Journal of Computational and Applied Mathematics, 204 (2007), pp. 25–37. [32] Using MATLAB, The MathWorks, Inc., Natick, MA. Online version. [33] B. L. Drees, B. Sundin, E. Brazeau, J. P. Caviston, G.-C. Chen, W. Guo, K. G. Kozminski, M. W. Lau, J. J. Moskow, A. Tong, L. R. Schenkman, A. McKenzie, P. Brennwald, M. Longtine, E. Bi, C. Chan, P. Novick, C. Boone, J. R. Pringle, T. N. Davis, S. Fields, and D. G. Drubin, A protein interaction map for cell polarity development, Journal of Cell Biology, 152 (2001), pp. 549–571. [34] V. Arnau, S. Mars, and I. Mar´ın, Iterative cluster analysis of protein interaction data, Bioinformatics, 21 (2005), pp. 364–378. [35] M. Penrose, Geometric Random Graphs, Oxford University Press, Oxford, 2003. [36] N. Prˇzulj, O. Kuchaiev, A. Stevanovic, and W. Hayes, Geometric evolutionary dynamics of protein interaction networks, Pacific Symposium on Biocomputing (PSB), (2010), pp. 178–189. [37] T. Cox and M. Cox, Multidimensional Scaling, Chapman and Hall, London, 1994. [38] D. J. Higham, N. Prˇzulj, and M. Raˇsajski, Fitting a geometric graph to a protein-protein interaction network, Bioinformatics, 24 (2008), pp. 1093–1099. [39] O. Kuchaiev, M. Rasajski, D. Higham, and N. Prˇzulj, Geometric de-noising of protein-protein interaction networks, PLoS Computational Biology, 5 (2009), p. e1000454. [40] P. Grindrod, Range-dependent random graphs and their application to modeling large small-world proteome datasets, Physical Review E, 66 (2002), pp. 066702-1–7. [41] D. J. Higham, Unravelling small world networks, Journal of Computational and Applied Mathematics, 158 (2003), pp. 61–74. [42] P. Grindrod, D. J. Higham, and G. Kalna, Periodic reordering, IMA Journal of Numerical Analysis, 30 (2010), pp. 195–207. [43] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork, Comparative assessment of large-scale data sets of protein-protein interactions, Nature, 417 (2002), pp. 399–403. [44] G. D. Bader, D. Betel, and C. W. V. Hogue, BIND: the biomolecular interaction network database, Nucleic Acids Research, 31 (2003), pp. 248–250. [45] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, H.-C. M., and G. Cesareni, Mint: a molecular interaction database, FEBS Letters, 513 (2002), pp. 135–140. [46] S. Li, C. M. Armstrong, N. Bertin, H. Ge, S. Milstein, M. Boxem, P.-O. Vidalain, J.-D. J. Han, A. Chesneau, T. Hao, D. S. Goldberg, N. Li, M. Martinez, J.-F. Rual, P. Lamesch, L. Xu, M. Tewari, S. L. Wong, L. V. Zhang, G. F. Berriz, L. Jacotot, P. Vaglio, J. Reboul, T. Hirozane-Kishikawa, Q. Li, H. W. Gabel, A. Elewa, B. Baumgartner, D. J. Rose, H. Yu, S. Bosak, R. Sequerra, A. Fraser, S. E. Mango, W. M. Saxton, S. Strome, S. van den Heuvel, F. Piano, J. Vandenhaute, C. Sardet, M. Gerstein, L. Doucette-Stamm, K. C. Gunsalus, J. W. Harper, M. E. Cusick, F. P. Roth, D. E. Hill, and M. Vidal, A map of the interactome network of the metazoan C. elegans, Science, 303 (2004), pp. 540–543. [47] W. Zhong and P. W. Sternberg, Genome-wide prediction of C. elegans genetic interactions, Science, 311 (2006), pp. 1481–1484.
15 Modelling Biological Networks via Tailored Random Graphs Anthony C.C. Coolen1,2 , Franca Fraternali2 , Alessia Annibale1 , Luis Fernandes2 and Jens Kleinjung3 1 Department of Mathematics, King’s College London, UK Randall Division of Cell and Molecular Biophysics, King’s College London, UK 3 Division of Mathematical Biology, National Institute for Medical Research, London, UK 2
15.1
Introduction
The nature of biological research has changed irreversibly in the last decade. Experimental advances have generated unprecedented amounts of genetic and structural information at molecular levels of resolution, and to understand the biological systems that are now being studied, knowing the list of parts no longer suffices. The parts have become too complicated and too numerous. It is often not even clear how best to represent the new high-dimensional experimental data in a way that helps us make sense of them. We need to integrate our knowledge of the individual biological events and components in mathematical and computational models that can capture the complexity of the data at a systems level. Data collected on metabolic, regulatory or signalling processes in the cell are usually represented in the form of networks, with nodes representing dynamical variables (metabolites, enzymes, RNA or protein concentrations) and links between nodes representing pairs of variables that have been observed to interact with each other. Some of these networks are directed (e.g. metabolic and gene regulatory networks, GRNs), and some are undirected (e.g. protein–protein interaction networks, PPINs). The idea behind this representation is that functional properties of complex cellular processes will have fingerprints in the structure of their networks. Most of the observed biological networks, however, are too complex to allow for direct interpretations; to proceed we need precise tools for quantifying their topologies. Ensembles of tailored random graphs with controlled topological properties are a natural and rigorous language for describing biological networks. They suggest precise definitions of structural features, they allow us to classify networks and obtain precise (dis)similarity measures, they provide ‘null models’ for
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
310
Handbook of Statistical Systems Biology
hypothesis testing, and they can serve as efficient proxies for real networks in process modelling. In this chapter we explain the connection between biological networks and tailored random graphs, we show how this connection can be exploited, and we discuss exact algorithms for generating such graphs numerically.
15.2
Quantitative characterization of network topologies
We consider networks with N nodes (‘vertices’) labelled by roman indices. For each node pair (i, j) we write cij = 1 if a link (‘edge’) j → i is present, and cij = 0 if not. The set of N 2 link variables {cij } specifies the network in full, and is abbreviated as c. We limit ourselves in this chapter to nondirected networks (such as PPINs), where cij = cji for all (i, j), and we assume that cii = 0 for all i. We denote the set of all such 1
nondirected networks as G = {0, 1} 2 N(N−1) . The specific biological networks we are interested in tend to be large, containing of the order N ∼ 104 nodes, but with a small average number k = N −1 ij cij of links per node. The current estimate for e.g. the human PPIN is k ∼ 7. 15.2.1
Local network features and their statistics
To characterize networks quantitatively a natural first step is to inspect simple local quantities and their distribu tions (Albert and Barab´asi 2002; Newman 2003; Dorogovtsev et al. 2008), such as the degrees ki (c) = j cij (the number of partners of node i) or the clustering coefficients Ci (c) = [ j =/ k cij cik cjk ]/[ j =/ k cij cik ] (the fraction of the partners of i that are themselves connected). For instance, the distribution of the N degrees,1 p(k|c) =
1 δk,ki (c) N
(15.1)
i
gives us a simple and transparent characterization of the network’s topology. Often we would in addition like to capture correlations between local properties of different nodes, especially between connected nodes, which prompts us to define also distributions such as W(k, k |c) =
1 δk,ki (c) cij δk ,kj (c) Nk
(15.2)
ij
i.e. the fraction of connected node pairs (i, j) with degrees (k, k ). From (15.2) follows the assortativity (Newman 2002), the overall correlation between the degrees of connected nodes: (15.3) a(c) = [kk w − k2w ]/[k 2 w − k2w ] with the short-hand f (k, k )w = kk f (k, k )W(k, k |c), and where we used the symmetry of W(k, k |c). Both (15.1) and (15.2) are global measures of network structure, and they provide complementary information. They have the advantage of not depending explicitly upon the network size N (such a dependence would be undesirable, as most biological datasets are known to represent incomplete samples and the sizes of available biological networks continue to increase). However, (15.1) and (15.2) are not independent, since W(k|c) =
k
1 Here
δab = 1 if a = b, and δab = 0 if a = / b.
W(k, k |c) =
k p(k|c) k
(15.4)
Modelling Biological Networks 311
40
p (k)
0.5
35
0.4
30 25
0.3
k ′ 20
0.2
15
(a)
(b)
1
10
0.1 0
2
1
20
40
60 k
80
5 1
100
0.0 1 5 10 15 20 25 30 35 40
(c)
k
Figure 15.1 Windows on the Homo sapiens PPIN data in Prasad et al. (2009), with N = 9306 proteins and k = 7.53 interactions/node on average. (a) Data shown as a nondirected network, with proteins as nodes and physical interactions as links. (b) Degree distribution p(k), defined in (15.5). (c) Rescaled joint degree statistics of connected nodes (k, k ) = W (k, k )/W (k)W (k ), following (15.6). One would have (k, k ) = 1 for all (k, k ) if connected nodes had uncorrelated degrees, so deviations from light green suggest nontrivial structural properties. Reproduced from Fernandes et al. (2010)
More generally one could have for each node i a list ki (c) = (ki1 (c), . . . , kir (c)) of local quantities. Choosing e.g. ki (c) = (c+1 )ii would give ki2 (c) = jm cij cjm cmi (the number of length-3 paths through i),2 followed by counters of longer loops. The choice ki (c) = j (c )ij would give observables that count the number of paths through each node of a given length (open or closed3 .) The distributions (15.1,15.2) would then generalize to 1 δk,ki (c) (15.5) p(k|c) = N i 1 δk,ki (c) cij δk ,kj (c) (15.6) W(k, k |c) = Nk ij
with (15.5) giving the overall fraction of nodes in the network with local properties k, and (15.6) giving the fraction of connected node pairs (i, j) with local properties (k, k ). In this chapter we will mainly work with structure characterizations of the above form. However, we note that there are alternatives. One is the network spectrum (μ|c) = N −1 i δ[μ − μi (c)], where the μi (c) are the eigenvalues of the matrix c; from it one obtains the joint distribution of loops of all lengths, see e.g. K¨uhn (2008) and Rogers et al. (2010) and references therein. Another alternative is the spectrum of the network Laplacian L(c), a matrix defined as Lij (c) = ki (c)δij − cij ; it contains information on sub-graph statistics, modularity, and typical path distances, see e.g. Mohar (1991) and references therein. 15.2.2
Examples
Figure 15.1 illustrates the topology characterization (15.1,15.2) for the PPIN of H. sapiens. It is clear that from the network image itself [Figure 15.1(a)] one cannot extract much useful information. Instead we characterize the network structure hierarchically by measuring increasingly sophisticated degree-related quantities. The 2 From 3 These
ki1 (c) and ki2 (c) the clustering coefficient Ci (c) follows via Ci (c) = ki2 (c)/ki1 (c)[ki1 (c) − 1]. so-called generalized degrees were, to our knowledge, first proposed in Skantzos (2005).
312
Handbook of Statistical Systems Biology
zero-th level is to measure the average degree k. The next level is measuring the degree distribution p(k|c) (15.1). Finally, we collect degree statistics of connected nodes by probing W(k, k |c) (15.2). To aid our interpretation of the latter distribution we plot, rather than W(k, k |c) itself, the ratio (k, k |c) = W(k, k |c)/W(k|c)W(k |c)
(15.7)
where the marginals W(k|c) follow via (15.4). A weak Gaussian smoothening is applied to (k, k ), to prevent trivial pathologies in calculations. Since W(k|c) can be written directly in terms of p(k|c), only the ratio (k, k |c) can reveal topological information (if any) that is not already contained in the degree distribution. Any significant deviation from (k, k |c) = 1 tells us that the network wiring contains nontrivial regularities beyond those encoded in the degree distribution, which manifest themselves in either a higher (red) or a lower (blue) than expected tendency of degree pairs (k, k ) to interact. In dealing with real data one should be aware, however, that these are generally incomplete samples of a true underlying biological network, and that sampling impacts on the shape of distributions such as (15.5,15.6) (Han et al. 2005; Stumpf and Wiuf 2005). Both links and nodes can be under-sampled, and the sampling is likely to be biased (Hakes et al. 2008). For instance, PPIN datasets are influenced by experimentalists’ focus on proteins that are ‘interesting’ or easy to measure, by experimental protocol, and even by data processing.
15.3 15.3.1
Network families and random graphs Network families, hypothesis testing and null models
Any quantitative characterization of networks via increasingly detailed structure measurements, of which k, p(k) and W(k, k ) are specific examples, induces automatically a hierarchical classification of all networks into families, see Figure 15.2. This is not a deep insight, but it does aid our formulation of practical concepts and questions. Let us denote with G[p] ⊆ G the subset of networks with topological features characterized by a given choice for the function p(k) (see Figure 15.2), and let us write the number of networks in that subset as |G[p]|. Similarly, let us denote with G[p, ] ⊆ G[p] the subset of networks with topological features characterized by specified choices for both p(k) and (k, k ), with |G[p, ]| giving the number of such
Figure 15.2 Classification of networks into hierarchically organized families. Starting from the set of all nondirected graphs of size N, we first classify networks according to their average connectivities, followed by class subdivision for each k according to the degree distribution p(k), and by further class subdivision for each p(k) according to the degree correlation profile (k, k ) = W (k, k )/W (k)W (k ), etc. By construction, the subsets will decrease in size with each sub-division, i.e. the specification of networks becomes increasingly detailed and prescriptive
Modelling Biological Networks 313
networks. It is then natural to define network comparison, observation interpretation, and hypothesis testing along the following lines: •
•
•
A network with features {p, } is more complex than a network with features {p, } if |G[p, ]| < |G[p, ]|. The rationale is this: the smaller the number of networks with given features {p, }, i.e. the smaller the associated compartment in Figure 15.2, the more difficult it will be to find or construct a network with these specific features. Measuring a value for some observable (c) in a network c ∈ G[. . .] is trivial for the family G[. . .] if most of the networks c ∈ G[. . .] exhibit (c) = . Especially in large networks, where usually |G[p, ]| |G[p]|, an observation may be nontrivial for G[p] but trivial once we limit ourselves to G[p, ]. For instance, in the set of all graphs with average degree k the vast majority will have Poissonian degree statistics, so observing k2 > 2k2 becomes highly unlikely for large N. Yet, once we limit ourselves further to networks with power-law degree distributions the previously unlikely event becomes ordinary. To test a hypothesis that an observation (c ) = in a network c is atypical, we must define a nullhypothesis in terms of one of the above sets G[. . .]. The p-value of the test is then the probability to observe (c) = (or a more extreme value) if we pick graphs c randomly from G[. . .]. In analogy with our previous observations, an observation may have a very small p-value (and be interpreted as important) if we choose a large and diverse family of networks, say G[p], but may be recognized as trivial once we limit ourselves to the subset G[p, ] to which c belongs. In the latter case we would say that the observation (c ) = is a strict consequence of its degree correlations, as measured by (k, k ).
Many important questions relating to quantifying network structures and to interpretation of observations apparently involve calculating averages over constrained sets of randomly generated networks, and counting the number of networks with specific structural features (Milo et al. 2002; Foster et al. 2007; Holme and Zhao 2007). This is the connection between biological networks and tailored random graph ensembles. 15.3.2
Tailored random graph ensembles
Random graph ensembles (Erd¨os and R´enyi 1959; Molloy and Reed 1995; Watts and Strogatz 1998; Barab´asi and Albert 1999) give us a mathematical framework within which to make our ideas precise, and allow us to apply methods from information theory and statistical mechanics (Park and Newman 2004; Garlaschelli and Loffredo 2008). Random graph ensembles are defined by a set of allowed graphs, here taken to be (a subset of) G, and a measure p(c) that tells us how likely each c ∈ G is to be generated. The ensembles found in the previous section were all of the following form. We prescribed as constraints the values for specific observables, i.e. μ (c) = μ for μ = 1 . . . p, and demanded that only graphs that met the constraints were included, each with uniform weight: ph (c|) = Zh−1 () δ(c), ,
Zh () =
δ(c),
(15.8)
c∈G
with = (1 , . . . , p ). Such ensembles have maximum Shannon entropy (Cover and Thomas 1991), given the imposed ‘hard’ constraints, so in an information-theoretic sense the only structural information built into our graphs is that imposed by the constraints. Alternatively, one could relax the constraints and instead of
314
Handbook of Statistical Systems Biology
(c) = for all c ∈ G demand that these constraints are satisfied on average, i.e. that c∈G p(c)(c) = . Maximizing Shannon’s entropy under the latter ‘soft’ constraint give the so-called exponential family ωμ ()μ (c) ω ()μ (c) , Zs () = e μ (15.9) ps (c|) = Zs−1 () e μ μ
c∈G
where the parameters ωμ () must be solved from the equations c∈G ps (c|)μ (c) = μ . In contrast to (15.8), not all graphs generated by (15.9) will exhibit the properties (c) = , but for observables (c) that are macroscopic in nature one will generally find even in (15.9) deviations from the ‘hard’ condition (c) = to tend to zero as N becomes large. The normalization factor in (15.8) equals the number of graphs with the property (c) = , and can also be written in terms of the Shannon entropy of this ensemble via Zh () = exp[− c∈G ph (c|) log ph (c|)]. For ‘soft’ constrained ensembles (15.9) all graphs c ∈ G could in principle emerge. However, some are much more likely than others and one can define in that case more generally an effective number of graphs N(), via the connection with entropy. So we have more generally for either ensemble S[] = − p(c|) log p(c|) (15.10) N[] = eS[] , c∈G
and one must expect for large N and macroscopic observables (c) that the leading order of S() does not depend on whether we use (15.8) or (15.9). For instance, if we choose as our constraining observable only the average degree N −1 ij cij , we obtain the following maximum entropy ensembles (upon rewriting the measure for the soft constraint ensemble, and after solving its Lagrange parameter equation): (15.11) ph (c|k) = Zh−1 (k) δ cij ,Nk ij k k δcij ,1 + 1− δcij ,0 ps (c|k) = (15.12) N N i<j
The latter is the well known Erd¨os–R´enyi random graph ensemble (Erd¨os and R´enyi 1959). Both are tailored to the production of random graphs with average connectivity k, and are otherwise strictly unbiased. Specializing further, in the spirit of Figure 15.2, we next constrain the full degree sequence k = (k1 , . . . , kN ) (equivalent to fixing the degree distribution p(k), apart from node permutation). We then obtain the following maximum entropy ensembles: δ cij ,ki (15.13) ph (c|k) = Zh−1 (k) i
j
eωi +ωj 1 ps (c|k) = δ + δ c ,1 c ,0 ij ij 1+eωi +ωj 1+eωi +ωj i<j
(15.14)
with the {ωi } to be solved from the N equations ki = =/ i (1 + e−ωi −ω )−1 . Both ensembles (15.13,15.14) are tailored to the production of random graphs with degrees k, and are otherwise strictly unbiased. Specializing further to the level where, for instance, all degrees k as well as the joint distribution W(k, k ) (15.2) are precribed gives us the graph ensembles δk,k(c) δ (15.15) ph (c|k, W) = δ c δ ,NkW(k,k ) ij k,ki (c) ij k,kj (c) Zh (k, W) kk 1 c [ω +ω +ψ(ki (c),kj (c))+ψ(kj (c),ki (c))] ps (c|k, W) = (15.16) e i<j ij i j Zs (k, W)
Modelling Biological Networks 315
with the {ωi } and {ψ(k, k )} to be solved from the equations ki = c∈G ps (c|k, W) j cij and c∈G ps (c|k, W) ij δk,ki (c) cij δk,kj (c) = NkW(k, k ), respectively. Increasing the complexity of our observables gives us increasingly sophisticated tailored random graphs, which share more and more features with the biological networks one aims to study or mimic, but the price paid is mathematical and computational complexity. For instance, for realistic network sizes it will be hard to solve the equations for the Lagrange parameters reliably in ensembles such as (15.16) by numerical sampling 1 of the space G. Even for N = 1000 this space already contains 2 2 N(N−1) ≈ 10150 364 graphs (to put this number into perspective, there are estimated to be only around 1082 atoms in the universe).
15.4
Information-theoretic deliverables of tailored random graphs
If we choose our ensembles carefully, and our networks are sufficiently large, it is possible to proceed analytically. We want to define our ensembles up to the complexity limit where the various sums over all c ∈ G can in leading order in N still be calculated mathematically. There is little point limiting ourselves to (15.11,15.12), since they typically generate graphs with Poissonian degree distributions which are very different from our biological networks (Barab´asi and Albert 1999). So we focus on (15.13,15.14) and (15.15,15.16). Since we know the degrees of our biological networks, and since it is not hard to handle hard degree constraints, we choose our ensembles such that k = k(c) for all c. However, incorporating the wiring information contained in W(k, k ) or in the ratio (k, k ) = W(k, k )/W(k)W(k ) is easier using a soft constraint. The following choice was studied in detail in Annibale et al. (2009):
δk,k(c) k k p(c|k, ) = (15.17) Q(ki , kj )δcij ,1 + 1− Q(ki , kj ) δcij ,0 Z(k, ) N N i<j
with Q(k, k ) = (k, k )kk /k2 . In fact, in Annibale et al. (2009) the degree distribution p(k) was constrained, instead of the sequence k, giving the modestly different starting point [ p(ki )]p(c|, k) (15.18) p(c|p, ) = k
i
The ensemble (15.17) is tailored to the production of random graphs with degrees k (via a ‘hard’ constaint) and with joint degree statistics of connected nodes characterized by (k, k ) (via a ‘soft’ constraint), and is otherwise unbiased. It is the maximum entropy ensemble if we constrain the values of all degrees and the expectation value of the joint distribution W(k, k ) in (15.2). For N → ∞ the ‘soft’ deviations from W(k, k |c) = W(k, k ) will vanish. 15.4.1
Network complexity
We saw that the complexity of graphs with given properties (c) = is related to the number of graphs that exist with these properties. This number is expressed in terms of the Shannon entropy of the random graph ensemble p(c|) via (15.10). It turns out that for the ensemble (15.18) one can calculate analytically the leading orders in N of the entropy, via statistical mechanical techniques (e.g. path integrals and saddle-point integration), and thus avoid the need for numerical sampling to find Lagrange parameters. The result is an exact expression for the Shannon entropy S[p, ] and for the effective number of graphs N[p, ] with degree distribution p(k) and degree statistics of connected nodes given by (k, k ): S[p, ]/N = S0 − C[p, ] + N ,
N[p, ] = eS[p,]
(15.19)
316
Handbook of Statistical Systems Biology
in which N represents a finite size correction term that will vanish as N → ∞, and S0 is the Shannon entropy per node one would have found for the trivial ensembles (15.11,15.12), namely S0 =
1 k log[N/k]+1 2
(15.20)
The most interesting term in (15.19) is C[p, ], which tells us precisely when and how the imposition of the structural properties p(k) and (k, k ) reduces the space of compatible graphs, in the spirit of Figure 15.2. It contains two non-negative contributions: p(k) 1 C[p, ] = p(k) log p(k)p(k )kk (k, k ) log (k, k ) (15.21) + π(k) 2k k
kk
e−k kk /k! is the Poissonian degree distribution with average degree k one would have found
Here π(k) = for the ensemble (15.12). If the degrees of connected nodes are uncorrelated, so (k, k ) = 1 for all (k, k ), the second term of (15.21) will vanish. Hence the first term represents the complexity per node generated by the degree distribution alone; it is seen to increase as the degree distribution becomes more dissimilar from a Poissonian one (measured via a Kullback–Leibler distance). The second term of (15.21) represents the excess complexity per node generated by preferential wiring of the network, beyond the complexity induced by the imposed degrees. For the derivation of the above formulae we refer to Annibale et al. (2009) and its precursors (Bianconi et al. 2008; P´erez-Vicente and Coolen 2008). 15.4.2
Information-theoretic dissimilarity
Information-theory also provides measures for the dissimilarity between networks cA and cB , that take account of the probabilistic nature of network data by being formulated in terms of the associated random graph measures p(c|pA , A ) and p(c|pB , B ). One has a choice of definitions, but most are very similar and even identical when the underlying distributions become close. One of the simplest formulae is Jeffreys’ divergence, which after a simple rescaling leads to the following distance between networks cA and cB : p(c|p , ) p(c|p , ) 1 A A B B DAB = + p(c|pB , B ) log p(c|pA , A ) log 2N p(c|pB , B ) p(c|pA , A ) c∈G (15.22) Again the sums over all graphs in G can be calculated in leading order in the system size, and after taking N → ∞ the end result is once more surprisingly simple, explicit and transparent: p (k) 1 p (k) 1 A B pA (k) log pB (k) log + DAB = 2 pB (k) 2 pA (k) k k (k, k ) 1 A + pA (k)pA (k )kk A (k, k ) log 4kA B (k, k ) kk (k, k ) 1 B pB (k)pB (k )kk B (k, k ) log + 4kB A (k, k ) kk 1 1 + pA (k)k log ρAB (k) + pB (k)k log ρBA (k) (15.23) 2 2 k
k
The first line gives the degree statistics contribution to the dissimilarity of networks A and B. The second and third lines reflect wiring details beyond those imposed by the degree sequences. Line four is an interference term, involving quantities ρAB (k) to be solved from a simple equation that is derived in Roberts et al. (2010).
Modelling Biological Networks 317
Figure 15.3 PPIN datasets corresponding to 11 species (nine eukaryotic and six bacteria). NP, number of proteins; NI, number of interactions; PCG, number of protein coding genes; kmax , maximum degree; and DM, detection method. Most data were derived from high-throughput yeast two-hybrid (Y2H) or affinity purificationmass spectrometry (AP-MS) experiments. We added a recent protein complementation assay (PCA) dataset and several consolidated datasets that combine high-throughput experimental data with literature mining. The Ito et al. (2001) data were divided in a high confidence set (core) and a low confidence set, as suggested by the authors. The Collins et al. (2007) data consist of the raw purifications in Krogan et al. (2006) and Gavin et al. (2002, 2006) but re analysed differently. We also included two commonly used yeast datasets: the Han et al. (2004) network (a consolidated dataset referred to as the ‘Filtered Yeast Interactome’, consisting of experimentally determined and in silico predicted interactions), and the von Mering et al. (2002) dataset (assembled from two catalogues of yeast protein complexes, the MIPS and Yeast Protein Database catalogue)
The derivation of (15.23) from (15.22), apart from the interference term, is found in Annibale et al. (2009). In contrast to dissimilarity measures of networks that are based on link overlap, the measure (15.23) is based strictly on macroscopic measures and has a precise information-theoretic basis.4
15.5
Applications to PPINs
In contrast to genomic data, the available proteome data are still far from complete and of limited reproducibility (Hart et al. 2006; Stumpf et al. 2008). It is therefore vital that we understand the origin of the discrepancies between observed PPINs. Here we explore the use of information-theoretic random graph based tools for PPIN characterization and comparison, as described in the preceding sections, to shed light on this problem. Figure 15.3 shows various PPIN datasets, colour coded according to their experimental detection method. To get some feeling for these data we show in Figure 15.4 the degree correlations of connected nodes as measured 4 Link-by-link overlap is not a good measure of the (dis)similarity between two networks, just as the size in bits of a file does not generally
give its true information content.
Handbook of Statistical Systems Biology H.pylori (Y2H 2001)
k’
40
C.jejuni (Y2H 2007)
Pi
40
35
35
1.8
30
1.5
30
1.5
25
1.2
25
1.2
20
0.9
20
0.9
15
15
0.6
10
0.3
5 1
0.0
0.0 1 5 10 15 20 25 30 35 40 k
1 5 10 15 20 25 30 35 40 k E.coli (AP-MS 2006)
Pi
M.loti (Y2H 2008)
40
Pi
35
1.8
30
1.5
30
1.5
25
1.2
25
1.2
20
0.9
20
0.9
15
15
0.6
10
0.0
T.pallidum (Y2H 2008)
0.0 1 5 10 15 20 25 30 35 40 k Synechocystis (Y2H 2007)
Pi
40 1.8
35
0.3
5 1
1 5 10 15 20 25 30 35 40 k 40
0.6
10
0.3
5 1
1.8
35
k’
k’
40
0.6
10
0.3
5 1
Pi
1.8
35
30
1.5
30
1.5
25
1.2
25
1.2
20
0.9
20
0.9
15
0.6
10
0.3
5 1
0.0 1 5 10 15 20 25 30 35 40 k
k’
k’
Pi
1.8
k‘
318
15
0.6
10
0.3
5 1
0.0 1 5 10 15 20 25 30 35 40 k
Figure 15.4 Rescaled degree statistics (k, k ) of connected nodes in bacterial PPINs. Since one would have found (k, k ) = 1 for all (k, k ) if connected nodes had uncorrelated degrees, deviations from light green suggest nontrivial structural features. Reproduced from Fernandes et al. (2010)
by (15.7) for the bacterial species in Figure 15.3. There appears to be nontrivial information in the degree correlations, giving rise to diverse patterns for different species. The most closely related bacteria in Figure 15.3 are Helicobacter pylori and Campylobacter jejuni, which both belong to the Campylobacterales genus, yet this is not reflected in their degree correlations. Similarly, comparing Escherichia coli,C. jejuni, Treponema pallidum and H. pylori, all belonging to the Proteobacteria Phylum family (the majority of gram-negative
Modelling Biological Networks 319 S.cerevisiae II(core)(Y2H 2001) 40
40 1.5
30
1.5
25
1.2
25
1.2
0.9
20
0.9
20
k’
30
15
0.6
10
0.0
S.cerevisiae VIII (AP-MS 2006)
0.3
5 1
0.0 1 5 10 15 20 25 30 35 40 k
1 5 10 15 20 25 30 35 40 k 40
0.6
10
0.3
5 1
S.cerevisiae IX (AP-MS 2006)
Pi
40
Pi
35
1.8
35
1.8
30
1.5
30
1.5
25
1.2
25
1.2
0.9
20
0.9
20 15
k’
k’
1.8
35
15
15
0.6
0.6
10
10 0.3
5 1
0.0
S.cerevisiae X (AP-MS 2007) 40
0.0 1 5 10 15 20 25 30 35 40 k S.cerevisiae XI (PCA 2008)
Pi
40 1.8
35
0.3
5 1
1 5 10 15 20 25 30 35 40 k
Pi 1.8
35
30
1.5
30
1.5
25
1.2
25
1.2
20
0.9
20
0.9
15
0.6
10
0.3
5 1
0.0 1 5 10 15 20 25 30 35 40 k
k’
k’
Pi
1.8
35
k’
S.cerevisiae XII (Y2H 2008)
Pi
15
0.6
10
0.3
5 1
0.0 1 5 10 15 20 25 30 35 40 k
Figure 15.5 Rescaled degree statistics (k, k ) of connected nodes in yeast PPINs. Since one would have found (k, k ) = 1 for all (k, k ) if connected nodes had uncorrelated degrees, deviations from light green suggest nontrivial structural features. Reproduced from Fernandes et al. (2010)
bacteria), does not reveal a consistent pattern either. More worryingly, fully consistent degree correlation fingerprints are not even observed for datasets of the same species. This is seen in Figure 15.5 which shows the degree correlations for yeast, the focus of most large-scale PPIN determinations so far, displayed in chronological order of experimental determination.
320
Handbook of Statistical Systems Biology
A hint at a possible explanation emerges if one compares only plots that refer to the same experimental technique. The degree correlation patterns then appear more similar, differing mostly in the strengths of the deviations from the random level, which increase roughly with the time of publication of the dataset. Compare e.g. Saccharomyces cerevisiae II (core) to S. cerevisiae XII (both obtained via Y2H), and S. cerevisiae VIII to S. cerevisiae X (both obtained via AP-MS). The interactions reported in S. cerevisiae X were derived from the raw data of two AP-MS datasets (S. cerevisiae VIII and S. cerevisiae IX), but processed using a different scoring and clustering protocol. The AP-MS datasets generally show stronger degree correlation patterns than the Y2H ones (this is also observed for H. sapiens data) although the regions where the main deviations from the random level occur are different. 15.5.1
PPIN assortativity and wiring complexity
To assess the statistical significance of observed differences in degree correlation patterns we measure for the different datasets two quantities that are strongly dependent upon the degree correlations: the assortativity (15.3) and the wiring complexity, i.e. the second term of (15.21). We test the observed values against similar observations in appropriate null models. The latter are graphs generated randomly and with uniform probabilities from the set G[p], all nondirected networks with size and degree distribution identical to those of the networks under study, but without degree correlations. For large N all graphs in G[p] will have (k, k ) = 1 for all (k, k ), and zero wiring complexity and assortativity. Any nonzero value reflects finite size effects, possibly complemented by imperfect equilibration during the generation of the null models (the numerical generation of such graphs is the subject of a subsequent section). In Fig. 15.6(a) we plot the assortativities of our PPIN datasets (original), together with those of their null models (reshuffled). Most sets have slightly negative assortativity values, i.e. weak preference for interactions between nodes with different degrees. The main deviant from this trend is S. cerevisiae X, with a strong positive assortativity. This is consistent with Figure 15.5, where this dataset indeed exhibits high values of (k, k ) along the main diagonal, signalling a preference for interactions between nodes with similar degrees. The assortativities of the null models are expected to be closer to zero than those of the PPINs. This is indeed true for the majority of cases, and we may therefore conclude that the structures observed for (k, k ) in our PPIN data (as in Figures 15.4 and 15.5) cannot all be attributed to finite size fluctuations, and are hence statistically significant. The wiring complexity per node is the second term in (15.21). It measures topological information contained in a network’s degree correlations, beyond that contained in the degree distribution alone. However, given the considerable differences between the average connectivities in table Figure 15.3, and the likelihood that these reflect sampling variability (if anything), we choose to measure and plot instead the wiring complexity per link, i.e. 1 ˜ C[p, ]wiring = p(k)p(k )kk (k, k ) log (k, k ) (15.24) 2 kk
(related to C[p, ]wiring via division by k). In Figure 15.6(b) we plot this quantity for our PPIN datasets (original), together with those of the corresponding null models (reshuffled). Interestingly, the AP-MS networks tend to have higher wiring complexities than the Y2H ones. The wiring complexities of the null models are expected to be closer to zero than those of the real PPINs, and this is again borne out by the data. Once more we conclude that the structures observed for (k, k ) in our PPIN data (as in Figures 15.4 and 15.5) cannot be attributed to finite size fluctuations; they are statistically significant. 15.5.2
Mapping PPIN data biases
We saw that an efficient information-theoretic measure for the dissimilarity between two networks cA and cB is given by the Jeffreys’ divergence between the probability measures of the associated random graph ensembles.
Modelling Biological Networks 321 Assortativity Original Reshuffled
•
0.6
•
0.5 • •
0.4 0.3 0.2
•
0.1 0.0
•
•
• • •
–0.1
• •
• •
• •
•
–0.2
•
•
• • •
• •
–0.3
•
•
–0.4 –0.5
H. py sa lo r p H. ien i sa s I pi en s I I P. fa M.l S. lcip oti ce ar u S. revi m ce sia r S. evi e I ce sia S. rev e II ce isi re ae Sy visi III ne ae ch XI oc I T. ysti pa s llid um H. E. S. sap coli ce ien S. revi s II ce sia I S. revi e IV ce sia re e S. vis VI ce iae re S. vis VII I c i S. erev ae I ce is X i r a D. ev e m isia X el an e X H. og I sa as t S. pi ce en er S. rev s IV is ce re iae vis V ia e VI I
ni ju
je
H.
C.
C.
el
eg
an
s
–0.6
Yeast-two-Hybrid
Affinity Purification-Mass Spectrometry Database Datasets
(a)
Protein Complementary Assay
Data Integration
Wiring Complexity 0.90
Original Reshuffled
0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10
ns I ns II M P. fa .lot S. lcip i ce ar u S. rev m ce isi a r S. ev e I ce isi a S. rev e I ce isi I re ae Sy vis III ne iae ch X o II T. cys pa tis llid um H E.c .s o S. api li ce en s r S. evi III ce sia S. rev e I ce is V re iae S. vis V ce iae I re S. vis VIII ce ia e S. rev IX ce isi a D rev e X .m is el iae an X H og I .s ap ast e S. i ce en r s S. rev IV ce is re iae vi si V ae VI I
.s
ap
ie
i or
yl
ie
ap
.s
.p H
H
H
ni ju
.je
le .e C
C
ga
ns
0.00
Yeast-two-Hybrid
(b)
Affinity Purification-Mass Spectrometry Database Datasets
Protein Complementary Assay
Data Integration
Figure 15.6 Assortativity (a) and wiring complexity per link (b) for the biological PPIN in Figure 15.3 and their null models. Apart from having size and degree distributions identical to their biological counterparts, the null models are strictly random, and would for large N have zero assortativity and wiring complexity. Reproduced from Annibale et al. (2009) and Fernandes et al. (2010)
322
Handbook of Statistical Systems Biology Distance
Distance 0
2
4
6
8
10
0
12
S.cerevisiae X
6
8
10
AP-MS
S.cerevisiae III S.cerevisiae XII
S.cerevisiae I
S.cerevisiae II S.cerevisiae XI
Y2H PCA
S.cerevisiae IV S.cerevisiae IX S.cerevisiae VI
Affinity Purification-Mass Spectrometry Database Datasets
(a)
4
S.cerevisiae VIII
C.jejuni S.cerevisiae VIII T.pallidum S.cerevisiae X H.sapiens III S.cerevisiae V E.coli D.melanogaster H.sapiens IV S.cerevisiae XI H.pylori S.cerevisiae IV S.cerevisiae IX P.falciparum S.cerevisiae VI H.sapiens II H.sapiens I S.cerevisiae VII M.loti S.cerevisiae III C.elegans S.cerevisiae I S.cerevisiae II S.cerevisiae XII Synechocystis
Yeast-two-Hybrid
2
AP-MS
Protein Complementation Assay
Data Integration
(b)
Figure 15.7 Network comparison by dendrogram clustering using the distance measure (15.23). (a) Dendrogram for the full PPIN collection in Figure 15.3. (b) Dendrogram for PPINs of S. cerevisiae only. Branch colours indicate the different experimental detection techniques. The integration datasets (S. cerevisiae V and S. cerevisiae VII were excluded from (b), since they are based on a variety of techniques. Reproduced from Fernandes et al. (2010).
If we work at the level of the sets G[p, ], the result is formula (15.23). Had we characterized networks only according to their degree distributions we would have worked with the sets G[p], and would have found only the first line of (15.23). Given the observed assortativies and wiring complexities of our data, relative to those of null models, we take the degree correlations to be significant. We may then study the relations between the different biological networks by calculating their pairwise distances via (15.23), use the distance table to cluster the datasets, and show the result in the form of a dendrogram. This gives Figure 15.7, which is quite revealing.5 Those data sets which were most strongly criticized in the past for having worryingly small overlaps, e.g. the Y2H data sets S. cerevisiae I versus II and H. sapiens I versus II, are now unambiguously found to be topologically similar. However, our collection of PPINs group primarily by detection method; for the presently available PPIN datasets, any biological similarities are overshadowed by methodological biases. This is particularly evident in the bottom subgroup [pink leaves in Figure 15.7(a)], which clusters almost exclusively Y2H datasets and comprises a wide range of species. The methodological biases are also obvious in the intra-species comparison of S. cerevisiae shown in Figure 15.7(b). The largest sub group distance within this tree is the one between two AP-MS datasets that have been post-processed differently (the top two within the green box). Also, the single PCA network is separated from the AP-MS and Y2H subgroups. We conclude: (i) PPINs of the same species and measured via the same experimental method are statistically similar, and more similar than networks measured via the same method but for different species; and (ii) PPINs measured via the same experimental 5 If one repeats this exercise using only the first line of (15.23), the partitioning in different techniques is less evident. The degree–degree
correlations apparently contribute a valuable amount of information to PPIN comparisons.
Modelling Biological Networks 323
method cluster together, revealing a bias introduced by the methods that is seen to overrule species-specific information.
15.6
Numerical generation of tailored random graphs
We now turn to the question of how to generate numerically graphs from ensembles such as (15.17). It is not difficult to build algorithms that sample the space of all graphs with a given degree sequence; the difficulty lies in generating each graph with the correct probability (Bender and Canfield 1978; Chung and Lu 2002; Stauffer and Barbosa 2005). One popular algorithm (Newman et al. 2001) is limited to the case where graphs are to be generated with equal probabilities, i.e. to (15.13) or (15.17) with (k, k ) = 1. This method cannot generate graphs with degree correlations. A second popular method for generating random graphs with a given degree sequence is ‘edge swapping’, which involves successions of ergodic graph randomizing moves of a type that leave the degrees k invariant (Seidel 1973; Taylor 1981). However, we will see that naive accept-all edge swapping can cause sampling biases which render this protocol unsuitable for generating null models. The reason is that the number of edge swaps that can be executed is not a constant, it depends on the graph c at hand. 15.6.1
Generating random graphs via Markov chains
A general and exact method for generating graphs from the set G[k] = {c ∈ G| k(c) = k} randomly, with specified probabilities p(c) = Z−1 exp[−H(c)] was developed in Coolen et al. (2009). It has the form of a Markov chain, namely, a discrete time stochastic process ∀c ∈ G[k] :
pt+1 (c) =
W(c|c )pt (c )
(15.25)
c ∈G[k]
Here pt (c) is the probability of observing graph c at time t in the process, and W(c|c ) is the one-step transition probability from graph c to c. For any set of ergodic6 reversible elementary moves F : G[k] → G[k] we can choose transition probabilities of the form W(c|c ) =
q(F |c ) δc,F c A(F c |c ) + δc,c [1 − A(F c |c )]
(15.26)
F ∈
The interpretation is as follows. At each step a candidate move F ∈ is drawn with probability q(F |c ), where c denotes the current graph. This move is accepted (and the transition c → c = F c executed) with probability A(F c |c ) ∈ [0, 1], which depends on the current graph c and on the proposed new graph F c . If the move is rejected, which happens with probability 1−A(F c |c ), the system stays in c . We may always exclude from the identity operation. One can prove that the process (15.25) will converge towards the equilibrium measure p∞ (c) = Z−1 exp[−H(c)] upon making in (15.26) the choices q(F |c) = IF (c)/n(c) A(c|c ) =
6 So
(15.27)
1 n(c )e− 2 [H(c)−H(c )]
n(c )e− 2 [H(c)−H(c )] + n(c)e 2 [H(c)−H(c )] 1
1
we can go from any initial graph c ∈ G[k] to any final graph c ∈ G[k] by a finite number of moves F ∈ .
(15.28)
324
Handbook of Statistical Systems Biology
Here IF (c) = 1 if the move F can act on graph c, with IF (c) = 0 otherwise, and n(c) denotes the total number of moves that can act on a graph c (the ‘mobility’ of state c): n(c) = IF (c). (15.29) F ∈
15.6.2
Degree-constrained graph dynamics based on edge swaps
We apply the above result to the case where our moves are edge swaps, which are the simplest graph moves that preserve all node degrees. They act on quadruplets of nodes and their mutual links, so we define the set Q = {(i, j, k, ) ∈ {1, . . . , N}4 | i < j < k < } of all ordered node quadruplets. The possible edge swaps to act on (i, j, k, ) are the following, with thick lines indicating existing links and thin lines indicating absent links that will be swapped with the existing ones, and where (IV, V, VI) are the inverses of (I, II, III): I ir
jr
r
r k
IV
II
III
jr ir r r k
jr ir r r k
ir
jr
r
r k
V
VI
jr ir r r k
jr ir r r k
We group the edge swaps into the three pairs (I,IV), (II,V), and (III,VI), and label all three resulting autoinvertible operations for each ordered quadruple (i, j, k, ) by adding a subscript α. Our auto-invertible edge swaps are from now on written as Fijk;α , with i < j < k < and α ∈ {1, 2, 3}. We define associated indicator functions Iijk;α (c) ∈ {0, 1} that detect whether (1) or not (0) the edge swap Fijk;α can act on state c, so Iijk;1 (c) = cij ck (1 − ci )(1 − cjk ) + (1 − cij )(1 − ck )ci cjk Iijk;2 (c) = cij ck (1 − cik )(1 − cj ) + (1 − cij )(1 − ck )cik cj Iijk;3 (c) = cik cj (1 − ci )(1 − cjk ) + (1 − cik )(1 − cj )ci cjk
(15.30) (15.31) (15.32)
If Fijk;α can indeed act, i.e. if Iijk;α (c) = 1, this edge swap will operate as follows: Fijk;α (c)qr = 1 − cqr Fijk;α (c)qr = cqr
for (q, r) ∈ Sijk;α for (q, r) ∈ / Sijk;α
(15.33) (15.34)
where Sijk;1 = {(i, j), (k, ), (i, ), (j, k)} Sijk;2 = {(i, j), (k, ), (i, k), (j, )} Sijk;3 = {(i, k), (j, ), (i, ), (j, k)}
(15.35) (15.36)
Insertion of these definitions into the general recipe (15.26,15.27,15.28) then gives W(c|c ) =
Iijk;α (c ) n(c )
i<j
⎡
×⎣
δc,Fijk;α c e− 2 [E(Fijk;α c )−E(c )] + δc,c e 2 [E(Fijk;α c )−E(c )] 1
1
e− 2 [E(Fijk;α c )−E(c )] + e 2 [E(Fijk;α c )−E(c )] 1
1
⎤ ⎦
(15.37)
with E(c) = H(c) + log n(c). The graph dynamics algorithm described by (15.37) is the following. Given an instantaneous graph c : (i) pick uniformly at random a triplet (i, j, k, ) of sites, (ii) if at least one of the
Modelling Biological Networks 325
three edge swaps c → Fijk;α (c ) is possible, select one of these uniformly at random and execute it with an acceptance probability −1 (15.38) A(c|c ) = 1 + eE(Fijk;α c )−E(c ) then return to (i). For this Markov chain recipe to be practical we finally need a formula for the mobility n(c) of a graph. This could be calculated (Coolen et al. 2009), giving:7 2 1 1 1 2 1 1 1 n(c) = ki + ki − ki − ki cij kj + Tr(c4 ) + Tr(c3 ) (15.39) 4 4 2 2 4 2 i
i
i
ij
Naive ‘accept-all’ edge swapping would correspond to choosing E(c) = 0 in (15.37), and upon equilibration it would give the biased graph sampling probabilities p∞ (c) = n(c)/ c n(c ). The graph mobility is seen to act as an entropic force, which can only be neglected if (15.39) is dominated by its first three terms; it was shown that a sufficient condition for this to be the case is k 2 kmax /k2 N. For networks with narrow degree sequences this condition would hold, and naive edge swapping would be acceptable. However, one has to be careful with scale-free degree sequences, where both k2 and kmax diverge as N → ∞. 15.6.3
Numerical examples
It is easy to construct example degree distributions where taking into account the entropic effects caused by nontrivial graph mobilities is vital in order to generate correct graph sampling probabilities. Here we show an example of the Markov chain (15.37) generating upon equilibration8 graphs with controlled degree correlation structures of the form (k, k ) =
(k − k )2 [β1 − β2 k + β3 k2 ][β1 − β2 k + β3 k 2 ]
(15.40)
(with the parameters βi following from 15.4). An initial graph c0 was constructed with a non-Poissonian degree distribution and trivial relative degree correlations (k, k |c0 ) ≈ 1, corresponding to the flat ensemble (15.13); see Figure 15.8. After iterating until equilibrium the Markov chain (15.37) with move acceptance rates tailored to approaching (15.17) as an equilibrium measure, one finds indeed values for the degree correlations in very good agreement with their target values (see the bottom panels of Figure 15.8).
15.7
Discussion
In this chapter we have discussed the fruitful connection between biological signalling networks and the theory of random graph ensembles with tailored structural features. We focused on two aspects of this connection: how random graph ensembles can be used to generate rigorous information-theoretic formulae with which to quantify the complexities and (dis)similarities of observed networks, and how to generate numerically tailored random graphs with controlled macroscopic structural properties, to serve e.g. as null models in hypothesis testing. We limited ourselves here to nondirected graphs, in view of space limitations and since these have so far been the focus of most research papers; similar analyses can be (and are being) undertaken for directed ones. The quantitative study of cellular signalling networks is still in its infancy, and as our mathematical
TrA = A . i ii algorithm ran for a duration of 75 000 accepted edge swaps, and measurements of Hamming distances confirmed that with this duration the dynamics achieved maximum distance between initial and final graphs. 7 Here 8 The
326
Handbook of Statistical Systems Biology
Figure 15.8 Results of canonical edge-swap Markov chain dynamics tailored to generating random graphs with the non-uniform measure (15.17). (Top left) Degree distribution of the (randomly generated) initial graph c0 , with N = 4000 and k = 5. (Top right) Rrelative degree correlations (k, k |c0 ) of the initial graph. (Bottom left) The target relative degree correlations (15.40) chosen in (15.17). (Bottom right) Colour plot of the relative degree correlations (k, k |cfinal ) in the final graph cfinal , measured after 75 000 accepted moves of the Markov chain (15.37). Reproduced from Coolen et al. (2009).
tools continue to improve one can envisage many future research directions. These include e.g. the (biased) network sampling problem, where one could perhaps use the new information-theoretic formulae to predict unobserved nodes, and the study of integrated signalling networks that combine transcription and protein– protein interaction information. At the mathematical level the main new challenge to be confronted is to develop tools similar to the ones discussed in this chapter for measures of network structure that involve the statistics of loops. If observables include loop counters one cannot simply extend the existing mathematical techniques (sums over all graphs can no longer be made to factorize by existing manipulations); radically new ideas are required.
Modelling Biological Networks 327
References Albert R and Barab´asi AL 2002 Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–96. Annibale A, Coolen ACC, Fernandes LP, et al. 2009 Tailored graph ensembles as proxies or null models for real networks I: tools for quantifying structure. J. Phys. A: Math. Theor. 42, 485001. Arifuzzaman M, Maeda M, Itoh A, et al. 2006 Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res. 16, 686–691. Barab´asi AL and Albert R 1999 Emergence of scaling in random networks. Science 286, 509–512. Bender E and Canfield E 1978 The asymptotic of labelled graphs with given degree sequences. J. Comb. Theory, Ser. A 24, 296–307. Bianconi G, Coolen ACC and P´erez Vicente CJ 2008 Entropies of complex networks with hierarchically constrained topologies. Phys. Rev. E 78, 016114. Chung F and Lu L 2002 The average distances in random graphs with given expected degrees. Proc. Natl. Acad. Sci. USA 99, 15879–15882. Collins SRR, Kemmeren P, Chu X, et al. 2007 Towards a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol. Cell Proteomics 6, 439–450. Coolen ACC, De Martino A and Annibale A 2009 Constrained Markovian dynamics of random graphs. J. Stat. Phys. 136, 1035–1067. Cover TM and Thomas JA 1991 Elements of Information Theory. John Wiley & Sons, Ltd. Dorogovtsev SN, Goltsev AV and Mendes JFF 2008 Critical phenomena in complex networks. Rev. Mod. Phys. 80, 1275–1335. Erd¨os P and R´enyi A 1959 On random graphs I. Publ. Math. 6, 290–297. Ewing RM, Chu P, Elisma F, et al. 2007 Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol. Syst. Biol. 3, 89. Fernandes LP, Annibale A, Kleinjung J, et al. 2010 Protein networks reveal detection bias and species consistency when viewed through information–theoretic glasses. PLoS ONE 5, e12083. Foster JG, Foster DV, Grassberger P and Paczuski M 2007 Link and subgraph likelihoods in random undirected networks with fixed and partially fixed degree sequences. Phys. Rev. E 76, 046112. Garlaschelli D and Loffredo MI 2008 Maximum likelihood: extracting unbiased information from complex networks. Phys. Rev. E 78, 015101. Gavin AC, B¨osche M, Krause R, et al. 2002 Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147. Gavin AC, Aloy P, Grandi P, et al. 2006 Proteome survey reveals modularity of the yeast cell machinery. Nature 440, 631–636. Hakes L, Pinney JW, Robertson DL and Lovell SC 2008 Protein-protein interaction networks and biology-what’s the connection? Nature Biotechnol. 26, 69–72. Han JDJ, Bertin N, Hao T, et al. 2004 Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430, 88–93. Han JDJ, Dupuy D, Bertin N, et al. 2005 Effect of sampling on topology predictors of protein-protein interaction networks. Nature Biotechnol. 23, 839–844. Hart GT, Ramani AK, and Marcotte EM 2006 How complete are current yeast and human protein-interaction networks? Genome Biol. 7, 120. Ho Y, Gruhler A, Heilbut A, et al. 2002 Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183. Holme P and Zhao J 2007 Exploring the assortativity-clustering space of a network’s degree sequence. Phys. Rev. E 75, 046111. Ito T, Chiba T, Ozawa R, et al. 2001 A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98, 4569–4574. Krogan NJ, Cagney G, Yu H, et al. 2006 Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440, 637–643. K¨uhn R 2008 Spectra of sparse random matrices. J. Phys. A: Math. Theor. 41, 295002.
328
Handbook of Statistical Systems Biology
Lacount DJ, Vignali M, Chettier R, et al. 2005 A protein-protein interaction network of the malaria parasite plasmodium falciparum. Nature 438, 103–107. Milo R, Shen-Orr S, Itzkovitz S, et al. 2002 Network motifs: simple building blocks of complex networks. Science 298, 824–827. Mohar B 1991 The Laplacian spectrum of graphs. In Graph Theory, Combinatorics, and Applications, vol 2 (eds Alavi Y, Chartrand G, Oellermann OR, and Schwenk AJ), pp. 871–898. John Wiley & Sons, Ltd. Molloy M and Reed B 1995 A critical point for random graphs with a given degree sequence. Random Structures and Algorithms 6, 161–180. Newman MEJ 2002 Assortative mixing in networks. Phys. Rev. Lett. 89, 208701. Newman MEJ 2003 Handbook of Graphs and Networks: from the Genome to the Internet (eds Bornholdt S and Schuster HG). Wiley-VCH. Newman MEJ, Strogatz SH and Watts DJ 2001 Random graphs with arbitrary degree distribution and their applications. Phys. Rev. E 64, 026118. Park J and Newman MEJ 2004 Statistical mechanics of networks. Phys. Rev. E 70, 066117. Parrish JR, Yu J, Liu G, et al. 2007 A proteome-wide protein-protein interaction map for Campylobacter jejuni. Genome Biol. 8, R131. P´erez-Vicente CJ and Coolen ACC 2008 Spin models on random graphs with controlled topologies beyond degree constraints. J. Phys. A: Math. Theor. 41, 255003. Prasad TSK, Goel R, Kandasamy K, et al. 2009 Human protein reference database 2009 update. Nucleic Acids Res. 37, D767-D772. Rain JC, Selig L, De Reuse H, et al. 2001 The protein-protein interaction map of Helicobacter pylori. Nature 409, 211–215. Roberts ES, Coolen ACC and Schlitt T 2010 Tailored graph ensembles as proxies or null models for real networks II: results on directed graphs. J. Phys. A: Math. Theor. submitted. Rogers T, P´erez Vicente C, Takeda K and P´erez Castillo I 2010 Spectral density of random graphs with topological constraints. J. Phys. A: Math. Theor. 43, 195002. Rual JFF, Venkatesan K, Hao T, et al. 2005 Towards a proteome-scale map of the human protein–protein interaction network. Nature 437, 1173–1178. Sato S, Shimoda Y, Muraki A, et al. 2007 A large-scale protein-protein interaction analysis in Synechocystis sp. PCC6803. DNA Res. 14, 207–216. Seidel JJ 1973 A survey of two-graphs. In Colloquio Internazionale sulle Teorie Combinatorie, vol. I, pp. 481–511. Accademia Nazionale dei Lincei. Shimoda Y, Shinpo S, Kohara M, et al. 2008 A large scale analysis of protein-protein interactions in the nitrogen-fixing bacterium Mesorhizobium loti. DNA Res. 15, 13–23. Simonis N, Rual JF, Carvunis AR, et al. 2008 Empirically controlled mapping of the Caenorhabditis elegans proteinprotein interactome network. Nature Methods 6, 47–54. Skantzos NS 2005 Unpublished research report. Stark C, Breitkreutz BJ, Reguly T, et al. 2006 Biogrid: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–D539. Stauffer AO and Barbosa VC 2005 A study of the edge switching Markov-Chain method for the generation of random graphs. arXiv:0512105. Stelzl U, Worm U, Lalowski M, et al. 2005 A human protein-protein interaction network: a resource for annotating the proteome. Cell 122, 957–968. Stumpf MPH and Wiuf C 2005 Sampling properties of random graphs: the degree distribution. Phys. Rev. E 72, 036118. Stumpf MPH, Thorne T, de Silva E, et al. 2008 Estimating the size of the human interactome. Proc. Natl. Acad. Sci. USA 105, 6959–6964. Taylor R 1981 Constrained switchings in graphs. In Combinatorial Mathematics VIII, vol. 884 (ed. McAvaney KL), pp. 314–336. Springer. Tarassov K, Messier V, Landry CR, et al. 2008 An in vivo map of the yeast protein interactome. Science 320, 1465–1470. Titz B, Rajagopala SV, Goll J, et al. 2008 The binary protein interactome of Treponema pallidum – the syphilis spirochete. PLoS ONE 3, e2292.
Modelling Biological Networks 329 Uetz P, Giot L, Cagney G, et al. 2000 A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627. von Mering C, Krause R, Snel B, et al. 2002 Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417, 399–403. Watts DJ and Strogatz SH 1998 Collective dynamics of ‘small-world’ networks. Nature 393, 440–442. Yu H, Braun P, Yildirim MA, et al. 2008 High-quality binary protein-protein interaction map of the yeast interactome network. Science 322, 104–110.
Part D DYNAMICAL SYSTEMS
16 Nonlinear Dynamics: A Brief Introduction Alessandro Moura and Celso Grebogi Institute for Complex Systems and Mathematical Biology, University of Aberdeen, UK
16.1
Introduction
Biological systems are governed by complex interactions among their constituent parts, and these interactions result in nonlinear dynamics. This is attested by the ubiquity of feedback and feedforward mechanisms in biology, in scales ranging from whole multicellular organisms to single-cell dynamics. The nonlinearity of biological interactions has a number of crucially important consequences for the dynamics of biological systems, including the emergence of multiple stability and limit cycle behaviour. This chapter is dedicated to another characteristic class of phenomena that can appear in nonlinear dynamical systems: chaos. Chaotic dynamics is common in nonlinear systems, and is characterised by irregular and unpredictable behaviour of trajectories in phase space, which are impossible to predict for long times [1–5]. Despite this irregularity, chaotic dynamics is governed by a few simple principles which have deep connections with concepts in statistical physics and information theory. In this short introduction to chaos in nonlinear dynamics, our focus will be on these aspects of the theory, which help explain why chaos theory is applicable to such a broad range of phenomena. We will avoid technicalities and try to convey what we consider to be the key ideas in the area, without attempting to be mathematically rigorous, and we will emphasise especially the connection between all these different ideas, and how they all are ultimately a consequence of the sensitivity of the dynamics to initial conditions which is the hallmark of chaos. We direct the readers interested in more in-depth coverage of the topics presented here to references [1–5]. We start by defining precisely the single most characteristic feature of chaos – the sensitivity of trajectories to initial conditions, as measured by the Lyapunov exponent (Section 16.2). We then see how the combination of the exponential separation of close trajectories with bounded motion leads to a dynamics which can be characterised by probabilities, resembling statistical physics (Section 16.3). This leads to the definition of entropy for a chaotic system, which allows us to quantify the information generated by a chaotic system (Section 16.4). Finally, we see how the description of the dynamics of chaotic systems can be drastically
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
334
Handbook of Statistical Systems Biology
simplified without any loss of information, by means of symbolic dynamics, and we derive some important results on the complex structure of orbits in chaotic systems (Section 16.5).
16.2
Sensitivity to initial conditions and the Lyapunov exponent
Chaotic systems are characterised by a sensitivity to initial conditions, which means that very close initial conditions lead to trajectories which quickly separate, and end up going into completely distinct motions. The separation takes place exponentially fast in chaotic systems. More precisely, for a pair of initial conditions that are separated by a distance δ0 defined in an appropriate metric in the phase space, the distance δ(t) at future times t increases as δ(t) ∼ δ0 eλt for sufficiently large t, where λ is the Lyapunov exponent, first introduced by Oseledec [6]: δ(t) 1 ln . (16.1) λ= lim δ0 →0,t→∞ t δ0 Now let us interpret δ0 as our limit of resolution in distinguishing initial conditions – for example, suppose we have a measurement accuracy limit which makes it impossible to distinguish two initial conditions separated by a distance less than δ0 . If the system has a positive Lyapunov exponent λ, within a time of the order of τ ∼ 1/λ,
(16.2)
it will be impossible to predict the state of the system. τ is called the Lyapunov time. The existence of a positive Lyapunov exponent imposes severe restrictions on the predictability of chaotic systems, even though they are governed by totally deterministic laws. To see this, suppose we want to improve the prediction time of a certain chaotic system of interest by increasing the accuracy δ0 of the measurement of the initial conditions. Assume that due to a technological breakthrough our measurement is now ten orders of magnitude more accurate. This would mean that we managed to decrease δ0 : δ0 → 10−10 δ0 . Let be the greatest error we consider acceptable in our prediction. So we can define the prediction time T as the time it takes for the initial error δ0 to increase to ; in other words, 1 T = ln . (16.3) λ δ0 The above expression means that decreasing δ0 by a factor 10−10 would increase the predictability time by an additional λ−1 10 ln 10. If, for example, /δ0 = 103 before the increase in accuracy, this would represent only about a fourfold increase in the prediction time. So we would have to increase the accuracy by ten orders of magnitude to buy a fourfold increase in the prediction time! The problem here is that we would need to increase the accuracy exponentially fast in order to obtain a linear increase in the quality of our prediction as measured by T , which makes increasing T indefinitely a hopeless proposition. This means that in practise, for times much larger than the Lyapunov time, chaotic systems are unpredictable.
16.3
The natural measure
In many cases the trajectories of a dynamical system are confined to a bounded region in phase space; this is the usual situation in models of biological systems. For dissipative systems, volume in phase space is contracted by the motion, which implies that trajectories eventually converge to a set of zero volume (or measure) in phase space, called attractors. If the system is chaotic, however, this zero-volume set cannot be a simple fixed point or limit cycle, since the positive Lyapunov exponent of chaotic systems means that nearby orbits separate exponentially. This stretching characterised by the Lyapunov exponent is combined with the folding caused
Nonlinear Dynamics: A Brief Introduction
335
by the confined nature of the motion and the contraction caused by the dissipative dynamics, resulting in attractors with very complex geometries – strange attractors [7], as they are known. It can be shown that this stretching and folding mechanism generates attractors with fractal properties – that is, nonsmooth manifolds with nontrivial structure in arbitrarily small scales. How are orbits organised within attractors? The exponential separation of initially close orbits suggests that trajectories starting from a neighbourhood of any given point in the attractor can eventually get arbitrarily close to any other point of the attractor. This property is called topological transitivity, and it is a crucial feature of chaotic dynamics. It implies the existence of dense orbits in the attractor. An orbit is dense if every neighbourhood of every point in the attractor has a nonempty intersection with the orbit. Topological transitivity suggests that single trajectories may sample the whole attractor, in much the same way that in classical statistical mechanics the state of a gas is imagined to sample all the phase space region it has access to – the energy shell in the microcanonical ensemble. There we pass from a description of the system based on trajectories to one based on probabilities by invoking the idea that the system samples the accessible phase space such that the average time it spends in a certain volume of the phase space is proportional to that volume, independently of the initial conditions. This (unproved) assumption of classical statistical physics allows us to assign probabilities to given regions of the phase space, and forget about trajectories, simplifying enormously the task of making predictions about the system. We can follow a similar path in chaotic systems if dynamics is ergodic [8], that is, if it is such that almost all trajectories of the system spend the same fraction of time in any given region of phase space, independently of their initial conditions. By ‘almost all’ we mean all initial conditions except possibly for a set of volume zero (more precisely, of Lebesgue measure zero). To make this concept more precise, let us define a partition of the phase space as a set of N disjoint sets {Wi }N i=1 such that the attractor is contained in the union of all N (x ) sets. Consider a certain initial condition x0 which generates a certain trajectory, and define as pi 0 the fraction (x ) of time the trajectory spends in the set Wi . If the dynamics is ergodic, the pi 0 are independent of x0 except for a set of zero volume. We can therefore write the fractions as simply pi . Interpreting pi as the probability that we find the system in the region Wi of phase space, we make a conceptual leap similar to the one made in statistical mechanics. But are chaotic systems really ergodic? Ergodicity requires a loss of memory: trajectories should ‘forget’ about their initial conditions after a sufficiently long time for ergodicity to hold. This is strongly suggested by the existence of positive Lyapunov exponents and exponential separation of trajectories – as we have seen previously, for times much longer than the Lyapunov time trajectories can be expected to be found anywhere, regardless of where they started from. Ergodicity has been proved rigorously in some simple systems, and has been verified numerically to high accuracy in many others, and it is widely believed that it is a general property of chaotic systems. The combination of topological transitivity (the property that a single orbit can explore the whole of the attractor) and ergodicity (the fact that orbits do this exploration independently of their initial conditions) ensures that one can assign uniquely a probability p(A) to any region A of phase space; the function p is called the natural measure of the dynamics [9], and plays a crucial role in dynamical systems theory.
16.4
The Kolmogorov–Sinai entropy
Now that we can assign probabilities to sets of states for chaotic systems, we can make use of all the concepts used in statistical physics. In particular, the entropy S({Wi }) of a given partition {Wi } is defined as pi ln pi . (16.4) S({Wi }) = S (1) = − i
336
Handbook of Statistical Systems Biology
pi = p(Wi ) is the probability of finding the system in the region Wi of phase space at a certain time t, say at t = 0. Now let us define pij as the probability that the system is observed in region Wi at time t = 0, and then at region Wj at a later time t = t. Thus pij is given by −1 pij = p(Wij ) = p(Wi ∩ Mt Wj ),
(16.5)
−1 where Mt Wj denotes the set of initial conditions such that they are found in Wj at time t (M −1 is the pre-image of the mapping M which advances the system’s state by a time interval t). The sets defined by −1 Wj constitute a more refined partition than the original partition {Wi }, with more regions Wij = Wi ∩ Mt with smaller measure. We can define an entropy S (2) ({Wi }) = S({Wij }) for this new partition: S (2) = S({Wij }) = − pij ln pij . (16.6) ij
Borrowing from Shannon’s information theory [10], S (1) can be interpreted as the amount of information we have about the system, given that we know in which region Wi the system is at t = 0. Similarly, S (2) is a measure of the amount of information we have if we know in which of the N regions of the partition the system is at times t = 0 and t = t. So the difference S (2) − S (1) = S (2) is the amount of information we have gained by taking note of where the system goes one time step in the future. We can define the rate of ‘information production’ H({Wi }) by a chaotic system during a time interval t, for a given partition {Wi }, as H({Wi }) = lim S (n) . n→∞
(16.7)
The Kolmogorov – Sinai entropy H is defined by choosing a partition {Wi } which maximises the information gain [11]: H = sup H({Wi }). {Wi }
(16.8)
When we introduced the concept of the separation of trajectories, the Lyapunov exponent was interpreted as a measure of the amount of time it takes for a chaotic system to become unpredictable. There chaos was seen as a creator of uncertainty. Now we are talking about chaos as a generator of information. How are these two points of view reconciled? The answer is that the exponential separation of trajectories makes it very hard to infer the future state of the system from the present state (beyond the Lyapunov time), but the same phenomenon enables one to infer the past state from the present one with exponentially increasing accuracy, as shown in the discussion about the ever finer partitions Wij···k above. Suppose our partition {Wi } represents the accuracy with which we can measure the state of the system. If we observe the system at time t = 0 and see that it is in region W1 , for example, we only know that the initial condition is in that region. But after observing a further time step and −1 W3 , which seeing that the system is in, say, W3 , we now know that the initial condition must be in W1 ∩ Mt is a smaller region than W1 ; so we have increased the accuracy of our estimation of the initial condition. A further observation will increase our accuracy even more, and so on. Because of the exponential separation, clearly the measure of the region where we estimate the initial condition must be decreases exponentially — otherwise the Kolmogorov–Sinai entropy would be zero from Equation (16.8).
16.5
Symbolic dynamics
For a given partition {Wi }N i=1 , any trajectory of the system can be mapped to a bi-infinite sequence of symbols · · · a−2 a−1 .a0 a1 a2 · · · ,
(16.9)
Nonlinear Dynamics: A Brief Introduction
337
where the symbols ai ∈ {1, 2, . . . , N} correspond to the successive positions of the trajectory with respect to the partition {Wi }. Taking the index i = 0 to mean time t, a sequence · · · 21.31 · · · would mean that at time t = 0 the system is in W3 , at time t = t it will be in W1 , at time t = −2t it was in W2 , and so on. We can imagine that the infinite sequence (16.9) is approached by a limit process where symbols are added successively to the specification of an orbit. So we would start with a−1 .a0 , and then add one more time step in the past and in the future and make the sequence a−2 a−1 .a0 a1 , and so forth. Let us denote by Aij all the orbits in the system which correspond to the sequence i.j, that is, to a−1 = i and a0 = j. There are many trajectories satisfying this. Now consider the set Akijl ; this is a subset of Aij , of smaller measure. As we increase the number of symbols we prescribe, we narrow down the set of possible trajectories more and more. In the limit of infinite symbols, we could expect that we would narrow it down to a single orbit. This is indeed the case if we choose the appropriate partitions, called Markov partitions [2]. Using Markov partitions, we establish a one-to-one mapping between the orbits in the system and the set of all allowed symbol sequences. Advancing one time step t corresponds to moving the dot in Equation (16.9) one position to the right: · · · a−2 a−1 .a0 a1 a2 · · · −→ · · · a−2 a−1 a0 .a1 a2 · · · .
(16.10)
This procedure of mapping the original dynamics of the system to a dynamics in the symbol space is referred to as symbolic dynamics [12]. A number of properties of chaotic dynamics can be very clearly seen using symbolic dynamics. For simplicity in the following discussion, we will assume that we have a Markov partition with only two symbols, which we will designate by 0 and 1. We will assume, furthermore, that all possible transitions are allowed – that is, any symbol (0 or 1) can succeed any symbol. This means that all possible combinations of 0 and 1 are allowed sequences, and correspond to unique trajectories. The conclusions which follow do not depend on these assumptions; they are only made to simplify the arguments. It is clear that because of the one-to-one mapping between and the trajectories, all repeated symbol sequences correspond to periodic orbits – for example, · · · 01.0101 · · · corresponds to a period-2 orbit, while · · · 001.001 · · · and · · · 101.101 · · · are period-3 orbits. It follows immediately from this that there are infinitely many periodic orbits in the system, with orbits of arbitrarily high periods. There are more ways of constructing period-3 symbol sequences than period-2, and in general, it can be easily verified that the number of periodic orbits increases exponentially with the period. These are all general properties of chaotic systems. What about the sequences which never repeat themselves? These correspond to aperiodic orbits, and represent ‘most’ of the orbits in the system, even though there are infinitely many periodic orbits. This can be seen by the following argument. First we show that we can map every symbol sequence in to a point in the unit square. To do this, we define the coordinate x on the square to be the real number with a base-2 expansion of 0.a0 a1 a2 · · · . That is, 1 1 1 a0 + 2 a1 + 3 a2 + · · · . 2 2 2
(16.11)
1 1 1 a−1 + 2 a−2 + 3 a−3 + · · · . 2 2 2
(16.12)
x= Similarly, we define the y coordinate by y=
Clearly 0 ≤ x, y ≤ 1. Periodic orbits then correspond to points in the square with rational coordinates, which are associated with repeated digit sequences. So the set of all points with irrational coordinates are the aperiodic orbits. But we know from analysis that the set of points with irrational coordinates has full measure – that is, area 1 – while the set of points with rational coordinates has total area zero. So ‘almost all’ orbits in the system are aperiodic. The set of periodic orbits, however, is very important nevertheless, since it is dense: there are (infinitely many) periodic orbits arbitrarily close to any other orbit in the system, and any finite-time
338
Handbook of Statistical Systems Biology
trajectory in the system can be approximated with arbitrary precision by a periodic orbit. In all chaotic systems, periodic orbits are the ‘skeleton’ of the dynamics, and all the important dynamical features of a system can be understood in terms of the periodic orbits.
16.6
Chaos in biology
Chaos theory and dynamical systems theory in general have close ties with biology, and many of the most paradigmatic examples in chaos are inspired from biology. For example, the logistic map is one of the most well-known examples of one-dimensional mappings exhibiting chaotic behaviour, and it was suggested by May as an idealised discrete-time model for population growth in a single-species ecosystem with limited resources [13]. In fact, chaos is ubiquitous in ecology and population dynamics, including in particular epidemiology [14]. Chaos is also relevant for many other areas in biology, from the dynamics of whole organs in multicellular organisms, such as the heart, to that of single cells, such as neurons, and even subcellular processes, such as intracellular calcium oscillations [14], to name just a few. The nonlinearity of the dynamical interactions between the various components of biological systems makes chaos a likely phenomenon, and we expect chaos theory to continue to be relevant for biology in the future.
References [1] Eckmann, J.-P., Ruelle, D. (1985) Rev. Mod. Phys. 57, 617. [2] Alligood, K. T., Sauer, T.D., Yorke, J.A. (1996) Chaos: an Introduction to Dynamical Systems, Springer-Verlag, New York, NY. [3] Devaney, R. L. (2003) An Introduction to Chaotic Dynamical Systems, 2nd edn, Westview Press, Boulder, CO. [4] Ott, E. (2002) Chaos in Dynamical Systems, Cambridge University Press, New York, NY. [5] Strogatz, S. (2000) Nonlinear Dynamics and Chaos, Perseus Publishing, Cambridge, MA. [6] Oseledec, V. I. (1968) Trans. Moscow Math. Soc. 19, 197. [7] Ruelle, D., Takens, F. (1971) Comm. Math. Phys. 20, 167. [8] Birkhoff, G. D. (1931) Proc. Natl Acad. Sci. USA 17, 656. [9] Bowen, R., Ruelle, D. (1975) Invent. Math. 79, 181. [10] Shannon, C. E. (1948) Bell Syst. Tech. J. 27, 379. [11] Kolmogorov, A. K. (1958) Dokl. Akad. Nauk. SSSR 98, 525. [12] Williams, S. (ed.) (2004) Proc. Symp. Appl. Math. 60. [13] May, R. M. (1976) Nature 261, 459. [14] Murray, J. D. (2001) Mathematical Biology, Springer-Verlag, New York, NY.
17 Qualitative Inference in Dynamical Systems ¨ Fatihcan M. Atay1 and Jurgen Jost1,2 1
17.1
Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany 2 Santa Fe Institute for the Sciences of Complexity, USA
Introduction
Dynamical systems are mathematical models of processes where the state of a system changes in time according to some rule depending on its present state, external inputs, and possibly involving certain parameters. Mathematicians and physicists also say that such a system evolves in time. The word ‘evolution’, however, here does not carry the implications that it possesses in biology, namely, the interaction of selection, mutation and inheritance. Thus, a dynamical system can, but need not arise from an optimization process, it can, but need not possess some stochastic aspects, and it can, but need not describe the state of a population of agents. All biological systems change in time, and therefore, dynamical systems are an ubiquitous class of biological models. The time scales can range from the millisecond scale of neuronal dynamics to the millions of years for population genetics. A fundamental difference between models is whether they are continuous or discrete. Here, the states of the system, the time of the model, and the space on which the model lives can all be taken as either discrete or continuous. Table 17.1 provides some examples. When one wants to model a biological process, one then has to decide about the level of detail or abstraction that is appropriate. In principle, continuous models can capture finer aspects than discrete ones, but other considerations may also make discrete versions preferable:
• •
Coarser models may capture the important qualitative aspects of a specific biological process well enough while their computational costs could be significantly lower. Irrelevant quantitative details provided by continuous models may obscure insight into the important qualitative aspects.
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
340
Handbook of Statistical Systems Biology Table 17.1 Dynamical systems with discrete (d) or continuous (c) time, space, or states Space d d d c
•
•
Time
States
d d c c
d c c c
Examples Cellular automata, Boolean networks Function iterations: x(n + 1) = F(x(n)), with n ∈ N = f (x(t)), with t ∈ R+ Ordinary differential equations: dx(t) dt 2 x(u,t) ∂x(u,t) Partial differential equations: ∂t = ∂ ∂u (t ∈ R+ , u ∈ R) 2
Detailed models may depend in a sensitive and perhaps unstable way on certain crucial parameters. Thus, when those parameters are not known with very high precision, the model predictions cannot be trusted. We shall discuss this issue in detail below. In any case, in such situations, one may prefer a coarser, but more robust model. In general, partial differential equations (PDEs) are mathematically more difficult to analyse than ordinary differential equations (ODEs). In particular, PDE models often develop singularities. These may correspond to real phenomena, like shock waves, turbulence, etc., but in other cases they could simply be artifacts of the model. In particular, biological systems rarely become singular in a serious manner. Concerning modelling of space, we also have the following considerations:
•
•
Instead of taking the spatial extension of a biological system and the possibility that the state could be a function of the location inside the system, one may simply look at aggregate or collective quantities. For instance, instead of considering the concentration of some molecule across the cell, one might wish to study the total amount of that molecule in the cell as a whole. Instead of looking at the spatial distribution of the members of a population in an ecosystem, one may only be interested in their total number in that system. Many biological systems are constituted of discrete, interacting units or agents. Therefore, instead of a system with a continuous spatial variable u, one may rather consider a discrete index i denoting the corresponding unit or agent. The resulting ODE then becomes vector valued, that is, in more down-to-earth notation, we get a system dxi (t) (17.1) = f i (x1 (t), . . . , xm (t)) dt where as always t ∈ R denotes time, and the index i = 1, . . . , m stands for the different units. The case m = 1 is called the scalar case, and in that case, we do not need an index for x. Note that the arguments of f i , that is, the dynamical rule for the ith unit on the right-hand side may involve all the xj , not only the one with j = i. This reflects the fact that the units are interacting, that is, the future states of unit i depend not only on its own present state, but also on those of other units.
Equation (17.1) is a system of ODEs. This type of model is perhaps the most useful and widely used for biological systems. At the same time, there is a substantial body of mathematical theory and computational methods available for systems of ODEs. Therefore, models of the form (17.1) will play the most prominent role in our survey of qualitative aspects of dynamical systems. The alternative to an ODE system is often a time-discrete system of the form xi (n + 1) = F i (x1 (n), . . . , xm (n)),
(17.2)
where n ∈ N now is discrete time. Equation (17.2) simply is the iteration of the map F = (F 1 , . . . , F m ) whereas (17.1) yields the dynamical flow generated by the rule f = (f 1 , . . . , f m ).
Qualitative Inference in Dynamical Systems
341
In fact, the two types are not so different from each other as the system (17.1) also induces a system of type (17.2). For that purpose, we simply define F (x) := x(T ) where x(t) is the solution of (17.1) with x(0) = x,
(17.3)
called the time-T map. In particular, by iteration, we see that for a solution of (17.2), we have x(nT ) = F n (x(0)),
(17.4)
i.e. we simply iterate the map F n times in order to get from the initial values to the values at time n. Fixed points of the map F correspond to period-T solutions of (17.1). However, not every system of type (17.2) is induced in this manner by one of type (17.1). Often, one observes a dynamical system of type (17.1) not for all values of t, but only at certain discrete ones, for instance integer ones t ∈ N. Thus again, the observations are taken for the time-discrete system (17.2) with F given by (17.3). Such observations of dynamical systems at integer times (or other discrete values of t) are called time series (see Section 17.7). Therefore, for time series analysis, the concepts to be developed below for dynamical systems are relevant and useful. Dynamical systems may often, and perhaps will typically, depend on certain parameters λ1 , . . . , λk : dxi (t) = f i (x1 (t), . . . , xm (t); λ1 , . . . , λk ). dt
(17.5)
These parameters can correspond to the values of certain external variables that are not part of the process and that can be assumed to stay constant during the time scale under consideration. These parameters then need to be measured or otherwise determined. They can also represent control parameters through which one hopes to steer the system. An important question then is how the short- or long-term behaviour of the dynamical systems depends on the values of these parameters. This is the question of structural stability that will be discussed in more detail below. Also, the system may be nonautonomous, that is, the dynamical rule might depend explicitly on time t, i.e. the model is of the form dxi (t) = f i (t, x1 (t), . . . , xm (t); λ1 , . . . , λk ) dt
(17.6)
In principle (17.6) can be put into the form (17.5) by treating t in the argument of f i as an additional variable, m+1 i.e. xm+1 = t, and augmenting the system (17.6) with the equation dxdt = 1, which yields again a system of the form (17.5). Nevertheless, often analytical tools are employed that are specifically useful for systems with explicit time dependence such as (17.6). For simplicity, we do not consider (17.6) in detail. A system of the form (17.1), that is, without explicit dependence of the dynamical rule f on t is called autonomous. The nice thing about autonomous systems is that they are invariant under time shifts. That is, when x(t) is a solution, so is x(t + s) for any s ∈ R. The dynamical rule, be it (17.1), (17.2), (17.5) or (17.6), must be supplemented by initial or starting conditions to obtain solutions. Usually, one assumes that the system starts to run, or the observations of the system commence, or the state is known at time 0, that is, xi (0) = x0i
(17.7)
for given values x01 , . . . , x0m in the state space of the system. For autonomous systems, the choice of the starting time at 0 is completely arbitrary. We could have as well taken any t0 ∈ R, as the solution x(t) of (17.1) with
342
Handbook of Statistical Systems Biology
x(t0 ) = x0 is equal to y(t + t0 ) where y(t) is the solution with y(0) = x0 . Under rather general assumptions,1 given an initial condition (17.7), our dynamical systems possess a unique solution x(t) for all times t ≥ 0. Thus, the future states of the system are determined by the dynamical rule together with the initial conditions (17.7). Furthermore, the solution x(t) at time t will depend continuously on the initial condition x(0) as well as on the parameters λi of the system. However, it should be noted that this continuous dependence is for finite time t and says nothing about the asymptotic behaviour of the system as t → ∞; furthermore, it gives no indication of how sensitively the solutions depend on initial conditions, a concept that lies at the basis of chaotic behaviour. Hence, as for the parameters, one may also ask how sensitively the short- or long-term behaviour of the system, that is, the solution x(t) for t > 0, will depend on those initial values. This is the question of dynamical stability. The simplest models are linear, that is (ignoring parameters and the like for the sake of simplicity), dxi (t) = dt
Aij xj (t) + bi
(17.8)
j=1,...,m
for constants Aij , bi . The solutions of such models are simple exponential or circle laws, and they are of rather limited use in themselves because they do not account for nonlinear interactions between the units. Biologically useful and insightful models are typically nonlinear. Frequently occurring nonlinearities are polynomial, sinusoidal, exponential, etc. Nevertheless, a mastery of linear systems turns out to be invaluable to understand the local behaviour of the system near particular solutions, to determine asymptotic properties such as stability, and to study local bifurcations that are gateways to essential nonlinear phenomena. Finally, some models include time delays. These are most easily discussed for discrete time models of the form (17.2). The state of the system at time n + 1 will then not only depend on the state at time n, but also on those at previous times. For example, the system x(n + 1) = F (x(n), x(n − 1), . . . , x(n − k))
(17.9)
has a memory that extends k time steps into the past. By defining an augmented state (x1 , . . . , xk+1 ) with xi (n) = x(n − i + 1), we can express the time evolution of the system in the form of (17.2), namely x1 (n + 1) = F (x1 (n), . . . , xk+1 (n)), xi (n + 1) = xi−1 (n),
for i = 2, . . . , k + 1.
Hence the delays have increased the dimension of the state-space, although the dimension has remained finite in this discrete-time setting. This can also be seen by noting that k + 1 initial conditions are needed to solve (17.9). The continuous-time case is similar, but yields an infinite-dimensional state-space, even in a simple scalar equation like dx = f (x(t), x(t − τ)) dt
(17.10)
with delay τ. Now the whole set of values x(t) for t ∈ [−τ, 0] (or any other interval of length τ, since the system is autonomous) are needed to solve the equation forward in time. Hence, the initial condition is no longer a set of numbers but a whole function on the interval t ∈ [−τ, 0], which shows that the appropriate state-space is an (infinite-dimensional) function space. Although most of the methods presented in this chapter 1 For instance when the right-hand sides of the equations are differentiable functions and the system evolves in a compact invariant set. Further smoothness assumptions may be needed as the situation calls for, e.g., when dealing with bifurcations in Section 17.4. Such conditions will be assumed to hold throughout this chapter without being explicitly stated.
Qualitative Inference in Dynamical Systems
343
can be extended to equations with delays, the infinite dimensionality causes practical problems that are best dealt with more specialized methods. The interested reader is referred to classical texts [1,2] or to Atay [3] for a recent account of contemporary problems. So far we have admitted any value x ∈ Rm for the observables that are changing in time according to our dynamical rule. Thus, in general the state space S of the dynamical system is taken as Rm . In practice, many dynamical systems are confined to some smaller state space S, that is, the rules f or F may only be defined for arguments in some S Rm , and the values x(t) may always stay in that set S when the initial values x(0) lie in S.
17.2
Basic solution types
In this section, we shall discuss systems of the form (17.1) or (17.5). Essentially analogous results hold for (17.2) whereas (17.6) or (17.10) may lead to additional complications that we shall not explore here. For notational conciseness we write (17.1) in vector form as d x(t) = f (x(t)) dt
(17.11)
where x(t) = (x1 (t), . . . , xm (t)) ∈ Rm and f : Rm → Rm is defined by f (x) = (f 1 (x1 , . . . , xm ), . . . , f m (x1 , . . . , xm )). A solution x(t) of (17.11) is called an orbit or a trajectory of the dynamical system. A point ξ = (ξ 1 , . . . , ξ m ) ∈ Rm is called a rest point (or fixed point or equilibrium point) if f (ξ) = 0.
(17.12)
dx = 0, dt
(17.13)
Thus, when x(t) = ξ for such a rest point, then
that is, the solution will stay at that rest point forever. A good strategy for analysing a dynamical system is to first identify its rest points. Even though one is typically interested in the system only for finite times t, for the analysis of the system it is a basic procedure to consider the asymptotic behaviour of a solution x(t) = (x1 (t), . . . , xm (t)) for t → ∞. And in turn, in order to understand that asymptotic behaviour, we need some concepts. Definition 17.2.1 A subset I ⊂ Rm is a (forward) invariant set for the flow (17.11) if x(t) ∈ I for all t ≥ 0 whenever x(0) ∈ I.
(17.14)
(Thus, any trajectory that starts in I at time t = 0 will stay in I forever.) A compact connected invariant set A is called an attractor if it possesses an invariant open neighbourhood U such that for every x0 ∈ U the solution x(t) of (17.5) with x(0) = x0 satisfies dist(x(t), A) → 0 for t → ∞.
(17.15)
The largest such U is called the basin of attraction of A. The reader should be warned that there are certain subtleties about the definition of an attractor, but we cannot go into those issues here. Here, (17.15) implies that an attractor is an invariant subset that is asymptotically
344
Handbook of Statistical Systems Biology
stable in the sense that if the system state is perturbed slightly away from the attractor, that is, when x(t0 ) ∈ U \ A for some t0 , then asymptotically the solution will return to A. The behaviour of the solutions of dynamical systems can then be roughly classified by the different types of attractors. Note that a system can have more than one attractor, and which of them is approached will then depend on the initial condition. 1. An attractor consisting of a single point, a so-called steady state. For instance, any solution dx = −x(t) dt
(17.16)
asymptotically approaches the point 0. Every such point attractor is a rest point, but not conversely, because rest points need not be asymptotically stable. For instance, the scalar system dx (17.17) = λ − x2 dt √ √ √ for λ > 0 has rest points ξ = ± λ, but only ξ = λ is attracting. In fact, if x(t) > λ, then dx < 0, and if dt √ √ √ √ > 0, and therefore, the solution approaches ξ = λ whenever − λ < x0 < − λ < x(t) < λ, then dx dt √ √ dx ∞. For x(t) < − λ, we have dt < 0, and a solution with x0 < − λ asymptotically goes to −∞. Likewise, the logistic equation dx = x(a − bx) dt
(17.18)
with a, b > 0 has two fixed points, an unstable one at ξ = 0 and a stable one at ξ = ab , by a reasoning similar to the preceding case. The logistic equation models growth processes with capacity constraints. Thus, whenever we start at x(0) > 0, the process will asymptotically converge to that stable fixed point. In general, when ξ is a rest point, i.e. f (ξ) = 0, we want to check its stability, that is, whether it is an attractor or not. For that purpose, we Taylor expand around ξ, i.e. f (ξ + s) = i si ∂f∂ξ(ξ) i + 2 1 i j ∂ f (ξ) 1 m 1 m i,j s s ∂ξi ∂ξj + . . . for ξ = (ξ , . . . , ξ ), s = (s , . . . , s ) [in the scalar case, this becomes the more 2 2
familiar f (ξ + s) = sf (ξ) + s2 f (ξ) + . . . ] and insert this into our system (17.11). By keeping only those terms that are linear in s for |s| 1, we obtain the system ds(t) = As. dt
(17.19)
Here, A = f (ξ) is the Jacobian matrix [∂f i /∂xj ] of the first derivatives of f = (f 1 , . . . , f m ) evaluated at the rest point ξ. The rest point then is stable if all eigenvalues of A have negative real parts. It is unstable if at least one of them is positive. When the largest real part is zero, no general statement can be made, and a more detailed analysis is necessary. This is the important linearization principle, which says that the local behaviour near a rest point of a nonlinear system is qualitatively similar to that of its linearization provided that the eigenvalues of the Jacobian have nonzero real parts. For instance, a pair of complex conjugate eigenvalues with positive or negative real part tells us the presence of, respectively, unstable or stable spirals around a fixed point. When the Jacobian has eigenvalues with real part zero, more care is needed since the conclusions cannot be translated to the nonlinear system. Such critical cases are discussed further in Section 17.4. Here the linear system provides the starting point of the analysis, which then needs to be amended by taking the nonlinear terms into account. For example, a complex conjugate pair with vanishing real part indicates the presence of periodic solutions in the linear system, in fact an infinite number of them
Qualitative Inference in Dynamical Systems
345
with the same period but with different amplitudes, since any linear combination of solutions of (17.19) is also a solution of the same equation. The nonlinear system, on the other hand, will under general conditions have a single periodic solution, which may or may not be stable. The Hopf bifurcation, which will be treated in Section 17.4, refers to the birth of such periodic solutions in nonlinear systems whose linearizations have complex conjugate eigenvalues with zero real part. Of course, fixed point attractors are usually not the most interesting ones because the solution will just stay there forever. 2. We consider a two-dimensional system and write (x, y) in place of (x1 , x2 ) for simplicity of notation. dx = y − x(x2 + y2 − λ) dt dy = −x − y(x2 + y2 − λ) (17.20) dt √ √ For λ > 0 it has an attracting periodic orbit ξ(t) = λ sin t, η(t) = λ cos t. Periodic attractors, also called limit cycles, form a very important class of attractors for biological and other dynamical systems as they correspond to repetitive rhythmic behaviour. 3. More generally, a system may have quasiperiodic attractors of the shape of a torus, that is, a product of circles. Here, the system is attempting such periodic behaviour in several directions simultaneously, and as a consequence, it might never exactly return to any previous state, but it will eventually and regularly return to the vicinity of every point on such a quasiperiodic attractor. We omit the precise definition. 4. Chaotic attractors, in contrast, are geometrically very irregular; typically, they have a fractal structure. Near such an attractor, the system behaves unstably in some, but stably in other directions. Again, we cannot provide the details here. As an example of a system with periodic oscillations, we consider the Lotka model for reaction kinetics. Here, we have chemical reactions A + X →k1 2X,
X + Y →k2 2Y,
Y →k3 B
(17.21)
where A, X, Y, B are the chemical substances involved in the reactions, and the ki are the reaction rates. Lower case letters will denote the concentrations of the chemicals. A is assumed to be maintained at a constant concentration a. The other concentrations are determined by the reactions according to the law of mass action: dx = k1 ax − k2 xy = x(κ1 − k2 y) with κ1 := k1 a dt dy = k2 xy − k3 y = y(−k3 + k2 x). dt
(17.22)
(The concentration of B is not of interest for the dynamics because it is not an input for any reaction.) The system has two fixed points. One of them is (0, 0) which is not stable according to the above criterion [see (17.19)]. The other one is ξ=
k3 , k2
η=
κ1 . k2
(17.23)
Here, our linearization criterion does not apply as we get an eigenvalue 0. We consider V (x, y) := k2 (ξ log x − x + η log y − y)
(17.24)
346
Handbook of Statistical Systems Biology
y
η
ξ
x
Figure 17.1 Trajectories in the Lotka model
which satisfies dV (x(t), y(t)) = dt
k3 − k2 x
dx + dt
κ1 − k2 y
dy =0 dt
(17.25)
by (17.23) and (17.24). Therefore, V (x(t), y(t)) has to be constant along any trajectory. Thus, each trajectory is confined to a level curve V (x, y) = constant. V attains its unique maximum at the fixed point (ξ, η). Therefore, the other level curves are circles around this fixed point. Thus, any trajectory moves along such a circle and is therefore periodic (Figure 17.1).
17.3
Qualitative behaviour
In this section, we make some simple qualitative observations. We start with a scalar equation dx = f (x), dt
(17.26)
like (17.16) or (17.17). The first observation is that in order to get confinement, that is, a solution always stays or moves into some bounded region, we need that f (x) < 0 for x 0,
and f (x) > 0 for x 0.
(17.27)
For instance, f (x) = −x satisfies this condition. The condition simply ensures that x(t) decreases when it is large, because its derivative then is negative, and it increases when it gets very negative, because its derivative will be positive then. We next look at the case m = 2 and write (x, y) in place of (x1 , x2 ). We consider the system dx = a(x) + b(y) dt dy = c(x) + d(y) dt
(17.28)
and again ask the simple question what the signs of the functions a, b, c, d can tell us about the behaviour. First of all, when a and d satisfy conditions of type (17.27), then again they drive the corresponding quantities towards bounded regions, this time as long as the other variable does not interfere with this.
Qualitative Inference in Dynamical Systems
347
In regions where b(y) and c(x) are positive (negative), they act towards increasing (inhibiting) x and y, respectively. Let us look at a more concrete situation, dx = −x + b(y) dt dy = −y + c(x) dt
(17.29)
with 0 < b(y) < K1 , lim b(y) = 0, y→∞
db(y) < 0, dy
0 < c(x) < K2
(17.30)
lim c(x) = 0
(17.31)
x→∞
dc(x) < 0. dx
(17.32)
A fixed point (ξ, η) is stable √ when the product b (η)c (ξ) of the derivatives of b and c is < 1 there because then the eigenvalues −1 ± b c of the linearization at the fixed point of the right-hand side, −1 b (η) (17.33) c (ξ) −1
are both negative. In this case, the rest point is stable according to the criterion given when discussing (17.19). In contrast, when the product of the derivatives is > 1, the rest point is unstable. Therefore, when we want to have a situation where two stable rest points coexist with an unstable one, the first derivatives have to go up and down. Therefore, the second derivatives have to change sign somewhere. Let us look at the concrete example dx α1 = −x + dt 1 + yβ dy α2 , = −y + dt 1 + xγ
(17.34)
β 1 βy 1 βy and b (y) = α(1+y with positive parameters α1 , α2 , β, γ. Here, b (y) = −α β )3 ((β + 1)y − β + 1). Here, (1+yβ )2 the size of the first derivative can be controlled by α1 , and when β > 1, the second derivative changes sign. Analogously for c. The biological interpretation of this system is that x and y are the concentrations of two repressors. Since b and c here [as already in the general system (17.29), see (17.30), (17.32)] are positive decreasing functions, the growth of each repressor decreases the other’s growth rate. For instance, a genetic toggle switch in Escherichia coli is modelled in this manner. β−1
17.4
β−2
Stability and bifurcations
When introducing the notion of an attractor, we have already talked about stability. An attractor A was stable in the sense that when a trajectory starts close to A, then it never moves away from A, but rather asymptotically approaches A. More generally, we can ask whether solutions x1 (t), x2 (t) with initial values satisfying |x1 (0) − x2 (0)| ≤
(17.35)
348
Handbook of Statistical Systems Biology
for some small > 0 satisfy |x1 (t) − x2 (t)| ≤ for all t > 0
(17.36)
|x1 (t) − x2 (t)| ≤ e−βt for all t > 0 and some β > 0,
(17.37)
or even
that is, whether they become exponentially closer to each other. Of course, in general, this is not possible for arbitrary initial values (17.35) as the trajectories may approach different attractors. For instance, consider dx (17.38) = λx − x3 dt √ for λ > 0. It possesses the two fixed point attractors ± λ. Moreover ξ = 0 is an unstable rest point √ in the √ sense that when x(0) > 0, then x(t) asymptotically approaches λ, while for x(0) < 0, it goes to − λ, and thus, in either case, it moves away from ξ = 0, but in different directions according to whether the trajectory starts to the right or to the left of this point. This example tells us that we expect dynamical stability in the sense of (17.36) or (17.37) only inside a single basin of attraction of some attractor, but not across different such basins. In fact, however, for many dynamical systems, the so-called chaotic ones, we never find such dynamical stability. In other words, a chaotic dynamical system is characterized by the property that even for very close initial conditions, the corresponding solutions will eventually move far away from each other. Thus, when we can determine the initial condition only with some finite precision, we are not able to make any prediction about the long-term behaviour of a trajectory. In other words, the trajectories of chaotic dynamical systems depend very sensitively on initial conditions, and arbitrarily small perturbations may have drastic long-term effects. There exists formal quantities that can be used to determine whether a system is chaotic. These are the so-called Lyapunov exponents. The Lyapunov exponent can be most easily formulated for discrete time dynamical systems (17.2). In the scalar case, we then have a single Lyapunov exponent 1 log |F (x(ν)| n→∞ n n−1
λ(x) = lim
(17.39)
ν=0
for the trajectory with x(0) = x and where, of course, F denotes the derivative of F . In fact, under the assumption of ergodicity, to be discussed below, this Lyapunov exponent does not depend on the initial value x. When the system is not ergodic, at least λ(x) is typically constant inside each basin of attraction. λ > 0 then intuitively means that on average, along a trajectory, the derivative of F satisfies |F | > 1, that is, the dynamical system is expanding and amplifies differences. Thus, we do not have dynamical stability, and the system is by definition chaotic. For the vector case, that is, when x has m > 1 components, we get m Lyapunov exponents λ1 ≥ λ2 ≥ · · · ≥ λm . The largest one can essentially be computed like (17.39), by always taking the derivative of F in that direction where it is largest. The higher Lyapunov exponents then have to be taken in directions that are orthogonal to that one. While the first Lyapunov exponent can be easily evaluated numerically, the numerical approximation of the other ones is more difficult because of that orthogonality condition. We shall discuss the approximate computation of Lyapunov exponents from time series data in Section 17.7. The number of positive Lyapunov exponents, that is, the number of independent expanding directions of a system can be taken as its degree of chaoticity. In particular, a system is chaotic when the largest Lyapunov exponent is positive, and it is sometimes called hyperchaotic if it has two or more positive Lyapunov exponents.
Qualitative Inference in Dynamical Systems
349
λ < 0: no fixed point λ = 0: x = 0 is a fixed point, neither attracting nor repelling λ > 0: x = ± √ λ are fixed points, x = √ λ attracting, x = − √ λ repelling
Figure 17.2 Bifurcation diagram for
dx dt
= − x2
In time series analysis, as explained in Section 17.7, one samples the trajectory of a dynamical system at discrete, say integer, times. Therefore, one only has the data from a single trajectory. Typically, one does not know the underlying dynamical rule. Therefore, methods have been developed to estimate Lyapunov exponents and other important quantities from observations of integer time samples of trajectories. An important question that is addressed by such methods then is whether the underlying system that may give rise to quite irregular time series data is simply stochastic or a deterministic, but possibly chaotic system. We will revisit time series in Section 17.7. The preceding was about dynamical stability, that is, about the sensitivity of the dependence on the initial values, or about whether the system asymptotically returns to the unperturbed state when its trajectory is slightly perturbed. There is another issue of stability, concerning the sensitivity of the dependence on the system parameters. That is, we are asking what happens when we perturb the dynamical rule itself. Will the system still behave essentially as before, or will it become qualitatively different? This is the issue of structural stability. First of all, this leads us to the issue of bifurcations. A bifurcation of a parameter dependent dynamical system occurs at those parameter values where the qualitative behaviour changes. We can see the simplest types of bifurcations in the preceding examples. The simplest one is (17.17). We have seen that (17.17) possesses two rest points, a stable and an unstable one, for λ > 0. In contrast, there is no rest point at all for λ < 0; in that parameter regime, every trajectory will go to −∞. Therefore, the qualitative behaviour of the system changes when λ varies from negative to positive values, that is, at λ = 0 (see Figure 17.2). This type of bifurcation is called a saddle-node bifurcation. Its qualitative features are shown in the bifurcation diagram in Figure 17.3, which depicts the solution branches near the rest point against the bifurcation
x
0 λ
Figure 17.3 A saddle-node bifurcation occurring at the value = 0. For each > 0 there are two rest points, one of which is stable and the other one is unstable, whereas for < 0 there are no rest points. The solid and dashed curves represent, respectively, the branch of stable and unstable rest points that emerge as passes through zero
350
Handbook of Statistical Systems Biology λ < 0: x = 0 is an attracting fixed point
λ = 0: x = 0 is an attracting fixed point λ > 0: x = ± √λ are attracting fixed points, while x = 0 is repelling
Figure 17.4 Bifurcation diagram for
dx dt
= x − x 3
parameter, with the convention that solid and dashed curves represent stable and unstable solutions, respectively. The next example is (17.38). We recall that, for λ > 0, this system possesses two stable and one unstable rest points, the latter one at ξ = 0. In contrast, for λ < 0, there is a single solution of λx − x3 = 0, that is, a single rest point, ξ = 0, but this rest point is now stable, because λx − x3 < 0 for x > 0 and λx − x3 > 0 for x < 0 (see Figure 17.4). So, again a bifurcation occurs at λ = 0, which is called a pitchfork bifurcation. To conclude our discussion of the most basic bifurcations, we recall that (17.20), again for λ > 0, possesses an attracting periodic orbit. Also, the origin (ξ, η) = (0, 0) is an unstable rest point. Again, there is a bifurcation at λ = 0, as for λ < 0, there is no periodic orbit while the origin now is an attracting fixed point. This is an (Andronov)–Hopf bifurcation. The bifurcation is shown in Figures 17.5 and 17.6. Here we distinguish between supercritical and subcritical bifurcations, depicted in Figures 17.5 and 17.6, respectively. The example of (17.20) is a supercritical case, where the solution remains in the vicinity of the rest point regardless of the sign of the bifurcation paramater λ. By contrast, in the subcritical bifurcation, the periodic solutions that emerge are unstable and coexist with the stable rest point. Then, as the rest point loses its stability, the trajectories leave the neighbourhood of the rest point and the system behaviour shows a sharp transition as it moves towards a distant part of the state-space. Note that whether the bifurcation is super- or subcritical cannot be determined from the linear part of the equations, and nonlinear terms must be taken into account. Finally, we note that
x
0 λ
Figure 17.5 Supercritical Andronov–Hopf bifurcation. The fixed point is stable for < 0, and as increases through zero, it loses its stability (dashed line) and a branch of stable periodic solutions of small amplitude (solid curve) emerges. An identical picture arises also for a supercritical pitchfork bifurcation where all solution branches represent fixed points
Qualitative Inference in Dynamical Systems
351
x
0 λ
Figure 17.6 Subcritical Andronov–Hopf bifurcation. The branch of unstable periodic solutions exist for < 0 when the fixed point is stable. As is increased through zero the fixed point loses its stability. Unlike the supercritical case, there is now no local solution branch to attract the trajectories and the system typically ends up at another attractor far away, which often manifests itself in a sudden and significant change in behaviour.
the pictures for (super- and subcritical) pitchfork bifurcations are the same as Figures 17.5 and 17.6, where all branches represent equilibrium solutions. Thus, at a bifurcation point, a dynamical system is not structurally stable. It is important to point out, however, that the saddle-node and the Hopf bifurcation themselves are structurally stable in the sense that they still occur when our system depending on the parameter λ is perturbed through further parameters. In other words, these are typical, or as one can also say, generic bifurcations, in the sense that they necessarily occur for certain types of parameter constellations, and they are therefore quite ubiquitous in parameter-dependent dynamical systems. By contrast, the pitchfork bifurcation mentioned above is not generic—changing the right-hand side of (17.38) slightly, say, by adding a small constant, changes the bifurcation picture qualitatively. We return to the example (17.22), but we now give it a different interpretation, coming from population dynamics, that is, from a completely different biological scale. This shows us that dynamical systems can be rather general formal models that may arise in very different applications. We consider two interacting populations from different species, with sizes denoted by x and y. The dynamics resulting from the interactions, the Lotka–Volterra model, is dx = x(a1 − b12 y) dt dy = y(−a2 + b21 x) dt
(17.40) (17.41)
All coefficients are assumed to be positive. The first population is a prey or host population fed upon by the second one, a predator or parasite population. Therefore, the interaction leads to a negative effect, with strength −b12 , for the prey population (which increases with rate a1 when left alone), but has a positive effect, with strength b21 for the predators (which decrease, with coefficient −a2 in the absence of prey). Of course, this system is the same as (17.22), as only the coefficients have different names. Therefore, again, we get periodic oscillations. When there is a lot of prey, the predators grow fast and eventually decrease the prey population until there is so little prey left that the predators decrease for lack of food. When the predators are few enough, the prey population can recover, and the cycle starts again. The dynamics here is not structurally
352
Handbook of Statistical Systems Biology
stable, however. We change (17.40) to dx = x(a1 − b12 y − b1 x) dt
(17.42)
where the new term may express competition for food among the prey. We then find the fixed point (ξ, η) = ( ab11 , 0). When y(t) = 0, it is attractive for x(t) because (17.42) then reduces to the logistic equation (17.18). This fixed point is also attractive for y(t) if a2 b1 − a1 b21 > 0
(17.43)
as is seen by the linearization criterion [see (17.19)]. Thus, the predators go extinct as η = 0 at this fixed point. This happens for an arbitrarily small value of b1 , and therefore the periodic oscillations found for b1 = 0 are not structurally stable. In fact, if a2 b1 − a1 b21 < 0
(17.44)
then we have an attractive fixed point (ξ, η) with ξ and η both > 0. Therefore, the two populations will asymptotically converge to some equilibrium where neither of them is extinct. Again, there are no longer any periodic oscillations, but rather spirals towards the equilibrium. From a practical point of view, the linearization principle and bifurcation analysis complete the first steps of the study of dynamical systems, which can be summarized as follows. First we identify the fixed points of the system, and then calculate the Jacobian matrix at each one. In the case where the Jacobian has no eigenvalues on the imaginary axis, it determines the stability of the fixed point: if all eigenvalues have negative real parts then the fixed point is asymptotically stable, if there is an eigenvalue with positive real part then it is unstable. In the critical case when there is an eigenvalue with zero real part, we move on to bifurcation analysis. Then, generically, we have either a zero eigenvalue corresponding to a saddle-node bifurcation, which is associated with the appearance and disappearance of fixed points, or a pair of purely imaginary complex conjugate eigenvalues corresponding to a Hopf bifurcation, which is associated with the appearance and disappearance of small amplitude periodic solutions near the fixed point. The analysis then proceeds further by studying the low-order nonlinear terms in the equation to obtain more detailed information about the bifurcation, such as whether it is super- or subcritical. We also briefly summarize the bifurcation analysis of discrete-time systems such as (17.2) following similar steps. We start by writing (17.2) in vector form x(n + 1) = F (x(n))
(17.45)
with x(n) = (x1 (n), . . . , xm (n)) ∈ Rm and F (x) = (F 1 (x1 , . . . , xm ), . . . , F m (x1 , . . . , xm )). A rest point ξ ∈ Rm of the system is a fixed point of F , that is, ξ = F (ξ). Hence if the system starts from the initial condition x(0) = ξ then x(n) = ξ for all n ∈ N+ . Similar to the derivation leading to (17.19), linearization shows that small perturbations s about ξ obey the equation s(n + 1) = As(n)
(17.46)
where A is the Jacobian matrix [∂F i /∂xj ] evaluated at ξ. The fixed point ξ is asymptotically stable if all eigenvalues of A have moduli less than one, i.e. they are inside the unit circle in the complex plane, and it is unstable if A has an eigenvalue with modulus greater than one. When A has an eigenvalue with modulus exactly one, then we have the critical case corresponding to bifurcations, for which the generic cases can be classified as (a) an eigenvalue equal to 1, (b) an eigenvalue equal to −1, and (c) a pair of complex conjugate eigenvalues on the unit circle. Case (a) is the analogue of the saddle-node bifurcation we have seen above, and case (b) is called a flip or period-doubling bifurcation. Typically a period-two orbit is born as a fixed
Qualitative Inference in Dynamical Systems
353
point loses its stability through a flip bifurcation. Since a period-two solution of F is a fixed point of the second iterate F ◦ F , the same analysis can be repeated for the stability of the new solution. An interesting case is when the new solution also undergoes a flip bifurcation and the process repeats. Such successive flip bifurcations can lead to complicated behaviour known as period-doubling route to chaos, which can even be observed in a simple scalar map with a quadratic nonlinearity. Finally, case (c), which results from a pair of complex conjugate eigenvalue of unit modulus, is called a Neimark–Sacker bifurcation and generically results in the birth of a closed invariant curve near the fixed point. We remark that the bifurcations considered here are codimension-1 bifurcations, which can be obtained by changing a single appropriate system parameter. By simultaneously varying several parameters, one can also study codimension-2 and higher-codimension bifurcations corresponding to more intricate dynamics. Finally, the bifurcations considered in this section deal with local behaviour near a simple invariant set, namely a fixed point, and hence are examples of local bifurcations, as opposed to global bifurcations that refer to interactions of invariant sets with each other as system parameters are changed. The importance of the concept of stability can hardly be overestimated for qualitative inference of dynamical systems. Indeed, uncertainties are always present in the model or in the measured parameters. When a solution is asymptotically stable or unstable, continuous dependence on initial conditions and parameter values are crucial to ensure that the calculated behaviour does not change qualitatively with small changes in parameters, and hence must be present in all similar physical systems. In the case of bifurcations, the genericity of the bifurcation guarantees that a similar bifurcation occurs for some nearby parameter value in the physical system, even though our parameters might only be imperfect estimates of the real ones. For example, the bifurcation parameter might be a constant external input level, and the possibility or impossibility of particular generic bifurcations tells us how the system can respond to varying input levels.
17.5
Ergodicity
When we have a dynamical system, we can average functions on the state space S in two different ways. On one hand, we can take the average of afunction φ with respect to some probability measure μ [that μ is a probability measure simply means that S dμ(ξ) = 1], that is, the spatial average, φ(ξ)dμ(ξ), (17.47) S
or we can take the average along a trajectory x(t) of the dynamical system, that is, the temporal average N−1 1 φ(x(n)) N→∞ N
lim
n=0
1 T →∞ T
or lim
T
φ(x(t))dt.
(17.48)
0
In order to make these two averages comparable, we assume that the measure μ is invariant under the dynamical system, that is, φ(ξ(t))dμ(ξ(t)) = φ(ξ(0))dμ(ξ(0)). (17.49) S
S
That means that when we replace every point ξ = ξ(0) in our state space by its dynamical image ξ(t), the solution of the dynamical at time t with initial value ξ(0), the average of a function is not affected. Actually, the relationship between dynamical systems and invariant measures is of interest in both directions. We can either start with a dynamical system and try to find all its invariant measures, or we can start with some measure μ and consider those dynamical systems that leave it invariant.
354
Handbook of Statistical Systems Biology
In any case, comparing the spatial and the temporal averages leads to a fundamental concept. Definition 17.5.1 A dynamical system is called ergodic with respect to an invariant probability measure μ on its state space S when the spatial and the temporal average agree for all functions φ on S, that is, N−1 1 φ(x(n)) = φ(ξ)dμ(ξ) N→∞ N S lim
(17.50)
n=0
for μ-almost all initial values x(0). In particular, for an ergodic dynamical system, a typical trajectory will sample all regions of its state space. The reason is that if there were some subset S0 of S with μ(S0 ) > 0 with x(t) ∈ / S0 for all t > 0 and all trajectories with x(0) ∈ S1 for some other subset S1 of S with μ(S1 ) > 0, then (17.50) would fail for the function φ with φ(ξ) = 1 for ξ ∈ S0 , φ(ξ) = 0 for ξ ∈ / S0 . In particular, we can sample the entire state space by just following a single trajectory of the dynamical system. Thus, a typical run of the dynamical system explores the entire state space. When the dynamical system is not ergodic, there is a simple mathematical trick: we break the state space S up into pairwise disjoint components Si such that the dynamical system is ergodic on every Si . For instance, when the system possesses several attractors, we consider the dynamics on each corresponding basin of attraction separately, with the corresponding invariant measure being supported on the attractor in question. Thus, when taking time series observations, we have to look at a few runs of the system with different initial values in order to explore the different ergodic components.
17.6
Timescales
In the beginning, we have distinguished dependent dynamical variables x = (x1 , . . . , xm ) that change by some dynamical rule and parameters λ = (λ1 , . . . , λk ) that were assumed to stay constant or to vary independently of the dynamical system itself, so as, for instance, to induce bifurcations. In many applications, however, it is more natural to subject also the λs to some dynamical rule, but one operating on a slower timescale. We shall therefore now describe a general setting of dynamical systems with two different timescales. Since the λs now will also become dependent dynamical variables, we change the notation and denote the slow variables by z. Formally, the system then is dx = f (x, z) dt dz = g(x, z) dt
with 0 < 1.
(17.51)
When we let → 0, we formally recover a system with constant parameters, called the layer system dx = f (x, z) dt dz = 0. dt
(17.52)
Qualitative Inference in Dynamical Systems
355
If we rather want to focus on the slow parameter dynamics, we instead introduce a fast time scale τ := t and rewrite (17.51) as
dx = f (x, z) dτ dz = g(x, z) dτ
with 0 < 1.
(17.53)
When we then let → 0, we are led to the reduced problem 0 = f (x, z) dz = g(x, z). dτ
(17.54)
So far, however, these limits were purely formal, and the question is when the dynamical system is governed by the fast dynamics (17.52) and when the slow dynamics (17.54) takes over. As an example, we consider the van der Pol oscillator
1 dx = z − x3 + x =: z − φ(x) dτ 3 dz = −x. dτ
(17.55)
This system has a single fixed point at (0, 0). The matrix of the linearization A at this fixed point is √ 1± 1−4 . 2
1 1
−1 0
Thus, according to the linearization criterion, see the discussion around which has the eigenvalues (17.19), it is unstable. Since on the other hand, the system is two-dimensional and the dynamics always stays in a bounded region as the term − 13 x3 always pushes it back when |x| gets too large, we should expect the presence of a stable periodic trajectory surrounding (0, 0). So far, our reasoning has not yet involved . Now, when is small, by the first equation in (17.55), x changes rapidly except when z ≈ φ(x). Thus, qualitative reasoning yields the following asymptotic picture as → 0. Wherever the dynamics starts, it quickly moves towards the curve z = φ(x). Then the slow dynamics dz dx dτ = −x takes over as long as z is such that z = φ(x) is a stable rest point of dτ = z − φ(x). Again, we apply our linearization criterion and find for the linearization A at such a rest point (ξ, ζ) with ζ = 13 ξ 3 − ξ √ 1−ξ 2 1 1−ξ 2 ± 1−ξ 2 +4 which has the eigenvalues the matrix . Thus, we have stability for |ξ| > 1, but 2 −1
0
stability is lost when ξ = ±1, ζ = ∓ 23 is reached. At such a point, the solution leaves the curve z = φ(x) and the fast dynamics takes over. x changes rapidly while z stays approximately constant until the dynamics reaches another branch of z = φ(x). From (1, − 23 ), it thus jumps to (−2, − 23 ). Since this is a stable part of z = φ(x), again the slow dynamics takes over until (−1, 23 ) is reached where the branch again loses stability. The fast dynamics lets the solution jump to (2, 23 ) which again is stable. So, we continue on this branch of z = φ(x) by the slow dynamics until we reach again (1, − 23 ) whence the process repeats itself periodically. This is depicted in Figure 17.7. The point of the above is that in order to understand the interaction of the fast and slow time scales qualitatively, we cannot simply put = 0, but we rather need to look at the asymptotics for → 0, > 0.
356
Handbook of Statistical Systems Biology z
z = 13 x 3− x
2 3
−2
−1
1
2
x
−2 3
Figure 17.7 The dynamics of the van der Pol oscillator in the limit → 0
17.7
Time series analysis
So far we have assumed that the equations governing the dynamics of our system are known, either precisely or approximately. There are also cases where one only has a set of measurements without knowing the dynamics that have produced it. The knowledge that the data come from some dynamical system can still allow us to obtain useful information about the system through the ideas presented in this chapter. In particular, we can construct a useful picture of the attractor and calculate several quantitative measures associated with it. Suppose we have a time sequence of r scalar measurements x(1), x(2), . . . , x(r − 1), x(r) . . . from a deterministic autonomous system, which itself may be high-dimensional. We first choose some positive integer k and pass to the so-called delay coordinates by constructing augmented vectors y(n) = (x(n), x(n − 1), . . . , x(n − k + 1)) ∈ Rk .
(17.56)
for n = k, . . . , r. (Note the similarity of this construction to the one used in Section 17.1 when we discussed delay equations.) If k is chosen sufficiently large (typically one more than twice the dimension of the attractor) then the sequence of the y(n)s yields an embedding of the dynamics into a k-dimensional space, in the sense that we obtain a picture that is consistent with properties of a dynamical system [4]. Choosing a smaller k will make some points appear artificially close to each other, much like the overlapping of geometrical features of a three-dimensional object when projected onto a two-dimensional plane. Hence, by progressively increasing the value of k until the number of such false neighbours decreases, one can obtain a suitable embedding even though the actual dimension of the system or the attractor is unknown.2 An additional point to consider is that if the system is sampled too frequently compared with the time scale of the dynamics, then the values x(n) and x(n − 1) will be highly correlated, which decreases the information content of the vector in (17.56). In such a case it may be more useful to construct the augmented vectors with a larger separation of data points, say, y(n) = (x(n), x(n − τ), x(n − 2τ) . . . , x(n − kτ)) 2 One often observes finite-dimensional attractors even when dealing with infinite dimensional deterministic systems, so that the described process ends at a finite k. Stochastic dynamics, however, essentially fill all embedding dimensions no matter how large k is chosen, which also gives a method to distinguish a chaotic time series from a purely random one.
Qualitative Inference in Dynamical Systems
357
for some suitable integer τ. For appropriate choices of the embedding dimension k and the embedding delay τ, a faithful representation of the geometry of the system trajectories and the attractor can be obtained [8]. Moreover, the method has uses also for systems which are not completely autonomous or when the data are contaminated by measurement noise. The reader is referred to Kantz and Schreiber [5] for more details. Once the data are embedded into the appropriate space through delay coordinates, it is possible to construct pictures of the attractor, perform noise reduction, predict future values of the system, calculate quantitative measures of the dynamics, and do similar analyses to study the system in detail. For instance, for prediction of the system state at time n + 1, we consider the vector y(n) and find another point y(p), p < n, that is close to it in the reconstructed space, say y(p) − y(n) < ε in the norm of Rk for some small threshold ε. Now, based on determinism and the property of continuous dependence on initial conditions, we can take the next point in time y(p + 1) as a predictor of y(n + 1). In fact, if we find several points that are ε-close to y(n), we can take the average of all such predictors to obtain an improved estimate that reduces the effect of noise on the prediction. Similarly, to calculate the largest Lyapunov exponent λ of the system from time series, consider two points y(p) and y(q) that are near each other, y(p) − y(q) < ε. By the definition of the Lyapunov exponent, we expect that the difference y(p + t) − y(q + t) grows locally like eλt in time, that is λ≈
1 y(p + t) − y(q + t) log t y(p) − y(q)
as long as the two trajectories stay close. Hence, following nearby trajectories in the reconstructed space Rk over a few time steps and averaging the separation rates over all nearby pairs yields an estimate of the largest Lyapunov exponent. The remaining exponents can be obtained by calculating separations along directions orthogonal to that corresponding to the largest exponent; see Wolf et al. [6] for details. Since nearby points in the reconstructed space play a prominent role in the analysis, it is sometimes useful to use graphical depictions that bring the neighbourhood information to the foreground. For this purpose, a recurrence plot is often used, which is a two-dimensional array whose ith row and jth column contains, say, a black pixel if y(i) − y(j) < ε and a white pixel otherwise [7]. The resulting patterns correspond to the dynamical features of the system; for instance, a periodic attractor manifests itself in the form of parallel lines and a chaotic attractor results in short line segments, the lengths of which are inversely related to the largest Lyapunov exponent. Recurrence plots can also give visual indication when the embedding dimension and delay are incorrectly chosen [8]. The references list both the sources of our chapter and texts that contain additional material or a more detailed treatment of specific topics 9-19. The titles indicate the contents of the treatises so that we do not need to describe them in detail here.
References [1] [2] [3] [4]
J. K. Hale, Theory of Functional Differential Equations, Springer, 1977 V. Kolmanovskii and A. Myshkis, Applied Theory of Functional Differential Equations, Kluwer, 1992 F. M. Atay (ed.), Complex Time-Delay Systems, Springer, 2010 F. Takens, Detecting strange attractors in turbulence, in Dynamical Systems and Turbulence, Lecture Notes in Mathematics, 898:366–381, Springer-Verlag, 1987 [5] H. Kantz and T. Schreiber, Nonlinear Time Series Analysis, Cambridge University Press, 1997 [6] A. Wolf, J. B. Swift, H. L. Swinney, and J. A. Vastano, Determining Lyapunov exponents from a time series, Physica D, 16:285–317, 1985 [7] J. P. Eckmann, S. O. Kamphorst, and D. Ruelle, Recurrence plots of dynamical systems, Europhysics Letters, 4:973– 977, 1987
358
Handbook of Statistical Systems Biology
[8] F. M. Atay and Y. Altıntas, Recovering smooth dynamics from time series with the aid of recurrence plots, Physical Review E, 59:6593–6598, 1999 [9] B. Aguda and A. Friedman, Models of Cellular Regulation, Oxford University Press, 2008 [10] J. Hofbauer and K. Sigmund, The Theory of Evolution and Dynamical Systems, Cambridge University. Press, 1988 [11] J. Hofbauer and K. Sigmund, Evolutionary Games and Population Dynamics, Cambridge University Press, 1998 [12] J. Jost, Partial Differential Equations, Springer, 2007 [13] J. Jost, Dynamical Systems, Springer, 2005 [14] J. Jost, Mathematical methods in biology and neurobiology, to appear [15] D. Kaplan and L. Glass, Understanding Nonlinear Dynamics, Springer, 1995 [16] E. Klipp, R. Herwig, A. Kowald, C. Wierling, and H. Lehrach, Systems Biology in Practice, Wiley-VCH, 2005 [17] C. Koch, Biophysics of Computation, Oxford University Press, 1999 [18] J. D. Murray, Mathematical Biology, Springer, 1989 [19] T. Trappenberg, Fundamentals of Computational Neuroscience, Oxford University Press, 2010
18 Stochastic Dynamical Systems Darren J. Wilkinson School of Mathematics and Statistics, Newcastle University, Newcastle-upon-Tyne, UK
18.1
Introduction
Previous chapters have shown that computational systems biology (Kitano, 2002) is concerned with developing dynamic simulation models of biological processes. Such models are useful for developing a quantitative understanding of the process, for testing current understanding of the mechanisms, and to allow in silico experimentation that would be difficult or time consuming to carry out on the real system in the laboratory. We have seen how continuous deterministic models can be developed, typically using an assumption of mass-action chemical kinetics leading to systems of ordinary differential equations (ODEs). In recent years there has been increasing recognition of the importance of modelling intrinsic stochasticity in intracellular biological processes, not captured by the traditional approaches (Wilkinson, 2009). In this chapter we will see how the theory of stochastic chemical kinetics forms the basis of a more realistic class of models than continuous deterministic alternatives, which models cellular dynamics using a Markov jump process (Wilkinson, 2006). We will also consider the challenges associated with fitting such models to experimental data.
18.2 18.2.1
Origins of stochasticity Low copy number
The deterministic approach to kinetics fails to capture the discrete and stochastic nature of chemical kinetics at low concentrations. As many intracellular processes involve reactions at extremely low concentrations, such discrete stochastic effects are often relevant for systems biology models. Biochemical species are present in a system as an integer number of molecules, and this number changes abruptly and discretely as individual chemical reaction events take place. Such events are intrinsically stochastic, driven by the Brownian motion of Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
360
Handbook of Statistical Systems Biology
molecules in the system. The random fluctuations which arise from these fundamental statistical mechanical processes are often referred to as intrinsic noise (Swain et al., 2002). 18.2.2
Other sources of noise and heterogeneity
Intrinsic stochasticity in biochemical processes due to random molecular interactions is by no means the only source of noise in biological systems. Extrinsic noise is a term often used to describe other sources of noise in a biological system, including the unmodelled or poorly understood effects of biochemical species and pathways on the subsystem of interest. For example, a particular biochemical reaction of interest may be modulated by an unknown enzyme whose levels are fluctuating independently of the reaction and system of interest. This might lead to a reaction whose rate appears to vary (randomly) over time. Further, when looking across populations of cells, variations in initial conditions, environmental variations, and genuine cell to cell heterogeneity all lead to additional sources of apparent ‘noise’.
18.3 18.3.1
Stochastic chemical kinetics Reaction networks
For mass-action stochastic kinetic models, it is assumed that the state of the system at a given time is represented by the number of molecules of each reacting chemical ‘species’ present in the system at that time, and that the state of the system is changed at discrete times according to one or more reaction ‘channels’. We assume that there are u species denoted X1 , . . . , Xu , and v reactions, R1 , . . . , Rv . Each reaction Ri is of the form pi1 X1 + · · · + piu Xu −→ qi1 X1 + · · · + qiu Xu , i = 1, . . . , v. Here pij denotes the number of molecules of Xj that will be consumed by reaction Ri , and qij the number of molecules produced. Let P be the v × u matrix formed from the pij and Q be the corresponding matrix of the qij . We can write the entire reaction system in matrix/vector form as PX −→ QX. The matrices P and Q are typically sparse, and this fact can be exploited in computational algorithms. The u × v matrix S = (Q − P)T is the stoichiometry matrix of the system, and is especially important in computational analysis of stochastic kinetic models, as its columns encode the change of state in the system caused by the different reaction events. This matrix is also sparse, due to the sparsity of P and Q. Let Xjt denote the number of molecules of Xj at time t and Xt = (X1t , . . . , Xut )T . 18.3.2
Markov jump process
Consider first a bimolecular reaction of the form X + Y −→ Z. What this reaction means is that a molecule of X is able to react with a molecule of Y (to form a molecule of type Z) if the pair happen to collide with one another (with sufficient energy), while moving around randomly, driven by Brownian motion. Considering a single pair of such molecules in a container of fixed volume, it is possible to use statistical mechanical arguments to understand the hazard of molecules colliding. Under fairly weak assumptions regarding the container and its contents (essentially that it is small or well stirred, and in thermal equilibrium), it can be rigorously demonstrated that the collision hazard is constant, provided the volume is fixed and the temperature is constant. A comprehensive treatment of this issue is given in Gillespie
Stochastic Dynamical Systems 361
(1992), to which the reader is referred for further details. However, the essence of the argument is that as the molecules are uniformly distributed throughout the volume and this distribution does not depend on time, then the probability that the molecules are within reaction distance is also independent of time. For the case of this bimolecular reaction, it is assumed that the probability of a given pair of molecules reacting in a time interval of length dt is c dt for some constant c. But supposing that there are x molecules of type X and y of type Y, then there are xy independent pairs of molecules that could potentially react, and hence the probability of a reaction of this type occurring in a time interval of length dt is cxy dt. Thus the probability of a reaction occurring depends only on the current state of the system (the number of molecules of each type), and not on the entire history of the process – this is the so-called Markov property. Further, the changes in state occur at discrete times, punctuating finite periods of no change. These are the defining characteristics of a Markov jump process (Allen, 2003). Now let us consider the general case. We assume that reaction Ri has hazard (or rate law, or propensity) hi (Xt , ci ), where ci is a rate parameter. That is, the probability of a reaction of type Ri occurring in a time interval of length dt is hi (Xt , ci )dt. For the reasons given above, we assume that there is no explicit dependence on time in this hazard function. We put c = (c1 , . . . , cv )T and h(Xt , c) = (h1 (Xt , c1 ), . . . , hv (Xt , cv ))T in order to simplify notation. Then the system evolves as a Markov jump process with independent reaction hazards for each reaction channel. Further, for mass-action stochastic kinetics, using arguments similar to that given above for the special case of bimolecular reactions, the algebraic form of each rate law is given as u Xjt , i = 1, . . . , v. hi (Xt , ci ) = ci pij j=1
Hence, given a reaction network structure, the vector of reaction rate constants, c, determines the stochastic behaviour of the system. Explicitly, if the current time is t and the current state of the system is Xt , then the set of possible states at time t + dt is {Xt } ∪ {Xt + S (i) |i = 1, . . . , v}, where S (i) is the ith column of S. Then for each i = 1, . . . , v we have Pr(Xt+dt = Xt + S (i) ) = hi (Xt , ci )dt, and consequently Pr(Xt+dt = Xt ) = 1 −
v
hi (Xt , ci )dt.
i=1
18.3.2.1
Chemical master equation
Now that we have described the stochastic kinetic model as a Markov jump process, it is natural to seek mathematical and computational descriptions of the process that will allow us to study and understand the process in greater detail. The conventional way to do this is to develop a set of coupled ODEs for the time evolution of the probability distribution for the system state. In the statistical physics literature this set of equations is known as the chemical master equation (Gillespie, 1992), but those from a probabilistic background will recognise them as Kolmogorov’s forward equations (Allen, 2003). We can give a simple derivation as follows. Assume that the initial state of the system is x0 at time t = 0, that is X0 = x0 , and suppose we are interested in the state of the system at some future time t > 0, Xt . Clearly Xt is a random variable, as the process is stochastic, but we can consider the probability distribution for Xt , and denote this p(·, t). That is, for every reachable state x, we have Pr(Xt = x|X0 = x0 ) = p(x, t).
362
Handbook of Statistical Systems Biology
We derive a set of differential equations for this probability distribution by considering p(x, t + dt), and its relationship to p(x, t). In order to have Xt+dt = x, the most likely explanation is that Xt = x also, but it is also possible that Xt was in the set {x − S (i) |i = 1, . . . , v} and that the system jumped to x from one of those other states. Using the probability of not moving (derived above), we can expand p(x, t + dt) using the theorem of total probability as v v p(x, t + dt) = p(x, t) 1 − hi (x, ci )dt + p(x − S (i) , t)h(x − S(i), ci )dt. i=1
i=1
A simple rearrangement gives the chemical master equation as v d p(x, t) = p(x − S (i) , t)hi (x − S (i) , ci ) − p(x, t)hi (x, ci ) . dt i=1
Although there are cases where this set of differential equations can be solved in order to give explicit forms for the probability density p(·, t) (McQuarrie, 1967), in general it is not possible to do this, either analytically or computationally, and therefore other approaches to analysis are required. 18.3.2.2
Gillespie algorithm
The Markov process associated with a stochastic chemical kinetic system is typically nonlinear with unbounded state space. Consequently the models are generally analytically intractable, but realisations of the model can be simulated exactly using a computer, using a discrete event simulation algorithm, known in this context as the Gillespie algorithm (Gillespie, 1977). In a given reaction system with v reactions, we know that the hazard for a type i reaction is hi (x, ci ), so the combined hazard for a reaction of some type occurring is v h0 (x, c) ≡ hi (x, ci ). i=1
It can be shown that the time to the next reaction is Exp(h0 (x, c)), and also that this reaction will be a random type, picked with probabilities proportional to the hi (x, ci ), independent of the time to the next event. That is, the reaction type will be i with probability hi (x, ci )/ h0 (x, c). Using the time to the next event and the event type, the state of the system can be updated, and simulation can continue. In the context of chemical kinetics, this standard discrete event simulation procedure is known as the ‘Gillespie algorithm’ (or ‘Gillespie’s direct method’, or sometimes even just the ‘stochastic simulation algorithm’, SSA), after Gillespie (1977). The algorithm can be summarised as follows: 1. Initialise the system at t = 0 with rate constants c1 , c2 , . . . , cv and initial numbers of molecules for each species, x1 , x2 , . . . , xu . 2. For each i = 1, 2, . . .
, v, calculate hi (x, ci ) based on the current state, x. 3. Calculate h0 (x, c) ≡ vi=1 hi (x, ci ), the combined reaction hazard. 4. Simulate time to next event, t , as an Exp (h0 (x, c)) random quantity. 5. Put t : = t + t . 6. Simulate the reaction index, j, as a discrete random quantity with probabilities hi (x, ci ) / h0 (x, c), i = 1, 2, . . . , v. 7. Update x according to reaction j. That is, put x : = x + S (j) , where S (j) denotes the jth column of the stoichiometry matrix S. 8. Output x and t. 9. If t < Tmax , return to step 2.
Stochastic Dynamical Systems 363
Thus, even in situations where mathematical analysis of the stochastic kinetic model is intractable, it is often straightforward to simulate exact independent realisations of the process on a computer. Studying such realisations can give insight into the statistical properties of the process. 18.3.2.3
Random time change representation
We have examined the classical mathematical representation of a stochastic kinetic model – the so-called ‘chemical master equation’. We have also examined the standard computational approach to generating realisations from the process – the Gillespie algorithm. There is a significant disconnect between these two representations. The chemical master equation does not directly relate to sample paths of the stochastic process, and this is a significant drawback in practice. An alternative mathematical representation of this Markov jump process can be constructed, known as the random time change representation (Kurtz, 1972), which turns out to be very helpful for mathematical analysis of the system. Let Rit denote the number reactions of type Ri in the time window (0, t], and then define Rt = (R1t , . . . , Rvt )T . It should be clear that Xt − X0 = SRt (this is known as the state updating equation). Now for i = 1, . . . , v, define Ni (t) to be the count functions for v independent unit Poisson processes (so that Ni (t) ∼ Po(t), ∀t). Then t Rit = Ni hi (Xτ , ci )dτ . 0
Putting N((t1 , . . . , tv
)T )
= (N1 (t1 ), . . . , Nv (tv
))T ,
we can write t h(Xτ , c)dτ Rt = N 0
to get
t
Xt − X0 = S N
h(Xτ , c)dτ ,
0
the random time-change representation of the Markov jump process. This simple mathematical representation of the process is typically an insoluble stochastic integral equation, but nevertheless lends itself to a range of analyses, and in many situations turns out to be more useful than the chemical Master equation, due to its direct connection to the sample paths of the process. In particular, the time change representation is useful for analysing and deriving exact and approximate algorithms for simulating realisations of the stochastic process. It can be used to justify Gillespie’s direct method, but also suggests related algorithms, such as Gillespie’s first reaction method (Gillespie, 1976) and the next reaction method of Gibson and Bruck (2000). It is also provides a natural way of understanding asymptotic properties, scaling limits and approximations. See Ball et al. (2006) for applications of this representation to analysis of approximate system dynamics. The time change representation also aids understanding of some recent fast hybrid stochastic simulation algorithms, such as those described in Haseltine and Rawlings (2002) or Salis and Kaznessis (2005). 18.3.2.4
Structural properties
The state updating equation Xt − X0 = SRt is key to understanding why the stoichiometry matrix, S, is such an important concept in biochemical network theory, irrespective of the modelling paradigm being adopted. In particular, there are important structural properties of the network which follow directly from this equation that are completely independent of any assumptions regarding the system dynamics. For example, for any
364
Handbook of Statistical Systems Biology
u-vector α, αT Xt represents a linear combination of the state variables. But it is clear from the state updating equation that if S T α = 0 (so that αT S = 0T ), then the linear combination of state variables will remain constant over time. Such linear combinations are known as conservation laws, and it can be important for numerical algorithms to identify such laws in order to ensure that they are preserved by numerical procedures. v-vectors β satisfying Sβ = 0 also have a useful interpretation. The null spaces of S and S T therefore give useful insight into global properties of the system, and can be identified using the singular value decomposition (SVD) of S (Golub and Van Loan, 1996). 18.3.3
Diffusion approximation
The chemical Langevin equation (CLE) is a diffusion approximation to the true Markov jump process (Gillespie, 2000). The approximation can be derived as a high concentration limit of the Markov jump process, but here we will give a simple derivation based on replacing the increments of the process with continuous increments having the same mean and variance. Start with the time change representation and approximate Ni (t) t + Wi (t), where Wi (t) is an independent Wiener process (or Brownian motion) for each i. Using a little stochastic calculus (Øksendal, 2003) we find t Xt − X0 = S N h(Xτ , c)dτ 0 t t h(Xτ , c)dτ + W h(Xτ , c)dτ S 0 0 hi (Xt , c) dWt , ⇒ dXt = Sh(Xt , c) dt + S diag where Wt is the same dimension as h(·, ·). Now since Var(dXt ) = S diag{hi (Xt , c)}S T dt we can rewrite the above stochastic differential equation (SDE) as dXt = Sh(Xt , c) dt + S diag{h(Xt , c)}S T dWt , where Wt is now the same dimension as Xt . We adopt the usual statistical convention that the square root of a symmetric matrix A is any matrix B satisfying BBT = A (the Cholesky factor is a common choice). This Ito SDE (Øksendal, 2003) describes a Markov process continuous in time (with continuous but nondifferentiable sample paths), and represents the diffusion process most closely matching the dynamics of the true discrete stochastic kinetic model. Although this approximation is good only when molecular concentrations are relatively high, it is widely used, and can give useful insight into process dynamics even in situations where the accuracy of the approximations can potentially break down. See Gillespie (2000) for a more physically motivated derivation and Ball et al. (2006) for more sophisticated approaches to approximating the Markov jump process. One attractive feature of SDE representations of biochemical network models is that they can often be simulated on a computer much more rapidly than an equivalent discrete stochastic kinetic model. As for ODE models, simulation typically proceeds based on simulation from an approximate time discretisation of the model. To understand the simplest such scheme, first start with the SDE for an essentially arbitrary diffusion process dXt = μ(Xt )dt + σ(Xt )dWt . For ‘small’ time steps t, the increments of the process can be expressed in terms of the Euler–Maruyama discretisation (analogous to the Euler discretisation of an ODE) Xt+t − Xt ≡ Xt = μ(Xt )t + σ(Xt )Wt ,
Stochastic Dynamical Systems 365
where Wt ∼ N(0, It). This approximation becomes arbitrarily accurate as t −→ 0, and so this provides an intuitive interpretation of the stochastic process that the SDE represents. A system at time t can therefore be stepped to t + t by adding an increment sampled from a N(μ(Xt )t, σ(Xt )σ(Xt )T t) distribution. Just as there exist higher order methods for simulating ODE models, there exist similar schemes for SDE models, such as the Milstein method. However, such methods are less widely used for SDEs than higher-order methods for ODEs, due to the complexity of implementation. SDEs therefore tend to be most commonly simulated using the above Euler–Maruyama method in conjunction with a conservative choice of t > 0. See Kloeden and Platen (1992) for further details of numerical schemes for SDE models. 18.3.4
Reaction rate equations
The reaction rate equations (RREs) are the continuous deterministic ODEs that have traditionally been used for modelling biochemical network dynamics. Again, these can be formally derived as a high-concentration limit of the Markov process (Kurtz, 1972), but here we will briefly note that they follow directly from the CLE by setting the noise term to zero to get dXt = Sh(Xt , c) dt, more commonly written as dxt = Sh(xt , c), dt or x˙ t = Sh(xt , c). The use of ODE models of this form has been discussed in previous chapters of this volume, and will not be pursued further here. 18.3.5
Modelling extrinsic noise
Whilst we can rely on the theory of stochastic chemical kinetics for modelling of intrinsic noise arising from the discreteness of (bio)chemical molecules, there is no corresponding physical theory for modelling of extrinsic noise in a system, yet this can be just as important. Extrinsic noise is a fairly loosely defined term which essentially represents all noise and heterogeneity in the system not explicitly associated with the intrinsic stochasticity of discrete chemical kinetics. It is often typically associated with processes not explicitly modelled, which nevertheless have an impact on the dynamics of the system under study. For example, it is often the case that a chemical reaction will be modelled as mass-action, using a constant rate parameter, despite the fact that the reaction is in fact modulated by a (possibly unknown) enzyme that has not been modelled, and hence cannot be included explicitly in the reaction mechanism. One way to model such effects is to allow some or all of the model’s rate parameters (the ci ) to vary stochastically over time. For example, one could model one or more rate parameters using a geometric Ornstein–Uhlenbeck process, which has a nice representation as an SDE (Øksendal, 2003). It could also make sense to correlate these processes to model common sources of variation. These stochastic processes for the rate parameters can then be coupled to a stochastic (or deterministic) model of the system dynamics. The coupling will depend on the type of the biochemical network model. The simplest case is coupling a set of SDEs for rate parameters to the CLE for the biochemical network. This simply extends the dimension of the SDE model, and does not really complicate simulation or analysis further. Coupling SDEs for rate constants to a Markov jump process gives rise to a hybrid discrete-continuous Markov process model which can be simulated using hybrid simulation approaches such as those described in Haseltine and Rawlings (2002). Coupling SDEs for rate parameters to the deterministic RRE model is also possible, and represents one way to introduce time-varying coefficients into ODE models. The coupled model will be a hypoelliptic diffusion (Pokern et al., 2009), having rough sample paths for the time-varying parameters and smooth (once-differentiable) sample paths for the chemical species. See Shahrezaei et al. (2008) for further discussion of introducing extrinsic noise into biochemical network models, including the use of coloured noise.
366
Handbook of Statistical Systems Biology
It is worth emphasising that although randomly time-varying rate parameters have probably received the most attention as a mechanism for introducing extrinsic noise into biochemical network models, they are by no means the only possible mechanism. A much simpler possibility for modelling cell populations is to regard each cell as having fixed rate constants, but to allow each cell to have different fixed parameters, with those fixed parameters being drawn from a population distribution. This is very simple for simulation purposes, as one simulates a cell by first picking a set of rate constants and then simulating time course dynamics as normal. It is also natural to combine this approach with mechanisms for dealing with random or uncertain initial state conditions [see chapter 7 of Wilkinson (2006)]. It is difficult to provide general advice on modelling extrinsic noise, as it will inevitably be highly problem dependent, but the lack of specific guidance does not suggest that this source of noise should be ignored, as several studies have suggested that extrinsic noise can often dominate intrinsic noise in real biological systems (Swain et al., 2002).
18.4
Inference for Markov process models
Although this chapter has covered several different stochastic representations of biochemical networks and noisy system dynamics, all of the models discussed have been Markovian, meaning that knowledge of the current state of the system makes the past and future states of the system conditionally independent. This is a very powerful property, which is exploited by all methods for analysing and computationally simulating such systems. It also turns out to be a key property when attention turns to issues of inference, parameter estimation and model assessment and identification. Indeed, given the range of models covered, the Markov property will turn out to be the only theoretical property of the models that we are able to exploit, in general. The models that have been described in this chapter will often contain aspects which are to some extent uncertain. The most obvious uncertain features in many cases will be the reaction rate constants. Although rate constants for many enzymatic reactions have been determined in vitro, many important reaction rates are not at all well understood, and the in vivo effective rates of most reactions are not known with any great certainty. It is therefore highly desirable to make inferences for reaction rates using time course experimental data of some description. Although this sounds very obvious and straightforward, it turns out to be exceptionally challenging, from both experimental and computational viewpoints. The experimental issues associated with obtaining high-resolution time course data on single cells are somewhat beyond the scope of this chapter. Suffice to say that experiments leading to such data are still far from trivial to carry out, and that even when they are successful, they will typically lead to simultaneous measurements on only one or two proteins. Since these measurements will be noisy, and many models will contain dozens of reacting species, we typically find ourselves in an extreme data-poor scenario from the viewpoint of statistical parameter inference. The computational issues are no less daunting. Estimating parameters of stochastic models using partial data is not a simple matter of least-squares fitting of simulated sample paths to experimental measurements. By their very nature, stochastic models give different results each time that they are simulated, and so there is no reason to expect that any one given realisation will match up with the experimental data, even if all of the parameters are exactly ‘correct’. The statistical concept of likelihood provides the best way of understanding how consistent the experimental data are with the assumed model parameters. However, the likelihood of discrete time course data is typically intractable for the kinds of models that have been considered in this chapter, and so computationally intensive methods of inference turn out to be necessary. 18.4.1
Likelihood-based inference
At some level, inference for complex Markov process models is not fundamentally more difficult than for many other high-dimensional nonlinear statistical models. Let us begin by thinking specifically about the discrete stochastic kinetic model considered first in this chapter. Given complete information about the trajectory of
Stochastic Dynamical Systems 367
the process over a given fixed time window, the likelihood of the process can be computed exactly. If we observe the process x = {x(t) : t ∈ [0, T ]} where x(t) represents the values of Xt for one particular (observed) realisation of the stochastic process, we can determine from the reaction structure the time and type of the n reaction events occurring in the time interval (0, T ]. Suppose that the ith reaction event is (ti , ν
i ), i = 1, . . . n. Also define t0 = 0, tn+1 = T . Let rj be the total number of type j events occurring (so n = vj=1 rj ). Then the complete-data likelihood for the observed sample path is n T L(c; x) ≡ Pr(x|c) = hνi (x(ti−1 ), cνi ) exp − h0 (x(t), c) dt . 0
i=1
See chapter 10 of Wilkinson (2006) for further details. Note that the integral occurring in the above equation is just a finite sum, so there are no approximations required to evaluate it (though as usual, it is numerically advantageous to actually work with the log of the likelihood). Note however, that there can often be significant computational challenges associated with storing complete sample paths and evaluating associated likelihoods for nontrivial models. There are simplifications which arise for rate laws of the form hi (x, ci ) = ci gi (x) (true for basic mass-action stochastic kinetic models), as then the complete-data likelihood factorises as L(c; x) =
v
Lj (cj ; x)
j=1
where
r Lj (cj ; x) = cjj exp −cj
T
gj (x(t)) dt ,
j = 1, . . . , v.
0
If there is interest in maximum likelihood estimation, these component likelihoods can be directly maximised by differentiation to get rj , j = 1, . . . , v. cj = T gj (x(t))dt 0
From the viewpoint of Bayesian inference (to be discussed), the component likelihoods are also semi-conjugate to priors of the form cj ∼ (aj , bj ) and hence can be combined to get full-conditional posterior distributions of the form T gj (x(t))dt . cj |x ∼ aj + rj , bj + 0
18.4.2
Partial observation and data augmentation
All of the inferential complications arise from the fact that, in practice, we cannot hope to observe the system perfectly over any finite time window. Observations of the system state will typically occur at discrete times, will usually be partial (not all species in the model will be measured), and will often be subject to measurement error. This data-poor scenario leads to a challenging missing-data problem. Consider first the best-case scenario – perfect observation of the system at discrete times. Conditional on discrete-time observations, the Markov process breaks up into a collection of independent bridge processes that are not analytically tractable. This means that discrete time transition probabilities cannot be derived, which in turn means that the likelihood of the model cannot be computed. Such problems become even more severe in more data-poor scenarios, which are typical. In one way or another, almost all methods developed to overcome this problem involve some kind of data augmentation scheme which stochastically imputes key features of the process not
368
Handbook of Statistical Systems Biology
observed, but which if known, would render the problem more tractable. Although it is possible to develop such schemes from the viewpoint of maximum likelihood, it is much more natural and convenient within a Bayesian framework for computational statistical inference. The inferential framework we are working in concerns partially observed Markov process models, sometimes known as POMP models (He et al., 2010). In general we think about a continuous-time process X = {Xs |s ≥ 0} (for now, we suppress dependence on any model parameters). Most of what follows would also apply to discrete time Markov process models, but there are additional simplifications in that case. In order to keep the notation simple, we will consider the case where observations occur at integer times, but the extension to arbitrary observation times is trivial. With this in mind, it is helpful to think about breaking the process up into intervals corresponding to the inter-observation periods, so that for t = 1, 2, . . . we have Xt = {Xs |t − 1 < s ≤ t}. Sample-path likelihoods such as π(xt |xt−1 ) can sometimes (but not always) be computed (and are usually computationally problematic even when they can), but discrete time transitions such as π(xt |xt−1 ) are typically intractable. Observations on the system are partial, occurring at integer times D = {dt |t = 1, 2, . . . , T } where dt |Xt = xt ∼ π(dt |xt ),
t = 1, . . . , T,
and we assume that π(dt |xt ) can be evaluated directly, corresponding to some relatively simple measurement error model. Maximum likelihood approaches to POMP model inference are discussed in He et al. (2010), but here we will concentrate on computational Bayesian approaches. The standard technique for attempting parameter inference for such complex partially observed stochastic system would typically involve constructing some kind of Markov chain Monte Carlo (MCMC) algorithm on a state space augmented with at least some aspects of the unobserved Markov process (Gamerman, 1997). Most ‘obvious’ MCMC algorithms will attempt to impute (at least) the skeleton of the Markov process: X0 , X1 , . . . , XT , but this will typically require evaluation of the intractable discrete time transition likelihoods. It is this intractability of the likelihoods that is at the root of the difficultly for developing efficient inferential algorithms for problems of this type. There are two related strategies which can be used which do not involve approximating the likelihoods in some way. The first, and most commonly adopted approach is data augmentation, which attempts to stochastically ‘fill in’ the entire process, typically exploiting the fact that the sample path likelihoods are often tractable. This strategy works in principle, provided that the sample path likelihoods are tractable, but is difficult to automate, and is exceptionally computationally intensive due to the need to store and evaluate likelihoods of continuous sample paths. The second approach, closely related to data augmentation, is the so-called likelihood-free approach (Marjoram et al., 2003), which is sometimes also known as the plug and play approach (He et al., 2010). This method exploits the fact that it is possible to forward simulate from π(xt |xt−1 ) (typically by simulating from π(xt |xt−1 )), even if it cannot be evaluated. Likelihood-free is really just a special kind of augmentation strategy whereby the augmentation scheme is structured in such a way as to ensure that the sample path likelihoods drop out of the Metropolis–Hastings acceptance ratios. 18.4.3
Data augmentation MCMC approaches
For a standard data augmentation approach to the development of MCMC algorithms, the key problem is filling in the inter-observation sample paths, xt , conditionally on the available data (and model parameters). The problem changes according to how partial the data is. In the best-case scenario: perfect observation of the system state at integer times, one has to fill in the sample paths conditional on the observed system state at the ends of each interval. Considering just one interval, we need to explore rt consistent with xt+1 − xt = Srt , in addition to all sample paths consistent with rt . Both reversible jump and block-updating strategies are possible – see Boys et al. (2008) for details, but these standard MCMC techniques do not scale well to large, complex models with very large numbers of reaction events.
Stochastic Dynamical Systems 369
18.4.4
Likelihood-free approaches
One of the problems with the classical data augmentation approaches to inference in realistic data-poor scenarios is the difficulty of developing algorithms to explore a huge (often discrete) state space with a complex likelihood structure that makes conditional simulation difficult. Such problems arise frequently, and in recent years interest has increasingly turned to methods which avoid some of the complexity of the problem by exploiting the fact that we are easily able to forward-simulate realisations of the process of interest. Methods such as likelihood-free MCMC (LF-MCMC) (Marjoram et al., 2003) and Approximate Bayesian Computation (ABC) (Beaumont et al., 2002) are now commonly used to tackle problems which would be extremely difficult to solve otherwise. A likelihood-free approach to this problem can be constructed as follows. Let π(x|c) denote the (complex) likelihood of the simulation model. Let π(D|x, τ) denote the (simple) measurement error model, giving the probability of observing the data D given the output of the stochastic process and some additional parameters, τ. Put θ = (c, τ), and let π(θ) be the prior for the model parameters. Then the joint density can be written π(θ, x, D) = π(θ)π(x|θ)π(D|x, θ). Suppose that interest lies in the posterior distribution π(θ, x|D). A Metropolis–Hastings scheme can be constructed by proposing a joint update for θ and x as follows. Supposing that the current state of the Markov chain is (θ, x), first sample a proposed new value for θ, θ , by sampling from some (essentially) arbitrary proposal distribution f (θ |θ). Then, conditional on this newly proposed value, sample a proposed new sample path, x by forwards simulation from the model π(x |θ ). Together the newly proposed pair (θ , x ) is accepted with probability min{1, A}, where A=
π(θ ) f (θ|θ ) π(D|x , θ ) × × . π(θ) f (θ |θ) π(D|x, θ)
Crucially, the potentially problematic likelihood term, π(x|θ) does not occur in the acceptance probability, due to the fact that a sample from it was used in the construction of the proposal. Note that choosing an independence proposal of the form f (θ |θ) = π(θ ) leads to the simpler acceptance ratio A=
π(D|x , θ ) . π(D|x, θ)
This ‘canonical’ choice of proposal also lends itself to more elaborate schemes, as we will consider shortly. This ‘vanilla’ LF-MCMC scheme should perform reasonably well provided that D is not high-dimensional, and there is sufficient ‘noise’ in the measurement process to make the probability of acceptance non-negligible. However, in practice D is often of sufficiently large dimension that the overall acceptance rate of the scheme is intolerably low. In this case it is natural to try and ‘bridge’ between the prior and the posterior with a sequence of intermediate distributions. There are several ways to do this, but here it is most natural to exploit the Markovian nature of the process and consider the sequence of posterior distributions obtained as each additional time point is observed. For notational simplicity consider equispaced observations at integer times and define the data up to time t as Dt = {d1 , . . . , dt }. Similarly, define sample paths xt ≡ {xs |t − 1 < s ≤ t}, t = 1, 2, . . ., so that x = {x1 , x2 , . . .}. The posterior at time t can then be computed inductively as follows: 1. Assume at time t we have a (large) sample from π(θ, xt |Dt ) (for time 0, initialise with sample from prior). 2. Run an MCMC algorithm which constructs a proposal in two stages: (a) First sample (θ , xt ) ∼ π(θ, xt |Dt ) by picking at random and perturbing θ slightly (sampling from a kernel density estimate of the distribution).
370
Handbook of Statistical Systems Biology
|θ , x ). (b) Next sample xt+1 by forward simulation from π(xt+1 t (c) Accept/reject (θ , xt+1 ) with probability min{1, A} where
A=
, θ ) π(dt+1 |xt+1 . π(dt+1 |xt+1 , θ)
3. Output the sample from π(θ, xt+1 |Dt+1 ), put t : = t + 1, return to step 2. Consequently, for each observation dt , an MCMC algorithm is run which takes as input the current posterior distribution prior to observation of dt and outputs the posterior distribution given all observations up to dt . As dt is typically low-dimensional, this strategy usually leads to good acceptance rates. It is worth emphasising the generality of this algorithm. Although we are here thinking mainly of applying it to discrete stochastic kinetic models, it is applicable to any Markov process discretely observed with error, including any of the Markovian models considered in this chapter. It is also trivially adaptable to nonuniform observations, and to observation of multiple independent time courses (the posterior distribution from one time course can be used to form the prior distribution for the next). It is also adaptable to data from multiple (different) models which share many parameters – an important scenario in systems biology, where data are often available on multiple genetic mutants. The sequential likelihood-free algorithm described above can be implemented in a reasonably generic manner. The resulting algorithms are very powerful, but exceptionally computationally intensive. It is therefore natural to want to exploit powerful remote computing resources connected to a local machine via the Internet. CaliBayes (http://www.calibayes.ncl.ac.uk/) is an example of such a remote facility. Simulation models (either deterministic or stochastic) are encoded using the Systems Biology Markup Language (SBML) (Hucka et al., 2003), and these are sent to the remote server together with a large sample from the prior distribution and the experimental data. When the computations are completed, a large sample from the posterior distribution is returned to the user. the CaliBayes system uses a service-oriented architecture (SOA), and makes use of modern web-service technology – further details are provided in Chen et al. (2010). The forward simulation of SBML models is carried out using third-party simulators such as COPASI (Hoops et al., 2006), FERN (Erhard et al., 2008) or BASIS (Kirkwood et al., 2003), and these may be specified by the user. An R package (calibayesR) which provides a user-friendly interface to most of the CaliBayes services is available from R-forge (http://r-forge.r-project.org/). Application of this likelihood-free approach to a problem concerning bacterial decision making is discussed in Wilkinson (2011).
18.4.5
Approximate Bayesian computation
There is a close connection between LF-MCMC methods and those of approximate Bayesian computation (ABC). Consider first the case of a perfectly observed system, so that there is no measurement error model. Then there are model parameters θ described by a prior π(θ), and a forwards-simulation model for the data D, defined by π(D|θ). It is clear that a simple algorithm for simulating from the desired posterior π(θ|D) can be obtained as follows. First simulate from the joint distribution π(θ, D) by simulating θ ∼ π(θ) and then D ∼ π(D|θ ). This gives a sample (θ , D ) from the joint distribution. A simple rejection algorithm which rejects the proposed pair unless D matches the true data D clearly gives a sample from the required posterior distribution. However, in many problems this will lead to an intolerably high rejection rate. The ‘approximation’ is to accept values provided that D is ‘sufficiently close’ to D. In the simplest case, this is done by forming a (vector of) summary statistic(s), s(D ) (ideally a sufficient statistic), and accepting provided that |s(D ) − s(D)| < ε for some suitable choice of metric and ε (Beaumont et al., 2002). However, in certain circumstances this ‘tolerance’, ε can be interpreted as a measurement error model (Wilkinson, 2008), and for problems involving
Stochastic Dynamical Systems 371
a large amount of data, ABC may be applied sequentially (Sisson et al., 2007). Sequential ABC approaches have been applied to systems biology problems by Toni et al. (2009). Further, it is well known that ABC approaches can be combined with MCMC to get approximate LF-MCMC schemes (Marjoram et al., 2003). 18.4.6
Particle MCMC
The LF-MCMC scheme presented earlier does not scale well to very large problems, involving very many parameters and/or very many time points. Practically speaking, too many samples must be used in order to obtain good coverage of the posterior distribution. ABC, similarly, is only an approximate method. The recently proposed particle MCMC methods of Andrieu et al. (2010) offer the possibility of developing MCMC schemes which have the exact posterior of interest as the equilibrium distribution of the Markov chain. Particle MCMC algorithms combine conventional MCMC approaches (Gamerman, 1997) with sequential Monte Carlo (SMC) algorithms (Doucet et al., 2001) in order to exploit the best features of both approaches. Furthermore, likelihood-free implementations of some of the suggested algorithms are possible. The particle marginal Metropolis–Hastings (PMMH) algorithm (Andrieu et al., 2010) has a natural likelihood-free implementation which is in many ways analogous to the LF-MCMC method presented earlier, but with better numerical stability properties, obtained at great computational cost. The PMMH algorithm proceeds by first sampling a proposed new set of model parameters θ , from an essentially arbitrary proposal distribution f (θ |θ), and then running a continuous time bootstrap particle filter (which is a likelihood-free SMC algorithm) for the system state in order to sample a proposed new system state x , in addition to com ), required for the Metropolis–Hastings acceptance ratio ˜ puting an approximate marginal likelihood π(D|θ of the MCMC algorithm. The method remains exact despite the use of an approximate marginal likelihood estimate. This is an application of the pseudo-marginal approach to ‘exact approximate’ MCMC first proposed by Beaumont (2003) and examined in detail in Andrieu and Roberts (2009). Since the PMMH method runs a particle filter at each iteration of an MCMC algorithm it is computationally very expensive. However, in the case of very challenging inferential scenarios, this large computational cost may be worth paying, as it may still offer the best possibility for obtaining reliable parameter inferences. 18.4.7
Iterative filtering
Iterative filtering (Ionides et al., 2006) is a likelihood-free (plug-and-play) particle filtering technique designed to lead to maximum likelihood estimates for model parameters using time course data. It has much in common with the computational Bayesian approaches in that is uses forwards simulation from the model in conjunction with particle filtering in order to avoid the need to explicitly calculate discrete time transition densities for complex Markov processes. Interested readers are referred to Ionides et al. (2006) for further details. It is worth noting that there is an R package on CRAN called pomp (King et al., 2008) which implements iterative filtering for POMP models, in addition to several other likelihood-free methods, including a simple variant of the PMMH algorithm discussed previously. 18.4.8
Stochastic model emulation
All of the likelihood-free methods for inference described above require at least many thousands, and typically many millions, of simulated trajectories from the model of interest. For a simple model that is fast to simulate, this is tolerable. However, for more complex models, this requirement may make the algorithms impractical to run. In this case, either a completely different approach to inference needs to be taken, or a way needs to be found to make simulation from the model of interest much faster. For discrete stochastic chemical kinetic models, there is a great deal of interest in developing fast, approximate, ‘hybrid’ stochastic simulation algorithms which attempt to separate the timescales of different species in the model in such a way as to allow
372
Handbook of Statistical Systems Biology
the treating of some quantities as continuous, others discrete, etc., in order to allow advancing the system state across many reaction events in a single algorithmic step. For very large and complex biochemical reaction networks, this kind of approach can potentially speed up simulation by at least an order of magnitude at the expense of a negligible loss of accuracy (Salis and Kaznessis, 2005). In some cases this improvement in simulation speed will be sufficient to make running the inference algorithms possible, but in others it will not. In this case, more drastic approximations are required. For stochastic kinetic models, one possibility is to approximate the system with a CLE in order to obtain an SDE model. Inference for SDE models is considered in the next section. Alternatively, it may be preferable to develop a fast stochastic model emulator. This is typically a simple empirically fitted statistical model for the output of the system state at the next observation time point given the state of the system at the current time point. Such emulators are easiest to develop if the state of the system is not especially high dimensional, and the measured output is low dimensional. The emulator is typically fitted to some offline runs of the simulator on a carefully selected set of inputs. Generating these offline runs can be slow. However, once fitted, the emulator can be sampled at very low computational cost, which makes it ideal for embedding into computationally intensive inference algorithms. There are many ways to approach this problem, but some simple strategies have been explored to good effect in Henderson et al. (2009); Henderson et al. (2010). 18.4.9
Inference for stochastic differential equation models
Stochastic differential equation models (such as the CLE) are one particular type of Markov process. They tend to be relatively fast to simulate (for example, using an Euler–Maruyama method), so it is often possible to carry out inference using one of the likelihood-free methods described earlier. However, SDEs have special structure and a degree of computational tractability which mean that it is often possible to develop more efficient inferential algorithms based on more conventional data augmentation approaches. An SDE conditioned on two end-points is known as a diffusion bridge, and techniques for efficient simulation of diffusion bridges can be used as part of a MCMC scheme to develop data augmentation algorithms which fill in the entire unobserved Markov process. Since the diffusion bridges are analytically intractable, computational methods for simulation are required, such as the modified diffusion bridge method of Durham and Gallant (2002). Efficient MCMC algorithms based on such techniques are described in Golightly and Wilkinson (2006b,a, 2008, 2010). Even greater computational efficiency can be gained if one is prepared to make further approximations, such as the linear noise approximation (Van Kampen, 1992). This approximation renders the stochastic part of the process analytically tractable, and therefore allows the development of significantly more efficient inferential algorithms (Komorowski et al., 2009).
18.5
Conclusions
Stochasticity is ubiquitous in biology. If we are to accurately model biological systems, we must develop stochastic models, probably including both intrinsic and extrinsic sources of noise. Only by doing so can we properly capture heterogeneity in cell populations. Similarly, if we are to fit biological models to data that arise from noisy biological systems, our models must also reflect sources of noise in the data, or we will often find irreconcilable differences between our models and the experimental data which cannot be ‘fixed’ by simply adding a layer of measurement error on top of the model. However, developing stochastic models is more challenging than creating deterministic models. Fortunately, stochastic chemical kinetic theory provides a framework for model building that leads directly to a Markov jump process model from a specification of elementary biochemical reactions. This provides a simple and convenient starting point for beginning to understand intrinsic stochasticity in biochemical networks. Simulation of such processes on a computer is
Stochastic Dynamical Systems 373
straightforward, though often computationally intensive. Various approximations and additional sources of noise and uncertainty may be introduced into the models, and this inevitably complicates the development of efficient simulation algorithms and analysis. The fitting of stochastic models to experimental data is both theoretically and computationally demanding. Fortunately, a great deal of progress has been made in the development of computationally intensive Bayesian inference algorithms for Markovian stochastic models in recent years. Likelihood-free approaches are particularly attractive, as they offer a very flexible and general approach to many inferential problems which naturally arise in computational systems biology.
Acknowledgements This work was funded by the Biotechnology and Biological Sciences Research Council through grants BBF0235451, BBSB16550 and BBC0082001. It was also partially supported by a visit to the Statistical and Applied Mathematical Sciences Institute (SAMSI) programme on the Analysis of Object Data (AOD).
References Allen LJS 2003 Stochastic Processes with Applications to Biology. Pearson Prentice Hall, Upper Saddle River, NJ. Andrieu C and Roberts GO 2009 The pseudo-marginal approach for efficient Monte Carlo computations. Annals of Statistics 37(2), 697–725. Andrieu C, Doucet A and Holenstein R 2010 Particle Markov chain Monte Carlo methods (with discussion). Journal of the Royal Statistical Society: Series B 72(3), 269–342. Ball K, Kurtz TG, Popovic L and Rempala G 2006 Asymptotic analysis of multiscale approximations to reaction networks. Annals of Applied Probability 16(4), 1925–1961. Beaumont MA 2003 Estimation of population growth or decline in genetically monitored populations. Genetics 164, 1139–1160. Beaumont MA, Zhang W and Balding DJ 2002 Approximate Bayesian computation in population genetics. Genetics 162(4), 2025–2035. Boys RJ, Wilkinson DJ and Kirkwood TBL 2008 Bayesian inference for a discretely observed stochastic kinetic model. Statistics and Computing 18(2), 125–135. Chen Y, Lawless C, Gillespie CS, Wu J, Boys RJ and Wilkinson DJ 2010 CaliBayes and BASIS: integrated tools for the calibration, simulation and storage of biological simulation models. Briefings in Bioinformatics 11(3), 278–289. Doucet A, de Freitas N and Gordon N (eds) 2001 Sequential Monte Carlo Methods in Practice. Springer, New York. Durham GB and Gallant RA 2002 Numerical techniques for maximum likelihood estimation of continuous time diffusion processes. Journal of Business and Economic Statistics 20, 279–316. Erhard F, Friedel CC and Zimmer R 2008 FERN – a Java framework for stochastic simulation and evaluation of reaction networks. BMC Bioinformatics 9, 356. Gamerman D 1997 Markov Chain Monte Carlo. Texts in Statistical Science. Chapman and Hall, New York. Gibson MA and Bruck J 2000 Efficient exact stochastic simulation of chemical systems with many species and many channels. Journal of Physical Chemistry A 104(9), 1876–1889. Gillespie DT 1976 A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of Computational Physics 22, 403–434. Gillespie DT 1977 Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry 81, 2340–2361. Gillespie DT 1992 A rigorous derivation of the chemical master equation. Physica A 188, 404–425. Gillespie DT 2000 The chemical Langevin equation. Journal of Chemical Physics 113(1), 297–306. Golightly A and Wilkinson DJ 2006a Bayesian sequential inference for nonlinear multivariate diffusions. Statistics and Computing 16, 323–338.
374
Handbook of Statistical Systems Biology
Golightly A and Wilkinson DJ 2006b Bayesian sequential inference for stochastic kinetic biochemical network models. Journal of Computational Biology 13(3), 838–851. Golightly A and Wilkinson DJ 2008 Bayesian inference for nonlinear multivariate diffusion models observed with error. Computational Statistics and Data Analysis 52(3), 1674–1693. Golightly A and Wilkinson DJ 2010 Markov chain Monte Carlo algorithms for SDE parameter estimation. In Learning and Inference for Computational Systems Biology (ed. Lawrence ND). MIT Press, Cambridge, MA, pp. 253–275. Golub GH and Van Loan CF 1996 Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore, MD. Haseltine EL and Rawlings JB 2002 Approximate simulation of coupled fast and slow reactions for stochastic chemical kinetics. Journal of Chemical Physics 117(15), 6959–6969. He D, Ionides EL and King AA 2010 Plug-and-play inference for disease dynamics: measles in large and small populations as a case study. Journal of the Royal Society Interface 7(43), 271–283. Henderson DA, Boys RJ, Krishnan KJ, Lawless C and Wilkinson DJ 2009 Bayesian emulation and calibration of a stochastic computer model of mitochondrial DNA deletions in substantia nigra neurons. Journal of the American Statistical Association 104(485), 76–87. Henderson DA, Boys RJ, Proctor CJ and Wilkinson DJ 2010 Linking systems biology models to data: a stochastic kinetic model of p53 oscillations. In Handbook of Applied Bayesian Analysis (eds O’Hagan A and West M). Oxford University Press, Oxford, pp. 155–187. Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P and Kummer U 2006 COPASI – a complex pathway simulator. Bioinformatics 22(24), 3067–3074. Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Novere NL, Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J and Wang J 2003 The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4), 524–531. Ionides EL, Breto C and King AA 2006 Inference for nonlinear dynamical systems. Proceedings of the National Academy of Sciences of the United States of America 103, 18438–18443. King AA, Ionides EL and Bret´o CM 2008 pomp: statistical inference for partially observed Markov processes. Available at http://cran.r-project.org/web/packages/pomp. Kirkwood TBL, Boys RJ, Gillespie CS, Proctor CJ, Shanley DP and Wilkinson DJ 2003 Towards an e-biology of ageing: integrating theory and data. Nature Reviews Molecular Cell Biology 4(3), 243–249. Kitano H 2002 Computational systems biology. Nature 420(6912), 206–210. Kloeden PE and Platen E 1992 Numerical Solution of Stochastic Differential Equations. Springer Verlag, New York, NY. Komorowski M, Finkenst¨adt B, Harper CV and Rand DA 2009 Bayesian inference of biochemical kinetic parameters using the linear noise approximation. BMC Bioinformatics 10, 343. Kurtz TG 1972 The relationship between stochastic and deterministic models for chemical reactions. The Journal of Chemical Physics 57(7), 2976–2978. Marjoram P, Molitor J, Plagnol V and Tavare S 2003 Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences of the United States of America 100(26), 15324–15328. McQuarrie DA 1967 Stochastic approach to chemical kinetics. Journal of Applied Probability 4, 413–478. Øksendal B 2003 Stochastic Differential Equations: an Introduction with Applications, 6th edn. Springer-Verlag, Heidelberg. Pokern Y, Stuart AM and Wiberg P 2009 Parameter estimation for partially observed hypoelliptic diffusions. Journal of the Royal Statistical Society: Series B 71(1), 49–73. Salis H and Kaznessis Y 2005 Accurate hybrid stochastic simulation of a system of coupled chemical or biochemical reactions. Journal of Chemical Physics 122, 054103. Shahrezaei V, Ollivier JF and Swain PS 2008 Colored extrinsic fluctuations and stochastic gene expression. Molecular Systems Biology 4, 196. Sisson SA, Fan Y and Tanaka MM 2007 Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences of the United States of America 104(6), 1760–1765.
Stochastic Dynamical Systems 375 Swain PS, Elowitz MB and Siggia ED 2002 Intrinsic and extrinsic contributions to stochasticity in gene expression. Proceedings of the National Academy of Sciences of the United States of America 99(20), 12795–12800. Toni T, Welch D, Strelkowa N, Ipsen A and Stumpf MPH 2009 Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface 6(31), 187–202. Van Kampen NG 1992 Stochastic Processes in Physics and Chemistry. North-Holland, Amsterdam. Wilkinson DJ 2006 Stochastic Modelling for Systems Biology. Chapman & Hall/CRC Press, Boca Raton, FL. Wilkinson DJ 2009 Stochastic modelling for quantitative description of heterogeneous biological systems. Nature Reviews Genetics 10, 122–133. Wilkinson DJ 2011 Parameter inference for stochastic kinetic models of bacterial gene regulation: a Bayesian approach to systems biology (with discussion). In Bayesian Statistics 9 (ed. Bernardo JM). Oxford Science Publications, Oxford, pp. 679–705. Wilkinson RD 2008 Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Technical report, Sheffield University.
19 Gaussian Process Inference for Differential Equation Models of Transcriptional Regulation Neil Lawrence1 , Magnus Rattray1 , Antti Honkela2 and Michalis Titsias3 1
Department of Computer Science and Sheffield Institute for Translational Neuroscience, University of Sheffield, UK 2 Helsinki Institute for Information Technology, University of Helsinki, Finland 3 School of Computer Science, University of Manchester, UK
19.1
Introduction
Many systems biology models consist of interacting sets of differential equations. Given full knowledge of the functions involved in these systems traditional methods of systems identification could be used to parameterize the models. A particularly challenging aspect of biological systems is the scarcity of the data. Some of the model components may be unobservable, and those that are observed are not normally sampled at a high rate, particularly not in the context of high throughput experiments such as gene expression microarray and RNA sequencing. This means that important sources of information are missing. In this chapter we consider a twopronged approach to dealing with this problem. In particular we look at modeling missing components of the model through a basis function approximation to the functions that cannot be observed. We look at generalized linear models as a potential approach to mapping these missing nonlinear functions. Unfortunately, as the complexity of these models increases, so does the number of parameters. We therefore consider Bayesian approaches to dealing with the increasing number of parameters. As we increase the complexity of the model further, we find that these generalized linear models converge on a particular probabilistic process known as a Gaussian process. We then explore the use of Gaussian processes in biological modeling with a particular focus on transcriptional networks and gene expression data. We will review simple cascade models of translation and transcription that can be used for genome-wide target identification. We then introduce nonlinearities into the differential equations. This means that we need to resort to sampling methods, such as Markov chain Monte
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
Gaussian Process Inference
377
Carlo (MCMC), to perform inference in the systems. We briefly review an efficient approach to sampling and demonstrate its use in modeling transcriptional regulation. 19.1.1
A simple systems biology model
Gene expression is governed by transcription factors. To analye expression, we need to form a mathematical idealization of this process. The exact mechanism behind transcription is not yet fully understood, but a coarse mathematical representation of the system would assume that the rate of production of a given gene’s mRNA is dependent on the amount of transcription factor in the cell. When collating gene expression from many cells in a tissue we might model the rate of mRNA production through an ordinary differential equation (ODE). Barenco et al. (2006) described just such a model of gene expression, dmj (t) = bj + sj p(t) − dj mj (t), dt
(19.1)
where the rate of production of the jth gene’s mRNA is dependent on a basal rate of transcription, bj , a sensitivity, sj to the governing transcription factor concentration, p(t) and a rate of decay, dj . In this simple model each of the genes is assumed to be governed by a single input transcription factor and this is known as a single input module motif (Alon 2006). Short of assuming that the mRNA concentration is linearly related to the transcription factor concentration1 this is perhaps the simplest model of transcriptional regulation that can be composed. It has a closed form solution for the mRNA concentration that depends on the transcription factor (TF) concentration. If we assume that the system starts at t = 0 then we have t bj + sj e−dj t edj u p(u)du (19.2) mj (t) = aj e−dj t + dj 0 b
where the initial value for the mRNA concentration is given by mj (0) = djj + aj . The simple model described above can be extended through nonlinear response, active degradation of the mRNA, stochastic effects, and so on. However, it clearly illustrates the main issue we wish to address in this chapter. Namely, how do we deal with the fact that the transcription factor concentration, p(t) may be unobservable? Our preliminary studies in this area were inspired by the work of Barenco et al. (2006) who (ignoring the transient term) reordered the equation so that for each observation time, t1 , . . . , tn , they had dmj (tk ) p(tk ) = sj−1 − bj + dj mj (tk ) . dt dm (t )
They then created pseudo-observations of the production rates of the mRNA, dtj i , through fitting polynomials to their time series and computing gradients at each time point. This gave them an estimate for the TF concentration and they used Bayesian sampling to estimate the model parameters. Khanin et al. (2006) considered a similar equation with a nonlinear repression response. They and Rogers et al. (2007), who also considered single input motifs, dealt with the missing function through fitting a piecewise constant approximation to the missing function. In this chapter our focus will be on an alternative approach: modeling the missing function with a generalized linear model. 1 This would be equivalent to assuming very high decay in the simple model we describe. Models like this are widely used. One can see clustering to find coregulated targets as making this assumption. The assumption is made more explicitly in some genome-wide analysis models (Sanguinetti et al. 2006a,b).
378 Handbook of Statistical Systems Biology Figure 19.1 Some functions based on a simple basis set with three members. Location parameters of the basis functions are set to k = 2, 4, 6 and the time scale of the bases is set to 1. Functions derived from these bases are shown for different weights. Each weight was sampled from a standard normal density
Gaussian Process Inference
19.2
379
Generalized linear model
For our purposes a generalized linear model can be seen as constructing a function through a weighted sum of nonlinearities. Formally, we assume that a function, f (t) can be represented as
f (t) =
M
wk φk (t),
k=1
where the basis function has a nonlinear form. One possibility is a basis derived from a Gaussian form,
(t − τk )2 exp − . φk (t) = 2k π2k 1
(19.3)
Here τk represents a location parameter which gives the center of the basis function and represents a timescale parameter which gives the width of the basis function (or the timescale over which it is active). As the distance between t and τk increases the basis function approaches zero. This is therefore sometimes described as a ‘local’ basis function. In Figure 19.1 we show a set of bases and some nonlinear functions that can be derived from them using different weights. Our aim is to introduce this representation of the function into our model of the system. We assume that the transcription factor concentration can be represented through basis functions so we have,
p(t) =
M
wk φk (t).
k=1
Taking the simple linear differential equation model given in (19.1) we can substitute in our representation of the transcription factor concentration and recover dmj (t) = bj + sj wk φk (t) − dj mj (t). dt M
k=1
Substituting this into the solution for mj (t) we have
−dj t
mj (t) = aj e
bj + + sj wk e−dj t dj M
k=1
t
edj u φk (u)du
(19.4)
0
where we have been able to pull the weighted sum over the basis functions outside the integral. Our solution for the mRNA concentration is now also expressed as a generalized linear model. Now the basis set consists b of a transient term, aj e−dj t , a constant term, djj , and a weighted sum of convolutions of our original basis. For
380
Handbook of Statistical Systems Biology
some choices for φk (t) the integral in (19.4) will be tractable. In particular if we use the Gaussian form shown in (19.3) we can show that, ⎞ ⎡ ⎛ dj 2k t d 2 2 ⎜ t − τk + 2 ⎟ j k 1 ⎢ ⎢erf ⎜ ⎟ e−dj t edj u φk (u)du = e−dj (t−τk ) e 4 ⎠ 2⎣ ⎝ k 0 ⎛ −erf ⎝− where erf(·) is the error function defined as
x
erf(x) = 0
dj 2k
τk + k
2
⎞
⎤
⎥ ⎠⎥ ⎦
2 √ exp −z2 dz. π
For fixed parameters, aj , sj , dj , bj there is a deterministic relationship between the output mRNA concentration, mj (t) and a governing TF concentration, p(t). In Figure 19.2 we show functions that correspond to high, medium and low decay rates that result from solving the differential equation for the different TF concentrations 3 3 shown in Figure 19.1. By setting aj , bj j=1 to zero and sj j=1 to one we ensure that the differences between all the solutions are arising only from the different responses to the underlying basis from Figure 19.1. The solution for each individual basis function is shown in Figure 19.2(a). The solution for each different set of weights from Figure 19.1 is then simply the weighted sum of the relevant weights multiplied by the convolved basis functions. 19.2.1
Fitting basis function models
M Given a basis set, {φk (t)}M k=1 , a set of weights, {wk }k=1 , and the parameters of the differential p equation, p {bj , dj , sj }j=1 , we can solve the differential equations for the mRNA concentrations, mj (t) j=1 . We can determine all these parameters of the model (including the parameters of the basis functions) by fitting through maximum likelihood. If we assume Gaussian noise of variance σ 2 we can compute the log likelihood of an observed dataset of mRNA concentrations, perhaps obtained from a gene expression microarray,
log p(y|w) = −
p n 1 p (mj (ti ) − yi,j )2 , log 2πσ 2 − 2 2 2σ j=1 i=1
where the gene expression data has been collected at times {ti } and has values given by yi,j for the jth gene. We have used the notation mj (ti ) to denote the value of the predicted mRNA concentration for the jth gene at the ith time point. Bearing in mind that this is dependent on the parameters of the jth differential equation and the basis functions, we can maximize this log likelihood with respect to all these parameters and the noise variance. Here, in our representation of the likelihood p(y|w), we have only made explicit the conditioning on the vector of weights, w = [w1 , . . . , wM ] , as we will mainly focus on maximization with respect to this vector. With such a probabilistic formulation we can also exploit the PUMA (Milo et al. 2003; Liu et al. 2005) framework for propagating uncertainty through microarray analysis and associate an individual variance σi,j with each gene expression observation, yi,j . Working with the convolved basis set we have t φj,k (t) = sj e−dj t edj u φk (u)du. 0
Gaussian Process Inference
Figure 19.2 The bases in Figure 19.1 have been pushed through the differential equation to give the convolved basis shown in (a). We took aj = 0, bj = 0 and sj = 1 for j = 1, 2, 3. Decays were then set to be low, d1 = 0.01, medium, d2 = 0.1, and high, d3 = 1. In (b)–(d) we see mRNA concentrations derived from these bases with different weights, wk . Weights match those shown in Figure 19.1. In each row the left-hand plot is the result of convolving with a low decay rate (d1 = 0.01), the middle plot a medium decay rate (d2 = 0.1) and the right-hand plot a higher decay rate (d3 = 1)
381
382
Handbook of Statistical Systems Biology
We can then write a particular mRNA concentration as mj (t) = ai e−dj t +
bj + w φj (t) dj
where w is a vector of the values {wk }M k=1 and φ (t) is a vector valued function representing the values of the M (t) basis functions φj,k . Ignoring the basal rate and ai for the moment, we can write the log likelihood as k=1
p(y|w) = −
p log 2πσ 2 2 ⎛
⎞ p p n n 1 ⎝ 2 ⎠ − 2 −w φj (ti )φj (ti ) w + 2 w φj (ti )yi,j + yi,j 2σ j=1 i=1
j=1 i=1
which can be maximized with respect to w by finding a stationary point of the log likelihood, ⎡ ⎤−1 p p n n w=⎣ φj (ti )φj (ti ) ⎦ φj (ti )yi,j . j=1 i=1
j=1 i=1
If we require a lot of flexibility for our model of the TF concentration, p(t), we can use a large number of basis functions, M. However, as we increase the number of basis functions we may find that we are no the maximum likelihood solution for w. If M > np then the matrix given by n able to compute plonger will not be invertible. A solution to this problem is to handle w through Bayesian φ (t )φ (t ) i i j i=1 j j=1 inference [see Lawrence and Rattray (2010) for a brief introduction]. In Bayesian inference, parameters are treated with a prior distribution, in this case p(w), and rather than being maximized they are integrated out: p(y) =
p(y|w)p(w)dw.
The quantity p(y) is called the marginal likelihood, integrated likelihood or evidence of the model. It is important for model comparison (MacKay 2003). If we place a zero mean Gaussian prior density over w,
1 exp − w C−1 w p(w) = √ 1 2 2π |C| 2 1
we can compute the marginal likelihood of the data analytically and we find that it is given by −1 1 2 p(y) = √ y , 1 exp − 2 y K + σ I 2π K + σ 2 I 2 1
where y is a ‘stacked’ version of the data matrix, i.e. it is composed of the columns from y stacked on top of one another. The matrix K gives the covariance between the different mRNA concentrations at different times. It is structured as a block matrix with p × p blocks, each block of size n × n. The diagonal blocks give covariances of a single gene output, while off-diagonal blocks give cross covariances between different outputs. In general we can define each element of K to be ki,j (t, t ) = φi (t) Cφj (t ). This denotes the covariance between the i and jth mRNA concentrations computed between times t and t .
Gaussian Process Inference
19.2.1.1
383
Relationship with basis of p(t)
The model as we described is dependent on our choice of basis function for p(t). An interesting feature is that we can represent all the elements of the marginal likelihood’s covariance matrix through the inner product of the basis, k0,0 (t, t ) = φ(t) Cφ(t ): t t ki,j (t, t ) = sj si e−di t−dj t edi u edj u φ(u) Cφ(u ) du du. (19.5) 0 0 k0,0 (u,u )
In fact this equation holds in general: as long as k0,0 (t, t ) is a positive definite function, it represents an inner product of a basis set. If we consider a spherical prior for w, so C = γI, then we can write k0,0 (t, t ) = γφ(t) φ(t ). Now, instead of maximizing over all different weight parameters, w, we only need to find γ. Through marginalization we have reduced the number of parameters in the system by M − 1. A problem remains though, each basis function has a center, τk , and these locations represent an additional M parameters that also need to be determined. We now show how these parameters can also be eliminated. We will consider a limit that allows us to effectively take the number of basis functions, M → ∞. 19.2.2
An infinite basis
Now we have eliminated the parameters w we are left to decide the number of basis functions, M, and the location of these basis functions, {τk }M k=1 . First we will explore what happens when we place those basis functions at uniform intervals over time so we have M (t − a − τ · k)2 1 wk √ exp − , p(t) = 2 π2 k=1
where we have set the location parameter of each φk (t) to τk = a + τ · k. We showed that the marginal likelihood of the model is entirely dependent on the inner product between basis vectors at different times. For the basis at uniform intervals this can be written as ! " M ! " t 2 + t 2 − 2 (a + τ · k) t + t + 2 (a + τ · k)2 γ exp − . k0,0 t, t = 2 π 2 k=1
More basis functions allow greater flexibility in our model of the TF concentration. We can increase the number of basis functions we use in the interval between a and b by decreasing the interval τ. However, to do this without increasing the expected variance of the resulting TF concentration we need to scale down the variance of the prior distribution for w. This can be done by setting the variance parameter to be proportional to the interval, so we take γ = ατ. Now we have basis functions where the location of the leftmost basis function, k = 1, is τ1 = a + τ and the rightmost basis is τM = b so that b = a + τ · M. The fixed interval distance between a and b is therefore given by b − a = (M − 1)τ. We are going to increase the number of basis functions by taking the limit as τ → 0. This will take us from a discrete system to a continuous system. In this limit the number of basis functions becomes infinite because we have M = limτ→0 b−a τ + 1. In other words we are moving from a fixed number of basis functions to infinite basis functions. The inner product between the basis functions becomes !
k0,0 t, t
"
α = 2 π
b a
! " t 2 + t 2 − 2τ t + t + 2τ 2 exp − dτ. 2
384
Handbook of Statistical Systems Biology
Completing the square gives ! " α k0,0 t, t = 2 π
⎛
b a
2 + t 2 + 2 τ − t ⎜ exp ⎝−
1 2
!
t + t
"2
2
−
1 2
!
t + t
"2 ⎞ ⎟ ⎠ dτ
# b ! " t + t 2 α 2 (t − t )2 2 k0,0 t, t = √ exp − exp − 2 τ − dτ, 22 π2 a 2 2π2 # %b $ ! " 2 (t − t )2 1 t + t α exp − , 1 + erf τ− k0,0 t, t = √ 22 2 2 2 2π2 a Performing the integration leads to !
k0,0 t, t
"
! "2 & # t − t t + t 2 1 b− =√ exp − erf 22 2 2 2 2π2 ' # t + t 2 , a − − erf 2 2 α
Finally, if we take the limit as a → −∞ and b → ∞ (i.e. we have infinite basis functions distributed across the entire real line) then the square bracketed term on the right becomes 2 and we have !
k0,0 t, t
"
! "2 t − t = √ exp − , 22 2π2 α
(19.6)
which is known as the squared exponential covariance function (despite the fact that it is not a squared exponential, but an exponentiated quadratic). The analysis above shows that if we take a one-dimensional fixed basis function model, we can increase the number of basis functions to infinity and distribute them evenly across the real line. In this way we lose the requirement to specify M location parameters and further reduce the number of parameters in the system. Without the Bayesian approach we had 2(M + 1) + 4p parameters. The combination of the Bayesian approach and the use of infinite basis functions leads to 4p + 3 parameters without any loss of model flexibility. In fact the resulting model is more flexible as it is allowing for infinite basis functions. Instead of specifying the basis function directly, we now specify the covariance of the marginal through a positive definite function k0,0 (t, t ). The procedure for moving from inner products, φ(t) φ(t ), to covariance functions, k0,0 (t, t ), is sometimes known as kernelization (Sch¨olkopf and Smola 2001) due to the fact that the covariance function has the properties of a Mercer kernel. A function of two variables can be seen as a Mercer kernel when a symmetric matrix of values from k0,0 (t, t ) computed for a vector of times t is always positive semi-definite. In other words, if ki,j = k0,0 (ti , tj ) is the element from the ith row and jth column of K0,0 and ti is the ith element from t, we should have that K0,0 is a positive definite matrix for any vector of inputs t. This same property is what allows it to be used as a covariance function: covariances must be positive definite. The matrix K0,0 specifies the covariance between instantiations of the function p(t) at the times given by t. Mercer’s theorem says that underlying all such positive definite functions there is always a (possibly infinite) feature space, φ(t), which can be used to construct the covariance function. For our example the relationship between the feature space and this covariance function emerges naturally through considering a Bayesian approach to a fixed basis function model. The resulting model is known as a Gaussian process (O’Hagan 1978). This perspective of converting from a parametric model to a covariance function is one way of introducing Gaussian processes
Gaussian Process Inference
385
[see also Williams (1997) and Rasmussen and Williams (2006)]. For an alternative introduction which ignores the parametric interpretation see Lawrence et al. (2010). 19.2.3
Gaussian processes
Using the marginal likelihood of the model we described instead of the model parameterized by w can be understood as taking a Gaussian process perspective on the model for p(t). Gaussian processes are powerful models for nonlinear functions. They allow us to specify the characteristics of a function through specifying its covariance function. Given a covariance function for the Gaussian process over p, the covariances between the target genes can also be computed from (19.5). Substituting k0,0 (t, t ) = φ(t) Cφ(t ) we have t t −di t−dj t di u ki,j (t, t ) = sj si e e edj u k0,0 (u, u )du du. 0
0
We can also compute the cross covariance between p(t) and the ith output gene, t k0,i (t, t ) = si e−di t edi u k0,0 (u, t )du. 0
This arises in the joint distribution of p(t) and ⎡
p {mi (t)}i=1
which is also Gaussian with covariance ⎤ K0,1 . . . K0,p K1,1 . . . K1,p ⎥ ⎥ ⎥ .. . . .. ⎥ . . . ⎦
K0,0 ⎢K ⎢ 1,0 K=⎢ ⎢ .. ⎣ . Kp,0 Kp,1 . . . Kp,p
where the matrix Ki,j is computed using ki,j (t, t ) for the relevant observation times for the ith and jth function. In a parametric model, given observations of the mRNA concentrations, we would compute the posterior distribution for w and use it to compute expectations of p(t) and other gene concentrations. In the Gaussian process setting it may be that there are infinitely many parameters in w which makes it difficult to express distributions over this space. Instead, we simply condition on the observations in the probabilistic model. Our data are taken to be composed of noise corrupted observations of the mRNA concentrations (for example gene expression measurements). yj (ti ) = mj (ti ) + i,j where !is a corruptive noise term which would be drawn independently, perhaps from a Gaussian "
i,j ∼ N 0, σi2 I for each observation. We augment this matrix with any potential observations of the TF concentration, y0 (ti ) = p(ti ) + i,0 . An observation of the TF may be in the form of a constraint that the TF concentration is known to be zero at time zero: p(0) = 0, σ02 = 0. Or in some cases it could be a direct measurement. This gives us the full dataset which we place in a stacked vector storing all the values, ) ( y = y0 , y1 , . . . , yp . Note that the vectors of observations, y0 , y1 ... need not be the same length or contain observations taken at the same times.
386
Handbook of Statistical Systems Biology
The joint distribution for the corrupted observations of the mRNA concentrations and protein is given by a Gaussian process with covariance K, y ∼ N (μ, K). The mean vector, μ, here is obtained by computing the mean function, μj (t) = aj e−dj t + djj for the relevant times t. The mean vector can be removed by placing a zero mean Gaussian prior over aj and bj . In this case we obtain a Gaussian process with a zero mean function and a modified covariance function, ! " y ∼ N 0, K b
(t, t ) = k (t, t ) but for i > 0, j > 0 we have where k0,0 0,0 α bj −δj (t+t ) + ki,j (t, t ) = ki,j (t, t ) + δi,j αaj e , dj
where δi,j is the Kronecker delta function2 and αaj and αbj are the variances of the priors over the aj and bj parameters, respectively. These may be shared across all genes: αaj = αa and αbj = αb . Similarly we can use a shared noise variance σj2 = σ 2 . This reduces the number of parameters sought to 2p + 5. The parameters of the model can be found by minimizing the negative log likelihood, 1 (19.7) E (θ) = − log p(y|θ) 2 1 1 = log |K| + y K−1 y + const. (19.8) 2 2 with respect to the parameters θ = sj , dj , α, αa , αb , , σ 2 . This requires us to compute the gradient of the covariance matrix with respect to each parameter, dK dθi and combine it with the gradient of the negative log likelihood, 1 1 dE (θ) = − K−1 + K−1 yy K−1 . dK 2 2 The resulting gradients of the negative log likelihood can be used in a gradient based minimizer to find a local minimum. We often make use of scaled conjugate gradients (Møller 1993), but other optimizers such as quasi-Newton approaches (Zhu et al. 1997) or conjugate gradients can also be used. If there are few points in the time series the parameters may be badly determined. As a diagnostic the curvature of the log likelihood can be computed or alternatively we can place appropriate prior distributions over the parameters and the gradients of the corresponding joint distribution can be used in a hybrid Monte Carlo, also known as a Hamiltonian Monte Carlo, algorithm [see e.g. MacKay (2003) for details on hybrid Monte Carlo] to obtain samples from the posterior distribution for θ. The framework we have described allows us to determine, from a set of known targets, the parameters of a set of linear ODEs that best govern those targets, assuming a single regulator. By retaining linear differential equations and specifying the TF concentration through a generalized linear model we can marginalize many of the model parameters. The use of Gaussian processes and prior distributions for the basal rate and aj further reduce the number of parameters in the model to 2p + 5. In Section 19.3 we describe an approach to ranking candidate targets of a given transcription factor based on this approach. However, we may be interested in more general modeling formulations. For example, our current model: a generalized linear model with Gaussian 2
The Kronecker delta is defined as one if i = j and zero otherwise.
Gaussian Process Inference
387
prior on the weights, gives us a prior distribution for the TF concentration that can become negative. However, even the simple fix of modeling the TF as a Gaussian process in log space, ! " log p(t) ∼ N μ0 , K0,0 , renders the elegant solution given above intractable. In the generalized linear model formulation the weights, w, now appear inside the exponent, M p(t) = e k=1 wk φk (t) , denying us the ability to bring the weights outside the integral when convolving. Even such a simple modification forces us to seek approximations to deal with this intractability. In this review we focus on the sampling approximations we have considered (Titsias et al. 2009, 2011), although we have also explored Laplace’s approximation3 (see Lawrence et al. 2007, 2010; Gao et al. 2008) and variational approximations could also be applicable. 19.2.4
Sampling approximations
Regardless of how we specify the prior distribution for the TF concentration, for a given sample from this function, p(i) (t) we should be able to solve the system of differential equations. For linear differential equations this requires solving the convolution integral, and for nonlinear differential equations we need to apply numerical methods such as Runge–Kutta. Irrespective of the manner in which we find the solution, the sample (i)
p(i) (t) is associated with a set of solutions mj (t)
p
j=1
for a given parameterization of the system θ (i) .
Whilst in practice we cannot sample the full function p(i) (t) we will assume that we can obtain points, p, from the function that are so closely spaced relative to the timescale of p(i) (t) that we are not losing any information. However, because the prior over p(t) is smooth then elements of the vector p that are close neighbors in time will be very strongly correlated. This can present a serious obstacle to efficient sampling in these systems. Under MCMC, for such a sample to be accepted it must respect the smoothness constraints imposed by the prior. However, random draws from the prior are highly unlikely to be accepted as they will result in values for mj (t) that do not match the data. The key challenge is to develop a sampling approach that respects the constraints imposed by the prior and can rapidly explore the space of values of p(t) that are plausible under the data y. With this task in mind a control point strategy for sampling from Gaussian processes was developed (Titsias et al. 2009, 2011).
19.3
Model based target ranking
We used Gaussian process inference over the linear activation model in (19.1) to identify targets of two TFs, Mef2 and Twist, regulating mesoderm and muscle development in Drosophila (Honkela et al. 2010). These TFs are thought to be primarily regulated by differential expression at the mRNA level and therefore their measured mRNA expression levels are highly informative about the protein concentration in the nucleus. We therefore include a model of translation from the TF mRNA concentration f (t) to the TF protein concentration p(t): dp(t) = f (t) − δp(t) dt 3
Laplace’s approximation involves second-order Taylor expansion around the mode of the posterior distribution.
(19.9)
388
Handbook of Statistical Systems Biology
where δ is the decay rate of the TF protein. The differential equation can be solved to give, p(t) = exp (−δt)
t
f (v) exp(δv)dv 0
and we see that the TF protein concentration p(t) is a linear function of the TF expression f (t). There is also a linear relationship between the TF protein and the regulated target gene expression levels mj (t) under the linear activation model [recall (19.2)] and therefore mj (t) is also a linear function of both p(t) and f (t). We place a Gaussian process prior on f (t) and the linear models of translation and activation define a joint Gaussian process over mj (t) and p(t) as well as f (t). If we use the squared exponential covariance in (19.6) for f (t) then all the terms in the covariance function for the multivariate Gaussian process {f, p, mj } can be calculated analytically (Honkela et al. 2010). The parameters of the TF mRNA covariance and the differential equation models parameterize the covariance function: θ = [δ, α, , {bj , sj , dj }]. These are estimated by maximizing the likelihood by gradient-based optimization [recall (19.8)]. In this example we assume that aj = 0 for all genes. After estimating the model parameters we use the likelihood as a score to rank genes according to their fit to the model. Before the final ranking weakly expressed genes need to be filtered because they often attain high likelihoods from any TF with an uninformative model. This can be accomplished using average z-scores computed using the variance information from PUMA preprocessing. In Figure 19.3 we show examples of the model fit to the data for putative targets of the TF Twist. We fitted two different classes of models. Figure 19.3(a) shows examples where we fit independent models to three different genes. Since the models are fitted independently for each gene there is no reason for the models to infer a consistent TF protein concentration profile. We refer to this as the single-target model approach. While somewhat unrealistic, this approach is very attractive due to its computational advantage: fitting of independent models is thus trivially parallelizable. In Figure 19.3(b) we fit one model to the same three genes by sharing the TF protein profile across target genes. This conforms more with our belief that the TF protein profile should be consistent across targets. The model is therefore less flexible as can be observed in the example where gene FBgn0003486 cannot be fitted by the multiple-target model. The multiple-target model relies on the assumption that the whole set of targets considered is genuine. We used the five top-ranked targets identified by the single-target models as a ‘training set’ in the multiple-target approach, adding in each putative target one-by-one to the training set and using the likelihood of the resulting six-target model for ranking. In practice one might also use additional biological prior knowledge to choose a small set of confident targets. Figure 19.4 shows evaluation of the model-based ranking results using data from a genome-wide Chromatinimmunoprecipitation (ChIP-chip) experiment (Zinzen et al. 2009) to determine whether there is evidence of TF binding within 2000 base pairs of a putative target. As well as showing results for the single-target and multiple-target Gaussian process methods, we also show the performance obtained using evidence of differential expression in mutant embryos (knock-outs), correlation with TF expression (correlation) and an alternative maximum likelihood model-based approach [quadrature, see Honkela et al.(2010) for details]. In the global validation we score all genes showing significant variability in the expression data. In the focused evaluation we only consider genes with annotated expression in mesoderm or muscle tissue according to the in situ hybridization data (Tomancak et al. 2002). We find that the model-based approach provides a significant advantage over the simpler correlation-based approach for Twist, whose targets display a diverse range of temporal profiles (Figure 19.3). For Mef2 the correlation-based approach works well. We observe that most of the predicted Mef2 targets have profiles very similar to Mef2 itself, in which case the model-based approach would not be expected to provide an advantage. The single-target Gaussian process approach is found to outperform other methods in most cases (Honkela et al. 2010).
twi mRNA (input)
twi mRNA (input)
twi mRNA (input)
twi mRNA (input)
Inferred twi protein
Inferred twi protein
Inferred twi protein
Inferred twi protein
FBgn0003486 mRNA FBgn0033188 mRNA FBgn0035257 mRNA
2 4 6 8 10 12 Time (h)
2 4 6 8 10 12 Time (h) (a)
2 4 6 8 10 12 Time (h)
FBgn0003486 mRNA FBgn0033188 mRNA FBgn0035257 mRNA
2 4 6 8 10 12 Time (h)
2 4 6 8 10 12 Time (h)
2 4 6 8 10 12 Time (h)
(b)
Gaussian Process Inference
Figure 19.3 Examples of the model fit for two different classes of Gaussian process model fitted to potential targets of the TF Twist (from Honkela et al. 2010). (a) Three independent single-target models for likely targets. Red marks denote observed expression levels from Tomancak et al. (2002) with 2 SD error bars. The inferred posterior means of the functions are shown in blue and the shaded regions denote 2 SD posterior confidence intervals. (b) A joint multiple-target model for the same set of target genes as in (a). Note that the multiple-target models used in evaluation have more targets than the one shown here
389
390
80
60
* ***** * *** *** ********* ****** *** ******
80
60
20
20
20
0
0
0
0
0
20
20
0 20 100 250 Top N to consider
**
40
* ***
40
*
*** *** ***
******
40
60
**
60
Single−target GP Multiple−target GP Single−target quadrature Knock−outs Correlation Filtered Random
80
* *
60
******
60
* ****
80
**
80
20
20 100 250 Top N to consider
Focused knock−outs: twi Global knock−outs: mef2 Focused knock−outs: mef2 100 100 100
80
40
40
20 100 250 Top N to consider
*** ****** *** *** *** *** ***
Global knock−outs: twi 100
20 100 250 Top N to consider
*** ***
***
20
** * **** *** *** *** ***** *** *** *** *** *** ***
Focused ChIP: mef2 100
40
20 100 250 Top N to consider
Relative enrichment (%)
Global ChIP: mef2 100
40
**
40
60
****** *****
60
******* ** *** *** ** ***
80
*** *** ***
80
* ** *** * *** *** ***
Relative enrichment (%)
Focused ChIP: twi 100
20
0 20 100 250 Top N to consider
0 20 100 250 Top N to consider
20 100 250 Top N to consider
Figure 19.4 Evaluation results of different rankings [from Honkela et al. (2010)] as discussed in the text showing the relative frequency of positive predictions among N top-ranking targets (‘global’ evaluations) or among N top genes with annotated expression in mesoderm or muscle tissue (‘focused’ evaluations). The dashed line denotes the frequency in the full population and the dash-dot line within the population considered in focused evaluation. The first row shows the frequency of targets with ChIP-chip binding within 2000 base pairs of the gene while the second row shows the frequency of predicted targets with significant differential expression in TF knockouts. p-values of results significantly different from random are denoted by: *** p < 0.001; ** p < 0.01; * p < 0.05. Comparison with knockout ranking is obviously omitted for knockout validation
Handbook of Statistical Systems Biology
Global ChIP: twi 100
Gaussian Process Inference
19.4
391
Multiple transcription factors
The models discussed in previous sections are based on the idealized assumption that the gene regulation process is driven by a single trascription factor. However, gene regulation in complex biological systems can involve several TFs. More precisely, to initiate trascription of a gene, multiple TFs may be required to simultaneously bind the related DNA sequence. Thus, in order to take into account the effect of multiple TFs, we need to generalize the basic single-TF model described previously. Furthermore, to deal with the nonnegativity of the mRNA and protein TF functions, along with saturation effects that are naturally encountered in their dynamics, it is essential to assume biologically plausible nonlinearities. The basic single-TF ODE model from Equation (19.1) can be generalized to include multiple TFs according to ! " dmj (t) = bj + sj G p1 (t), . . . , pI (t); wj , wj0 − dj mj (t), dt
(19.10)
where pi (t), i = 1, . . . , I, are the TF protein functions. The function G(·) represents a nonlinear response that allows the TFs to competitively or co-operatively activate or repress the transcription. A reasonable assumption for this nonlinear response is to follow a sigmoidal form, similarly to the Michaelis–Menten and hill-climbing functions used in single input motifs (Alon 2006). For instance, we can choose G(p1 (t), . . . , pI (t); wj , wj0 ) =
1 I
1 + e−wj0 −
i=1
,
(19.11)
wji log pi (t)
which is the standard sigmoid function that takes values in [0, 1] and receives as inputs the logarithm of the TF activities. Here, wj0 is a real-valued bias parameter and the I-dimensional real-valued vector wj = [wj1 . . . wjI ]T represents the interaction weights between the jth target gene and the I TFs. These parameters quantify the strength of the network links between TFs and genes in the underlying regulatory network. Specifically, when wji = 0 the link between the jth gene and the ith TF is absent while when wji is negative or positive the TF acts as a repressor or activator, respectively. Notice that by estimating the values of the interaction weights, we can infer the network links between TFs and genes. This can lead to a very flexible framework for target identification which generalizes the method described in Section 19.3. The multiple TF model can contain a much larger number of unknown parameters and unobserved protein functions compared with the single input motif model. Given the scarcity of the data, it is therefore essential that inference is carried out by a fully Bayesian approach using MCMC. This poses significant computational challenges since we need to infer via sampling all the unknown quantities which are the protein functions and the remaining model parameters. We have developed an MCMC approach that infers the TF profiles from a small set of training data, associated with a moderate-sized network, and then it performs target identification at a genome-wide scale. This Bayesian approach is based on placing GP priors on the logged TF concentrations and suitable priors on the remaining model parameters (i.e. interaction weights and kinetic parameters) and then simulate from the posterior distribution by applying suitable Metropolis–Hastings updates. When observations of the TF mRNAs are available, the above framework can be combined with the protein translation ODE model presented in Section 19.3. This can further facilitate the estimation of the TF profiles and resolve identifiability problems caused by the scarcity of the data and experimental conditions. We applied the multiple TF dynamical model to an artifical dataset consisting of time-series mRNA mesearuments associated with 1000 genes and two TFs. The network links between TFs and genes are unknown apart from a small set of 20 genes which are assumed to have known connections. This small set of genes is used to infer the TF concentration functions. Figure 19.5(a) and (b) shows the inferred TF profiles where blue lines
392
30
10
9
9
25
8
8
7
7
20
6
6
5
15
5
4
4 10
3 2
3 2
5
1
0
1 0
2
4
6
8
10
0
0
2
4
(a)
6
8
10
0
10
10
9
9
8
8
7
7
4
3
3
3
2
2
2
1
1
1
8
10
6
8
10
7
4
6
10
6
5
(d)
8
8
6
4
6
9
5
2
4
(c)
6
0
2
(b)
5 4
0
2
4
6
(e)
8
10
0
2
4
(f)
Figure 19.5 (a, b) Inferred TFs. The estimated means are plotted as the blue solid lines and the shaded areas represent 95% uncertainty around the estimated means. The red solid lines show the ground-truth TFs that generated the data. (c–f) Several examples of how the multiple TF model predicts gene mRNA functions. The red crosses denote observed measurements, while blue lines and shaded areas represented noise-free (without adding the observation noise from the likelihood) Bayesian model predictions of the mRNA functions
Handbook of Statistical Systems Biology
10
Gaussian Process Inference TF2
1
1
0.8
0.8
0.6
0.6
TP rate
TP rate
TF1
0.4 Non−linear multi−TF (AUC 0.60) Linear single−TF (AUC 0.55)
0.2
0 0
(a)
0.2
0.4
0.6
FP rate
393
0.8
0.4 Non−linear multi−TF (AUC 0.80) Linear single−TF (AUC 0.53)
0.2
0
1
(b)
0
0.2
0.4
0.6
0.8
1
FP rate
Figure 19.6 Receiver operating characteristic (ROC) curves for predicting the network connections in the synthetic data. (a) shows the performance when predicting the first TF, while (b) shows the performance for predicting the second TF. Red curves show the results by using the multiple TF model and blue lines the results by using the linear single-TF model. In both plots, random prediction is represented by the diagonal black line
represent the inferred means, shaded areas represent 95% uncertainty and the red lines are the actual profiles that generated the data. Figure 19.5(c)–(f) gives several examples on how the model fits the mRNA expression data. Specifically, Figure 19.5 (c) shows a gene that is activated only by the first TF while the second TF is inactive. Figure 19.5(d) and (e) shows two genes that are regulated [activated in (d) and repressed in (e)] only by the second TF. Figure 19.5(f) shows a gene that is jointly regulated by the two TFs so that it is repressed by the first TF and activated by the second one. Figure 19.6 shows ROC curves of predictive performance for identifying the network links between the two TFs and the 980 test genes. The performance of a linear single-TF model is also displayed and is clearly inferior to the performance of the multiple TF model.
19.5
Conclusion
The difficulty of measuring every biochemical species involved in cellular interaction networks will mean that we will often be faced with the problem of missing functions. In this chapter we have reviewed an elegant approach to dealing with such missing functions; we started with parameteric generalized linear models for dealing with the missing functions, and extended them through Bayesian treatments and considering the limit of infinite basis functions to arrive at Gaussian process models. We reviewed a simple model of transcription and translation and showed how it can be used to rank putative targets of transcription factors. The Gaussian process framework provides an elegant framework for analytically dealing with missing functions if the differential equation responds in a linear way to the driving function. However, if we consider nonlinear responses, which we must necessarily do if we wish to constrain the concentration of the missing functions to be positive, the analytic approach is no longer tractable. We reviewed an efficient sampling-based approach that allows us to consider nonlinear responses and can be applied to the situation where there are multiple transcription factors regulating the targets of interest. Both approaches can be applied for genome-wide ranking of putative targets. Together these approaches present a powerful set of diagnostic tools for unraveling the transcriptional interactions that underpin a biological time series.
394
Handbook of Statistical Systems Biology
References Alon, U. 2006. An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and Hall/CRC, London. Barenco, M., Tomescu, D., Brewer, D., Callard, R., Stark, J. and Hubank, M. 2006. Ranked prediction of p53 targets using hidden variable dynamic modeling. Genome Biology, 7(3):R25. Gao, P., Honkela, A., Rattray, M. and Lawrence, N. D. 2008. Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities. Bioinformatics, 24:i70–i75. Honkela, A., Girardot, C., Gustafson, E. H., Liu, Y.-H., Furlong, E. E. M., Lawrence, N. D. and Rattray, M. 2010. Modelbased method for transcription factor target identification with limited data. Proceedings of the National Academy of Sciences of the United States of America, 107(17):7793–7798. Khanin, R., Viciotti, V. and Wit, E. 2006. Reconstructing repressor protein levels from expression of gene targets in E. Coli. Proceedings of the National Academy of Sciences of the United States of America, 103(49):18592–18596. Lawrence, N. D. and Rattray, M. 2010. A brief introduction to Bayesian inference. In Learning and Inference in Computational Systems Biology (eds N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti), MIT Press, Cambridge, MA, chapter 5. Lawrence, N. D., Sanguinetti, G. and Rattray, M. 2007. Modelling transcriptional regulation using Gaussian processes. In Advances in Neural Information Processing Systems (eds B. Sch¨olkopf, J. C. Platt and T. Hofmann), MIT Press, Cambridge, MA, vol. 19, pp. 785–792 Lawrence, N. D., Rattray, M., Gao, P. and Titsias, M. K. 2010. Gaussian processes for missing species in biochemical systems. In Learning and Inference in Computational Systems Biology (eds N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti), MIT Press, Cambridge, MA, chapter 9. Liu, X., Milo, M., Lawrence, N. D. and Rattray, M. 2005. A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips. Bioinformatics, 21(18):3637–3644. MacKay, D. J. C. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge. Milo, M., Fazeli, A., Niranjan, M. and Lawrence, N. D. 2003. A probabilistic model for the extraction of expression levels from oligonucleotide arrays. Biochemical Transations, 31(6):1510–1512. Møller, M. F. 1993. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4):525–533. O’Hagan, A. 1978. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society, B, 40:1–42. Rasmussen, C. E. and Williams, C. K. I. 2006. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA. Rogers, S., Khanin, R. and Girolami, M. 2007. Model based identification of transcription factor activity from microarray data. BMC Bioinformatics, 8(Suppl. 2):S2. Sanguinetti, G., Lawrence, N. D. and Rattray, M. 2006a. Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities. Bioinformatics, 22(22):2275–2281. Sanguinetti, G., Rattray, M. and Lawrence, N. D. 2006b. A probabilistic dynamical model for quantitative inference of the regulatory mechanism of transcription. Bioinformatics, 22(14):1753–1759. Sch¨olkopf, B. and Smola, A. J. 2001. Learning with Kernels. MIT Press, Cambridge, MA. Titsias, M.K., Lawrence, N. D. and Rattray, M. 2009. Efficient sampling for Gaussian process inference using control variables. In Advances in Neural Information Processing Systems (eds D. Koller, D. Schuurmans, Y. Bengio and L. Botton), MIT Press, Cambridge, MA, vol. 21, pp. 1681–1688. Titsias, M. K., Rattray, M. and Lawrence, N. D. 2011. Markov chain Monte Carlo algorithms for Gaussian processes. In Bayesian Time Series Models (eds D. Barber, A. T. Cemgil and S. Chiappa), Cambridge University Press, Cambridge, chapter 14. Tomancak, P., Beaton, A., Weiszmann, R., Kwan, E., Shu, S., Lewis, S. E., Richards, S., Ashburner, M., Hartenstein, V., Celniker, S. E. and Rubin, G. M. 2002. Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biology, 3(12):RESEARCH0088 Williams, C. K. I. 1997. Regression with Gaussian processes. In Mathematics of Neural Networks: Models, Algorithms and Applications (eds S. W. Ellacott, J. C. Mason and I. J. Anderson), Kluwer, Dordrecht, pp. 378–382. Zhu, C., Byrd, R. H. and Nocedal, J. 1997. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization. ACM Transactions on Mathematical Software, 23(4):550–560. Zinzen, R. P., Girardot, C., Gagneur, J., Braun, M. and Furlong, E. E. M. 2009. Combinatorial binding predicts spatiotemporal cis-regulatory activity. Nature, 462(7269):65–70.
20 Model Identification by Utilizing Likelihood-Based Methods Andreas Raue and Jens Timmer Institute of Physics and Freiburg Institute for Advanced Studies (FRIAS) and Centre for Biological Systems Analysis (ZBSA), University of Freiburg, Germany
Biotechnological progress provided the possibility to decipher the human genome, yielding a multiplicity of genes representing specific functional components such as proteins or siRNA (International Human Genome Sequencing Consortium, 2004; Gerstein et al., 2007). The functionality that can be observed in nature is mainly not arising from genes directly but from the interactions of these components. Therefore, knowing the whole genome gives an inventory of the functional components rather than a manual or explanation of functionality. The remaining task, putting together the henceforward known pieces to understand how functionality actually arises from the interplay of these components, is enormous. It is nevertheless essential to be able to understand how failures in this functionality, such as in cancer, arise and how they can be treated. To this end it is necessary to investigate the properties of reaction networks. In this chapter, we want to elucidate how dynamical modeling and likelihood-based model identification strategies can be utilized to reveal properties of the underlying biological system. The purpose of modeling is not to rebuild the biological system in all of its complexity. A model is rather a tool that helps to understand and to handle the complexity of the processes occurring in biology. In this context, ordinary differential equations (ODEs) are often used as a mathematical description to investigate the dynamics of molecular processes. It is assumed that diffusion is fast compared with the reaction rates of the protein interactions and the spatial extent of the cell. Tools to build ODE models for complex reaction networks are available (Schmidt and Jirstrand, 2006; Maiwald and Timmer, 2008). The advantage of building and utilizing a model is that all the available knowledge about the biological processes has to be formulated explicitly. The model can then be tested in a statistical framework by comparison with experimental data. There are two main applications:
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
396
Handbook of Statistical Systems Biology
(i) Concurring biological hypothesis such as presence or absence of a reaction can be formulated and yield different model structures. By comparing the agreement of the output of each candidate model with the experimentally measured data, using appropriate model selection techniques, some hypothesis may be falsified. For an application, see Swameye et al. (2003) where nuclear cycling of STAT protein is identified. (ii) Once a promising candidate model is selected the model can be used to study the dynamical behavior of the system that cannot be accessed by experiments directly. For example, the estimated values of rate constants can be investigated, the dynamics of unobserved model components can be predicted or the model response to altered external conditions such as altered network structure or different external stimulations can be studied. For an application, see Becker et al. (2010) where the dynamics of erythropoietin receptor (EpoR) trafficking and interaction was investigated. Finally, systems properties such as robustness can be investigated (Kollmann et al. 2005). Both applications require the reliable estimation of beforehand unknown model parameters such as rate constant or initial concentrations. The predicted dynamics depend intrinsically on these estimated parameters. At this point it is important to consider the uncertainties in the parameter estimation procedure: the measurement uncertainties translate to parameter uncertainties, and parameter uncertainties translate to uncertainties in the predicted model dynamics. Due to technical limitations the reaction network under investigation is often only partly accessible by experiments. This means that not all molecular species incorporated in a model can be measured directly. For example Western blotting, a widely used technique to measure the dynamics of protein concentration, depends on the availability and specificity of anti-bodies for protein detection. Furthermore, the measurement errors of these techniques are usually large. Given a certain amount and quality of experimental data measured under specific experimental conditions, it is therefore not assured that model parameters can be estimated unambiguously. Frequently, experimental data are limited considering the size of the model which results in parameters that are nonidentifiable (Raue et al. 2009). Even identifiable parameters can only be determined within confidence intervals, which contain the true value of the parameter with a desired probability (Lehmann and Leo 1983). Consequently, if model parameters are not well determined the predicted model dynamics are also not well determined and the questions that should be answered by the model might not be addressable. Therefore, it is important to have detailed knowledge of parameter uncertainties to be able to assess the precision of model predictions realistically. On the other hand, the analysis of the uncertainties itself provides the potential to improve predictability of a model, i.e. by resolving parameter nonidentifiabilities using experimental design techniques. In the following, the above mentioned concepts will be introduced from a data-based perspective, using likelihood-based methods and an illustrative example.
20.1
ODE models for reaction networks
Reaction networks can be modelled using a system of ODEs (Wolkenhauer 2008). To adequately describe the underlying biological process with this mathematical framework spatial effects have to be neglected by assuming that diffusion is fast compared with the reaction rates of the protein interactions and the spatial extent of a cell. Intrinsic stochasticity can be neglected, if the copy number of proteins is sufficiently large, i.e. for a protein copy number >10 (Taniguchi et al. 2010). Moreover, stochastic effects are suppressed, if the dynamics of protein concentration is modelled as average over a cell population. This applies for many measurement techniques such as obtained by Western blotting or by quantitative reverse transcription polymerase chain reaction (qRT-PCR).
Model Identification by Utilizing Likelihood-Based Methods
397
A model M describing n species concentrations xi in a reaction network by ODEs can be written as x˙ (t, θ) = f (x(t, p ), u (t), θ) x(0, θ) = w(θ) y (t, θ) = g (x(t, θ), θ) +
(20.1) (20.2) (20.3)
with internal model states x, an externally given stimulus u , model parameters θ, a m-dimensional mapping g of the internal model states to the model outputs y , possibly involving additional parameters θ such as scaling and offset parameters. The measurement noise ∼ N(0, σ 2 ) is assumed to be known and normally distributed. Equation (20.1) represents the dynamics of protein concentration, where usually each component of the right-hand side is a sum of reaction fluxes v of several protein interactions, as will be introduced in the next section. The initial conditions of the dynamical system given in Equation (20.2) can also depend on the parameters θ. The model outputs y described by Equation (20.3) represent the experimentally measurable quantities. For partially observed models, the dimension m of the model outputs is smaller than the dimension n of the internal model states. 20.1.1
Rate equations
Consider a reaction network consisting of protein interactions rj with j = 1, ..., and corresponding reaction flux vj (x, u , θ) that can be dependent on species concentrations x, the stimulus u and on parameters θ. In this context, the parameters θ involved in the dynamics, i.e. the reaction rate equations, are often called rate constants. Together with the stoichiometry matrix N, Equation (20.1) can be expressed as ⎞ ⎛ N11 N12 x˙ 1 ⎜ x˙ ⎟ ⎜ N N ⎜ 2 ⎟ ⎜ 21 22 ⎟ ⎜ x˙ = ⎜ .. ⎜ .. ⎟ = ⎜ .. ⎝ . ⎠ ⎝ . . x˙ n Nn1 Nn1 ⎛
⎞ ⎛ ⎞ v1 . . . N1 ⎟ ⎜ . . . N2 ⎟ ⎜ v2 ⎟ ⎟ ⎟ · ⎜ ⎟ = f (x, u , θ). ⎟ ⎜ .. ⎟ .. . ⎠ ⎝ . ⎠ Nn v
One element Nij of the stoichiometry matrix reflects whether and how often the species xi occurs as reactant or product in the reaction rj with flux vj . For Nij = 0, reaction rj does not affect species xi . For Nij > 0, reaction rj produces Nij copies of species xi and for Nij < 0, reaction rj consumes |Nij | copies of species xi . In a graphical representation of a reaction network, each nonzero element of N corresponds to an edge connecting two molecular species. Some commonly used rate equations modeling the reaction fluxes v are: •
First-order mass-action kinetics for reactions x1 → x2 v(x1 , θ1 ) = θ1 · x1
•
(20.4)
describing processes such as translocation of a species between compartments. Second-order mass-action kinetics for reactions x1 + x2 → x3 v(x1 , x2 , θ1 ) = θ1 · x1 · x2 describing processes such as complex formation.
(20.5)
398 •
Handbook of Statistical Systems Biology x2
Michaelis–Menten kinetics for simplified enzymatic reactions x1 → x3 v(x1 , x2 , θ1 , θ2 ) =
θ1 · x1 · x2 θ2 + x1
(20.6)
describing saturation effects for rate-limiting enzyme concentration x2 . For high substate concentrations x1 θ2 the reaction rate saturates v(x2 , θ1 ) = θ1 · x2 .
(20.7)
If the substrate concentration x1 θ2 , saturation can be neglected by approximating Equation (20.6) by v(x1 , x2 , θ3 ) = θ3 · x1 · x2 •
(20.8)
An inhibitor x2 can decrease the flux of a reaction x1 → x3 . The reaction flux can be approximated by v(x1 , x2 , θ1 , θ2 ) =
•
with θ3 = θ1 /θ2 .
θ1 · x1 1 + θ2 · x2 .
(20.9)
Similarly, x2 increasing the flux of a reaction x1 → x3 can be approximated by v(x1 , x2 , θ1 , θ2 ) = θ1 · x1 · (1 + θ2 · x2 ).
(20.10)
For more details about enzyme kinetics and the modelling of inhibition, see Segel (1993).
20.2
Parameter estimation
The set of parameters θ fully specifies the dynamical behavior of an ODE model M and its link to the experimental data. Parameters in reaction networks such as rate constants, initial species concentrations and concentration scaling or offset factors are usually positive definite, θ ∈ R+ . To avoid the natural lower bound of zero, logarithmic parameter values should be used. The agreement of experimental data with the predicted model output is measured by an objective function, commonly the weighted sum of squared residuals χ (θ) = 2
dk m k=1 l=1
†
ykl − yk (tl , θ) σkl
2 (20.11)
†
where experimental data ykl for each model output k = 1, . . . , m is measured at time points tl with l = 1, . . . , dk . The measurement noise is assumed to be normally distributed and its magnitude σkl is assumed to be known. It is important to note that the noise of concentration measurements is usually log-normally † distributed (Kreutz et al. 2007). Therefore, the experimental data ykl and the predicted model output yk (tl , θ) for parameters θ should be log-transformed as well. The parameters can be estimated by θˆ = arg min χ2 (θ) , (20.12) which corresponds to maximum likelihood estimation (MLE) of θ for normally distributed measurement noise, because χ2 (θ) = const. − 2 · log(L(θ))
(20.13)
Model Identification by Utilizing Likelihood-Based Methods
where L(θ) =
dk m
k=1 l=1
⎛
1 2 2πσkl
1 exp ⎝− 2
†
ykl − yk (tl , θ) σkl
2 ⎞ ⎠
399
(20.14)
is the likelihood function. Due to the non linear nature of the ODE model of Equation (20.1) with respect to the model parameters, the minimization of χ2 (θ) in Equation (20.12) cannot be solved analytically. Therefore, numerical algorithms have to be utilized to minimize/optimize the objective function χ2 (θ). There are many different algorithmic approaches to solve the problem given in Equation (20.12). Local optimizers. Beginning from an initially guessed parameter set θ0 , a local optimizer tries to reduce χ2 (θ) successively, see for example the Levenberg–Marquardt algorithm in Press et al. (1990). For ODE models, suppling analytic derivatives that guide these algorithms to the minimum of χ2 are essential. The numerical reasons for this and practical considerations for the calculation will be given in the next section. If experimental data are insufficient considering the complexity of the model, and due to the non linearity of the model with respect to the parameters, the problem given in Equation (20.12) may not be well behaved, i.e. the objective function may exhibit several local minima. In this case, a local optimizer can get stuck and does not yield the best possible parameter set. Global optimizers. In contrast, global optimizers try to keep track of several possible solutions or allow for steps in increasing direction of χ2 temporarily (Egea et al. 2007). Despite the terminology ‘global’, it is important to note that there is no algorithm that can guarantee to find the global optimum. Applying successive runs of a local optimizer from different initial parameter sets yields effectively a global optimizer. 20.2.1
Sensitivity equations
A local or effectively global optimization algorithm that efficiently estimates θ numerically by minimizing χ2 (θ) in Equation (20.12), needs reliable estimates of the sensitivities, i.e. the derivative of the model outputs y with respect to the parameters θ, ∂y ∂g ∂x = · . (20.15) ∂θ ∂x ∂θ where ∂g/∂x is given by Equation (20.3) analytically. In the case of ODE models, the analytical calculation of the remaining sensitivities of the internal model states ∂x / ∂θ is hampered, because Equation (20.1) cannot, in general, be solved analytically. Usually numerical ODE solvers have to be applied. The most obvious way to calculate ∂x / ∂θ numerically is by finite difference (FD) approximation ∂x x(θ) − x(θ + h · ej ) ≈ (20.16) ∂θj h where ej is the jth unit vector and h should be sufficiently small. In the context of ODE models, this approach is problematic because the integrator used to solve Equation (20.1) numerically might use different step sizes to obtain x(θ) and x(θ + h · ej ) and only guarantees an approximate solution given a certain solver tolerance level. The numerical error especially introduced for small values of h can not be controlled in this way and leads to discontinuities of ∂x / ∂θ along the time axis. An elegant method to calculate the sensitivities ∂x / ∂θ numerically is via integration of the sensitivity equations (SE). The calculation d ∂x ∂ ∂ ∂f ∂x ∂f = x˙ = f = · + (20.17) dt ∂θ ∂θ ∂θ ∂x ∂θ ∂θ
400
Handbook of Statistical Systems Biology SE
1
1
0.8
0.8
0.6
0.6
∂ x /∂θ
∂ x /∂θ
FD
0.4
0.4
0.2
0.2
0
0
0
10
20
(a)
30 time
40
50
60
0
10
20
30 time
40
50
60
(b)
Figure 20.1 Two sensitivities ∂xi / ∂j calculated by finite differences (FD) in (a) and by the sensitivity equations (SE) in (b) illustrating the numerical problems in the case of FD
yields a differential equation for ∂x / ∂θ where the right-hand side is given analytically, i.e. ∂f /∂x and ∂f /∂θ can be calculated from Equation (20.1) (Leis and Kramer 1988). These n · (n + #θ) additional differential equations can be simultaneously solved with the original ODE system. The additional initial conditions are given by ∂x ∂w (0) = (20.18) ∂θ ∂θ using Equation (20.2). The resulting enlarged ODE system, including the original model dynamics and the dynamics of the sensitivities, can be efficiently solved because the Jacobian of the sensitivity subsystem is composed of the Jacobian of the original system, see Hindmarsh et al. (2005) for a computationally efficient implementation. Figure 20.1 shows a comparison of sensitivities calculated by FD [Figure 20.1(a)] and by the SE [Figure 20.1(b)] illustrating the numerical problems in the case of FD. 20.2.2
Testing hypothesis
Once the optimal parameter values, i.e. the best model fit, are obtained from the estimation procedure for all candidate models, one needs to judge whether the models are in agreement with experimental data and whether one of the models explains the experimental data significantly better. Various approaches exist, see Cox and Hinkley (1974) or Seber and Wild (2003). Here, two commonly used methods derived from the theory of MLE will be explained. 20.2.2.1
Pearson’s χ2 -test
2 distribution with degrees of The limit distribution of the objective function given in Equation (20.11) is the χdf m m freedom df = k=1 dk − #θ, where k=1 dk is the total amount of experimental data and #θ is the number 2 distribution with df degrees of freedom is of parameters. The expectation value and the variance of the χdf 2 2 = df and Var χdf = 2 · df, (20.19) χdf
respectively, see Figure 20.2. Thus, a model that adequately represents the experimental data should have a 2 distribution, see the dashed line in value of the objective function not larger than a given quantile of the χdf Figure 20.2.
Model Identification by Utilizing Likelihood-Based Methods
401
PDF(χ2df = 5)
0.2
0.1
0
0
2
4
6
8
10 χdf2 = 5
12
14
16
18
20
Figure 20.2 Probability density function of the df2 distribution with df = 5 degrees of freedom. The solid line indicates the expectation value. The dashed line indicates the 95% quantile that is utilized for a test with a significance level of ˛ = 0.05
Please note, that this test yields a reliable conclusion only if the magnitude of the measurement noise σ 2 is assessed realistically. Furthermore, the actual degrees of freedom consumed by a nonlinear model might be slightly different from #θ. This can be checked by simulation studies. 20.2.2.2
Likelihood-ratio test
Consider a model M∗ and an extended model M containing df additional parameters. For this test the null hypothesis is that an enlarged model does not fit the experimental data significantly better. The models need to be nested, M∗ ⊂ M. If the additional df parameters lead to a significant decrease of the objective function, the hypothesis can be rejected indicating that the larger model M is preferable to the small model M∗ . The test statistic is given by χ2 (M∗ ) − χ2 (M) =: χ2
∼
2 χdf .
(20.20)
2 distributed. The Here, the decrease of the objective function χ2 due to the additional parameters is χdf hypothesis is rejected with a significance level of α, if
pval =
∞
χ2
2 PDF(χdf ) dχ2
≤
α
(20.21)
2 ) denotes the probability density function of the χ2 distribution. This indicates that the where PDF(χdf df smaller model M∗ is dismissed in favour of the enlarged model M due to a significant decrease of the objective function. If pval > α the hypothesis cannot be rejected and the smaller model cannot be dismissed. 2 distribution with degrees of freedom df = 5 is illustrated by the A significance level α = 0.05 for the χdf dashed line in Figure 20.2.
20.2.3
Confidence intervals
For brevity, χ2 (θ) will be termed likelihood in the following. Assume a model M that sufficiently describes the available experimental data. A confidence interval [σi− , σi+ ] of a parameter estimate θˆ i to a confidence level 1 − α signifies that the true value θi∗ is located within this interval with probability 1 − α. In the following, asymptotic and finite sample confidence intervals will be introduced.
402
Handbook of Statistical Systems Biology
20.2.3.1
Asymptotic confidence intervals
Confidence intervals can be derived from the curvature of the likelihood, e.g. using the Hessian matrix H = ∇ T ∇ χ2 |θˆ i .
(20.22)
Using the covariance matrix of the parameter estimates C = 2 · H−1 , asymptotic confidence intervals are given by 2 , 1 − α) · C σi± = θˆ i ± Q(χdf (20.23) ii 2 , 1 − α) is the 1 − α quantile of the χ2 distribution with df degrees of freedom (Press et al. where Q(χdf df 1990), see Figure 20.2. The choice of df yields two different types of confidence intervals: df = 1 gives pointwise confidence intervals that hold individually for each parameter, df = #θ being the number of parameters giving simultaneous confidence intervals that hold jointly for all parameters. Asymptotic confidence intervals are a good approximation of the uncertainty of θˆ i , if the amount of experimental data is large compared with #θ and/or the amount of measurement noise is small. They are exact if the model outputs y depend linearly on θ. Even for the simplest reaction network described by Equations (20.1– 20.3), the model outputs y depend nonlinearly on θ. Note that for a ODE system consisting of mass-action reactions, only the right-hand side of the ODE system is linearly dependent on the parameters, its solution x and hence the model outputs y are not. Furthermore, the amount and quality of experimental data is often limited, e.g. protein concentration measurements by Western blotting. Therefore, asymptotic confidence intervals might not be appropriate (Joshi et al. 2006). The ellipsoid in Figure 20.3 indicates asymptotic confidence intervals and illustrates its discrepancy from the actual shape of the likelihood.
χ2 1.2 1 0.8
log10(θ2)
0.6 0.4 0.2 0 −0.2 −0.4 0
0.01
0.02
0.03
0.04
0.05
log10(θ1)
Figure 20.3 Illustrative contour plot of 2 () in logarithmic parameter space. Shaded contour lines from black to white correspond to low to high values of 2 (). The dashed ellipsoid indicates asymptotic confidence intervals; the thick solid line indicates likelihood-based confidence intervals
Model Identification by Utilizing Likelihood-Based Methods
20.2.3.2
403
Finite sample confidence intervals
Confidence intervals can also be derived using a threshold in the likelihood. These so-called likelihood-based confidence intervals define a confidence region ˆ < α } {θ | χ2 (θ) − χ2 (θ)
2 with α = Q(χdf , 1 − α)
(20.24)
whose borders represent confidence intervals (Meeker and Escobar 1995). The threshold α is the 1 − α 2 distribution and represents with df = 1 and df = #θ point-wise, respectively, simultaneous quantile of the χdf confidence intervals to a confidence level 1 − α, see Figure 20.2. Likelihood-based confidence intervals are considered superior to asymptotic confidence intervals for finite samples (Neale and Miller 1997). This is illustrated in Figure 20.3. In the asymptotic case, both approaches are equivalent. 20.2.3.3
Coverage
In order to evaluate the appropriateness of confidence intervals, usually, coverage rates are studied. The coverage rate CR for a parameter signifies how often its true value θi∗ is actually covered by the confidence interval [σi− , σi+ ]. To this end data is simulated assuming a set of true parameter values. If confidence intervals are appropriate, the coverage rate should reflect the desired level of confidence CR ≈ 1 − α. If the coverage rate is larger than the desired level the confidence intervals tend to be more conservative than required, if it is smaller the actual uncertainty in the estimates is underestimated. ˆ For likelihood-based confidence intervals that are based on Equation (20.24), the difference χ2 (θ ∗ ) − χ2 (θ) ˆ For nonlinear models and small data corresponds to the amount of over-fitting for the estimated parameters θ. ˆ may differ from the χ2 distribution (Press et al. 1990). samples the actual distribution of χ2 (θ ∗ ) − χ2 (θ) df For instance the distribution can be shifted, if the actual degrees of freedom df consumed by the nonlinear model differs from the number of model parameters. Since the deviation of the distribution is dependent on ˆ should always be verified by simulation studies. If the specific application, the distribution of χ2 (θ∗ ) − χ2 (θ) deviations are observed, the threshold α should be adjusted according to the generated distribution to obtain appropriate coverage rates for the confidence intervals.
20.3
Identifiability
A parameter θi is identifiable, if the confidence interval [σi− , σi+ ] of its estimate θˆ i is finite. Two phenomena accounting for parameters to be nonidentifiable will be discussed here. Structural nonidentifiability is related to the model structure, independent of experimental data. This is intensively discussed in the literature (Cobelli and DiStefano III 1980). In contrast practical nonidentifiability also takes into account the amount and quality of experimental data that was used for parameter calibration (Raue et al. 2009). In the following, identifiability will be introduced from the perception of parameter estimation, i.e. in a data-based way. 20.3.1
Structural nonidentifiability
A structural nonidentifiability arises from the structure of the model itself, independent of measurement noise . Assume a model defined by Equations (20.1–20.3), where the functions f , w and g describing the model dynamics and the model outputs are given, the external stimulation u is known and perfect measurements with σ 2 = 0 are assumed. The crucial question is whether the model parameters are uniquely identifiable from the experimentally observed quantities y in the given setting. The formal solution of y may contain a redundant parameterization, due to a dimension reducing mapping g of internal model states x to model outputs y in Equation (20.3). The redundancy can be characterized by
404
Handbook of Statistical Systems Biology (a): structural nonidentifiability
(b): practical nonidentifiability
(c): identifiability
θ2
10
5
0
0
5 θ1
10 0
5 θ1
10 0
5 θ1
10
Figure 20.4 Contour plots of 2 () for a two-dimensional parameter space, shown in nonlogarithmic scale for illustrative purposes. Shaded contour lines from black to white correspond to low to high values of 2 (). Thick solid lines display likelihood-based confidence regions and asterisks the optimal parameter estimates ˆ . (a) A structural nonidentifiability along the functional relation h() = 1 · 2 − 10 = 0 (dashed line). The likelihoodbased confidence region is infinitely extended. (b) A practical nonidentifiability. The likelihood-based confidence region is infinitely extended for 1 → +∞ and 2 → +∞, and lower confidence bounds can be derived. (c) Both parameters are identifiable. Adapted from Raue et al. (2009)
sub ) = 0 between a subset of parameters θsub ⊂ θ. These redundant parameters θsub functional relations h(θ without changare structurally nonidentifiable and may be varied according to the functional relations h ing the model outputs y . In terms of the objective function χ2 (θ) defined by Equation (20.11), a structural nonidentifiability therefore manifests as iso–χ2 manifold sub ) = 0 θ | h(θ ⇒ χ2 (θ) = const. (20.25) Consequently, the parameter estimates θˆ sub and, respectively, the internal model states x affected by these parameters are not uniquely identified by measurements of y . Confidence intervals of a structurally nonidentifiable parameter θi ∈ θsub are infinite, [−∞ , +∞] in logarithmic parameter space considered here. Hence, θi cannot be estimated at all. A direct detection of a redundant parameterization in the analytic form of y is hampered, because Equation (20.1) cannot, in general, be solved analytically. For a two-dimensional parameter space, χ2 (θ) can be visualized as a landscape. A structural nonidentifiability results in a perfectly flat valley, infinitely extended along the corresponding functional relation, as illustrated in Figure 20.4(a). Since a structural nonidentifiability is independent of the accuracy of available experimental data, it cannot be resolved by a refinement of existing measurements. The only remedy is a qualitatively new measurement which alters the mapping g , usually by increasing the number of observed species. A parameter is structurally identifiable, if a unique minimum of χ2 (θ) with respect to θi exists, see Figure 20.4(b) and (c). 20.3.2
Practical nonidentifiability
A parameter that is structurally identifiable may still be practically nonidentifiable. This can arise due to insufficient amount and quality of experimental data, manifesting in a confidence interval that is infinite. Please note, that the asymptotic confidence interval of a structurally identifiable parameter estimate may be
Model Identification by Utilizing Likelihood-Based Methods
405
large, but is always finite because Cii > 0 [see Equation (20.23)]. Therefore, it is not possible to infer practical nonidentifiability using asymptotic confidence intervals. According to Raue et al. (2009), a parameter estimate θˆ i is practically nonidentifiable, if the likelihoodbased confidence region is infinitely extended in the direction of θi , although the likelihood has a unique minimum for this parameter. This means that the increase in χ2 (θ) stays below the threshold α for a desired confidence level 1 − α in the direction of θi . Similarly as for structural nonidentifiability, this flattening out of the likelihood can continue along a functional relation. The confidence interval of a practically nonidentifiable parameter is not necessarily extended infinitely to both sides. There can be a finite upper or lower bound of the confidence interval [σi− , σi+ ], but either σi− = −∞ or σi+ = +∞ in logarithmic parameter space is considered here. For a two-dimensional parameter space, a practical nonidentifiability can be visualized as a relatively flat valley, which is infinitely extended. The distance of the valley bottom to the lowest point θˆ never exceeds α , as illustrated in Figure 20.4(b). Along a practical nonidentifiability the model outputs y change only negligibly and remain compliant with the given measurement accuracy. Nevertheless, model behavior in terms of internal states x might vary strongly. Improving the detection of typical dynamical behavior by increasing the amount and quality of measured data and/or the choice of measurement time points will ultimately cure a practical nonidentifiability, yielding finite likelihood-based confidence intervals, see Figure 20.4(c). Inferring how to decrease confidence intervals most efficiently is the subject of experimental design, which will be discussed in Section 20.4.1. 20.3.3
Connection of identifiability and observability
The uncertainty of parameter estimates θˆ directly translate to uncertainty of model trajectories. Nonidentifiability describes the phenomenon that parameters might not be determined. Consequently, non observability indicates that model trajectories might not be determined due to nonidentifiability of model parameters. As mentioned in Section 20.3.1, a structural nonidentifiability only affects the observability of the internal model states x. The model outputs y are reproduced exactly for all parameter combinations on the manifold that corresponds to the structural nonidentifiability [see Equation (20.25)]]. As mentioned in Section 20.3.2, a practical nonidentifiability affects the model outputs y only negligibly. This means that the model outputs stay in agreement with the measurement accuracy of the experimental data. Some internal model states x might nevertheless be affected strongly by a practical nonidentifiability and hence might be nonobservable. Also, confidence intervals of parameter estimates translate to confidence intervals of model trajectories. Similar to coverage rates for confidence intervals of parameter estimates, coverage rates for confidence intervals of model trajectories can be assessed. From a different point of view, identifiability can be stated in terms of observability. Additional internal model states xn+i corresponding to parameters θi where i = 1, . . . , #θ can be introduced. The ODE system of Equation (20.1) is then enlarged by equations x˙ n+i = 0 and every incidence of θi is replaced by xn+i in both Equations (20.1) and (20.3). Consequently, showing observability of xn+i implies identifiability of θi .
20.4
The profile likelihood approach
In the remainder of this chapter the profile likelihood approach (Raue et al. 2009) will be presented, which facilitates the numerical implementation of the concepts for data-driven modeling introduced so far. The approach enables to reveal which parameters are identifiable and thus allows to infer which model predictions are feasible. Provided that parameters are identifiable the question that follows is how large their confidence intervals are, indicating how reliable a model prediction is.
406
Handbook of Statistical Systems Biology
The approach is able to detect both structurally and practically nonidentifiable parameters and simultaneously calculates confidence intervals. Since large models are under consideration, it is important that the approach is computationally feasible and its output is interpretable even if the issue depends on a high-dimensional parameter space. Furthermore, the approach can be used for experimental design, to suggest additional measurements that efficiently reduce parameter uncertainties, and for model reduction, to tailor the model complexity to the information content provided by the experimental data. Finally, the usage and benefits of the approach will be illustrated by applying it to the core model of erythropoietin (Epo) and EpoR interaction and trafficking with the corresponding subset of the experimental data from Becker et al. (2010). The idea of the approach is to explore the parameter space for each parameter in the direction of least increase in χ2 (θ). For a structurally nonidentifiable parameter this means following the functional relations sub ) = 0. In the case of a practically nonidentifiable parameter, the aim is to detect directions where the h(θ likelihood flattens out. 2 (Venzon and Moolgavkar 1988; Murphy and van A useful concept for this task is the profile likelihood χPL der Vaart 2000). It can be calculated for each parameter individually by 2 χPL (θi ) = min χ2 (θ) θj =/ i
(20.26)
meaning re-optimization of χ2 (θ) with respect to all parameters θj =/ i , for each fixed value of the parameter θi . Hence, the profile likelihood keeps χ2 (θ) as small as possible alongside θi . Figure 20.5 displays the profile likelihood for the three typical cases introduced in Figure 20.4 showing that the likelihood is explored in the desired way to detect nonidentifiabilities. Structural nonidentifiable parameters are characterized by a flat profile likelihood [see Equation (20.25)]. The profile likelihood of a practically nonidentifiable parameter has a minimum but does not exceed a threshold α for increasing and/or decreasing values of θi , see Section 20.3.2. In contrast, the profile likelihood of an identifiable parameter exceeds α for both increasing and decreasing values of θi . The points of passover represent likelihood-based confidence intervals as defined in Equation (20.24) (Royston 2007). Figure 20.5(b) shows the profile likelihood approaching the borders of the likelihood-based confidence region in the desired 2 (θ ), the functional relations h(θ sub ) = 0 correway. By following the change of parameters θj =/ i along χPL i 2 sponding to a structural nonidentifiability can be recovered, see Figure 20.5(a). An algorithm to calculate χPL was described in Raue et al. (2009). 20.4.1
Experimental design
To improve the certainty of a specific model prediction, it would be valuable to suggest additional measurements that efficiently cure nonidentifiability and narrow the confidence interval of a parameter θi affecting this issue. The set of trajectories along the profile likelihood of θi reveals spots where the uncertainty of θi has the largest impact on the model. Additional measurements at spots of largest variability of the trajectories efficiently accomplish this task. The amplitude of variability of the trajectories at these spots allows to assess the necessary precision of a new measurement, to provide adequate data that are able to improve parameter identification. The impact of new measurements can be evaluated by Monte Carlo simulations. With this aim, the described analysis of the profile likelihood is repeated, taking into account additional simulated data. The resulting change of the profile likelihood and correspondingly the resolution of nonidentifiability and the narrowing of the likelihood-based confidence intervals, justifies the use of new measurements to gain a more confident model prediction.
Model Identification by Utilizing Likelihood-Based Methods (a): structural nonidentifiability
(c): practical nonidentifiability
(e) identifiability
(b): profile likelihood of (a)
(d): profile likelihood of (c)
(f): profile likelihood of (e)
407
θ2
10
5
2 ( ) PL 1
0
0
5 θ1
10 0
5 θ1
10 0
5 θ1
10
2 Figure 20.5 Assessing the identifiability of parameter 1 by the profile likelihood PL (1 ). (a) A structural nonidentifiability along the functional relation h(1 , 2 ) = 1 · 2 − 10 = 0 manifesting in a flat profile likelihood in (b). (c) A practical non identifiability manifesting in a flattening out of the profile likelihood for 1 → ∞ in (d). 2 (1 ) exceeds ˛ . (e) An identifiable parameter A lower confidence bound can be assessed by the point where PL 1 . The profile likelihood approaches a parabola shape yielding a finite confidence interval. (a,c,e) Contour lines shaded from black to white correspond to low to high values of 2 (). Thick contour lines indicate likelihood-based confidence regions and asterisks correspond to the optimal parameters ˆ . Dashed lines indicate the trace of the 2 profile likelihood for parameter 1 in terms of parameter 2 . (b,d,f) Dashed lines indicate the profile likelihood PL of parameter 1 . The thick lines display the threshold ˛ utilized to assess likelihood-based confidence regions for a confidence level ˛
20.4.2
Model reduction
The approach can be used for model reduction by considering a threshold α corresponding to point-wise confidence intervals using df = 1 [see Equation (20.24)]. Assume a parameter θi is practically nonindentifiable for decreasing parameter value. Consider a reduced model M∗ with simplified kinetics concerning θi , e.g. for mass-action kinetics by removing the reaction. The reduced and the original model are nested M∗ ⊂ M. In this case, the threshold α corresponds to a likelihood-ratio test of the reduced model M∗ against the original model M to a significance level α, as introduced in Section 20.2.2. Falling below this threshold for θi being nonindentifiable, the profile likelihood indicates that it is not possible to dismiss the reduced model M∗ in favor of M, based on the available experimental data. Hence, for practical reasons, the smaller model is favorable. 20.4.3
Observability and confidence intervals of trajectories
As mentioned in Section 20.3.3, identifiability of model parameters has a direct impact on the observability of model trajectories. Nonobservability can be investigated by plotting both the internal model trajectories
408
Handbook of Statistical Systems Biology
x and the model outputs y for parameter values sampled along the profile likelihood of a nonidentifiability. The variations reveal which trajectories are nonobservable due to nonidentifiability. In the case of identifiable parameters, the same procedure allows to assess confidence intervals of model trajectories by plotting trajectories for all parameter values sampled along the profile likelihood that are inside the confidence region defined by Equation (20.24) and taking the maximum and minimum trajectory for each time point. Here, it is sufficient to consider a threshold corresponding to point-wise confidence intervals of the parameter estimates, see Section 20.2.3, if a confidence interval for each time point separately is desired. 20.4.4
Application
This section is based on Raue et al. (2010). To demonstrate the usage of identifiability and observability analysis for experimental design, the profile likelihood approach will be applied to the core model of Epo and EpoR interaction and trafficking with the corresponding subset of the experimental data. The network representation of the model is shown in Figure 20.6. Using mass-action kinetics the corresponding ODE system is given by: ˙ = −kon · [Epo] · [EpoR] + kon · kD · [Epo EpoR] + kex · [Epo EpoR i] [Epo] ˙ [EpoR] = −kon · [Epo] · [EpoR] + kon · kD · [Epo EpoR] + kt · Bmax − −kt · [EpoR] + kex · [Epo EpoR i] [Epo ˙EpoR] = +kon · [Epo] · [EpoR] − kon · kD · [Epo EpoR] − ke · [Epo EpoR] ˙ [Epo EpoR i] = +ke · [Epo EpoR] − kex · [Epo EpoR i] − kdi · [Epo EpoR i] − −kde · [Epo EpoR i] ˙ [dEpo i] = +kdi · [Epo EpoR i] ˙ e] = +kde · [Epo EpoR i] [dEpo where [ · ] indicates internal model states x corresponding to species concentration. The experimental set-up and the biological background for the model is described in detail in Becker et al. (2010). Briefly, in erythroid progenitor cells the dynamical properties of EpoR determine how signals encoded in the concentration of the ligand Epo are processed at the receptor level and how subsequently downstream signaling cascades such as the JAK2-STAT5 pathway are activated. This leads to cellular responses such as differentiation and proliferation of erythrocytes. A mathematical model is used to infer the dynamical characteristics of ligand binding and ligand as well as receptor trafficking, because unoccupied EpoR is not directly accessible by experiments. 20.4.4.1
Initial set-up
In a first experiment, experimental data were recorded in triplicates in two compartments: in the extracellular medium (y1 ) and bound to EpoR on the cell membrane (y2 ) y1 = scale · ([Epo] + [dEpo e]) y2 = scale · [Epo EpoR], see Figure 20.7(a). After parameter estimation, a good agreement of model output and experimental data yielding a value of the objective function χ2 = 6.55 for 16 data points and 10 free parameters is obtained, see Section 20.2.2. To investigate the uncertainty of the parameter estimates the profile likelihood of each parameter was evaluated, as displayed in Figure 20.8(a). The calculation takes about 30 seconds per parameter on a normal office computer. The flatness of the profile likelihood reveals that parameters Bmax , Epo0 , kD , kon and scale are structurally nonidentifiable. The change of the other parameters along the profile likelihood of one of these parameters
Model Identification by Utilizing Likelihood-Based Methods
kon / koff
y2
409
y1
Epo
dEpo_e
Epo
extracellular medium
EpoR
plasma membrane
EpoR
ktBmax
cytoplasm
kex
kde
kt
kex
ke
dEpo_i
Epo kdi
EpoR_i
y3
Figure 20.6 Model of Epo and EpoR interaction and trafficking. Dashed boxes correspond to the quantities accessible by measurements
linking the five structurally nonidentifiable parameters, see e.g. χ2 (kon ) in reveals a functional relation h PL Figure 20.9. The corresponding effect on the model trajectories x and y of this concerted change in the parameters can be illustrated by plotting the model trajectories for parameter values along the profile likelihood of one of 2 (k ) in Figure 20.7(a). As expected, the model outputs y are not affected, but these parameters, see e.g. χPL on the trajectories of internal model states x are shifted by a common factor. Hence, in this case the structural nonidentifiability represents a freedom in the choice of the concentration scale and is a result of missing information about absolute concentration in the experimental set-up. Consequently, all parameters containing concentration in their unit are affected. Similar results were obtained in Raue et al. (2009) for a model published by Swameye et al. (2003). 20.4.4.2
Including absolute concentrations
In order to resolve the structural nonidentifiability, the observability analysis presented in Figure 20.7(a) suggests including information about the absolute concentration of one of the species. For the experiment, the cells were treated with 2100 ± 210 pM of 125 I-Epo. Therefore, a prior distribution Epo0 ∼ N(2100, 2102 ) of the parameter describing the initial Epo concentration is assumed by penalizing the objective function of (20.11) with an additional summand (Epo0 − 2100)2 /2102 . Recalculating the profile likelihood verifies that the structural nonidentifiability is resolved, see Figure 20.8(b). Nevertheless, the parameter kex remains practically nonidentifiable. Its upper confidence bound is determined at σ + = −2.230 but its lower confidence bound is not feasible.
(a): initial setup 5 x 10
4 x 10
y1
3.4
y2
Epo
EpoR
5
6000
3.2 4
act. [cpm]
3
2.6
2
2.4
1
2.2
0 0
4000
50
100
150
200
250
300
3000 5000
2000 1000
0
350
0
50
100
dEpo_i
150
200
250
300
0
350
0
50
100
dEpo_e
200
250
300
6000
100
150 200 250 time [min]
300
0
50
100
150
200
250
300
350
250
300
350
250
300
350
250
300
350
250
300
350
1500 1000 500
0
350
200
2000
500
0
50
150
2500
1000
2000
0
100
3000
4000
0
50
Epo_EpoR_i
1500
8000
500
0
350
2000
10000
1000
150
Epo_EpoR
1500 conc. [pM]
5000
10000
3
2.8
0 0
50
100
150
200
250
300
350
0
50
100
150
200
(b): with information about the initial Epo concentration x 105
x 104
y1
3.4 3.2 3 2.8 2.6
1500
2
1000
700 600
500
2.2
0
0
100
150
200
250
300
800
3
1
50
EpoR 1000 900
2.4
0
500
350
0
50
100
dEpo_i
150
200
250
300
0
350
50
100
dEpo_e
150
200
250
300
350
0
50
100
Epo_EpoR
150
200
Epo_EpoR_i 500
300 conc. [pM]
Epo 2000
4 act. [cpm]
act. [cpm]
y2
5
300
1500
250 200
1000
150
250
400
200
300
150
100
500
200
100
50
100
50
0
0
0
50
100
150 200 250 time [min]
300
350
0
0
50
100
150
200
250
300
350
0
0
50
100
150
200
250
300
350
0
50
100
150
200
(c): with measurements of intra-cellular Epo x 105
x 104
y1
3.4 3.2 3 2.8 2.6 2.4 2.2
Epo
y2
5
2500
4
2000
3
1500
2
1000
1
600 400
500
200
0 0
50
100
150
200
250
300
350
0
50
100
150
200
250
300
0
350
0 x 10
4
50
100 150 200 250 300 350
0
50
Epo_EpoR
y3
100 150 200 250 300 350 Epo_EpoR_i
400
10 act. [cpm]
EpoR 800
600
8
300 400
6 200
4 2
200
100
0 0
50
100
150 200 250 time [min]
300
0
350
0
0
50
100 150 200 250 300 350
0
50
dEpo_i
dEpo_e 2000
300 conc. [pM]
100 150 200 250 300 350
1500 200 1000 100
500
0
0 0
50
100 150 200 250 300 350 time [min]
0
50
100 150 200 250 300 350
Figure 20.7 Dependence of the trajectories of the model outputs y and of internal model states x on uncertainties in the parameter estimates. (a) The concerted change of parameters along the structural nonidentifiability of kon does not affect the model outputs but shifts the trajectories of the internal model states by a common factor. (b) The practical nonidentifiability of kex only slightly affects the model outputs y , staying in agreement with the measurement precision of the experimental data. Nevertheless the trajectories of EpoR, Epo EpoR i and dEpo i are affected. (c) The remaining uncertainties of the identifiable parameters translate to confidence intervals of the model trajectories
Model Identification by Utilizing Likelihood-Based Methods 411
2 Figure 20.8 Profile likelihood PL of the model parameters displayed in combination with the thresholds 0.95 yielding with df = 1 confidence intervals that hold for each parameter individually. The optimal parameter value is indicated by an asterisk, if unique. (a) The flatness of the profile likelihood reveals that five parameters are structurally nonidentifiable, given the initial set-up. (b) By including information about the initial Epo concentration the structural nonidentifiability can be resolved. Parameter kex remains practically nonidentifiable. (c) By including measurements of intracellular Epo all parameters are made structurally and practically identifiable
412
Handbook of Statistical Systems Biology
2 Figure 20.9 Initial set-up. The change of the other parameters along the profile likelihood PL (kon ) indicates functional relations between all five structurally nonidentifiable parameters (indicated by the solid lines)
20.4.4.3
Including measurement of intracellular Epo
In order to efficiently resolve the practical nonidentifiability of kex , additional measurements need to be planned. Therefore, the variability of the model trajectories along the profile likelihood χ2 (kex ) is investigated, as displayed in Figure 20.7(b). The available model outputs y1 and y2 show only slight variations and hence are not suitable for refined measurements. However, the trajectories of Epo EpoR i and dEpo i show much larger variations and suggest an additional measurement of 125 I-Epo inside the cell y3 = scale · ([Epo EpoR i] + [dEpo i]) , that was recorded in triplicates as well. After including the additional data, the profile likelihood of all model parameters indicate their structural and practical identifiability, as displayed in Figure 20.8(c). 20.4.4.4
Confidence intervals
The resulting confidence intervals of the now identifiable parameters are finite and their values are given in Table 20.1. Finally, the remaining uncertainty of the now structurally and practically identifiable parameters can be translated to confidence intervals of the model trajectories. Therefore the trajectories corresponding to all acceptable parameter values according to the profile likelihood are evaluated. The resulting upper and lower confidence band are displayed in Figure 20.7(c) for the thresholds 0.95 that is also indicated in Figure 20.8(c) for the derivation of confidence intervals of the model parameters. Based on a confidence threshold with df = 1 the confidence bands have to be interpreted point-wise for each time point individually. In order to ensure the appropriateness of the derived confidence intervals, we performed a simulation study by assuming that the estimated parameter values given in Table 20.1 are the true values. Using these values and the same model outputs, measurement time points and measurement noise as in the original experimental dataset, 450 independent datasets were generated. After estimating the parameters for each of these datasets, 2 -distribution with df = 10, the number of the resulting distribution of the over-fitting is in line with the χdf estimated parameters, see Figure 20.10(a). This verifies that the threshold α utilized to derive the likelihoodbased confidence intervals was assessed correctly.
Model Identification by Utilizing Likelihood-Based Methods
413
Table 20.1 Individual confidence intervals [ − , + ] of the model parameter to a confidence level of 95%. Values are given on a log10 scale. The coverage rates CR of the estimates are in line with the expected values (93.33 and 96.66%)
20.4.4.5
Name
ˆ
−
+
Bmax Epo0 kD kde kdi ke kex kon kt scale
+2.821 +3.322 +2.583 -1.884 -2.730 -1.177 -2.447 -4.091 -1.758 +2.210
+2.710 +3.227 +1.641 -1.941 -3.083 -1.203 -2.764 -4.208 -1.828 +2.133
+2.932 +3.400 +2.993 -1.829 -2.535 -1.150 -2.225 -3.973 -1.683 +2.305
unit pM pM pM 1/min 1/min 1/min 1/min 1/(pM·min) 1/min cpm/pM
CR 95.78% 92.44% 95.78% 95.78% 95.56% 97.33% 96.44% 95.56% 97.33% 92.89%
Coverage rates
In order to compute coverage rates for the parameter estimates, for each of the simulated data sets 95% individual confidence intervals were calculated. According to the 0.05 and 0.95 quantiles of the binomial distribution, the coverage rates CR for 450 simulated datasets and a 95% confidence level are expected to be between 93.33% and 96.66%. The coverage rates are in line with this expectation, see in Table 20.1. Coverage rates can also be assessed for the time point-wise confidence bands on the model trajectories shown in Figure 20.7(c), see Figure 20.10(b). The coverage rates are, in most cases, in line with their expected values described by the 0.05 and 0.95 quantiles of the binomial distribution. Too low coverage rates are observed for species Epo at t ≈ 20 minutes, for Epo EpoR at t > 300 minutes and for Epo EpoR i between 150 and 250 min. Here, increased attention is indicated when interpreting the results. The deviations occur because the high dimensional confidence region in parameter space was not sampled densely but approximated by the profile likelihood.
20.5
Summary
The iterative cycle between mathematical modeling and experimentation in systems biology facilitates a better understanding of the highly complex processes that occur in cell biology. We elucidated a likelihood-based framework for model identification that allows to systematically proceed from the building of an ODE model, to the practical aspects of parameter estimation, to the analysis of the uncertainty of parameter estimates in terms of identifiability and confidence intervals, to the corresponding effect of the uncertainty on the model trajectories in terms of observability, and finally to the design of new experiments that efficiently improve model predictability until the relevant biological question can be assessed reliably. Our argumentation was based on the profile likelihood approach and was illustrated by a realistic example of ligand and receptor interaction and trafficking. The profile likelihood is a powerful concept to infer parameter uncertainties in a high-dimensional parameter space. Since it is a systematic and directed exploration of the parameter space it has less computational cost than sampling parameter space randomly, which becomes intractable for high dimensions. The profile likelihood can be calculated for each parameter separately. Thereby it is possible to restrict the analysis to the parameters relevant for the specific question. Moreover, this allows to perfectly parallelize the approach, which is a major
414
Handbook of Statistical Systems Biology
Figure 20.10 The appropriateness of confidence intervals was assessed by a simulation study with 450 generated data sets. (a) Comparison of the amount of over-fitting to the expected df2 -distribution with df = 10. (b) The solid lines indicate the coverage rates CR for the time point-wise confidence bands on the model trajectories shown in Fig. 20.7c. The dashed lines are the desired 95% level of confidence, the gray shades are the expected coverage rates.
benefit for its scalability. The approach can be applied to any parameter estimation problem, where a likelihood or a similar objective criterion is available, e.g. partial differential equations, stochastic differential equations or a Bayesian framework. The approach results in easily interpretable plots of profile likelihood versus parameter. It can be automated, but an explicit advantage is that the output can be evaluated visually. This gives insight into a complex and high-dimensional parameter space. Structural nonidentifiabilities, originating from incomplete observation of the internal model states can also be detected. Arising from a limited amount and quality of experimental data, practical nonidentifiabilities can a be inferred. Bridging the gap between identifiability and confidence intervals, the profile likelihood allows to derive likelihood-based confidence intervals for each parameter. Functional relations between parameters occurring due to nonidentifiabilities can be recovered. The results of the approach can, on the one hand, be used to design new experiments that efficiently resolve non identifiability and narrow confidence intervals, and on the other hand be used for model reduction. Whether a model that is not well determined should be reduced or additional data should be measured depends on the issue being addressed. Thus, identifiability analysis ensures that the model complexity is tailored to the information content given by the experimental data.
Acknowledgements We thank Clemens Kreutz, Daniel Kaschek, Seong-Hwan Rho, Thomas Maiwald, Verena Becker, Marcel Schilling, Julie Bachmann and Ursula Klingm¨uller for their support.
Model Identification by Utilizing Likelihood-Based Methods
415
This work was supported by the German Federal Ministry of Education and Research (Virtual Liver, LungSys 0315415E, FRISYS 0313921); the European Union (CancerSys EU-FP7 HEALTH-F4-2008-223188); the Initiative and Networking Fund of the Helmholtz Association within the Helmholtz Alliance on Systems Biology (SBCancer DKFZ II.1) and the Excellence Initiative of the German Federal and State Governments (EXC 294).
References Becker V, Schilling M, Bachmann J, Baumann U, Raue A, Maiwald T, Timmer J and Klingmueller U 2010 Covering a broad dynamic range: information processing at the erythropoietin receptor. Science 328(5984), 1404–1408. Cobelli C and DiStefano III J 1980 Parameter and structural identifiability concepts and ambiguities: a critical review and analysis. American Journal of Physiology-Regulatory, Integrative, and Comparative Physiology 239(1), 7–24. Cox D and Hinkley D 1974 Theoretical Statistics. Cambridge University Press. Egea J, Rodr´ıguez-Fern´andez M, Banga J and Mart´ı R 2007 Scatter search for chemical and bio-process optimization. Journal of Global Optimization 37(3), 481–503. Gerstein M, Bruce C, Rozowsky J, Zheng D, Du J, Korbel J, Emanuelsson O, Zhang Z, Weissman S and Snyder M 2007 What is a gene, post-encode? History and updated definition. Genome Research 17(6), 669–81. Hindmarsh A, Brown P, Grant K, Lee S, Serban R, Shumaker D and Woodward C 2005 Sundials: suite of nonlinear and differential/algebraic equation solvers. ACM Transactions on Mathematical Software 31(3), 363–396. International Human Genome Sequencing Consortium 2004 Finishing the euchromatic sequence of the human genome. Nature 431(7011), 931–945. Joshi M, Seidel-Morgenstern A and Kremling A 2006 Exploiting the bootstrap method for quantifying parameter confidence intervals in dynamical systems. Metabolic Engineering 8(5), 447–455. Kollmann M, Løvdok L, Bartholom´e K, Timmer J and Sourjik V 2005 Design principles of a bacterial signalling network. Nature 438(7067), 504–507. Kreutz C, Rodriguez MMB, Maiwald T, Seidl M, Blum HE, Mohr L and Timmer J 2007 An error model for protein quantification. Bioinformatics 23(20), 2747–2753. Lehmann E and Leo E 1983 Theory of Point Estimation. John Wiley & Sons, Ltd. Leis J and Kramer M 1988 The simultaneous solution and sensitivity analysis of systems described by ordinary differential equations. ACM Transactions on Mathematical Software 14(1), 45–60. Maiwald T and Timmer J 2008 Dynamical modeling and multi-experiment fitting with potterswheel. Bioinformatics 24(18), 2037–2043. Meeker W and Escobar L 1995 Teaching about approximate confidence regions based on maximum likelihood estimation. The American Statistician 49(1), 48–53. Murphy S and van der Vaart A 2000 On profile likelihood. Journal of the American Statistical Association 95(450), 449–485. Neale M and Miller M 1997 The use of likelihood-based confidence intervals in genetic models. Behavior Genetics 27(2), 113–120. Press W, Teukolsky S, Flannery B and Vetterling W 1990 Numerical Recipes: FORTRAN. Cambridge University Press. Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingm¨uller U and Timmer J 2009 Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 25(15), 1923–1929. Raue A, Becker V, Klingm¨uller U and Timmer J 2010 Identifiability and observability analysis for experimental design in non-linear dynamical models. Chaos 20(4), 045105. Royston P 2007 Profile likelihood for estimation and confidence intervals. Stata Journal 7(3), 376–387. Schmidt H and Jirstrand M 2006 Systems biology toolbox for matlab: a computational platform for research in systems biology. Bioinformatics 22(4), 514–515. Seber G and Wild C 2003 Nonlinear Regression. John Wiley & Sons, Ltd.
416
Handbook of Statistical Systems Biology
Segel I 1993 Enzyme Kinetics: Behavior and Analysis of Rapid Equilibrium and Steady-state Enzyme Systems. WileyInterscience. Swameye I, M¨uller T, Timmer J, Sandra O and Klingm¨uller U 2003 Identification of nucleocytoplasmic cycling as a remote sensor in cellular signaling by databased modeling. Proceedings of the National Academy of Sciences of the United States of America 100(3), 1028–1033. Taniguchi Y, Choi P, Li G, Chen H, Babu M, Hearn J, Emili A and Xie X 2010 Quantifying e. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329(5991), 533. Venzon D and Moolgavkar S 1988 A method for computing profile-likelihood-based confidence intervals. Applied Statistics 37(1), 87–94. Wolkenhauer O 2008 Systems Biology. Portland Press.
Part E APPLICATION AREAS
21 Inference of Signalling Pathway Models Tina Toni1 , Juliane Liepe2 and Michael P. H. Stumpf2 1
21.1
Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, USA 2 Division of Molecular Biosciences, Imperial College London, UK
Introduction
Cells need to survive in a perpetually changing environment. Living in such complex conditions, they must constantly detect and process external signals and use this information to regulate their internal dynamics. Their highly evolved signal transduction systems pick up environmental, developmental and physiological cues and trigger a generally well-coordinated response by altering gene and protein expression and metabolic processes. Signal transduction is a process that detects extracellular stimuli and converts them into a specific cellular response. The signal or stimuli (e.g. growth factors or hormones) act as ligands that bind to the outside part of the transmembrane receptors. This normally triggers a change (e.g. posttranslational modifications of proteins such as phosphorylation), in the transmembrane part of the receptor. The signal is then further transferred through specialized networks (e.g. phosphorylation cascades and feedback loops) through the cytoplasm, before it enters the nucleus, often in a form of a specific transcription factor, which triggers the expression of specific target genes. Understanding cell signalling is crucial for understanding crucial cellular processes such as differentiation and apoptosis. Furthermore, cell signalling has extensively been studied in association with diseases, such as cancer, where signal transduction networks are thought to be radically affected, in some cases even rewired. Therefore, inference and modelling of signalling pathways has gained considerable attention among interdisciplinary researchers. There are several levels of detail that can be chosen for inference and modelling; the choice depends on the data available and the aims of modelling (Papin et al. 2005). The simplest level is the basic connectivity level, which models the relatedness between the network components. The molecular species can be denoted by a node, and connected by an edge. The next level of sophistication is the reconstruction of causal relationships, where the edges between nodes also involve direction and describe cause-and-effect relationships. The highest level of detail is captured by quantitative mechanistic dynamic models, which
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
420
Handbook of Statistical Systems Biology
employ stoichiometric matrices and rate constants, and model how molecular numbers or concentrations change in time. Mechanistic models can be further classified into deterministic and stochastic models. Several good reviews on modelling of signalling pathways are available: Cho and Wolkenhauer (2003) offer an introduction to systems modelling; Papin et al. (2005) provide a basic overview of modelling, data collection and model analysis of signalling pathways; Klipp and Liebermeister (2006) cover dynamic modelling of specific signalling pathways; and Christensen et al. (2007) provide a comprehensive review of graph-theoretical network inference and analysis. Furthermore, a whole range of more specialized reviews are available, for example: analysis of the spatio-temporal dimension of signalling (Kholodenko et al. 2010), modelling and analysis of MAPK signalling (Kolch et al. 2005), and inference of conditional independence-based graphical models (Markowetz and Spang 2007). In this chapter we give an overview of approaches for inferring signalling networks. The specific focus will be on mechanistic models of cell signalling dynamics. The future will, however, present almost unprecedented challenges to the inference of the structure and dynamics of such signalling networks: in biological systems, the constituent parts are distributed across the cell and dependence on space and time will have to be incorporated explicitly. Furthermore, stochastic effects pervade the dynamics of such systems and need to be included as well. We will therefore present a flexible but powerful statistical procedure that can be applied across such a wide range of challenging problems. We will outline an approximate Bayesian computation (ABC) based inference approach for dynamic models; we will furthermore illustrate the use of the ABC parameter estimation algorithm in an application to the Akt signalling pathway, and conclude by a general discussion of problems arising in the context of reverse engineering dynamical systems in biology.
21.2
Overview of inference techniques
We structure the overview of inference approaches according to different types of signalling pathway models and according to different sets of biological data which are used for inference. We start with large-scale networks, most of which have been modelled at the level of basic connectivity and work our way to the mechanistic dynamic models, where we discuss deterministic and stochastic modelling frameworks in more detail. Other chapters in this volume also discuss methodologies that can be employed in the context of signal transduction systems. Large-scale signalling networks have most often been constructed using gene expression data from microarray (or increasingly now RNA-seq) experiments. The structure of these networks can be inferred by the use of clustering algorithms or data analysis methods such as principal component analysis or partial least squares. These large-scale approaches study biological networks through clustering and correlation patterns between molecular players such as genes, proteins or metabolites. The modelling and analysis tools exploit ideas from graph theory, statistics and machine learning. Although these studies are crucially important and provide the basis for further analysis, they are inherently qualitative, focus solely on network structure, and therefore provide only a limited insight into or understanding of the underlying biological processes. These models allow for structural analysis of the network, for example studying the connectivity of the nodes (Tavazoie et al.1999; Brown et al. 2000). An additional data source for learning signalling pathways are protein–protein interaction data, which are obtained from yeast two-hybrid or other experimental techniques, and which can be used for inference in combination with gene expression datasets or alone (Scott et al. 2006; Bebek and Yang 2007). As outlined in Chapter 10, the dynamical nature of protein–protein interaction data is increasingly being recognized and nowhere will this matter more than in the context of signal transduction where the flow of information through the networks or pathways involves numerous but sometimes fleeting interactions between proteins (often dependent on post-translational modifications of one or more of the proteins).
Inference of Signalling Pathway Models
421
One shortcoming of graphical models and in particular Bayesian networks is that while they infer directed interactions, which could reflect cause and effect relationships(Neapolitan 2004), capturing the intricate feedback structures pervading is not straightforward in this framework. This is a major shortcoming when modelling obviously cyclic biological processes, such as metabolic (e.g. the Krebs cycle) and signalling processes or the cell cycle. These models generally cannot capture the essential dynamics and are therefore unable to fully utilize the information contained in time course experimental data. Time series data allow for inference of dynamic models. One of the commonly used probabilistic graphical models are dynamic Bayesian networks (Husmeier 2003; L`ebre 2010) (which circumvent the inability of modelling the cyclic processes by static Bayesian networks). In addition to graphical models, there has recently been a rise in the number of mechanistic signalling pathway models. Mechanistic models are detailed, quantitative dynamic models, often consisting of ordinary or stochastic differential equations. In order to utilize these models for predictions, the quantitative parameter values, for example kinetic rates, need to be known. A number of parameter inference algorithms have been developed to estimate the parameter values or determine parameter distributions. However, before we are able to estimate the parameters, we need to have confidence in the model structure or topology of the network. In cases when the model structure is unknown, it needs to be inferred. One way of approaching this problem is to propose a set of candidate models and then select the best one using model selection tools. In Section 21.3 we give a brief overview of parameter estimation and model selection algorithms for dynamical systems. Time course biological data are most commonly collected over the population of cells, for example by measuring protein concentrations by Western blots, and such data represent measurements of an ‘average cell’. However, with huge efforts going into advancing the experimental technologies, it is increasingly possible to perform experiments on the scale of a single cell (Spiller et al. 2010). This allows for collection of data such as time courses of abundances of specific proteins inside single cells. Commonly used techniques for collecting single-cell protein concentration data are flow cytometry, which can isolate cells with specified properties, time-lapse microscopy, which can quantify fluorescently labelled proteins, and many others – for a comprehensive review see Spiller et al. (2010). Using such techniques, it has been shown that single-cell behaviour is often qualitatively different than the averaged behaviour over a population of cells (Batchelor et al. 2009; Lee et al. 2009). In other words, averaging the measurements over a population of cells can mask important processes and behaviours and cloud our understanding of the behaviour of individual cells. The function of an organism depends on the events in individual cells, as well as how these cells act in concert. It is therefore important to understand both the single-cell dynamics and the dynamics of cell populations, as well as how they relate to each other through cell-to-cell communication and network mechanisms. Along with technological advances, computational methods are being developed for inference from singlecell data. Stochastic models capturing single cell dynamics were initially developed for gene transcription and translation (Mettetal et al. 2006; Ingram et al. 2008). Master equations and stochastic differential equations have been employed to capture the dynamic behaviour as well as the noise arising from low numbers of molecular players participating in the biochemical processes of interest. But these models pose their own, considerable statistical challenges which will need to be met. One of the pioneering works using single-cell data to infer signalling networks was performed by Sachs et al. (2005). They employed molecular interventions to perturb cells and used flow cytometry to collect data simultaneously from thousands of cells; these data were then fed into a Bayesian network inference algorithm which reconstructed the structure, but not necessarily the complete dynamical behaviour of such systems. Apart from Bayesian networks, mechanistic models have also been employed to capture heterogeneity between cells (Waldherr et al. 2009). Wilkinson (2009) argues that the heterogeneity can either be modelled by stochastic differential equations or by deterministic differential equations. The former approach is appropriate
422
Handbook of Statistical Systems Biology
when stochasticity arises due to low number of molecules, while the latter case is more appropriate for systems where molecules are present in large numbers, and where heterogeneity is governed by different cell sizes, numbers of ribosomes, stages in cell cycle, etc. Instead of using stochastic equations, an alternative way of accounting for heterogeneity is by assuming that each cell follows deterministic dynamics, but this dynamics is ruled by slightly different initial conditions and parameter values. We can of course also combine stochastic kinetics with variable initial conditions and rate constants.
21.3
Parameter inference and model selection for dynamical systems
This section gives an overview over methods from frequentist and Bayesian statistics for estimation of parameters, and for choosing which model is best for modelling the underlying system. There are two broad schools of thought in statistical inference: frequentist and Bayesian, and the two different schools are encountered throughout this handbook. In frequentist statistics in a parameter estimation framework one talks about point estimates and confidence intervals around them. The likelihood function is a central concept in statistical inference, and is used in both frequentist and Bayesian settings. The likelihood function is a function of parameters θ, L(θ) = P(D|θ). When P(D|θ) is considered as a function of data D with fixed parameters θ, it is a probability density function (for continuous distributions) or probability mass function (for discrete distributions). When it is considered as a function of θ, then we call it the likelihood function. The established way of obtaining a point estimate of the parameters is by taking a maximum likelihood estimate; i.e. the set of parameters for which L(θ) is maximal. On the other hand, Bayesian statistics is based on probability distributions that are defined more generally than simply based on the frequency of events. Here one aims to obtain the posterior probability distribution over the model parameters, which is proportional to the product of a suitable prior distribution (which summarizes our prior knowledge or beliefs about the system under investigation, e.g. reaction rates cannot be negative, or other biophysical considerations) and the likelihood (the information that is obtained from the data), P(θ|D) =
P(D|θ)P(θ) ∝ P(D|θ)P(θ). P(D)
Both frequentist and Bayesian statistics offer tools to estimate parameters of ordinary differential equation and master equation models, and for choosing which model has the highest support from the data. Systems biology models include numerous parameters and it is generally impossible to obtain all of these values by experimental measurements alone. Therefore parameter inference (also referred to as model calibration, model fitting or parameter estimation by different authors) algorithms have to be used to estimate these parameter values computationally. A variety of different approaches have been developed and are being used; they all share two main ingredients: a cost function, which specifies the distance between the model and experimental data; and an optimization algorithm, which searches for parameters that optimize the cost function. The most commonly used cost functions in a frequentist approach are the likelihood (one wants to maximize it) and the least squares error (one wants to minimize it). The Bayesian ‘equivalent’ to a cost function is the Bayesian posterior distribution. There are many different kinds of optimization algorithms. Their goal is to explore the landscape defined by a cost function and find the optimum (i.e. minimum or maximum, depending on the type of cost function used).
Inference of Signalling Pathway Models
423
The simplest are local gradient descent methods [e.g. Newton’s method, Levenberg–Marquardt (Levenberg 1944; Marquardt 1963)]. These methods are computationally efficient and fast, but are only able to find local optima. When the cost function landscape is complex, which is often the case for systems biology models with high-dimensional parameter spaces, these methods are unlikely to find the global optimum, and in this case more sophisticated methods need to be used. Multiple shooting (Peifer and Timmer 2007) performs better in terms of avoiding getting stuck in local optima, but as argued by Brewer et al. (2007), it may perform poorly when measurements are sparse or noisy. A large class of optimization methods are global optimization methods that try to explore complex surfaces as widely as possible; among these, genetic algorithms are particularly well known and have been applied to ordinary differential equation models (Vera et al. 2008). Moles et al. (2003) tested several global optimization algorithms on a 36-parameter biochemical pathway model and showed that the best performing algorithm was a stochastic ranking evolutionary strategy (Runarsson and Yao 2000). Further improvements in computational efficiency of this algorithm were obtained by hybrid algorithms incorporating local gradient search and multiple shooting methods (Rodriguez-Fernandez et al. 2006; Balsa-Canto et al. 2008). However, in practice even so-called global algorithms may get attracted to local optima and therefore several runs from different starting conditions may be required in order to gain confidence in the obtained optima. And, even after a suitable global maximum has been found, these approaches do not provide reliable confidence estimates; these need to be obtained using additional techniques. To obtain an ensemble of good parameter values, an approach based on simulated annealing (Kirkpatrick et al. 1983) and a Monte Carlo search through parameter space can be used (Brown and Sethna 2003; Brown et al. 2004). In a Bayesian setting, Markov chain Monte Carlo (MCMC) methods (Vyshemirsky and Girolami 2008b) [software available (Vyshemirsky and Girolami 2008a)] and Kalman filtering (Quach et al. 2007) have been applied to estimate the posterior distribution of parameters. Bayesian methods do not only estimate confidence intervals, but provide even more information by estimating, in principle at least, the whole posterior parameter distribution. Sometimes a point estimate is also used in a Bayesian setting; it is called a maximum a posteriori (MAP) estimate and is the mode of the posterior distribution. To obtain confidence intervals for a point estimate in a frequentist setting, a range of techniques can be applied that include variance-covariance matrix-based techniques (Bard 1974), profile likelihood (Venzon and Moolgavkar 1988) and bootstrap methods (Efron and Tibshirani 1993). Parameter inference of stochastic dynamic models is harder and less developed, owing to the generally intractable analytical form of the likelihood. Both non-Bayesian (Reinker et al. 2006; Tian et al. 2007) and Bayesian approaches using MCMC have been developed and extensively applied in financial econometrics (Johannes and Polson 2005), and recently in biology (Boys et al. 2008; Gilioli et al. 2008). Discrete stochastic master equations can be approximated using the diffusion approximation to obtain the chemical Langevin equation, for which more computationally efficient parameter estimation algorithms have been developed (Golightly and Wilkinson 2005, 2006; Wilkinson 2006). Because the models are stochastic, the probability model is given and not assumed (e.g. in the form of an error model). This makes the interpretation of the likelihood, although certainly not its evaluation, more straightforward. Parameter estimation should be accompanied by identifiability and sensitivity analyses. If a parameter is nonidentifiable, this means it is difficult or impossible to estimate due to either model structure (structural nonidentifiability) or insufficient amount or quality of data (statistical nonidentifiability) (Yue et al. 2006; Hengl et al. 2007; Schmidt et al. 2008). Structurally nonidentifiable parameters should ideally be removed from the model. Sensitivity and robustness analyses study how model output behaves when varying inputs such as parameters (Saltelli et al. 2008) or combinations of parameters. If the model output changes a lot when certain parameters are varied slightly, we say that the model is sensitive to changes in those parameter combinations. Recently, the related concept of sloppiness has been introduced by Sethna and co-workers (Brown and Sethna 2003; Gutenkunst et al. 2007). They call a model ‘sloppy’ when the parameter sensitivity eigenvalues are distributed over many orders of magnitude; those parameter combinations with large eigenvalues are called
424
Handbook of Statistical Systems Biology
‘sloppy’ and those with low eigenvalues ‘stiff’. Sloppy parameters are hard to infer and carry very little discriminatory information about the model. The concepts of identifiability, sloppiness and parameter sensitivity are, related: nonidentifiable parameters and sloppy parameters are hard to estimate precisely because they can be varied a lot without having a large effect on model outputs; the corresponding parameter estimates will thus have large variances. A parameter with large variance can, in a sensitivity context, be interpreted as one to which the model is not sensitive if the parameter changes. It may therefore in practice be less important (but also harder) to obtain reliable estimates for such parameters. We will comment on this at the end of this chapter.
21.3.1
Model selection
Model selection methods strive to rank the candidate models, which represent different hypotheses about the underlying system, relative to each other according to how well they explain the experimental data. It is important to keep in mind that the chosen model is not the ‘true’ model, but the best model from the set of candidate models (Box and Draper 1987). It is the best model available for making inferences and predictions from the data. In general, including more parameters in a model allows a better fit to the data to be achieved. In the extreme case where the number of parameters equals or exceeds the number of data points, there is always a way of setting the parameters so that the fit will be perfect. This is called overfitting. Wel famously addressed the question of ‘how many parameters it takes to fit an elephant’ (Wel 1975), which practically suggests that if one takes a sufficiently large number of parameters, a good fit can always be achieved. The other extreme is underfitting, which results from using too few parameters or too inflexible a model. A good model selection algorithm should follow the principle of parsimony, also referred to as Occam’s razor (MacKay 2003), which aims to determine the model with the smallest possible number of parameters that adequately represents the data and what is known about the system under consideration. Occam’s razor has never been mathematically formalized and by itself is therefore best seen as an ideal to aspire to, rather than providing a recipe to follow in practice. Probably the best known method for model selection is (frequentist) hypothesis testing. If models are nested (i.e. one model can be obtained from the other by setting some parameter values to zero), then model selection is generally performed using the likelihood ratio test (Swameye et al. 2003; Timmer and Muller 2004). If two models have the same number of parameters and if there is no underlying biological reason to choose one model over the other, then we choose the one which has a higher maximum likelihood. However, if the numbers of parameters differ, then the likelihood ratio test penalizes overparameterization. If the models are not nested, then model selection becomes more difficult but a variety of approaches have been developed that can be applied in such situations. Bootstrap methods (Efron and Tibshirani 1993; Timmer and Muller 2004) are based on drawing many so-called bootstrap samples from the original data by sampling with replacement, and calculating the statistic of interest (e.g. an achieved significance level of a hypothesis test) for all of these samples. This distribution is thin compared with the real data in order to determine if a model can reproduce or resemble the data to within a specified level of consistency. Other model selection methods applicable to non-nested models are based on information-theoretic criteria (Burnham and Anderson 2002) such as the Akaike information criterion (AIC) (Akaike 1973, 1974; Swameye et al. 2003; Timmer and Muller 2004). These methods involve a goodness-of-fit term and a term measuring the parameteric complexity of the model. The purpose of this complexity term is to penalize models with a high number of parameters; the criteria by which this term is chosen differs considerably among the methods. While they are based on sound statistical and information theoretic criteria, they are frequently seen as ad hoc criteria.
Inference of Signalling Pathway Models
425
In a Bayesian setting model selection is done through so-called Bayes factors [for a comprehensive review see Kass and Raftery (1995)]. We consider two models, m1 and m2 , and would like to determine which model explains the data D better. The Bayes factor measuring the support of model m1 compared with model m2 , is defined by B12 =
P(D|m1 ) P(m1 |D)P(m2 ) = , P(D|m2 ) P(m2 |D)P(m1 )
where P(D|mi ) is the marginal likelihood, P(mi ) is the prior and P(mi |D) is the marginal posterior distribution of model mi , i = 1, 2. The Bayes factor is a summary of the evidence provided by the data in favour of one statistical model over another.
21.4
Approximate Bayesian computation
In addition to the methods outlined above, all of which have been applied to signal transduction networks, recent times have seen growing interest in problems of such intricacy and complexity that likelihood-based approaches become quickly intractable. Here, if we still want to apply the paraphernalia of a Bayesian framework, so-called ABC approaches offer a way out. They exploit the computational efficiency of modern simulation techniques by replacing calculation of the likelihood with a comparison between the observed data and simulated data. Let θ be a parameter vector to be estimated. Given the prior distribution, P(θ), the goal is to approximate the posterior distribution, P(θ|D0 ) ∝ f (D0 |θ)P(θ), where f (D0 |θ) is the likelihood of θ given the data D0 . ABC methods have the following generic form: 1. Sample a candidate parameter vector θ ∗ from some proposal distribution P(θ). 2. Simulate a dataset D∗ from the model described by a conditional probability distribution f (D|θ ∗ ). 3. Compare the simulated dataset, D∗ , with the experimental data, D0 , using a distance function, d, and tolerance ; if d(D0 , D∗ ) ≤ , accept θ ∗ . The tolerance ≥ 0 is the desired level of agreement between D0 and D∗ . The output of an ABC algorithm is a sample of parameters from a distribution P(θ|d(D0 , D∗ ) ≤ ). If is sufficiently small, then the distribution P(θ|d(D0 , D∗ ) ≤ ) will be a good approximation for the posterior distribution, P(θ|D0 ). Frequently, in an ABC setting we focus on comparing summary statistics of the data rather than the data directly. In the context of signal transduction, which requires time course measurements, we can use the data directly and we will therefore use this approach here. This has the added advantage of avoiding problems with model selection that have recently been pointed out by Robert et al. (2011) The most basic ABC algorithm outlined above is known as the ABC rejection algorithm; however, recently more sophisticated and computationally efficient ABC methods have been developed. They are based on MCMC (ABC MCMC) (Marjoram et al. 2003) and sequential Monte Carlo (ABC SMC) techniques (Sisson et al. 2007; Beaumont et al. 2009; Del Moral et al. 2009; Toni et al. 2009). ABC algorithms can be applied to either deterministic or stochastic models. They can be utilized for parameter estimation, i.e. to estimate the posterior distribution of the multidimensional parameter vector. Another application of ABC is model selection, where the algorithm distinguishes between candidate models
426
Handbook of Statistical Systems Biology
and ranks them on the basis of how well they can explain the experimental data, or, in more precise terms, a posterior distribution over the models is inferred. Along with these uses, ABC also provides information on the sensitivity of parameters. In this section we present an application of ABC for parameter estimation of a deterministic Akt signalling pathway model. For applications of ABC inference to stochastic dynamic models – which proceed analogously – and model selection for dynamical systems we refer the reader to Toni et al. (2009) and Toni and Stumpf (2010). Software implementing these techiques is also available (Liepe et al. 2010).
21.5
Application: Akt signalling pathway
The Akt pathway (also known as the Protein kinase B pathway) is a mammalian signalling pathway, which regulates cellular processes such as growth, proliferation and apoptosis in response to different growth factors (Manning and Cantley 2007). It has been extensively studied (Brazil and Hemmings 2001; Woodgett 2005) due to its medical importance, since abnormalities in Akt activation can lead to a variety of complex diseases, including type-2 diabetes and cancer.
Figure 21.1 Experimental data measurements of the response of proteins pEGFR, pTrkA, pAkt and pS6 to EGF (a) and NGF (b) stimulation. The concentration of each phosphorylated protein was measured by a Western blot for six stimulation concentrations (see key). Circles denote the measurements at eight time points. Concentrations are given in ng/ml
Inference of Signalling Pathway Models
427
Figure 21.2 Schematic overview of the Akt signalling pathway model. The EGF or NGF growth factor binding leads to the phosphorylation of its respective receptor (pEGFR or pTrkA), which then activates the signalling cascade. Akt is recruited to the membrane (pEGFR-Akt, pTrkA-Akt) and consequently phosphorylated (pAkt). Phosphorylated Akt creates a complex with the ribosomal protein S6 (pAkt-S6) and phosphorylates it (pS6). Reactions proTrkA ↔ TrkA and proEGFR ↔ EGFR model receptors’ turnover. An ordinary differential equation model assuming mass action kinetics was generated based on these reactions
Even the simple Akt model studied here is an example of a model with a high-dimensional parameter space (19 parameters). This is the first time that the ABC SMC approach is applied to such a high-dimensional problem. Here we test the performance of ABC SMC on the Akt model and explore the limits of the approach. We also aim to demonstrate what can be gained by analysing the whole distribution rather than merely confining analyses to point estimates. We use the Akt model of Fujita et al. (2010) and their experimental data to infer the posterior parameter distribution. We then use this distribution to study the sensitivity of the Akt model to its parameters. The experiments were conducted by stimulating the PC12 cells1 with epidermal growth factor (EGF) and nerve growth factor (NGF); six stimulation concentrations were used for each growth factor, which resulted in twelve different datasets. Figure 21.1 presents the quantitative time course data of the following phosphorylated molecular species: EGF receptor (EGFR), TrkA (one of the subunits of the NGF receptor), Akt (a kinase) and S6 (a ribosomal protein). For a schematic overview and explanation of the Akt pathway model see Figure 21.2. 1
PC12 cells are a cell line derived from a rat tumour and are among the most commonly used experimental cell lines of eukaryotic model systems. Specific responses to different growth factors make them a good model system to study neuronal differentiation (Sasagawa et al. 2005). When treated with EGF PC12 cells divide and proliferate and when treated with NGF they stop dividing and terminally differentiate into nerve cells.
428
Handbook of Statistical Systems Biology
So far, only point parameter estimates have been determined; Fujita et al. obtained them by using a combination of evolutionary programming and the Levenberg–Marquardt algorithm (Hoops et al. 2006). Having the point estimates available, we are now interested in obtaining additional information from the datasets. 21.5.1
Exploring different distance functions
In the limit where −→ 0 we know that the ABC posterior yields a good approximation to the true posterior over model parameters. However, in practical applications to high-dimensional problems this limit may be hard or even impossible to obtain in practice in finite time. To safeguard our analyses we may therefore have to take care in the choice of the appropriate distance measure (Deza and Deza 2006). We illustrate this choice, frequently not considered in the ABC literature here. We first used the Euclidian distance function: Ns n ∗ )2 , (xi,s − xi,s d1 (x, x∗ ) = s=1
i=1
Figure 21.3 Best fits resulting from ABC SMC using both distance functions. (a) The EGF branch. (b) The NGF branch. (Left) Fits using distance function d1 . (Right) Fits using distance function d2 . Distance function d2 results in a better fit to experimental data (Figure 21.1)
Inference of Signalling Pathway Models
429
Figure 21.3 (Continued)
where n is the number of datasets, Ns is the number of data points in a given dataset (Ns = 8 here for our Akt data), x are the experimental data and x∗ the simulated data. This distance function is most widely used in fitting models to data; however, it has some disadvantages when used on particular datasets. It compares absolute concentration values, and because of that it gives more weight to distances calculated on high concentration data than those calculated on low concentration data. This is an intrinsic problem in signal transduction where the absolute abundances of molecular species can differ by orders of magnitude. Using this distance function, we implicitly assign higher importance to the fit between the ‘red’ compared with the ‘blue’ (see Figure 21.1) simulated and experimental datasets. This way, the parameters responsible for determining the ‘blue simulations’ are not informed by the data. We therefore explore another, weighted distance function Ns 1 (x − x∗ )2 , s = 1, . . . , n. d2 (x, x∗ ) = max i,s i,s max(xs ) i=1
The best fits obtained by using these two distance functions are presented in Figure 21.3. Comparison with the experimental data in Figure 21.1 shows that the distance function d2 results in better fits to experimental data for small concentrations (blue lines) than the fits obtained by the distance function d1 . We can conclude that the choice of a distance function can influence the quality of fits and it is therefore important to choose a good distance function for the inference in any practical application.
430
Handbook of Statistical Systems Biology
21.5.2
Posteriors
The obtained posterior parameter distribution is 19-dimensional and one way to visualize it is through twodimensional projections (Figure 21.4). Several interesting patterns among certain parameter combinations can be observed (Figure 21.5). For example, parameters p1 and p2 exhibit a positive correlation; parameters p12 and p16 form a weak ‘C shape’ suggesting a more complex relationship between them; parameters p10 , p11 and p15 have a narrow posterior compared with their respective prior distributions, which indicates that these parameters are more inferable than the others (with respect to the chosen prior distributions); we can also say that they are more informed by the data. Analysing Figure 21.4 and Figure 21.5 already demonstrates the advantage of having the whole posterior (or its ABC version) at hand: it illustrates which parameters can be inferred from present data, and which parameters are not inferred very well. We have to remember that any approach based on optimization will give us only a single point, from which it is impossible to gauge reliability of the parameter estimates, or how informed they are by the data. Instead any such analysis has to be followed up by either a sensitivity (where single parameters are varied and the change in the output is quantified) or robustness (where the parameter space is sampled simultaneously) analysis in order to gain such knowledge. It has been shown that ABC-based approaches (like conventional Bayesian approaches) incorporate such analyses implicitly (Secrier et al. 2009) and we will discuss this in more detail below. 21.5.3
Parameter sensitivity through marginal posterior distributions
We start with an analysis of the marginal parameter distributions obtained by ABC SMC by looking at how the 90% interquantile ranges change across populations. This analysis informs us about parameter inferability and sensitivity of the model to individual parameters, as before. If parameter ranges shrink quickly and considerably, we conclude that the model is sensitive to changes in these parameters. If, on the other hand, parameter ranges stay broad, then the model output does not change much with the parameter, and therefore we conclude that the model is not sensitive to changes in such parameters. The resulting interquantile ranges for the Akt model are presented in Figure 21.6. This analysis is perhaps specific to ABC SMC approaches (Toni et al. 2009) where it can be used on the fly as the sequential Monte Carlo sampler converges through a sequence of intermediate distributions towards the posterior. 21.5.4
Sensitivity analysis by principal component analysis (PCA)
Considering the marginals is informative but fails to provide information on correlations between the effects different parameters can have on a system’s dynamics. An easy and informative way to overcome these limitations is to apply PCA to the variance-covariance matrix calculated from the sample approximating the posterior distribution of the Akt parameters. The PCA output informs us which parameter combinations are stiff and sloppy. If we wish to know which individual (or raw) parameters are stiff and sloppy we need to project the eigenvectors of the stiff and sloppy directions onto raw parameters. We tackle this projection in two different ways and compare the results. Assume the output of the PCA is given in the following form: the eigenvalues are given as a vector (λ1 , . . . , λm ) and the eigenvectors as columns in a matrix
x1,1 . . . xm,1 . .. . . . . x1,n . . . xm,n
Inference of Signalling Pathway Models
Figure 21.4 Projections of the posterior parameter distribution of the Akt model. Narrow uniform prior distributions were chosen by taking the maximum likelihood estimates of Fujita et al. (Fujita et al. 2010) as a guide. Perturbation kernels and distance functions are defined in the main text. Number of particles N = 500. Number of populations T = 13. Different colours represent accepted particles from different populations: black (1), grey (5), blue (8), green (10), red (12), yellow (13)
431
432
Handbook of Statistical Systems Biology
Figure 21.5 Selected two-dimensional projections of posterior distribution scatterplots
We can then define the projection of eigenvector a onto a raw parameter b, pa,b , in two ways as x2 (1) 2 pa,b = n a,b 2 = xa,b x i=1 a,i
or
(2)
pa,b = 2λa |xa,b |,
a = 1, . . . m, b = 1, . . . , n, (1)
where | | denotes the absolute value of a real number. Projection pa,b represents the square of the eigenvector (2)
element that is pointing in the direction of a raw parameter b, while the projection pa,b calculates the length of the intersection of the PCA-determined ellipsoid with the raw parameter b axis. Reassuringly, both projection approaches determine the same sets of parameters as stiff and sloppy combinations (Figure 21.7): the analysis suggests that the stiffest parameters, which are determined by PC 19, are parameters p3 , p13 and p16 , followed by parameters p9 , p10 , p12 , p1 and p2 . We find that it is hard to determine individual sloppy parameters for this model, as PCs 1–5 spread equally across all the parameters (pie charts not shown). These results differ from the sensitivity results that we obtained by looking at the marginal distributions, which reminds us that multi dimensional posterior distributions cannot simply be treated as the product of the marginals, and reiterates the need for multivariate sensitivity or robustness analysis methodologies. The parameter sensitivity analysis results can be interpreted in light of data used for inference and the proposed model structure. For example, we found that parameters p13 , p15 and p17 are correlated, which can be explained by the fact that they drive the cycle of reactions of complex pAkt-S6 formation, phosphorylation and dephospohrylation of pS6; the correlation between these three parameters is necessary to maintain the circular flow of these three reactions, which maintain a steady state among the molecular species considered
Inference of Signalling Pathway Models
433
Figure 21.6 The figure shows how quantile ranges of marginal parameter populations change through populations. Each plot shows the 5% and 95% quantiles, the mean and the median for this simulation over each population. The interquantile ranges shrink the most for parameters p10 , p11 , p12 and p15 . This result indicates that the system is the most sensitive to changes in these parameters. Parameters that do not influence the model much (their posterior parameter range is roughly equal to the prior parameter range) are p8 and p14
Handbook of Statistical Systems Biology
0.00
0.10
434
PC 1
3
5
7
9
(1) pa,b
11
13
15
17
(2) pa,b
p2
p3
p1
p2
19
p1
p9 PC 15 p9 p10 p12
p17 p16
p10 p7 p4 p3 p2
PC 16 p12
p18
p10 p11 p14 p15 p7
p17 p16
p6 p4 p3
p10
p2
p19
p19 p18
p12 p13 p5
p6 PC 17
p7
p4 p3 p2 p1
p9 p11
p5 p8 p9 p10
p10
PC 18
p3
p4 p5 p6 p7 p9
p10 p11
p7
p5
p2 p17 p16 p14
PC 19
p6 p7 p9 p10
p4 p3
p13
p17
p4
p6
p7 p10 p12
p11
p4
p3 p2 p1 p18 p16 p14
p3
p2 p1 p18 p17 p16 p12 p13 p5 p4 p3 p17
p13 p16
p16
(2) Figure 21.7 Stiff principal components (PCs) resulting from the PCA analysis using projections p(1) a,b and pa,b . The histogram shows the proportions of variance explained by each PC
Inference of Signalling Pathway Models
435
here. Note that their respective conditional distributions are informative: if any of the parameters were measured or known more precisely, the other two would be automatically determined to high accuracy. Reassuringly, stiff parameter combinations were found for kinetic rates involved in reactions between measured species, for example p9 , p10 and p16 . These parameters drive the reactions for which the data contain most information. However, parameters describing the production and degradation of the measured receptor complex pEGFR-Akt cannot be described as stiff parameters. The reason probably lies in the nature of the measurement, which is the sum of bound and unbound phosphorylated EGF receptor (pEGFR-Akt and pEGFR) rather than the individual quantities. These data simply do not contain enough information for inference of details of such complex model descriptions. However, it would potentially be possible to reduce the model by summarizing the involved reaction into a single reaction. PCA could be used for this purpose as well – it is after all a dimension reduction technique – but a more productive approach would be to use further in-silico analysis in order to identify experimental scenarios or stimuli that allow estimation of more or of additional parameters (Apgar et al. 2008). In this section we have applied ABC SMC to a model of the Akt pathway with a high-dimensional parameter space and using real data. The obtained posterior parameter distribution has then been used to determine the sensitivity of parameters. We illustrated the possibility of choosing different distance functions, and two different ways of obtaining information about parameter sensitivity.
21.6
Conclusion
Development of inferential methods has had considerable attention from the statistical as well as computational biology communities. However, inference itself is not the end of the story. The purpose of building a model is that the model can be used for theoretical and computational analysis of the modelled biological system. For example, the model can give insights into understanding complexity that is beyond scientists’ intuition. The model can be used for making predictions of a system’s behaviour under changed environmental conditions or predicting outcomes of different perturbations or disruptions to a system such as knocking out a certain gene. To name just a few examples, models of signalling pathways have been used for studying the role of the feedback loops, to determine the robustness of the system to different perturbations, to understand how signal amplifies while travelling through the pathway, to study crosstalk between different signalling pathways, and in order to understand how single cell act in concert to form well-coordinated cell populations (Klipp and Liebermeister 2006). In order to obtain such models we have to rely on statistical methods as the underlying molecular mechanisms often prove elusive to modern experimental methodologies: not every quantity of interest can be measured and in most cases we observe only a small fraction of the constitent parts of a system directly. The hidden interactions need to be determined from careful analysis of data in light of mechanistic models, which frequently have to be inferred along the way. This poses enormous statistical challenges. Here we conclude by briefly outlining some of the major challenges: •
•
Single-cell level data exhibit high levels of cell-to-cell variation. The causes and phenotypic effects of this variability are at present unknown but their biological perspective is obvious: such variation underlies decision-making processes at the cellular level that are closely linked to e.g. spontaneous development of cancer, or the processes deciding when a stem cell divides and how cells differentiate. Inference is difficult at the best of times and there has been a dearth of statistical studies looking at the interplay between the dynamical features of a system and our ability to reconstruct such systems from experimental data (Kirk et al. 2008). Several chapters in this handbook provide the reader with the basic
436
•
•
•
Handbook of Statistical Systems Biology
concepts from dynamical systems theory that need to be considered when trying to reverse engineer the structure and dynamics of signal transduction systems. A particularly intriguing but apparently genuine feature of dynamical systems (Brown and Sethna 2003; Gutenkunst et al. 2007) is the fact that dynamical systems data only allow us to estimate a subset of the parameters of the system (in the absence of e.g. stimulus design). Sloppy and stiff parameters appear to make up the majority of parameters. While this means that some parameters may not be inferable in a statistical sense, it also means that the system output is robust to varying such parameters. This in turn means that inferability and parameter sensitivity/robustness are intricately linked and can fruitfully be considered together in the statistical analysis of dynamical systems in biology. Here the advantage of approaches which focus on distributions rather than point estimates comes to the fore. In our example above we have assumed implicitly that the mathematical model is adequate. Clearly, however, any analysis based on an incorrect model is likely to result in exaggerated confidence or credible intervals. What is required are statistical approaches which allow us to determine if a model does fit the data. Such model checking techniques are in many respects related to sensitivity analysis and model invalidation techniques (Anderson and Papachristodoulou 2009). Well established techniques exist for model checking in a Bayesian framework (Gelman et al. 2004 and Ratmann et al. 2009) have developed related methods for ABC inference. Ultimately, our models will need to address the spatial dynamics as well. Inference for stochastic, dynamical and spatially structured systems is in its infancy. While ABC-based approaches have the flexibility to deal with such data, their application is computationally expensive and their convergence needs to be carefully monitored. In the near future, however, they will provide perhaps the most obvious solution to such reverse engineering challenges.
We want to conclude with one final caveat that pervades much of systems biology: none of the signalling pathways considered are autonomous, closed systems that can be studied independently of the rest of the cellular machinery. Metabolic processes, protein–protein interactions, regulatory interactions and a host of small-molecule interactions can influence the transduction and processing of signals inside living systems. Identifying which we can and need to incorporate into our analysis – especially when we are considering biomedical interventions – is a scientific challenge of perhaps unprecedented scale, and requires close integration of powerful statistical reasoning techniques with experimental methodology development and biological domain knowledge. Here we only presented a small survey of what is possible at the moment.
References Akaike H 1973 Information theory as an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Akademiai Kiado, Budapest), pp. 267–281. Akaike H 1974 A new look at the statistical model identification. Automatic Control 19, 716–723. Anderson J and Papachristodoulou A 2009 On validation and invalidation of biological models. BMC Bioinformatics 10, 132. Apgar JF, Toettcher JE, Endy D, White FM and Tidor B 2008 Stimulus design for model selection and validation in cell signaling. PLoS Comput Biol 4(2), e30. Balsa-Canto E, Peifer M, Banga JR, Timmer J and Fleck C 2008 Hybrid optimization method with general switching strategy for parameter estimation. BMC Syst Biol 2, 26. Bard Y 1974 Nonlinear Parameter Estimation. Academic Press. Batchelor E, Loewer A and Lahav G 2009 The ups and downs of p53: understanding protein dynamics in single cells. Nat Rev Cancer 9(5), 371–377.
Inference of Signalling Pathway Models
437
Beaumont MA, Cornuet JM, Marin JM and Robert CP 2009 Old adaptive approximate Bayesian computation. Biometrika 96(4), 983–990. Bebek G and Yang J 2007 Pathfinder: mining signal transduction pathway segments from protein-protein interaction networks. BMC Bioinformatics 8, 335. Box G and Draper N 1987 Empirical Model Building and Response Surfaces. John Wiley & Sons, Ltd, p. 424. Boys R, Wilkinson D and Kirkwood T 2008 Bayesian inference for a discretely observed stochastic kinetic model. Stat Comput 18(2), 125–135. Brazil DP and Hemmings BA 2001 Ten years of protein kinase B signalling: a hard Akt to follow. Trends Biochem Sci 26(11), 657–664. Brewer D, Barenco M, Callard R, Hubank M and Stark J 2007 Fitting ordinary differential equations to short time course data. Philos T R Soc A 366, 519–544. Brown KS and Sethna JP 2003 Statistical mechanical approaches to models with many poorly known parameters. Phys Rev E 68, 021904. Brown KS, Hill CC, Calero GA, Myers CR, Lee KH, Sethna JP and Cerione RA 2004 The statistical mechanics of complex signaling networks: nerve growth factor signaling. Phys Biol 1(3-4), 184–195. Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey T, Ares M and Haussler D 2000 Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97(1), 262. Burnham K and Anderson D 2002 Model Selection and Multimodal Inference: a Practical Information Theoretic Approach. Springer Science + Business Media, Inc. Cho K and Wolkenhauer O 2003 Analysis and modelling of signal transduction pathways in systems biology. Biochem Soc Trans 31, 1503–1509. Christensen C, Thakar J and Albert R 2007 Systems-level insights into cellular regulation: inferring, analysing, and modelling intracellular networks. IET Syst Biol 1(2), 61–77. Del Moral P, Doucet A and Jasra A 2009 An adaptive sequential Monte Carlo method for approximate Bayesian computation. Technical report, Imperial College London. Deza E and Deza M 2006 Dictionary of Distances. Elsevier. Efron B and Tibshirani R 1993 An Introduction to the Bootstrap. Chapman & Hall/CRC. Fujita K, Toyoshima Y, Uda S, ichi Ozaki Y, Kubota H and Kuroda S 2010 Decoupling of receptor and downstream signals in the akt pathway by its low-pass filter characteristics. Sci Signal 3(132), ra56. Gelman A, Carlin J, Stern H and Rubin D 2004 Bayesian Data Analysis, 2nd edn. Chapman & Hall. Gilioli G, Pasquali S and Ruggeri F 2008 Bayesian inference for functional response in a stochastic predator-prey system. Bull Math Biol. 70(2), 358–381. Golightly A and Wilkinson D 2005 Bayesian inference for stochastic kinetic models using a diffusion approximation. Biometrics 61, 781–788. Golightly A and Wilkinson D 2006 Bayesian sequential inference for stochastic kinetic biochemical network models. J Comput Biol. 13(3), 838–851. Gutenkunst R, Waterfall J, Casey F, Brown K, Myers C and Sethna J 2007 Universally sloppy parameter sensitivities in systems biology models. PLoS Comput Biol 3(10), e189. Hengl S, Kreutz C, Timmer J and Maiwald T 2007 Data-based identifiability analysis of non-linear dynamical models. Bioinformatics 23(19), 2612–2618. Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P and Kummer U 2006 COPASI–a COmplex PAthway SImulator. Bioinformatics 22(24), 3067–3074 Husmeier D 2003 Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic bayesian networks. Bioinformatics 19(17), 2271. Ingram PJ, Stumpf MPH and Stark J 2008 Nonidentifiability of the source of intrinsic noise in gene expression from single-burst data. PLoS Comput Biol 4(10), e1000192. Iyengar R 2009 Why we need quantitative dynamic models. Sci Signal 2(64), eg3. Johannes M and Polson N 2005 MCMC methods for continuous-time financial econometrics. In Handbook of Financial Econometrics (eds Y. Ait Sahalia and L. Hansen). Elsevier. Kass R and Raftery A 1995 Bayes factors. J Am Stat Assoc 90, 773–795. Kholodenko BN, Hancock JF and Kolch W 2010 Signalling ballet in space and time. Nat Rev Mol Cell Biol 11(6), 414.
438
Handbook of Statistical Systems Biology
Kirk PDW, Toni T and Stumpf MPH 2008 Parameter inference for biochemical systems that undergo a Hopf bifurcation. Biophys J 95(2), 540–549. Kirkpatrick S, Gelatt C and Vecchi M 1983 Optimization by simulated annealing. Science 220(4598), 671–680. Klipp E and Liebermeister W 2006 Mathematical modeling of intracellular signaling pathways. BMC Neurosci 7(Suppl. 1), S10. Kolch W, Calder M and Gilbert D 2005 When kinases meet mathematics: the systems biology of MAPK signalling. FEBS Lett 579, 1891–1895. L`ebre S, Becq J, Devaux F. Stumpf MPH and Lelandais G 2010 Statistical inference of the time-varying structure of gene-regulation networks. BMC Syst Biol 4, 130. Lee TK, Denny EM, Sanghvi JC, Gaston JE, Maynard ND, Hughey JJ and Covert MW 2009 A noisy paracrine signal determines the cellular nf-kappab response to lipopolysaccharide. Sci Signal 2(93), ra65. Levenberg K 1944 A method for the solution of certain non-linear problems in least squares. Quart Appl Math 2, 164–168. Liepe J, Barnes C, Cule E, Erguler K, Kirk P, Toni T and Stumpf M 2010 Abc-sysbio–approximate bayesian computation in python with gpu support. Bioinformatics 26(14), 1797. MacKay DJC 2003 Information Theory, Inference, and Learning Algorithms. Cambridge University Press, p. 628. Manning BD and Cantley LC 2007 AKT/PKB signaling: navigating downstream. Cell 129(7), 1261–1274. Marjoram P, Molitor J, Plagnol V and Tavare S 2003 Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci USA 100(26), 15324–15328. Markowetz F and Spang R 2007 Inferring cellular networks–a review. BMC Bioinformatics 8(Suppl. 6), S5. Marquardt D 1963 An algorithm for least-squares estimation of nonlinear parameters. J Soc Indust Appl Math 11(2), 431–441. Mettetal JT, Muzzey D, Pedraza JM, Ozbudak EM and van Oudenaarden A 2006 Predicting stochastic gene expression dynamics in single cells. Proc Natl Acad Sci USA 103(19), 7304–7309. Moles C, Mendes P and Banga J 2003 Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Res 13(11), 2467–2474. Neapolitan RE 2004 Learning Bayesian Networks. Prentice Hall, p. 674. Papin J, Hunter T, Palsson B and Subramaniam S 2005 Reconstruction of cellular signalling networks and analysis of their properties. Nat Rev Mol Cell Biol 6(2), 99–111. Peifer M and Timmer J 2007 Parameter estimation in ordinary differential equations for biochemical processes using the method of multiple shooting. IET Syst Biol 1(2), 78–88. Quach M, Brunel N and d’Alch´e Buc F 2007 Estimating parameters and hidden variables in non-linear state-space models based on odes for biological networks inference. Bioinformatics 23(23), 3209–3216. Ratmann O, Andrieu C, Wiuf C and Richardson S 2009 Model criticism based on likelihood-free inference, with an application to protein network evolution. Proc Natl Acad Sci USA 106(26), 10 576–10 581. Reinker S, Altman R and Timmer J 2006 Parameter estimation in stochastic biochemical reactions. IEE Proc Syst Biol 153(4), 168–178. Robert C, Marin JM and Pillai N 2011 Why approximate bayesian computational (abc) methods cannot handle model choice problems. arXiv:1102.4432v1 stat.ME. Rodriguez-Fernandez M, Mendes P and Banga J 2006 A hybrid approach for efficient and robust parameter estimation in biochemical pathways. Biosystems 83, 248–265. Runarsson T and Yao X 2000 Stochastic ranking for constrained evolutionary optimization. IEEE T Evolut Comput 4, 284–294. Sachs K, Perez O, Pe’er D, Lauffenburger D and Nolan G 2005 Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721), 523. Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M and Tarantola S 2008 Global Sensitivity Analysis: the Primer. John Wiley & Sons, Ltd. Sasagawa S, ichi Ozaki Y, Fujita K and Kuroda S 2005 Prediction and validation of the distinct dynamics of transient and sustained ERK activation. Nat Cell Biol 7(4), 365–373. Schmidt H, Madsen MF, Danø S and Cedersund G 2008 Complexity reduction of biochemical rate expressions. Bioinformatics 24(6), 848–854.
Inference of Signalling Pathway Models
439
Scott J, Ideker T, Karp RM and Sharan R 2006 Efficient algorithms for detecting signaling pathways in protein interaction networks – sciweavers. J Comput Biol 13(2), 133–144. Secrier M, Toni T and Stumpf M 2009 The ABC of reverse engineering biological signalling systems. Mol Biosyst 5, 1925–1935. Sisson SA, Fan Y and Tanaka MM 2007 Sequential Monte Carlo without likelihoods. Proc Natl Acad Sci USA 104(6), 1760–1765. Spiller DG, Wood CD, Rand DA and White MRH 2010 Measurement of single-cell dynamics. Nature 465(7299), 736–745. Swameye I, Muller TG, Timmer J, Sandra O and Klingmuller U 2003 Identification of nucleocytoplasmic cycling as a remote sensor in cellular signaling by databased modeling. Proc Natl Acad Sci USA 100(3), 1028–1033. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ and Church GM 1999 Systematic determination of genetic network architecture. Nat Genet 22(3), 281. Tian T, Xu S, Gao J and Burrage K 2007 Simulated maximum likelihood method for estimating kinetic rates in gene expression. Bioinformatics 23(1), 84–91. Timmer J and Muller T 2004 Modeling the nonlinear dynamics of cellular signal transduction. Int J Bifurcat Chaos 14, 2069–2079. Toni T and Stumpf MPH 2010 Simulation-based model selection for dynamical systems in systems and population biology. Bioinformatics 26(1), 104–110. Toni T, Welch D, Strelkowa N, Ipsen A and Stumpf MPH 2009 Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc Interface 6, 187–202. Venzon D and Moolgavkar S 1988 A method for computing profile-likelihood-based confidence intervals. Appl Stat 37, 87–94. Vera J, Bachmann J, Pfeifer A, Becker V, Hormiga J, Darias N, Timmer J, Klingm¨uller U and Wolkenhauer O 2008 A systems biology approach to analyse amplification in the JAK2-STAT5 signalling pathway. BMC Syst Biol 2, 38. Vyshemirsky V and Girolami M 2008a BioBayes: a software package for Bayesian inference in systems biology. Bioinformatics 24(17), 1933–1934. Vyshemirsky V and Girolami MA 2008b Bayesian ranking of biochemical system models. Bioinformatics 24(6), 833–839. Waldherr S, Hasenauer J and Allgower F 2009 Estimation of biochemical network parameter distributions in cell populations. In Proceedings of the 15th IFAC Symposium on System Identification (SYSID), pp. 1265–1270. Wel J 1975 Least squares fitting of an elephant. Chemtech 128–129. Wilkinson DJ 2006 Stochastic Modelling for Systems Biology. Chapman & Hall/CRC. Wilkinson DJ 2009 Stochastic modelling for quantitative description of heterogeneous biological systems. Nat Rev Genet 10(2), 122. Woodgett JR 2005 Recent advances in the protein kinase B signaling pathway. Curr Opin Cell Biol 17(2), 150–157. Yue H, Brown M, Knowles J, Wang H, Broomhead DS and Kell DB 2006 Insights into the behaviour of systems biology models from dynamic sensitivity and identifiability analysis: a case study of an NF-kappaB signalling pathway. Mol Biosyst 2(12), 640–649.
22 Modelling Transcription Factor Activity Martino Barenco1 , Daniel Brewer2 , Robin Callard1 and Michael Hubank1 1
Institute of Child Health, University College London, UK 2 Institute of Cancer Research, Sutton, UK
We present here techniques designed to analyse time course gene expression data based on dynamical models of transcript concentration. The data feeding these models are both noisy and costly and are typically produced using microarrays. New technologies such as Next Generation Sequencing show promise in improving the quality of the data but due to the high level of cost, the relative paucity of data at the individual species level will remain for the foreseeable future. This imposes parsimony in model parametrisation. To fully utilise the data in these circumstances, novel techniques are required. Here we present such a procedure that uses the ordinary differential equation (ODE) formulation which is the paradigm of choice in most systems biology applications. The cornerstone of this technique is the estimation activity profiles, i.e. how the strength of the driver for change of each individual species’ gene expression varies in time (an example would be transcription factor activity). We thus demonstrates that despite cost constraints, parameter estimation need not be confined to kinetic constants. In the first part of this chapter, we present the modelling framework and an approximation scheme that turns model integration into a low-dimensional linear problem via a differentiation operator. This approximation scheme has the advantage of making model output quick to produce, which is a boon for any optimisation scheme but particularly in the genomics context, where potentially thousands of species have to be screened. The entries to this differentiation operator, a square matrix, result from rather technical steps which are examined in detail in the second section. In the third section, we present applications of these techniques to real data, the DNA damage response network in human T-cells. The approximation of the ODE model is based on its discretisation and hence the model output is itself discrete: points in time rather than continuous functions. In the fourth part, we present a scheme that can, for presentational purposes, output an arbitrary number of intermediary time points for both the estimated gene expression profile and its activator.
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
Modelling Transcription Factor Activity 441
22.1
Integrating an ODE with a differential operator
Our model for the transcription factor concentration is a nonautonomous ODE of the form dx(t) = F (t) − Dx(t) dt
(22.1)
where F (t) is a production term, typically encapsulating one or several transcription factors’ activities and thus likely to vary in time. The second term on the right-hand side is a degradation term which we suppose to be proportional to the transcript’s concentration x(t). Owing to the linear nature of the model, it is possible to write down the analytical solution to the equation: t F (τ)eDτ dτ. (22.2) x(t) = e−Dt 0
Note that the nonautonomous term is presented here in a rather generic way and could actually be richer in parameters. For example we could have F (t) = B + Sf (t)
(22.3)
where B is a constant term, f (t) represents the activity profile of a transcription factor and S is a scalar representing the sensitivity of a transcript to that transcription factor. Like the degradation rate D, The kinetic constants B and S are specific to the transcript under review while f is nonspecific and could possibly influence several different transcripts. This is the paradigm we used in Barenco et al. (2006). The term F (t) can be more complex, for example by using a Michaelis–Menten formulation: F (t) = B + V
g(t) K + g(t)
where g(t) would represent the transcription factor concentration near the binding site(s) and V and K are the usual kinetic constants associated with the Michaelis–Menten model. The advantage of this formulation is that it is more realistic because it encodes saturation in the production term. No matter how high the value of g(t) can be, the value of F (t) will never exceed B + V . It does though add a supplementary parameter to the linear model, which might be costly in the context of limited data availability. In particular, it should be noted that if the expression levels do not cover a respectable dynamic range, it would be difficult to arrive at robust values for the kinetic parameters. For a typical dataset, these will be measurements of the transcript concentration at discrete time points t0 , . . . , tn which we denote by xˆ 0 , . . . , xˆ n . [Note, we have defined xˆ i ≡ x(t ˆ i ).] Fitting the model consists then of trying to find an appropriate parameterisation for the function F , integrate the model using (22.2) or another scheme such as an explicit fourth-order Runge–Kutta. The outcome can then be compared with the data using an appropriate score function (typically, this will be a sum of squared differences between the model and data) then vary the parameters systematically until an optimum is found. However, parameterising a continuous function such as F (or indeed f and g in the nesting formulations above) is a nontrivial task, prone to be slow to integrate and difficult to scale coherently with respect to the number of time points observed. To overcome this difficulty, a more transparent scheme can be devised by discretising the model and parameterising any function by its values at the observed time points. For example F could be represented by a vector of n+1 parameters F0 , . . . , Fn where F (ti ) = Fi . This scheme has the advantage of transparency as the number of parameters required is clearly commensurate with the length of the measured time course. However we are left with the issue of having to solve a differential equation [Equation (22.1)] with a discrete nonautonomous term.
442
Handbook of Statistical Systems Biology
To address this, observe that the left-hand side of (22.1) can be approximated by using discrete time points of the candidate solution. For example, and supposing that the sampling rate is regular, we can write: dx(t) xi+1 − xi−1 ≈ dt t=ti ti+1 − ti−1 where we have defined xi ≡ x(ti ). The approximation scheme above amounts to fitting quadratic polynomial in t through the point (ti−1 , xi−1 ), (ti , xi ), (ti+1 , xi+1 ) and taking as an approximation of the derivative of x(t) at ti the slope of this quadratic at this same time point. It is possible to generalise this scheme to higher order polynomials using Lagrange interpolation (Press et al. 1992), calling the approximating polynomial P: P(t) =
r
C(k, p, r, t)xk ,
(22.4)
k=p
where C(k, p, r, t) =
r j=p,j = / k
t − tj tk − tj
(22.5)
The right-hand side of (22.4) is a polynomial of degree q = r − p passing through the points (tp , xp ), . . . , (ti , xi ), . . . , (tr , xr ) and will serve to arrive at an approximation of the derivative of x(t) at ti : r dx(t) dP(t) d ≈ = C(k, p, r, t) xk . dt t=ti dt t=ti dt t=ti k=p
Observe that the function C and therefore its first derivative does not depend on any of the xi s, so that in the formula above the approximation of the first derivative of the expression profile x can be written as a linear combination of xp , . . . , xr . Setting all Aik = 0 for all ks outside of the range [p, . . . , r], we can write: N dx(t) ≈ Aik xk . dt t=ti k=0
The calculation of the nonzero Aik as well as the choice of the range [p, . . . , r] is technical and deferred to subsequent sections. For now we can define the vectors: x ≡ (x0 , . . . , xn ), F ≡ (F (t0 , . . . , F (tn )), so that an approximate version of (22.1) can be written as: A.x = F − Dx whose formal solution is given by x = (A + DI)−1 .F
(22.6)
where I represents the (n + 1) × (n + 1) identity matrix and ‘.’ is the matrix product, all the rest being term-wise operations. This approach of solving the differential equation, which amounts to solving a system of linear equations, is much faster than solving it through a ‘classic’ Runge–Kutta fourth order scheme. Using a Runge–Kutta scheme
Modelling Transcription Factor Activity 443
on this type of equation also presupposes that an appropriate parameterisation for the functional form F has been been found. In contrast, the approach presented here has the advantage that the function F is parameterised in a straightforward manner, by specifying the values at the sampling points. Thus the parameterisation of the activator is commensurate with the experiment size. Our approach is tailored to a standard systems biology set-up of having time courses with relatively few time points, typically less than twenty. The apparent disadvantages are twofold. First, the differential equation is solved through an approximation. However, we found through simulations that the error made by approximating the differential equation is effectively swamped by the measurement error typically incurred in using expression data. The other perceivable disadvantage pertains to the discretisation itself, i.e. we only have model predictions and estimations of the activity profile at time points specified by the experimental design. However the paradigm behind the discretisation – polynomial interpolation – can be extended to nonsampling time points (see Section 22.4).
22.2
Computation of the entries of the differential operator
The discretisation scheme necessitates the computation of the nonzero entries of a square matrix [see Equation 22.6]. Individual entries are obtained through differentiation with respect to time of the function C shown in (22.5). A general form1 is given by: ⎞⎛ ⎞ ⎛ r r d 1 ⎠⎝ t − tj ⎠ (22.7) C(k, p, r, t) ≡ E(k, p, r, t) = ⎝ dt t − tj tk − t j j=p,j = / k
j=p,j = / k
However, the first derivative will be evaluated at a time ti coinciding with one of the time points part of the formula, i.e. i ∈ {p, . . . , r}. Plugging this into the above formula yields two simplified cases: r
E(k, p, r, ti ) =
j=p,j = / i
1 ti − tj
(22.8)
for k = i and 1 E(k, p, r, ti ) = tk − t i
r j=p,j = / k,i
ti − t j tk − tj
(22.9)
for k = / i. 22.2.1
Taking into account the nature of the biological system being modelled
These coefficients [(22.8) and (22.9)] could be used straight away but in some cases it is worth refining them further. The biological context in which they were developed was the DNA damage response network. Human T-cells were submitted to punctual γ irradiation and expression levels measured using microarrays post irradiation in addition to one time point just before irradiation. In modelling terms, the assumption is one of a system in equilibrium perturbed by a pulse at t = 0, with post irradiation measurements representing a transient orbit. In pathway terms, the transcript expression response is several species downstream of the 1 To
simplify manipulations, use
r j=p,j = / k
(t − tj ) = exp
r j=p,j = / k
log (t − tj ) .
444
Handbook of Statistical Systems Biology
pulse and it is therefore reasonable2 to assume that the first derivative of x(t) is zero at this point in time: dx(t) =0 (22.10) dt t=0 Assumption (22.10) will typically be introduced in those cases where t0 is part of the approximation, i.e. where p = 0. To derive the formula for these revised coefficents, we introduce an extra dummy point (t−1 , x−1 ) whose function is to absorb and carry the information contained in (22.10). The values of t−1 and x−1 need not be / ti for all i = 0, . . . , r. known as these will cancel out in subsequent manipulations below, we only require t−1 = We denote the coefficients of the first derivative approximation that uses the r + 2 points (t0 , x0 ), . . . , (tr , xr ) ˜ 0, r, t). A straightforward extension of (22.8) and (22.9) yields: and (t−1 , k−1 ) by E(k, r
˜ 0, r, ti ) = E(k,
j=−1,j = / i
1 ti − t j
(22.11)
for k = i and ˜ 0, r, ti ) = E(k,
1 tk − ti
r j=−1,j = / k,i
ti − tj tk − t j
(22.12)
for k = / i. Given that the approximation of the first derivative of x(t) at t = t0 is approximated by r P(t) ˜ ˜ 0, r, t0 )xk , = E(−1, 0, t, t )x + E(k, 0 −1 dt t=t0 k=0
we can then replace the left-hand side with zero [assumption 22.10] and solve for x−1 : x−1
r ˜ 0, r, t0 ) E(k, =− . ˜ E(−1, 0, r, t0 ) k=0
Plugging the formula above into the approximation of the first derivative of P(t) at ti yields: r dP(t) ˜ ˜ 0, r, ti )xk + E(−1, = 0, r, ti )x−1 E(k, dt ti k=0
=
r
˜ ˜ 0, r, ti )xk − E(−1, 0, r, ti ) E(k,
k=0
=
r
r ˜ 0, t, t0 ) E(k, xk ˜ E(−1, 0, r, t0 ) k=0
E0 (k, 0, r, ti )xk ,
k=0
where we have defined ˜ 0, r, ti ) − E(−1, ˜ 0, r, ti ) E0 (k, 0, r, ti ) ≡ E(k,
˜ 0, r, t0 ) E(k, . ˜ E(−1, 0, r, t0 )
(22.13)
2 In this perturbed equilibrium context, a species immediately downstream of a pulse (a noncontinuous function) would be continuous, nonsmooth at the time of the pulse. For the next species downstream, the function would be smooth and with zero first derivative. Both features do of course hold for species further downstream.
Modelling Transcription Factor Activity 445
For i = 0 we have by construction and in accordance with (22.10) E0 (k, 0, r, t0 ) = 0.
(22.14)
In the more general case, where i > 0, one plugs (22.11) and (22.12) into (22.13), distinguishing three subcases. For k = / 0, i we have ⎛ ⎞ r ti − tj ⎠ t0 − ti 1 E0 (k, 0, r, ti ) = ⎝ tk − ti tk − tj t0 − tk j=0,j = / i,k
= E(k, 0, r, ti )
t0 − ti . t0 − tk
(22.15)
For k = i, the formula is r
E0 (i, 0, r, ti ) =
j=0,j = / i
1 1 + ti − t j ti − t 0
= E(i, 0, r, ti ) + Finally, if k = 0, we have
⎛
E0 (0, 0, r, ti ) =
1 ⎝ t0 − t i
r
j=1,j = / i
1 . ti − t 0
(22.16)
⎞⎛ ti − t j ⎠ ⎝ 1 + (t0 − ti ) t0 − tj
⎛
= E(0, 0, r, ti ) ⎝1 + (t0 − ti )
r j=1
⎞ 1 ⎠ t0 − tj
r j=1
⎞ 1 ⎠ t0 − tj
(22.17)
Arriving at these formulae requires rather lengthy manipulations, which were skipped for brevity, but observe first that all references to the dummy point (t−1 , x−1 ) have vanished and secondly, that either of the coefficients E0 (. . . ) in (22.15), (22.16) and (22.17) can be obtained as updates of E(. . . ) in (22.11) and (22.12). 22.2.2
Bounds choice for polynomial interpolation
We now have a method to evaluate the slope of a time function at a given point (call this ti ) using its value at that point and others measurements nearby. We are still faced with a choice: which points should be used for this task? Using points that span an interval which does not include the point of interest is possible but a risky option as this would amount to performing polynomial extrapolation, which can be highly inaccurate. It is therefore preferable to pick the time point of interest and neighbouring ones. Picking two extra points, one on either side of ti would amount to fitting a quadratic (three interpolation points), but we found that this set-up tended to underestimate the amplitude between peaks and troughs in a damped oscillatory context, probably because of too coarse an approximation of the first derivative. We found, however, that using two points on either side solved this issue. Therefore, the ideal situation is to be able to use a total of five time points where ti is the median. However, when ti is at, or next to the time course boundaries, this is not possible. In those cases (Barenco et al. 2006), we favoured using bounds p and r to be as close as possible to the ideal situation above. After the
446
Handbook of Statistical Systems Biology Table 22.1 Index bounds as a function of the index of the approximation time point i
p (lower bound)
r (upper bound)
0 1 .. . 2≤i ≤n−2 .. . n−1 n
0 0 .. . i−2 .. . n−2 n−1
1 3 .. . i+2 .. . n n
2006 publication, we noticed that for the last time point this led to instabilities in the estimations and increased confidence intervals. It is therefore preferable in these extreme cases to be more conservative and choose p and r such that the difference |(i − p) − (r − i)| does not exceed 1. The choice of bounds as a function of the approximating time point index i is given in Table 22.1. With the choice of bounds given in Table 22.1 and with seven time points spaced every 2 h starting from t0 = 0 we obtain, by applying (22.11) (diagonal entries) and (22.12) (off-diagonal entries), the following 7 × 7 matrix A: ⎛
− 12
⎜ 1 ⎜− 6 ⎜ ⎜ 1 ⎜ 24 ⎜ ⎜ 0 A=⎜ ⎜ ⎜ ⎜ 0 ⎜ ⎜ ⎜ 0 ⎝ 0
1 2
0
0
0
0
− 14
1 2
1 − 12
0
0
− 13
0
1 3
1 − 24
0
1 24
− 13
0
1 3
1 − 24
0
1 24
− 13
0
0
0
1 12
− 12
1 3 1 4
0
0
0
0
− 12
0
⎞
⎟ 0 ⎟ ⎟ ⎟ 0 ⎟ ⎟ ⎟ 0 ⎟ ⎟ 1 ⎟ ⎟ − 24 ⎟ 1 ⎟ 6 ⎟ ⎠ 1 2
If we want to further refine the coefficients as explained in Section 22.2.1 the entries in the first three rows of A have to be revised, by replacing the first row with zeros and using formulae (22.16) (diagonal), (22.17) (first column) and (22.15) (remaining entries) in the second and third row. In this case the matrix A is as follows: ⎛
0
⎜ 17 ⎜− 36 ⎜ ⎜ 31 ⎜ 144 ⎜ ⎜ 0 A=⎜ ⎜ ⎜ ⎜ 0 ⎜ ⎜ ⎜ 0 ⎝ 0
0
0
0
0
0
1 4
1 4 1 4
1 − 36
0
0
2 9
1 − 48
0 1 − 24
− 23
0
1 3
0
− 13 1 24
− 13
0
0
0
1 12
− 21
1 3 1 4
0
0
0
0
− 21
1 24
0
⎞
⎟ 0 ⎟ ⎟ ⎟ 0 ⎟ ⎟ ⎟ 0 ⎟ ⎟ 1⎟ − 24 ⎟ ⎟ 1 ⎟ 6 ⎟ ⎠ 1 2
Modelling Transcription Factor Activity 447
Note that in either case the entries sum up to zero row-wise which is a good way to check that the formulae have been correctly implemented. It is also worth mentioning that whilst computing the entries of the differential operator might seem complicated and cumbersome, this needs to be done only once in each session, rather than for each function evaluation (for example in the context of parameter estimation). In other words, this scheme is computationally highly efficient.
22.3
Applications
The differential operator whose construction we have just described can be used in a variety of situations. The first one we have already hinted at was presented in Barenco et al. (2006) and named Hidden Variable Dynamic Modelling (HVDM). The focus was on p53, an important transcription factor in the context of the DNA damage response network (and the ‘hidden variable’). An algorithm was presented that combined some prior knowledge about the system (known targets of p53), expression data (three replicates of a seven points time course) and the linear model [Equation (22.1) with (22.3)]. In a first training step, we used the expression data of five known p53 targets to fit their optimal three kinetic parameters (B, S, D; all three gene-specific) as well as the activitor profile they share (f ). The experiment was run in biological replicates and we supposed that the activator profile could differ between replicates (but not, of course, between genes) in order to capture biological variation. On the other hand, kinetic parameters were forced to be constant across replicates. It is worth noting that it was necessary to use more than one gene in this step as using a single gene would have created a situation with the number of parameters exceeding the number of data points.3 To avoid identifiability issues, the parameter set had to be reduced4 and one of the parameters5 was measured independently rather than fitted. Confidence intervals for the estimated parameters were determined using a Markov chain Monte Carlo (MCMC) procedure. In the next screening step, information gleaned from the first step was used to decide on the dependency status of individual genes with respect to p53. The model was fitted for each individual gene in the microarray experiment using the previously values for the transcription factor activity [Figure 22.1(a)], so there were only three parameters to determine in each fit (those parameters were the gene’s individual kinetic parameters B, S and D). The putative dependency status of a gene on p53 was based on whether there was both a good model fit and the sensitivity parameter S was robustly greater than zero (as measured by the Z-score of that parameter). This screening step produced a list of putative p53 targets, of whch roughly half were previously undocumented p53 targets. The accuracy of this list was asserted using an independent experiment. This algorithm has also been implemented as an easy to use R/Bioconductor package (Barenco et al. 2009a). It is worth noting that owing to its efficiency, the algorithm is entirely implemented in that scripting language and there is no need to accelerate the code by using lower level programming languages. Since then, the same framework has been used with a more complicated form for the production term F . In Cromer et al. (2010), 3 For example, with a single time course of N points, there are N parameters to estimate for the activator profile, plus three kinetic parameters for the gene. By using several genes in this step, the activator profile is shared among genes so that one can find itself in the viable situation where the data count exceeds the parameter count. 4 Letting B , S and f be a set of parameters producing an optimal fit, it can be shown that choosing instead B + aS , bS and g will o o o o o produce an equivalent fit if f = a + bg where a and b are scalars. To avoid this situation, we fixed some parameter values, hereby taking those out of the fitting. We let the first time point of the activator be equal to zero (f0 = 0) and for one of the training genes, called the anchoring gene, we let the sensitivity coefficient S be equal to 1. 5 The degradation rate of the anchoring gene. Failing to implement this parameter as a known, independently measured quantity had the effect of returning very wide confidence intervals.
448
Handbook of Statistical Systems Biology 450
500 point estimates activity activity (repl. 2) activity (repl. 3)
400 350
p21 expr. + 95% c.i. model output
450 400 350
300
300
250
250 200
200
150
150
100
100
50
50
0
0 0
(a)
2
4
6
8
10
12
0
(b)
1800
2
4
6
8
10
12
10
12
350 DDB2 expr. + 95% c.i. model output
1600
p53tg1 expr. + 95% c.i. model output
300
1400 250 1200 1000
200
800
150
600 100 400 50
200 0
0 0
(c)
2
4
6 time (h)
8
10
12
0 (d)
2
4
6
8
time (h)
Figure 22.1 (a) Estimation of the transcription factor activity. Red crosses and error bars represent the actual estimation for the first replicate of the experiment whilst the continuous red line is the interpolation obtained from those estimations. The interpolation for replicates 2 and 3 are represented by dotted lines. (b) P21 expression profile. Black crosses and error bars represent the measured signal along with the (95%) measurement error. The red continuous line represents the model output as obtained by polynomial interpolation between the point estimates. (c) Same for DDB2 another gene part of the training set. (d) Same for p53tg1, a gene that was (correctly) predicted to be a p53 target. Notice that in spite of being activated by the same transcription factor, the expression profiles of target genes can be radically different
the production term comprises more than one transcription factor, thereby allowing to check possible multiple dependencies of individual genes. Both approaches outlined above require some prior knowledge of the system under review, in the form of known targets of the the transcription factors of interest. It is also noticeable that the turnover rate D was very
Modelling Transcription Factor Activity 449
important in shaping individual genes response to a common activator. This can be seen by contrasting Figure 22.1(b), (c) and (d) which depicts very different expression response shapes, entirely caused by different degradation rates (the level and amplitude of the response are, in turn, governed by the basal and sensitivity rates). Interestingly, this would cause a traditional clustering approach, applied directly on the expression data, to fail. This prompted a more exploratory approach that did not assume any prior biological information. The latter was replaced with quantitative information, i.e. the turnover rates of individual transcripts (Barenco et al. 2009b). Combining this with time course data and a reorganised version of (22.1), namely dx(t) + Dx(t) = F (t), dt it is possible to isolate the production term. The discretised version is given by: (A + DI)x = G ≈ F. Assuming the linear formulation (22.3) for F it is then possible to group genes according to the similarity of their individual activator profiles f where the distance between genes is Pearson’s correlation between individual G profiles. Using a graph-based representation, three global activities in a complex response were extracted from the dataset. It is important to note, that in contrast to HVDM, the differential operator is applied to data directly. In other words, there was a overfitting risk since noise was in in part used to estimate individual Gs. This potential problem was mitigated by the fact that global activities were obtained through averaging of several individual profiles. Individual genes expression were then refitted to the global activities, so that attribution of individual genes to one or the other of these activities was done in a principled, model-based, manner. Following this, we found that about 80% of the most upregulated genes in the experiment genes could be correctly6 classified, hereby demonstrating the usefulness of this framework in disentangling a very complex response.
22.4
Estimating intermediate points
The estimates for the transcription factor activity and the model output are, in the set-up we have described so far, determined only for those time points coinciding with experimental measurements. However, It could be useful to be able, at least for display purposes, to compute values of those time functions for intermediate points, hereby edging closer to a continuous description of the system. The paradigm we used to integrate ODEs – polynomial integration – lends itself well to this task, indeed such a continuous description is implicit in the scheme. In the approximations we described, we used a sliding window of measurement points and for intermediate estimations we use matching windows. Let tx be the point at which we want to have an intermediate approximation and ti and ti+1 be the neighbouring measurement points so that ti < tx < ti+1 . Rather than choosing between the bounds p and r used for ti or those used for ti+1 (call these p and r ) as stated in Table 22.1, we suggest instead to perform a weighted average of the two possible estimates, dependent on the position of tx within the interval [t, ti=1 ]. The estimation will make repeated uses of the coefficients given in (22.5). Letting y being the function we want to estimate at tx and yj the value we have for this function at t = tj , we obtain: ⎛ ⎞ ⎛ ⎞ r r − t ti+1 − tx ⎝ t x i ⎝ C(k, p, r, tx )yk ⎠ + C(k, p, r, tx )yk ⎠ . (22.18) yx = ti+1 − ti ti+1 − ti k=p
6 To
verify that assertion, independent verification experiments were performed.
k=p
450
Handbook of Statistical Systems Biology
Since this kind of estimation will be implemented for more than one time point tx , efficiencies can be found that avoid repeating calculations. For example, storing the denominators of all C(k, p, r, tx )’s, which do not depend on tx ’s value, is one of those. In Section 22.2.1, we have introduced revised coefficients allowing us to force the slope of the fitting polynomial to be zero at the first time point t0 . However, the coefficients we gave there were only valid for computing the first derivative of the fitting polynomial at a given measurement point, rather than its actual value. Using steps and tricks similar to those explained Section 22.2.1 it is possible to arrive at revised coefficients that implicitly force the fitting polynomial to have a first derivative equal to zero at t0 , distinguishing two possible cases: tx − t 0 (22.19) C0 (k, 0, r, tx ) = C(k, 0, r, tx ) tk − t0 when k = / 0. For k = 0: ⎛ C0 (0, 0, r, t) = C(0, 0, r, tx ) ⎝1 + (t0 − tx )
r j=1
⎞ 1 ⎠ . t0 − t j
(22.20)
The revised coefficients in (22.19) and (22.20) can be obtained as updates of the original ones and these revisions are, unsurprisingly, rather similar to the ones in (22.15) and (22.17), respectively. Because these intermediate time points result ultimately from an approximation of (22.1), the continuous representation that is provided is not necessarily consistent with this equation and should therefore be confined to display purposes. Examples are shown in Figure 22.1. Individual genes profiles [Figure 22.1(b)–(d)] were obtained with the updated coefficients 22.19 and 22.20 to satisfy the hypothesis that their expression is a transient orbit, while the transcription factor profiles [Figure 22.1(a)] were calculated using with the original coefficients given in (22.5).
Acknowledgements This work was supported by BBSRC (grant BB/E008488/1) and MRC (grant 09MH25). The authors wish to thank Neil Lawrence for suggesting Section 22.4. This chapter is dedicated to the memory of Jaroslav Stark who left us prematurely. Jaroslav helped to shape part of this chapter’s content and more importantly, has been a passionate advocate and contributor to Systems Biology in the UK. He and his incisive mind will be sorely missed.
References Barenco M, Tomescu D, Brewer D, Callard R, Stark J and Hubank M 2006 Ranked prediction of p53 targets using hidden variable dynamic modeling. Genome Biol, 7, R25. Barenco M, Papouli E, Shah S, Brewer D, Miller CJ and Hubank M 2009a rHVDM: an R package to predict the activity and targets of a transcription factor. Bioinformatics, 25, 419–420. Barenco M, Brewer D, Papouli E, Tomescu D, Callard R, Stark J and Hubank M 2009b Dissection of a complex transcriptional response using genome-wide transcriptional modelling. Mol Syst Biol, 5, 327. Cromer D, Christophides GK and Stark J 2010 Hidden variable analysis of transcription factor cooperativity from microarray time courses. IET Syst Biol, 4, 131–144. Press W, Teukolski S, Vetterling W and Flannery B 1992 Numerical Recipes in C: The Art of Scientific Computing, pp. 108–110. Cambridge University Press.
23 Host–Pathogen Systems Biology John W. Pinney Centre for Bioinformatics, Imperial College London, UK
23.1
Introduction
Host–pathogen interactions have been an important topic in theoretical biology for a number of years, chiefly focusing on models for the genetic basis of the molecular processes of infection and immunity. The gene-for-gene (GFG) model, where each parasite ‘virulence/avirulence’ locus is matched by a host ‘resistance/susceptibility’ locus, was devised over 50 years ago [1] to explain the observations made in crossinfection experiments with flax and its fungal pathogen, flax rust. In animals, the concept of self–nonself recognition is also an important component of the immune response, as represented by the matching alleles (MA) models [2]. These two simple approaches in some senses represent the extreme behaviours of real host– parasite systems; they have been combined in a two-step animal immune model incorporating both pathogen detection (MA model) and immune response (GFG model) [3]. These and other similar genetic interaction models have provided an essential framework for the analysis of the dynamics of host–pathogen systems, resulting in many important theoretical insights. As data become available on the specifics of molecular interactions within and between cells, however, network analysis approaches are starting to make an impact on the ways in which parasitologists and microbiologists approach the analysis of every aspect of host– pathogen biology. Systems biology is being used to improve our fundamental knowledge about the biology of pathogenic organisms, how they exploit and manipulate their diverse environments and negotiate transitions between hosts and between host and vector species. Although intracellular parasite genomes have generally been prioritized for sequencing and have supported many important breakthroughs arising from sequence analysis and comparative genomics, the dependence of the parasite on the host’s metabolic resources and cellular machinery imply that these organisms can only truly be understood within the framework of systems biology and, by extension, suitable models of the relevant host systems. These integrated models have already produced some fascinating glimpses of previously unexpected host–pathogen biology. They can be seen as excellent
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
452
Handbook of Statistical Systems Biology
examples of the power of the systems biology approach to generate new hypotheses from the synthesis of existing knowledge with new experimental data. An important aspect of this work is the identification of drug targets for novel therapeutic interventions. In addition to molecular targets within the pathogen itself, systems biology allows for the first time the rational evaluation of host molecules as potential targets, where they fulfil some essential role in pathogenicity or the pathogen life cycle. It is well established that host–pathogen interactions should be expected to be one of the major driving forces in evolution, both for the pathogens themselves and for the animals and plants that support them. Genetic interaction models are conceptually simple, yet result in complex and frequently chaotic evolutionary dynamics. One of the goals of host–pathogen systems biology is to implement evolutionary models at a systems level, in sufficient detail to allow the phenotypic effects of a particular genetic change to be estimated. Such models are important because it is reasonable to expect that host–pathogen interactions have had a massive effect on the evolutionary history of all kinds of molecular networks operating at the cellular level and above. The systems biology of infectious diseases is in many respects less developed than that of the major cancers and genetic diseases. Ultimately, however, an evolutionary perspective incorporating host–pathogen interactions may be necessary to understand the origins and mechanisms of many of these noninfectious human diseases. The methods and resources used in host–pathogen systems biology are as varied as the subject matter itself. Many of the research questions relate in some way to networks of molecular interactions, either within host, within pathogen or (most likely) as a biomolecular exchange or other interaction between the two systems (see Figure 23.1). Host–pathogen systems biology therefore makes full use of the standard systems biology
Pathogen systems
Interface systems
Host systems Immune system
Membrane transporters Microbiota Metabolism
Surface glycans Metabolism Surface receptors
Protein interactions
Invasion mechanism
HOST
PATHOGEN
Information processing
Protein interactions Secreted molecules Information processing
Gene regulation Cellular remodelling
Gene regulation
Genome Genome
Evolution
Disease process
Drug resistance
Drug treatment
Figure 23.1 Schematic showing some of the major components of host–pathogen systems and the relationships between them. Each labelled box represents an aspect of the combined system that can be mapped and/or modelled in some way. Although the figure uses the example of an intracellular human pathogen, many of the same systems are relevant to other host–pathogen systems in animals or plants
Host–Pathogen Systems Biology
453
toolkit of annotated genome sequence data, metabolic, protein interaction and gene regulation networks, highthroughput experimental measurements and computational modelling and simulation. Frequently, however, the networks of interest are only poorly defined in parasitic organisms and must be estimated or assembled from model organism templates and the available evidence.
23.2
Pathogen genomics
The genome sequences of many of the major human pathogens have already been published, with comprehensive online gateways established for data relating to protozoan parasites (http://eupathdb.org), pathogenic bacteria (http://wwww.patricbrc.org) and viruses (http://www.viprbrc.org). In addition, several fungal plant pathogens of agricultural significance have been sequenced [4]. The completion of each of these pathogen genome sequences is of course a significant scientific achievement; equally, each sequence marks the beginning of the hard work of building up a coherent genome-scale understanding of the biology of the disease. In fact, comparative genomic approaches allow a significant amount of information to be gleaned from the pathogen sequence alone, even in the absence of cellular network models [5]. In bacteria, virulent phenotypes are associated with genomic pathogenicity islands, the mapping of which has been a major undertaking [6]. The integration of sequence data across organisms with reference to these features can give insights into the function and evolution of virulence mechanisms. For example, the genomes of the human pathogens, enteropathogenic Escherichia coli (EPEC) and enterohaemorrhagic E. coli (EHEC) contain a pathogenicity island known as the locus for enterocyte effacement (LEE), also acquired independently by the murine pathogen Citrobacter rodentium [7]. The presence of other key virulence factors in all three genomes is evidence of the convergent evolution of a common infection strategy [8]. An understanding of the key components of this strategy has therefore helped to increase the value of C. rodentium as a model for the genetic basis of virulence in pathogenic E. coli. Bioinformatic resources for pathogen genomics have developed as the number of genome sequences has increased over the last 10 years. There are now several online databases offering collections of pathogen genomes and providing access to genome annotation, proteomic and gene expression data – see for example the community resources coordinated through the US NIAID project, listed at http://pathogenportal.net. As these databases have grown beyond their original designs, the maintenance of coherent information across multiple sequences has become challenging [9] and the adoption of high-throughput sequencing technologies is expected to add further to this informatics burden. Although it is not yet clear how best to store, search or give access to the anticipated high volumes of data, integration of these online resources will undoubtedly become even more important as the applications of systems biology for pathogenic organisms continue to expand.
23.3
Metabolic models
Models of metabolic networks are among the most developed aspects of systems biology and are in general associated with good predictive power. Metabolic modelling has been applied to many pathogenic organisms, often for the purposes of identifying suitable drug targets. However, integrated models of host and pathogen metabolism are also starting to illuminate various parts of organismal biology that have previously been poorly understood. The initial stages of metabolic modelling usually involve the reconstruction of a draft stoichiometric model of the relevant organism’s metabolism, known as a genome-scale metabolic network [10]. Although several automated systems for metabolic reconstruction exist, driven by knowledge-based assignments of enzymatic
454
Handbook of Statistical Systems Biology
function to predicted protein sequences, the preliminary models that result from such analyses are expected to contain numerous errors, and hence a considerable amount of manual curation must also be applied in order to obtain a model that will produce reliable results. In the case of many parasites, the phylogenetic distances over which protein functions must be transferred from the model organisms in which they are characterized are so great that some enzymes may be very difficult to identify correctly [11]. Nevertheless, recent developments in automated metabolic model generation now make it possible to process a genome sequence to obtain a model that is self-consistent and ready for flux balance analysis (FBA) [12]. These automated models provide a starting point, although further checks and validation remain essential before they can be deemed reliable. Genome-scale metabolic reconstructions have been used in a wide variety of analyses [13]. The most widely applied methodology for metabolic network modelling is constraint-based analysis, where information from experimental or other sources is used to constrain the potentially accessible states of the system. This is the basis for FBA, where the relative flux through each metabolic reaction can be estimated, based on knowledge of a viable growth medium, a formula for biomass composition and the assumption that evolution has acted to optimize a given objective function, usually the rate of biomass production [14]. Although these assumptions may be questionable in many cases, FBA can often give reliable predictions of growth rates on different substrates. It has the advantage over more detailed modelling methods that it does not require knowledge of enzyme kinetics, so once a reliable stoichiometric model has been assembled, relatively little additional experimental information is required. One of the chief applications of FBA is in the refinement of these models themselves, since any discrepancies between predicted and observed knockout phenotypes or substrate requirements can indicate misassignment of enzyme function or other deficiencies in the model. As an extension of the genome annotation process, therefore, metabolic modelling can help to identify the functions of pathogen-encoded enzymes or transporters where these are difficult to recognize from sequence alone. Given a particular environment, it is also possible to predict essential genes from the model, which has yielded potential drug targets, for example in Mycobacterium tuberculosis [15]. Despite the successes of FBA, it remains fundamentally limited as an analysis method. One serious difficulty is in its prediction of the effect of partial gene knockdown, such as would be effected on an enzyme by an inhibitory drug. The method known as Metabolic Control Analysis (MCA) has proven useful in an analysis of glycolysis inhibition in Trypanosoma brucei [16]. However, the requirement for kinetic parameter values for each enzyme in the system has so far made this method difficult to apply more widely. The metabolic networks of pathogens and symbionts have been of considerable interest for evolutionary biologists. The evolution of a parasitic or symbiotic lifestyle is associated with the loss of many genes that would be essential for the ancestral, free-living organism, especially where the host environment now provides many of the metabolic intermediates that would otherwise have to be synthesized from other nutrients. In an influential study, P´al et al. [17] used FBA as part of a simulation of the evolution of the metabolic network of the aphid intracellular endosymbiont, Buchnera aphidicola, from an ancestor with a metabolism similar to E. coli. By repeatedly deleting an enzyme from the network and testing viability within the host using FBA, they reduced the metabolic network to a minimal set of reactions, each of which was now essential. Since early deletions determined later essentiality, different simulations gave different networks. However, the genes retained agreed well with those found in B. aphidicola, having an accuracy of over 80%. Other researchers have also investigated how the architecture of metabolic networks is affected by interactions between different species. For example, Borenstein and Feldman [18] used an analysis of network topology to show structural differences between the metabolisms of parasitic and free-living bacteria and to derive a measure of host–parasite metabolic complementarity. Although there have already been many useful applications of metabolic modelling for pathogens, the genome-scale metabolic networks that are generally discussed in the literature only capture a part of the relevant information. In addition to our partial knowledge of enzyme kinetic parameters, there is also the question of how enzymes are regulated and when and where they are expressed. In eukaryotic parasites,
Host–Pathogen Systems Biology
455
the presence of specialized organelles with distinct metabolic functions adds a further level of complexity to metabolic modelling. Many of these parasites undergo massive changes in cellular organization and gene expression as they undergo transition from one life stage to the next. Integrated analysis of metabolic network models with transcriptomic and proteomic data will be required to establish such context-dependent metabolic network models for parasitic organisms. A step in this direction has been taken with a stage-specific model for Trypanosoma cruzi metabolism, where fluxes for enzymes not detected in a proteomic screen of the epimastigote (insect gut) stage were constrained to zero [19]. Beyond analysis of the pathogen metabolism in isolation, considering the combined metabolic system of the host and pathogen provides an important further level of analysis. A combined metabolic model of a Plasmodium falciparum merozoite within a human erythrocyte was constructed by Huthmacher et al. [20], with reference to the extensive transcriptomic data available for blood stage parasites. Using FBA simulations, this model was used to predict drug targets within the merozoite. Comparison with experimentally verified essential enzymes gave an overall prediction accuracy of 79%, substantially higher than that obtained using a previous method based on topological chokepoints. In a similar way, a metabolic model of M. tuberculosis within the human macrophage has also been constructed [21]. Analysis of this integrated model highlighted dramatic shifts in central metabolism between in vitro and in vivo growth (Figure 23.2). Transcriptomic data were also used to derive context-specific networks, predicting further differences between latent, pulmonary and meningeal infection states. This type of metabolic reprogramming is a major factor for the persistence of the bacillus in the human host, hence systems biology is expected to play an important part in the development of new therapies for tuberculosis [22]. The same host–microbe metabolic analyses can be equally well applied to the interactions between the gut epithelium and the commensal microbiota. Although not normally pathogenic in nature, these associations are vital to the health of the host; malfunctions are increasingly being associated with obesity and metabolic syndrome [23]. The relationship between the commensal microbiota, microbial pathogens and the immune system is also thought to be at the root of many inflammatory diseases [24], hence the homeostasis of this complex host–microbe system is emerging as a timely subject for studies of robustness in living systems [25].
23.4
Protein–protein interactions
As a consequence of the increasing accessibility of the yeast two-hybrid system and other technologies for querying protein interactions at high-throughput, proteome-wide datasets of protein–protein interactions are becoming more widely used in the study of pathogenic organisms. Comparative analyses of network structure have hinted that the protein interaction networks of parasites may be organized in a different way to those of free-living organisms. The P. falciparum interactome appears to share surprisingly little overlap with those of model organisms [26], and may even contain a ‘rich club’ of highly interconnected proteins [27]. Although the quality and completeness of the data still fall somewhat short of that necessary to prove these assertions conclusively, it is interesting to consider whether there may be reasons for an organism’s lifestyle to affect its repertoire of protein interactions. The observation of low conservation of interactions may indicate a greater plasticity in the rewiring of protein interactions, possibly related to the presence of secreted proteins that must interact with their host targets as well as each other. Below the genome scale, integration of protein interaction data is starting to allow the construction of network models for some of the subsystems that are essential to pathogen survival or virulence, such as bacterial motility [28] or invasion of human cells [29]. For some viral systems, sufficient structural information is now available to link the mechanisms of invasion and transport to the finest details of the protein interactions involved [30].
456
Handbook of Statistical Systems Biology
Figure 23.2 Map of the central metabolism of Mycobacterium tuberculosis, showing predicted changes of expression states when comparing in vivo (within macrophage) and in vitro conditions [21]. Reactions predicted to be up-regulated in vivo are labelled in green, while down-regulated reactions are labelled in red
Experimentally determined data on direct protein interactions between host and pathogen are still relatively sparse, although sets of interactions have been presented for a number of host–viral systems [31] (collected in the VirusMINT database [32]) and three bacterial pathogens [33]. Given the current paucity of reliable data, one clear goal of host–pathogen bioinformatics is the development of methods to predict the set of pathogen proteins involved in protein interactions with the host, together with the set of host proteins that are targeted. Taking these studies further, predictions of the bound complexes would also be of considerable value in computational drug design. Today, there are some accurate, sequence-based methods for predicting proteins secreted via specific pathways (for example, by the type 3 secretion system in gram-negative bacteria [34, 35]) or those likely to be displayed on the parasite’s surface. In the absence of such signals, it is at least possible to identify those pathogen-encoded protein domains that show homology to host protein domains with known binding properties. These associations are transferred to produce a list of host–pathogen protein
Host–Pathogen Systems Biology
457
pairs that may have the potential to bind via a known interface. This interolog approach has been applied with some success, for example in the prediction of host targets of Plasmodium proteins [36], or human SH3 domain targets of viral proline-rich domains [37]. The idea that some host–pathogen protein interactions might have homology to within-host interactions is particularly attractive in the case of eukaryotic parasites, which have many protein domains in common with their hosts. To narrow down lists such as those produced by interolog approaches, we can use evolutionary information to find domain pairs that appear to be conserved in the appropriate species. Coevolution may be evident at interfaces that experience strong selection pressures [38]. Where structures for the endogenous proteins are available, it may also be possible to identify characteristic amino acid changes that will stabilize the predicted complex.
23.5
Response to environment
The ability to respond to environmental cues in different parts of the host is essential for all pathogens. In some cases this may be as simple as switching on or off the production of a single protein. However, many parasites require massive cellular reorganization at different stages of the life cycle, sometimes with substantial morphological differences at the level of organelles or surface structures. These switches are associated with discrete developmental programmes of gene expression, triggered by a molecular signal or a change in environment. Modelling these signal transduction and gene regulation systems is of interest in drug development, since any disruption of these pathways would severely compromise pathogen viability. Many bacterial pathogens adapt their behaviour to suit different contexts within the host and this capability is usually directly related to virulence. Insights into the regulatory systems controlling such responses can be gained by screening deletion mutants in known virulence-associated regulators for the ability to establish and maintain an infection, as was done in Salmonella by Yoon et al. [39]. Out of 83 regulators tested, they were able to identify 14 that were clearly necessary for systemic mouse infection. By integrating transcriptional profiles of these genes, they constructed a regulatory network model, showing how multiple external stimuli can result in the transcription of the same set of genes required for infection. The phenotypic shifts displayed by other bacteria can be more extreme, as illustrated by the dramatically different expression patterns of Mycobacterium avium subsp. paratuberculosis (MAP) within cow tissues as compared with macrophages [40]. The regulation of such responses is sometimes more complex than direct transcriptional control. For example, uropathogenic E. coli (UPEC) expresses type 1 fimbriae, hair-like surface structures that are important for establishing adhesion to the host cell. The expression of the structural FimA and adhesive FimH protein subunits is under the control of a stochastic process, where individual cells switch phenotype between fimbriate and afimbriate on the inversion of a short chromosomal region known as fimS. This switch is affected by a number of internal and external factors, most importantly temperature, although the relationship between temperature and switching rate is known to be complex and difficult to model. Using automated multiscale abstraction, however, Kuwahara et al. [41] were able to gain useful insights into the modular structure and and behaviour of this regulatory system at different temperatures, suggesting a mode of action for the regulatory recombinase, FimB. In eukaryotic pathogens, environmental responses may lead to large changes in cellular morphology. Exposure of Candida albicans to blood plasma causes a switch from the yeast to the hyphal form, and although the fungus is normally cleared quickly from the bloodstream, in systemic infections these hyphae are able to penetrate through the endothelium into other tissues. Our knowledge of the related yeast, S. cerevisiae, allows for a productive application of comparative systems biology in examining transcriptional responses of these fungi to human blood [42] or phagocytosis [43]. One component of the C. albicans stress response known to be related to virulence is the cAMP pathway, for which a detailed dynamic model is now available
458
Handbook of Statistical Systems Biology
in S. cerevisiae [44]. These studies highlight the value in translating the systems biology of model organisms to their pathogenic relatives.
23.6
Immune system interactions
The host immune system is by definition one of the major interfaces between host and pathogen and most studies of host–pathogen interactions focus on this aspect of the relationship. Most fundamentally, immune evasion is a universal theme for pathogens, which must either disguise their presence from the immune system or possess strategies to deal with recognition and immunity. Immune cells interact directly with pathogens and parasitized cells, usually recognizing epitopes of surface proteins. However, some pathogens target immune cells specifically and are able to downregulate or otherwise subvert these mechanisms, resulting in propagation of the infection. The immune system has long been an attractive subject for computational biologists, notably in the prediction of interactions between the major histocompatibility complex (MHC) and pathogen-derived peptides [45]. From a systems biology perspective, it becomes important to move beyond such sequence-based predictors of immune response towards models of the signalling systems of the cells involved, as have for example been assembled from literature sources for the macrophage [46, 47] and dendritic cells [48]. The interactions between viruses and their host cells centre around control of the signalling molecules of the innate immune pathways, hence these models are helpful in understanding why some viruses can provoke more severe disease states than others [49] (Figure 23.3).
Figure 23.3 Model for the antagonistic immune interactions between the dendritic cell (DC) and Nipah virus [49]. The antagonist proteins P, V, and W encoded by the virus are effective suppressors of the DC immune responses
Host–Pathogen Systems Biology
459
Several bacterial pathogens are able to establish infections within immune cells. M. tuberculosis is able to survive within the phagosome of an infected macrophage by inhibiting its maturation to the phagolysosome state [50], hence a more detailed understanding of the molecular biology of the phagosome is desirable. In an integrated experimental study, Stuart et al. [51] used proteomics to establish a protein interaction network characteristic of Drosophila phagosomes, then established functions for several components of this interactome using RNAi screening for phagocytosis of Staphylococcus aureus and E. coli. This resulted in a set of functional modules, including discrete sets of proteins associated with the engulfment of gram-negative and gram-positive pathogens. Although Drosophila and other invertebrates lack an adaptive immune system, the importance of fundamental processes such as phagocytosis to host–pathogen interactions, combined with the genetic tractability of nonvertebrate cell cultures, makes them very attractive for studies on the molecular mechanisms of immunity and their subversion by intracellular parasites [52]. For the many human diseases that are borne by insect vectors, a clearer picture of insect immunity can also be important in itself [53]. Immune evasion constitutes another major topic in host–pathogen systems biology, being implicated not only in pathogen survival, but also in many of the pathogenic effects of infection [54]. Clearly, a systems-level understanding of the diverse immune evasion strategies employed by different pathogens would be hugely beneficial for progress in vaccine and drug development. One aspect that has been investigated in a systems context is the use of glycans for immune suppression and camouflage, a strategy that is shared from bacterial pathogens [55] to helminth parasites [56]. Glycan arrays and other emerging technologies for high-throughput glycomics make it possible to screen proteins against multiple carbohydrates in parallel, which is starting to have an impact in immunology research and vaccine development [57]. Some parasites have evolved complex genomic mechanisms associated with immune evasion, allowing them to switch on the expression of only one of a large number of possible versions of a hyper-variable surface protein at each generation and thus escape recognition by acquired immunity. The most well known of these mechanisms belongs to the human malaria parasite P. falciparum, whose var genes code for the PfEMP1 protein that causes cytoadhesion of infected erythrocytes to the endothelium, leading to many of the most serious complications of the disease. With multiple P. falciparum genome sequences now available, computational studies of the natural variation of PfEMP1 have become more informative as a tool for vaccine research [58]. Models of the dynamics of antigen variation, parasitaemia and immune response can also help to reconcile theories about switching mechanisms with epidemiological observations [59, 60]. Moving beyond the cellular level, there are several examples of immune models that consider the totality of the multicellular immune system and its response over time, a particularly appealing subject for systems biology. Such models often operate at multiple scales and require multiple computational techniques, for example in linking the prediction of epitopes to the modelling of T- and B-cell populations following infection or immunization [61]. In mechanistic models of real disease processes, however, sparse data mean that parameter estimation and model comparison can remain challenging [62]. Owing to its importance as a major cause of human disease and its long-term persistence in the host, large scale modelling of M. tuberculosis infection has received considerable attention [63]. The dynamics of the immune response are key to a systems biology understanding of tuberculosis persistence, since the slow growth rate of the bacillus is thought to be intrinsic to its immune evasion strategy [64, 65]. Spatial simulations of bacterial and immune cell behaviour have also played an important role in understanding the formation of granulomas of different types [66].
23.7
Manipulation of other host systems
All pathogens manipulate their host environment to some extent, though some have evolved extraordinarily elaborate mechanisms by which to secure access to nutrients or another type of fitness advantage, often causing
460
Handbook of Statistical Systems Biology
harm to the host. Many of these interventions are mediated by pathogen–host protein interactions, although examples are also known of regulation of host genes or metabolic manipulations. Models of such processes can be important in increasing our understanding of the most harmful aspects of pathogen biology and in highlighting key interactions that may be susceptible to chemotherapy. Occasionally, we have enough knowledge of the individual component interactions of a host–pathogen system to start to build up a picture of the host subsystems that are being manipulated. This is the case with the HIV-1 human interaction network, a compilation of current knowledge drawn from approximately 3200 published articles on the molecular biology of HIV infection, both in vitro and in vivo [67, 68] [Figure 23.4(a)]. Using a biclustering method borrowed from microarray expression data analysis, MacPherson et al. [69] recovered a tree describing the relationships between clusters of interactions between host and virus [Figure 23.4(b)]. Assigning biological processes to various sectors of this tree, they showed that it is possible to recover higher level biology from an integrative analysis of host–pathogen interactions. Together, the processes highlighted form a concise natural history of the HIV replication process and a catalogue of the human subsystems that are subverted. Other analyses of the same data [70, 71] have also drawn links with the host factors that have been associated with HIV susceptibility from genome-wide association studies, or in vitro using RNAi screens [72]. One clear and relatively tractable example of bacterial interference in a host system is that of Helicobacter pylori, which is a major risk factor for peptic ulcers and gastric cancer. H. pylori possessing the cag pathogenicity island manipulates host cell signalling by injection of the CagA virulence factor into the host cell cytoplasm, where it modulates the receptor tyrosine kinase c-Met, causing cell scattering and ultimately leading to disease. A logical model for c-Met signal transduction was constructed by Franke et al. [73], who modelled the effects of the pathogen intervention in this pathway. Using in silico knockout of other pathway components, they were also able to identify the human protein PLCγ 1 as a key factor in the pathogenic consequences of H. pylori infection and a possible therapeutic target that would be independent of bacterial drug resistance. Clearly, the consequences of infection are not restricted to a single cell or even a single tissue, but can involve a whole-organism response as the host’s own systems attempt to maintain homeostasis. Such a large-scale response was modelled by Saric et al. [74], who investigated the metabolic response in rats to infection by the liver fluke, Fasciola hepatica. Measuring metabolite concentrations in a number of tissues and biofluids, they found that the fluke is able to influence brain neurochemistry, causing a shift from adenosine towards inosine. Since inosine has been shown to have anti-inflammatory properties, this manipulation may in fact be an example of an indirect immune evasion strategy.
23.8
Evolution of the host–pathogen system
The evolution of pathogens and their virulence is of enormous practical importance in combating infectious disease. Access to genome sequences for multiple strains of the major human pathogens has led to a better understanding of the evolutionary mechanisms responsible for virulence and its maintenance, notably the prevalence of horizontal gene transfer among bacteria and the association of gene loss with adaptation to a pathogenic lifestyle [75]. The comparative genomics of closely related species is well established as a means to identify potential virulence factors. However, the comparison of species at the systems level can illuminate those aspects of evolutionary biology that emerge from the interactions between genetic components, for example the metabolic pathways in the Mycobacteria and their relationship with virulence [76]. The influence of pathogens on the host’s own evolution is also a subject that can be investigated within the framework of systems biology. Parasites have been present since the earliest forms of life, and their presence has shaped many aspects of extant species. Susceptibility to parasite infection is clearly associated with a
Host–Pathogen Systems Biology
Figure 23.4 (a) View of the HIV-1 human protein interaction network [67]. Grey ovals represent HIV-1 proteins. The smaller circles represent human proteins and are coloured according to their known biological processes. Black lines correspond to direct interactions (e.g. binding) and grey lines to indirect interactions (e.g. downregulation). (b) Neighbour joining tree showing HIV-interacting host subsystems, derived from the data shown in (a) using a biclustering approach [69]
461
462
Handbook of Statistical Systems Biology
reduction in the fitness of the host. Given that both host and parasite populations are subject to genetic change, any loci that impact on these interactions will experience a constantly shifting landscape of selection pressure. These host–parasite interactions, together with the ubiquity of parasite infection, may provide part of the explanations for evolutionary phenomena that are otherwise difficult to understand, such as the maintenance of sexual reproduction in animals and plants (known as the Red Queen hypothesis [77, 78]) and much of the genetic diversity observed in natural populations [79]. From a systems biology perspective, we can imagine pathogens as providing perturbations at specific points in a cellular network. The natural history of host–pathogen interactions is therefore deeply connected to the notion of ‘robustness’ of biochemical systems. This is the starting point for a theoretical study by Salath´e and Soyer [80], who investigated how a model signalling pathway evolves under selection pressure from parasite interactions. They found that antagonistic interactions result in selection for robustness against gene loss, which manifests itself as specific architectural patterns in the pathway. These architectures tend to be evolutionarily preserved even after the parasite interactions are removed, implying that real biochemical networks might retain distinct topological features and robustness properties as evidence of ancestral interactions with parasites.
23.9
Towards systems medicine for infectious diseases
One of the most exciting applications of host–pathogen systems biology is in the development of new medical interventions for infectious diseases. In this context, models of the host–pathogen system are being combined with other mathematical models to predict the effects of drug treatment and the evolution of drug resistance. Population dynamics approaches are of considerable interest in both the clinical and public health applications of host–pathogen systems biology, whether for modelling the immune response to infection or for the purposes of understanding the pathogen’s evolution within the patient or population. There are strong links here with control theory, where the goal is to design treatments or interventions that will optimize a given clinical variable. This would be of value in planning anti-HIV drug regimes, where there may be a trade-off between minimizing drug dosages and minimizing viral load [81], or in optimizing the timing of structured treatment interruptions for long-term control of the virus [82]. There is currently a great deal of interest in producing integrated models that combine multiple aspects of host–pathogen systems biology with the aim of facilitating drug development. For example, an integrated pipeline for drug target identification and validation has been developed for M. tuberculosis, incorporating protein interaction and metabolic network analysis to suggest critical proteins, followed by sequence and structure comparison against human proteins to assess potential as a drug target [83]. Following on from a purely constraint-based assessment of targeted enzymes, integration of the metabolic model with a population growth model can provide a more sophisticated way to compare the effects of inhibitors on disease progression [84]. Host–pathogen systems biology aims to provide a greater understanding of the molecular interactions that are essential to pathogen survival and reproduction. With this knowledge comes the ability to identify viable drug targets on the host side of the host–pathogen interface, for example the CCR5 receptor targeted by the HIV entry inhibitor Maraviroc. Since the host proteins do not mutate during treatment, this approach is thought to be less likely to result in drug resistance. From the point of view of the costs associated with drug development, another clear advantage is that it may be possible to repurpose many existing drugs targeting host proteins that are already licensed for other conditions [85].
23.10
Concluding remarks
In its broadest sense, the field of host–pathogen systems biology encompasses any application of systems modelling to any aspect of infection or immunity. Here I have attempted to give a flavour of the types of
Host–Pathogen Systems Biology
463
research questions that have been asked in the context of important human pathogens, the methodologies used and the progress made so far in each direction. However, it is clear that this field will benefit greatly from ongoing advances in the core systems biology methods and increasing access to data on the intrinsic host and pathogen systems, as well as the molecular details of their many interactions.
Acknowledgements JWP is funded by a Royal Society University Research Fellowship. Orkun Soyer is thanked for helpful suggestions.
References [1] H. H. Flor, Mutations in flax rust induced by ultraviolet radiation, Science, vol. 124, pp. 888–9, 1956. [2] R. K. Grosberg and M. W. Hart, Mate selection and the evolution of highly polymorphic self/nonself recognition genes, Science, vol. 289, pp. 2111–4, 2000. [3] A. F. Agrawal and C. M. Lively, Modelling infection as a two-step process combining gene-for-gene and matchingallele genetics, Proc Biol Sci, vol. 270, pp. 323–34, 2003. [4] J.-R. Xu, Y.-L. Peng, M. B. Dickman, and A. Sharon, The dawn of fungal pathogen genomics, Annu Rev Phytopathol, vol. 44, pp. 337–66, 2006. [5] J. M. Carlton, Genome sequencing and comparative genomics of tropical disease pathogens, Cell Microbiol, vol. 5, pp. 861–73, 2003. [6] S. J. Ho Sui, A. Fedynak, W. W. L. Hsiao, M. G. I. Langille, and F. S. L. Brinkman, The association of virulence factors with genomic islands, PLoS ONE, vol. 4, p. e8094, 2009. [7] W. Deng, Y. Li, B. A. Vallance, and B. B. Finlay, Locus of enterocyte effacement from Citrobacter rodentium : sequence analysis and evidence for horizontal transfer among attaching and effacing pathogens, Infect Immun, vol. 69, pp. 6323–35, 2001. [8] N. K. Petty, R. Bulgin, V. F. Crepin, A. M. Cerde˜no-T´arraga, G. N. Schroeder, M. A. Quail, N. Lennard, C. Corton, A. Barron, L. Clark, A. L. Toribio, J. Parkhill, G. Dougan, G. Frankel, and N. R. Thomson, The Citrobacter rodentium genome sequence reveals convergent evolution with human pathogenic Escherichia coli, J Bacteriol, vol. 192, pp. 525–38, 2010. [9] D. E. Sullivan, J. L. Gabbard, M. Shukla, and B. Sobral, Data integration for dynamic and sustainable systems biology resources: challenges and lessons learned, Chem Biodivers, vol. 7, pp. 1124–41, 2010. [10] A. M. Feist, M. J. Herrg˚ard, I. Thiele, J. L. Reed, and B. Ø. Palsson, Reconstruction of biochemical networks in microorganisms, Nat Rev Micro, vol. 7, pp. 129–43, 2009. [11] J. W. Pinney, B. Papp, C. Hyland, L. Wambua, D. R. Westhead, and G. A. McConkey, Metabolic reconstruction and analysis for parasite genomes, Trends Parasitol, vol. 23, pp. 548–54, 2007. [12] C. S. Henry, M. DeJongh, A. A. Best, P. M. Frybarger, B. Linsay, and R. L. Stevens, High-throughput generation, optimization and analysis of genome-scale metabolic models, Nat Biotech, vol. 28, p. 977, 2010. [13] M. A. Oberhardt, B. Ø. Palsson, and J. A. Papin, Applications of genome-scale metabolic reconstructions, Mol Syst Biol, vol. 5, p. 320, 2009. [14] J. D. Orth and B. Ø. Palsson, Systematizing the generation of missing metabolic knowledge, Biotechnol Bioeng, vol. 107, pp. 403–12, 2010. [15] D. J. V. Beste, T. Hooper, G. Stewart, B. Bonde, C. Avignone-Rossa, M. Bushell, P. Wheeler, S. Klamt, A. Kierzek, and J. Mcfadden, GSMN-TB: a web-based genome-scale network model of Mycobacterium tuberculosis metabolism, Genome Biol, vol. 8, p. R89, 2007. [16] M. A. Albert, J. R. Haanstra, V. Hannaert, J. Van Roy, F. R. Opperdoes, B. M. Bakker, and P. A. Michels, Experimental and in silico analyses of glycolytic flux control in bloodstream form Trypanosoma brucei, J Biol Chem, vol. 280, pp. 28306–15, 2005.
464
Handbook of Statistical Systems Biology
[17] C. P´al, B. Papp, M. Lercher, P. Csermely, S. G. Oliver, and L. D. Hurst, Chance and necessity in the evolution of minimal metabolic networks, Nature, vol. 440, pp. 667–70, 2006. [18] E. Borenstein and M. W. Feldman, Topological signatures of species interactions in metabolic networks, J Comput Biol, vol. 16, pp. 191–200, 2009. [19] S. Roberts, J. Robichaux, A. Chavali, P. Manque, V. Lee, A. Lara, J. Papin, and G. Buck, Proteomic and network analysis characterize stage-specific metabolism in Trypanosoma cruzi, BMC Syst Biol, vol. 3, p. 52, 2009. [20] C. Huthmacher, A. Hoppe, S. Bulik, and H. G. Holzh¨utter, Antimalarial drug targets in Plasmodium falciparum predicted by stage-specific metabolic network analysis, BMC Syst Biol, vol. 4, p. 120, 2010. [21] A. Bordbar, N. E. Lewis, J. Schellenberger, B. Ø. Palsson, and N. Jamshidi, Insight into human alveolar macrophage and M. tuberculosis interactions via metabolic reconstructions, Mol Syst Biol, vol. 6, p. 390, 2010. [22] D. J. V. Beste and J. McFadden, System-level strategies for studying the metabolism of Mycobacterium tuberculosis, Mol BioSyst, vol. 6, pp. 2363–72, 2010. [23] A. Waldram, E. Holmes, Y. Wang, M. Rantalainen, I. D. Wilson, K. M. Tuohy, A. L. McCartney, G. R. Gibson, and J. K. Nicholson, Top-down systems biology modeling of host metabotype-microbiome associations in obese rodents, J Proteome Res, vol. 8, pp. 2361–75, 2009. [24] C. V. Srikanth and B. A. McCormick, Interactions of the intestinal epithelium with the pathogen and the indigenous microbiota: a three-way crosstalk, Interdisciplinary perspectives on infectious diseases, vol. 2008, p. 626827, 2008. [25] H. Kitano and Oda, Robustness trade-offs and host-microbial symbiosis in the immune system, Mol Syst Biol, vol. 2, p. 2006.0022, 2006. [26] S. Suthram, T. Sittler, and T. Ideker, The Plasmodium protein network diverges from those of other eukaryotes, Nature, vol. 438, pp. 108–12, 2005. [27] S. Wuchty, Rich-club phenomenon in the interactome of P. falciparum – artifact or signature of a parasitic life style?, PLoS ONE, vol. 2, p. e335, 2007. [28] S. V. Rajagopala, Titz, Goll, J. R. Parrish, K. Wohlbold, M. T. McKevitt, T. Palzkill, H. Mori, R. L. Finley, and P. Uetz, The protein network of bacterial motility, Mol Syst Biol, vol. 3, p. 128, 2007. [29] A. Wang, S. C. Johnston, J. Chou, and D. Dean, A systemic network for Chlamydia pneumoniae entry into human cells, J Bacteriol, vol. 192, pp. 2809–15, 2010. [30] G. R. Nemerow, L. Pache, V. Reddy, and P. L. Stewart, Insights into adenovirus host cell interactions from structural studies, Virology, vol. 384, pp. 380–8, 2009. [31] S. M. Bailer and J. Haas, Connecting viral with cellular interactomes, Curr Opin Microbiol, vol. 12, pp. 453–9, 2009. [32] A. Chatr-Aryamontri, A. Ceol, D. Peluso, A. Nardozza, S. Panni, F. Sacco, M. Tinti, A. Smolyar, L. Castagnoli, M. Vidal, M. Cusick, and G. Cesareni, VirusMINT: a viral protein interaction database, Nucleic Acids Res, vol. 37, pp. D669–73, 2009. [33] M. D. Dyer, C. Neff, M. Dufford, C. G. Rivera, D. Shattuck, J. Bassaganya-Riera, T. M. Murali, and B. W. Sobral, The human-bacterial pathogen protein interaction networks of Bacillus anthracis, Francisella tularensis, and Yersinia pestis, PLoS ONE, vol. 5, p. e12089, 2010. [34] M. L¨ower and G. Schneider, Prediction of type III secretion signals in genomes of gram-negative bacteria, PLoS ONE, vol. 4, p. e5917, 2009. [35] Y. Yang, J. Zhao, R. L. Morgan, W. Ma, and T. Jiang, Computational prediction of type III secreted proteins from gram-negative bacteria, BMC Bioinformatics, vol. 11, Suppl. 1, p. S47, 2010. [36] M. D. Dyer, T. M. Murali, and B. W. Sobral, Computational prediction of host-pathogen protein-protein interactions, Bioinformatics, vol. 23, pp. i159–66, 2007. [37] M. Carducci, L. Licata, D. Peluso, L. Castagnoli, and G. Cesareni, Enriching the viral–host interactomes with interactions mediated by SH3 domains, Amino Acids, vol. 38, pp. 1541–47, 2010. [38] S. C. Lovell and D. L. Robertson, An integrated view of molecular coevolution in protein-protein interactions, Mol Biol Evol, Epub 14 June 2010. [39] H. Yoon, J. E. McDermott, S. Porwollik, M. McClelland, and F. Heffron, Coordinated regulation of virulence during systemic infection of Salmonella enterica serovar Typhimurium, PLoS Pathog, vol. 5, p. e1000306, 2009. [40] H. K. Janagama, E. A. Lamont, S. George, J. P. Bannantine, W. W. Xu, Z. J. Tu, S. J. Wells, J. Schefers, and S. Sreevatsan, Primary transcriptomes of Mycobacterium avium subsp. paratuberculosis reveal proprietary pathways in tissue and macrophages, BMC Genom, vol. 11, p. 561, 2010.
Host–Pathogen Systems Biology
465
[41] H. Kuwahara, C. J. Myers, and M. S. Samoilov, Temperature control of fimbriation circuit switch in uropathogenic Escherichia coli: quantitative analysis via automated model abstraction, PLoS Comp Biol, vol. 6, p. e1000723, 2010. [42] C. Fradin, M. Kretschmar, T. Nichterlein, C. Gaillardin, C. D’ enfert, and B. Hube, Stage-specific gene expression of Candida albicans in human blood, Mol Microbiol, vol. 47, pp. 1523–43, 2003. [43] L. Rizzetto and D. Cavalieri, A systems biology approach to the mutual interaction between yeast and the immune system, Immunobiology, vol. 215, pp. 762–9, 2010. [44] T. Williamson, J.-M. Schwartz, D. B. Kell, and L. Stateva, Deterministic mathematical models of the cAMP pathway in Saccharomyces cerevisiae, BMC Syst Biol, vol. 3, p. 70, 2009. [45] M. Nielsen and O. Lund, NN-align. An artificial neural network-based alignment algorithm for MHC class II peptide binding prediction, BMC Bioinformatics, vol. 10, p. 296, 2009. [46] S. Raza, K. A. Robertson, P. A. Lacaze, D. Page, A. J. Enright, P. Ghazal, and T. C. Freeman, A logic-based diagram of signalling pathways central to macrophage activation, BMC Syst Biol, vol. 2, p. 36, 2008. [47] S. Raza, N. McDerment, P. A. Lacaze, K. Robertson, S. Watterson, Y. Chen, M. Chisholm, G. Eleftheriadis, S. Monk, M. O’Sullivan, A. Turnbull, D. Roy, A. Theocharidis, P. Ghazal, and T. C. Freeman, Construction of a large scale integrated map of macrophage pathogen recognition and effector systems, BMC Syst Biol, vol. 4, p. 63, 2010. [48] S. Patil, H. Pincas, J. Seto, G. Nudelman, I. Nudelman, and S. C. Sealfon, Signaling network of dendritic cells in response to pathogens: a community-input supported knowledgebase, BMC Syst Biol, vol. 4, p. 137, 2010. [49] J. Seto, L. Qiao, C. A. Guenzel, S. Xiao, M. L. Shaw, F. Hayot, and S. C. Sealfon, Novel Nipah virus immuneantagonism strategy revealed by experimental and computational study, J Virol, vol. 84, pp. 10965–73, 2010. [50] V. Briken, Molecular mechanisms of host-pathogen interactions and their potential for the discovery of new drug targets, Curr Drug Targets, vol. 9, pp. 150–7, 2008. [51] L. M. Stuart, J. Boulais, G. M. Charriere, E. J. Hennessy, S. Brunet, I. Jutras, G. Goyette, C. Rondeau, S. Letarte, H. Huang, P. Ye, F. Morales, C. Kocks, J. S. Bader, M. Desjardins, and R. A. B. Ezekowitz, A systems biology analysis of the Drosophila phagosome, Nature, vol. 445, pp. 95–101, 2007. [52] M. S. Dorer and R. R. Isberg, Non-vertebrate hosts in the analysis of host-pathogen interactions, Microbes Infect, vol. 8, pp. 1637–46, 2006. [53] R. M. Waterhouse, E. V. Kriventseva, S. Meister, Z. Xi, K. S. Alvarez, L. C. Bartholomay, C. Barillas-Mury, G. Bian, S. Blandin, B. M. Christensen, Y. Dong, H. Jiang, M. R. Kanost, A. C. Koutsos, E. A. Levashina, J. Li, P. Ligoxygakis, R. M. Maccallum, G. F. Mayhew, A. Mendes, K. Michel, M. A. Osta, S. Paskewitz, S. W. Shin, D. Vlachou, L. Wang, W. Wei, L. Zheng, Z. Zou, D. W. Severson, A. S. Raikhel, F. C. Kafatos, G. Dimopoulos, E. M. Zdobnov, and G. K. Christophides, Evolutionary dynamics of immune-related genes and pathways in disease-vector mosquitoes, Science, vol. 316, p. 1738, 2007. [54] P. Schmid-Hempel, Parasite immune evasion: a momentous molecular war, Trends Ecol Evol (Amst), vol. 23, pp. 318–26, 2008. [55] E. Kay, V. I. Lesk, A. Tamaddoni-Nezhad, P. G. Hitchen, A. Dell, M. J. Sternberg, S. Muggleton, and B. W. Wren, Systems analysis of bacterial glycomes, Biochem Soc Trans, vol. 38, pp. 1290–3, 2010. [56] I. van Die and R. D. Cummings, Glycan gimmickry by parasitic helminths: a strategy for modulating the host immune response?, Glycobiology, vol. 20, pp. 2–12, 2010. [57] B. Lepenies and P. H. Seeberger, The promise of glycomics, glycan arrays and carbohydrate-based vaccines, Immunopharmacol Immunotoxicol, vol. 32, pp. 196–207, 2010. [58] T. S. Rask, D. A. Hansen, T. G. Theander, A. Gorm Pedersen, and T. Lavstsen, Plasmodium falciparum erythrocyte membrane protein 1 diversity in seven genomes – divide and conquer, PLoS Comp Biol, vol. 6, p. e10009332010. [59] M. Recker, S. Nee, P. C. Bull, S. Kinyanjui, K. Marsh, C. Newbold, and S. Gupta, Transient cross-reactive immune responses can orchestrate antigenic variation in malaria, Nature, vol. 429, pp. 555–8, 2004. [60] K. A. Lythgoe, L. J. Morrison, A. F. Read, and J. D. Barry, Parasite-intrinsic factors can explain ordered progression of trypanosome antigenic variation, Proc Natl Acad Sci USA, vol. 104, pp. 8095–100, 2007. [61] N. Rapin, O. Lund, M. Bernaschi, and F. Castiglione, Computational immunology meets bioinformatics: the use of prediction tools for molecular binding in the simulation of the immune system, PLoS ONE, vol. 5, p. e9862, 2010. [62] G. M. Dancik, D. E. Jones, and K. S. Dorman, Parameter estimation and sensitivity analysis in an agent-based model of Leishmania major infection, J Theoret Biol, vol. 262, pp. 398–412, 2010.
466
Handbook of Statistical Systems Biology
[63] D. Young, J. Stark, and D. Kirschner, Systems biology of persistent infection: tuberculosis as a case study, Nat Rev Micro, vol. 6, pp. 520–8, 2008. [64] M. P. Davenport, G. T. Belz, and R. M. Ribeiro, The race between infection and immunity: how do pathogens set the pace?, Trends Immunol, vol. 30, pp. 61–6, 2009. [65] K. Raman, A. G. Bhat, and N. Chandra, A systems perspective of host-pathogen interactions: predicting disease outcome in tuberculosis, Mol. BioSyst., vol. 6, pp. 516–30, 2010. [66] D. E. Kirschner, D. Young, and J. L. Flynn, Tuberculosis: global approaches to a global disease, Curr Opin Biotechnol, vol. 21, pp. 524–31, 2010. [67] R. G. Ptak, W. Fu, B. E. Sanders-Beer, J. E. Dickerson, J. W. Pinney, D. L. Robertson, M. N. Rozanov, K. S. Katz, D. R. Maglott, K. D. Pruitt, and C. W. Dieffenbach, Cataloguing the HIV Type 1 human protein interaction network, AIDS Res Hum Retroviruses, vol. 24, pp. 1497–502, 2008. [68] J. W. Pinney, J. E. Dickerson, W. Fu, B. E. Sanders-Beer, R. G. Ptak, and D. L. Robertson, HIV-host interactions: a map of viral perturbation of the host system, AIDS, vol. 23, pp. 549–54, 2009. [69] J. I. MacPherson, J. E. Dickerson, J. W. Pinney, and D. L. Robertson, Patterns of HIV-1 protein interaction identify perturbed host-cellular subsystems, PLoS Comp Biol, vol. 6, p. e1000863, 2010. [70] F. D. Bushman, N. Malani, J. Fernandes, I. D’Orso, G. Cagney, T. L. Diamond, H. Zhou, D. J. Hazuda, A. S. Espeseth, R. K¨onig, S. Bandyopadhyay, T. Ideker, S. P. Goff, N. J. Krogan, A. D. Frankel, J. A. T. Young, and S. K. Chanda, Host cell factors in HIV replication: meta-analysis of genome-wide studies, PLoS Pathog, vol. 5, p. e1000437, 2009. [71] S. Jaeger, G. Ertaylan, D. van Dijk, U. Leser, and P. Sloot, Inference of surface membrane factors of HIV-1 infection through functional interaction networks, PLoS ONE, vol. 5, p. e13139, 2010. [72] P. An and C. A. Winkler, Host genes associated with HIV/AIDS: advances in gene discovery, Trends Genet, vol. 26, pp. 119–31, 2010. [73] R. Franke, M. M¨uller, N. Wundrack, E.-D. Gilles, S. Klamt, T. K¨ahne, and M. Naumann, Host-pathogen systems biology: logical modelling of hepatocyte growth factor and Helicobacter pylori induced c-Met signal transduction, BMC Syst Biol, vol. 2, p. 4, 2008. [74] J. Saric, J. V. Li, J. Utzinger, Y. Wang, J. Keiser, S. Dirnhofer, O. Beckonert, M. T. A. Sharabiani, J. M. Fonville, J. K. Nicholson, and E. Holmes, Systems parasitology: effects of Fasciola hepatica on the neurochemical profile in the rat brain, Mol Syst Biol, vol. 6, p. 396, 2010. [75] M. J. Pallen and B. W. Wren, Bacterial pathogenomics, Nature, vol. 449, pp. 835–42, 2007. [76] P. Reddy Marri, J. P. Bannantine, and G. B. Golding, Comparative genomics of metabolic pathways in Mycobacterium species: gene duplication, gene decay and lateral gene transfer, FEMS Microbiol Rev, vol. 30, pp. 906–25, 2006. [77] J. Jaenike, An hypothesis to account for the maintenance of sex within populations, Evol. Theor. , vol. 3, pp. 191–4, 1978. [78] M. Salath´e, R. D. Kouyos, and S. Bonhoeffer, The state of affairs in the kingdom of the Red Queen, Trends Ecol Evol (Amst), vol. 23, pp. 439–45, 2008. [79] R. M. May and R. M. Anderson, Epidemiology and genetics in the coevolution of parasites and hosts, Proc R Soc B, vol. 219, pp. 281–313, 1983. [80] M. Salath´e and O. S. Soyer, Parasites lead to evolution of robustness against gene loss in host signaling networks, Mol Syst Biol, vol. 4, p. 202, 2008. [81] S. Bewick, R. Yang, and M. Zhang, Embedding evolutionary game theory into an optimal control framework for drug dosage design, Conf Proc IEEE Eng Med Biol Soc, vol. 2009, pp. 6026–9, 2009. [82] B. M. Adams, H. T. Banks, M. Davidian, H. Kwon, H. T. Tran, S. N. Wynne, and E. S. Rosenberg, HIV dynamics: modeling, data analysis, and optimal treatment protocols, J Comput Appl Math, vol. 184, pp. 10–49, 2005. [83] K. Raman, K. Yeturu, and N. Chandra, targetTB: a target identification pipeline for Mycobacterium tuberculosis through an interactome, reactome and genome-scale structural analysis, BMC Syst Biol, vol. 2, p. 109, 2008. [84] X. Fang, A. Wallqvist, and J. Reifman, A systems biology framework for modeling metabolic enzyme inhibition of Mycobacterium tuberculosis, BMC Syst Biol, vol. 3, p. 92, 2009. [85] A. Schwegmann and F. Brombacher, Host-directed drug targeting of factors hijacked by pathogens, Sci Signal, vol. 1, p. re8, 2008.
24 Bayesian Approaches for Mass Spectrometry-Based Metabolomics Simon Rogers1 , Richard A. Scheltema2 , Michael Barrett3 and Rainer Breitling4 1 School of Computing Science, University of Glasgow, UK Proteomics and Signal Transduction, Max Planck Institute for Biochemistry, Martinsried, Germany 3 Institute of Infection, Immunity and Inflammation, College of Medical, Veterinary and Life Sciences, University of Glasgow, UK 4 Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, UK 2
24.1
Introduction
Systems biology has emerged as a vibrant and popular research area since the completion of the first complete genome sequences [1]. This is largely due to the resulting ability to create and analyze genome-wide datasets at various molecular levels, mRNA and protein expression datasets being the most popular. However, protein and mRNA data provide only an incomplete picture of the cellular mechanisms within an organism, with much of the short-term dynamics being lost. As a consequence many of the success stories of systems biology have studied metabolic systems: examples include the early development of metabolic control analysis [2] and the more recent advances in constraint-based metabolic modeling [3]. Modern tools for measurements of the metabolome (the end product of gene expression) represent a key ingredient for a metabolite-centered systems biology [4, 5]. The more detailed picture of cellular operation, made possible with this additional information, is crucial for understanding basic biology and for developing treatments for many diseases (e.g. [6]). Measurement and analysis of metabolites is an established approach within the life sciences. Researchers have been unraveling metabolic networks for a long time (see e.g. [7]); however, recent advances in mass
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
468
Handbook of Statistical Systems Biology
spectrometry technology have greatly increased the speed and reliability of comprehensive metabolomics measurements [4, 8, 9]. The improved mass spectrometry technology provides the potential for the characterization of the metabolic content of more complex samples at increasing speed and reliability. As with the more traditional post-genomic technologies, there are statistical problems when dealing with the quantity of data that can be produced (some modern mass spectrometers used in metabolomics are capable of producing several gigabytes of data within a single day). However, an additional challenge, not shared by other functional genomics platforms, comes from the absence of the regular sequential/polymeric structure that characterizes other functional biomolecules (DNA, RNA, proteins); only the masses of the molecules and their relative abundance are reported. An additional step is required to map these masses to metabolites (or, more specifically, chemical formulas of metabolites) before it is possible to perform useful analyses on the contents of the samples – searching for patterns and relationships that may be indicative of interesting biological phenomena. Increasingly accurate mass spectrometers help overcome part of the problem, [5, 7, 10], but the technology is still a long way from providing a complete solution. There have been significant research efforts towards developing metabolomic analysis techniques that can be used once the spectra have been processed, in particular through the use of networks (e.g. [11]). Identified and quantified metabolites can be mapped onto metabolic pathway maps, to highlight metabolic processes that are affected by a particular experimental intervention. Identified metabolites that differ dramatically between conditions can be useful as biomarkers. More ambitiously, the integrated analysis of identified metabolic changes and the corresponding changes in enzyme concentrations, fluxes, regulatory activities and gene expression levels leads to a broader understanding of the metabolic system in action. The success of these techniques, however, critically depends on the identification of the detected metabolites.
24.2
The challenge of metabolite identification
The assignment of extracted peaks to chemical formulas, and subsequently to metabolites, in an accurate and timely manner is nontrivial, but vital if large-scale metabolomic analysis is to rival other functional genomics approaches. The simplest approach is to assign a measured mass to the chemical formula with the closest theoretical mass. For example, a peak observed at a mass of 194.08 Da in a coffee bean extract is likely to correspond to a caffeine molecule (exact molecular mass 194.080376 Da). If there is no other potential metabolite with an even closer theoretical mass, it will be assigned correctly. However, this approach fails for highly complex mixtures due to the finite accuracy of the mass spectrometry equipment. Reported masses are often very noisy and do not exactly represent the true masses of the constituents of the sample, especially if the corresponding molecules are present in low quantities and, of course, mass information alone cannot discriminate between molecules of identical composition but different structure. At low masses this is less problematic, as there are few potential formulas within the tolerance of the mass spectrometry equipment. At higher mass this approach breaks down, as several potential formulas may lie close enough to the measured mass to make disambiguation impossible at the mass accuracy of the equipment [12]. This is particularly the case when performing untargeted metabolome screens using liquid chromatography–mass spectrometry, which is potentially capable of picking up novel metabolites of high interest. It is, however, a problem that needs to be overcome: such screens have the potential to detect hundreds or thousands of metabolites – providing an unprecedented snapshot of the physiological state of an organism. It is possible to filter the list of potential formulas using empirical rules (e.g. [13]), but this fails to completely overcome the problem. Alternatively, one can pre-process the sample to make identification easier. One attractive approach [14] proposes creating four samples, each isotopically labeled, with different heavy atoms derived from the same biological source. Each of the four samples is presented to the mass spectrometer,
Bayesian Approaches for Mass Spectrometry-Based Metabolomics
469
and their analysis combined. Whilst this undoubtedly improves identification, it does so at a significant experimental and economic cost. Two more economical ways of improving the identification process incorporate additional, freely available information. The first [13] uses isotope profiles to filter potential formulas, while the second [15] exploits metabolic relationships to prioritize particular formulas from a set, dependent upon the existence of related metabolites in the same dataset. The latter method, which operates within a thoroughly probabilistic framework [15], has the simultaneous advantages of properly capturing (and reporting) the uncertainty inherent in the problem and allowing a framework for incorporating several forms of additional data. We suggest that extending the available models to integrate a multitude of different types of additional information in a probabilistic manner will be the foundation of statistical metabolomics as an independent research area.
24.3
Bayesian analysis of metabolite mass spectra
Bayesian statistics provides a natural framework for performing inference from data in the presence of noise [16]. Several studies have previously investigated the use of Bayesian statistics within the broad field of mass spectrometry [17–19]. Rogers et al. [15] give a specific example of the benefit of Bayesian statistics when identifying metabolites in noisy metabolomic data. In particular, the Bayesian framework allows the combination of experimental data (in this case, the pre-processed masses of measured peaks) with prior information (potential chemical relationships between compounds) to produce posterior distributions (and hence probabilities) over identifications of peaks as demonstrated in Figure 24.1. Here, three peaks (masses) have been identified from the experimental data: m1, m2 and m3. The peaks m2 and m3 are easily identified (there is only one possible compound within a reasonable accuracy window). However, m1 has a mass equidistant from two different compounds: without any additional information, it is given a probability of 0.5 of belonging to either [Figure 24.1(a)]. Additional information, distinguishing the two options, can be obtained by introducing the potential chemical connections between masses. In this case, the compounds assigned to m3 and m2 are both potentially related to one of the potential compounds for m1 as either a substrate or product in a potential biochemical reaction [Figure 24.1(b) and (c), respectively]. These relationships provide more evidence in favor of identification as this compound, reflected in the increasing probability. The method exploits the fact that (almost) all metabolites in a biological sample are chemically related to each other in a metabolic network, so that information about one metabolite will influence the conclusions about the identity of all other compounds as well. Additionally, a typical measurement will also contain derivatives of most metabolites, which provide additional evidence influencing the identifications. This can help the identification even when the dataset is sparse, i.e. many metabolites are undetected, as is often the case in metabolomics experiments. The approach uses a statistical technique known as Gibbs sampling to iteratively sample the assignment of a particular mass conditioned on the assignments of all other masses within the sample. This sampling is done from the following conditional distribution (the full details are available in [15]): p(zcm = 1|Z, xm , y, δ, γ) ∝ N
βcm + δ xm −1 1, γ yc Cδ + c βc m
(24.1)
where zcm is an indicator variable that is set to 1 if mass m is assigned to formula c, Z is a binary matrix holding all of the current assignments, xm is the mth measured mass, y is a vector of the masses of the C potential formulas and δ and γ are user-defined parameters that describe the influence of connections (δ) and the mass accuracy of the spectrometer (γ). The right-hand side of this expression consists of two terms – a Gaussian term that will be high if mass m is close to the theoretical mass of formula c and a connectivity
470 Handbook of Statistical Systems Biology Figure 24.1 Schematic of the metabolite identification system described in [15] which is based on a Bayesian statistical approach and uses metabolomic relationships to support the identification process
Bayesian Approaches for Mass Spectrometry-Based Metabolomics
471
term that will be high if the assignment of mass m to formula c will add many potential connections, i.e. if the assigned compound fits well into the metabolomic network. When used to analyze a recent dataset from protozoan parasites (Trypanosoma brucei, the causative agent of sleeping sickness), it produced assignments which were far more plausible in the context of the genomepredicted T. brucei metabolome [20] than those obtained by just using the compound closest to the experimentally observed mass. Of the 339 peaks produced by the experiment, 14 were assigned differently once connections were taken into account. It is clear that the Bayesian approach allowing the inclusion of additional data was very useful. The task for statistical metabolomics is to further enhance the results by expanding the approach. Although this is not necessarily proof that the new assignments are entirely correct, the probabilistic framework provides useful information that could be used to decide which (if any) of the peaks to subject to a second level of mass spectrometry analysis – peaks that are assigned with the lowest probabilities are likely to provide the highest level of disambiguation when concretely identified.
24.4
Incorporating additional information
The probabilistic approach introduced above already improves metabolite identification in its simplest form, which can be further enhanced by building models that incorporate more information. A number of sources of a priori knowledge lend themselves to exploitation. These include: 1. Inferred presence from genome-scale metabolic reconstructions. Some metabolites may be more likely than others, as they are inferred to be present in the biological system based on genome-scale metabolic reconstructions or because they have previously been observed in the same organism (or in another related one there is a clear hierarchy of confidence in the prior knowledge). Relative abundance (or relative signal intensity) can also be incorporated in this context: for example, it might be known a priori that the abundance of glutamate in a certain type of sample will be expected to be much larger than that of tryptophan, so that the identification of an extremely abundant signal as tryptophan might be discounted. Of course, various factors influencing the signal intensity, including ion suppression and matrix effects, need to be incorporated in a full description of these relationships. 2. Differences in expected probability of biochemical reactions. Some metabolic transformations are more likely than others, based on (general or specific) biochemical knowledge and metabolic models. It might be that enzymes catalyzing a particular type of reaction have already been described in a closely related species or that a particular type of reaction (e.g. dehydrogenation) is known to be extremely common in any biological system. This prior knowledge can also include thermodynamic predictions about the feasibility of a particular reaction, e.g. based on the group contribution method for estimating thermodynamic properties of metabolites in the absence of experimental data [21]. As these estimates will only be approximations, they add another level of uncertainty that can be embraced within the Bayesian framework. A particular challenge for future research will be the interpretation of the thermodynamic constraints and their influence on reaction feasibility and, consequently, metabolite identification in a genome-scale network context, e.g. using thermodynamics-based metabolic flux analysis [22]. 3. Chromatographic behavior and other biophysical information. Chromatographic information can support or refute certain identifications, as databases of the chromatographic behavior of authentic standards are becoming available. In fact, the combined information of chromatographic retention time and mass is frequently taken as sufficient information for definitive metabolite identification. We believe it would be even better to consider this information as part of a probabilistic scoring method, as liquid chromatography is known to exhibit relatively large fluctuations in retention time. Furthermore, the chromatographic retention time of metabolites can be predicted from elementary properties of their structure (with some uncertainty),
472
Handbook of Statistical Systems Biology
which can be incorporated to refine the identifications: a peak that is quite confidently identified based on mass, isotope pattern and metabolic plausibility may still be discounted (to get a lower posterior probability for a particular identification), if the predicted and observed retention time show a large discrepancy. This can, for example, occur in situations where the peak is a technical derivative (e.g. an ion adduct or chemical fragment) of an actual metabolite and therefore occurs at a shifted retention time. An important advantage of a unified statistical framework is that the retention time prediction and the identification of metabolites can be combined in a single approach, so that the calibration of the retention time prediction is based on confidently identified compounds within the sample, reducing the need for extensive external standards for each variation of machine settings. 4. Information on chemical derivatization. On the other hand, it is becoming increasingly obvious that a large fraction (perhaps more than 60% [23]) of the signals obtained from a complex metabolite mixture are generated from compounds that are technology-created derivatives of the real cellular metabolites. Rigorous incorporation of models describing the processes underlying the creation of these derivatives will make the interpretation of the data less error-prone (the presence of certain types of adducts and fragment masses will exclude a number of potential formulas and structures). In contrast to the previous incorporation of isotope information and metabolic relationships [13, 15], this will require not only statistical developments, but also a better physico-chemical understanding of the processes leading to these derivative signals. 5. Data integration across related samples. Finally, the Bayesian statistical framework allows the combination of data across samples, for instance arguing that compounds with similar properties that are detected repeatedly are likely to be the same metabolite, even if in some samples they are close to the detection limit. This also emphasizes an additional benefit of the approach, which will not only enable the more accurate detection and identification of metabolites but also leads to more reliable estimates of their (relative) abundance in the samples. In summary, there is much information that is already available and could be used to improve mass assignments. The probabilistic framework makes it possible to incorporate this information; the challenge for statisticians working on metabolomics data will be to develop specific statistical models to do this. Once a statistical model is developed in the form of some positive function f (xm , yc ) that increases as yc becomes a better candidate for xm , it is simple to incorporate it into the model described by Equation (24.1) as an additional term on the right-hand side: xm f (xm , yc ) βcm + δ −1 × . 1, γ p(zcm = 1|Z, xm , y, δ, γ) ∝ N yc Cδ + c βc m c f (xm , yc ) This can be done in turn and independently for each of the additional types of evidence outlined above, enabling a very convenient iterative and modular development process.
24.5
Probabilistic peak detection
Peak identification is not the only statistically relevant challenge in metabolomics. At an earlier step in the data processing, peak detection is equally amenable to a rigorous statistical approach. Correct detection of peaks is, like their subsequent assignment, crucial to the validity of any conclusions drawn from the analysis of the resulting dataset. Current approaches use peak detection techniques and then thresholding on peak quality and/or intensity to produce a list of peaks for further analysis. Within the community using mass spectrometry for proteomic analysis, there have been several attempts at using Bayesian inference in the peak detection step [17–19, 24, 25]. Such approaches provide distributions or probabilities of peaks existing at different masses. We suggest that an important next step will be the integration of this information with
Bayesian Approaches for Mass Spectrometry-Based Metabolomics
473
downstream formula assignments to eliminate the need for a threshold to determine what is and what is not a peak. As an example of why this is important, consider a peak that is just below the threshold. It is cut off and subsequently ignored for the remainder of the analysis. However, within the context of other peaks that have been identified and assigned, it might begin to look more interesting. Within the probabilistic approach, these data would never be discounted. They would merely be given a low probability (or confidence) by the detection algorithm that might then be increased a posteriori as other peaks and their relationships are identified. Similarly, something just above the threshold might begin to look less likely as other masses are identified. Investigation of this propagation of uncertainty through the detection process is a key part of our proposed research agenda for statistical metabolomics. The existence of algorithms such as those discussed above means that it is highly feasible that there will soon be progress towards a continuous probabilistic pipeline from raw data to scientific analysis. For example, the probability of peaks being present at different spectral positions could be considered an additional source of prior information that could be fed into the analysis producing both posterior probabilities over the presence of real signal within the spectrum and probabilities over the assignment of these spectral positions to formulas.
24.6
Statistical inference
Because it is unlikely that any of the distributions of interest in statistical metabolomics will come from mathematically convenient families, it is likely that the field will need sophisticated Bayesian inference tools, most likely Markov chain Monte Carlo (MCMC) techniques. These techniques allow us to draw samples from distributions that we cannot analytically compute. In [15], we used one such technique known as Gibbs sampling. This allowed us to draw samples from the distribution over peak-to-compound assignments (a distribution that would be infeasible to compute analytically due to the enumeration required over all possible sets of assignments). These samples can then be used to approximate the distributions of interest or used to approximate arbitrary expectations with respect to that distribution. The most obvious example of a useful expectation would be the posterior probability that a particular mass (say mass m) is assigned to a particular formula (say formula c): p(zmc = 1| . . .) where zmc is an indicator variable that is assigned the value 1 if indeed mass m is assigned to formula c and the dots to the right of the conditioning bar represent all of the data and user-defined parameters. This is computed very simply by counting the proportion of samples in the output where zmc = 1. A more subtle but potentially equally important example is the posterior probability of coexistence. One of the important features of this approach is the implicit dependence created between all of the assignments. Hence: / p(zmc = 1| . . .)p(zm c = 1| . . .). p(zmc = 1, zm c = 1| . . .) = Indeed, if there is a biochemical connection between formulas c and c , we might expect the left-hand side of this equation to be significantly higher than the right-hand side. Fortunately, it is easy to compute the quantity on the left-hand side as simply the proportion of samples in which these two formulas are both assigned (to masses m and m , respectively). This has useful, intuitive applications for the biological interpretation of the results, as it allows to examine the consequences of alternative identifications: if mass x1 is assigned to metabolite y1 then we are also likely to observe metabolites y2 , . . . , yn ; but if mass x2 is actually metabolite y1 , we probably see y2 , . . . , yn , which have a very different biological meaning. The incorporation of diverse additional sources of prior knowledge is likely to require similar techniques and it is all but certain that it will prove challenging to sample from these complex distributions. Some recently explored strategies to overcome the computational challenges associated with sampling from such
474
Handbook of Statistical Systems Biology
complex distributions might be useful in the metabolomics field as well. For example, in [26], the authors show how one can use thermodynamic integration (a technique pioneered by theoretical physicists) to improve the performance of sampling algorithms when confronted with highly multimodal densities that are common in the complex systems studied within systems biology.
24.7
Software development for metabolomics
In order for methods developed in statistical metabolomics to become used in practice, it is important that they are implemented with a user-friendly interface, freely available and compatible with popular experimental data formats (mzXML, or the more recent mzML). Software development for statistical metabolomics will need to take this into account by implementing the probabilistic framework in a manner that makes it compatible with existing data processing and management software such as mzMine http://mzmine.sourceforge.net, XCMS [6] and mzMatch http://sourceforge.net/ projects/mzmatch/. The mzMine environment provides a full-fledged user interface from which peak extraction and a number of the most popular analysis approaches for determining significant signals can be performed. This has the advantage that relatively inexperienced users can rapidly extract results from their data. In contrast, XCMS relies on the R environment for all data analysis, after the initial peak extraction and matching has been done. Such an approach requires a higher amount of background knowledge, including the R programming language, and is thus not readily accessible for many biologist users. Both XCMS and mzMine offer only limited access to the extracted and matched peak data for independent statistical analysis and rely on monolithic implementations which are difficult to adjust for new statistical
Figure 24.2 Example of a mzMatch pipeline. The mzXML/mzData/mzML standards changed the data analysis landscape by providing a common input format, but do not provide functionality for breaking up a data analysis pipeline into small interchangeable components. This functionality is provided with the introduction of the PeakML file format
Bayesian Approaches for Mass Spectrometry-Based Metabolomics
475
approaches. The more recent mzMatch environment, in contrast, is built in a modular fashion (i.e. it consists of small tools which can be combined in any configuration) around an open file format called PeakML for storing the complete peak data (including the full chromatographic trace for each peak with retention time, m/z and intensity information). Such a design can readily be extended with novel downstream analysis software like the proposed probabilistic framework, which can easily access the extracted peak data stored in the PeakML file (Figure 24.2). It is also to be expected that more advanced statistical approaches will ultimately find their way into commercial software for metabolomics analysis, which is provided by the manufacturers of mass spectrometers.
24.8
Conclusion
Large-scale metabolomic analysis is a key tool for the further understanding of biological organisms and their cellular processes. Development of useful statistical techniques to aid these analyses will therefore have far reaching benefits within the metabolomics communities and throughout systems biology. Statistical metabolomics will require the close collaboration of three very diverse groups of researchers: (1) metabolomic practitioners need to contribute not only test datasets, but also specific expertise on prior knowledge that can be incorporated in the statistical analysis, ranging from empirical knowledge of chromatographic behavior and other chemical properties to a more detailed understanding of the processes leading to technical artifacts in the mass spectrometry equipment. In return they would clearly benefit from better techniques to identify the metabolites present within their particular samples, so that they no longer depend on manual curation of the results by expert analytical chemists. (2) Bioinformaticians and Systems Biologists who are working within the field of metabolomics or would like to use metabolomic data need to contribute their expertise on metabolic networks, including biochemically feasible reactions and expected metabolites, and will in turn benefit from better detection and identification, which additionally allows for far more rigorous statistical assessment of the significance and (un-)certainty of particular results. (3) Finally, statistical and machine learning researchers need to develop new computational approaches to deal with the unique challenges posed by the nonstandard distributions and the heterogeneous sources of information encountered in this research; in return they will benefit from being able to use this very dynamic and high-profile research area as a test bed for developing more powerful inference methodology.
References [1] H. Westerhoff and B. Palsson. The evolution of molecular biology into systems biology. Nat Biotechnol, 22(10):1249– 1252, 2004. [2] D. A. Fell. Metabolic control analysis: a survey of its theoretical and experimental development. Biochem J, 286:313– 330, 1992. [3] J. L. Reed, I. Famili, I. Thiele, and B. O. Palsson. Towards multidimensional genome annotation. Nat Rev Genet, 7:130–141, 2006. [4] X. Feng, X. Liu, Q. Luo, and B.-F. Liu. Mass spectrometry in systems biology: an overview. Mass Spectrom Rev, 27(6):635–660, 2008. [5] W. Dunn, D. Broadhurst, M. Brown, P. Baker, C. Redman, L. Kenny, and D. Kell. Metabolic profiling of serum using Ultra Performance Liquid Chromatography and the LTQ-Orbitrap mass spectrometry system. J Chromatogr B, 871(2):288–298, 2008. [6] R. A. Scheltema, S. Decuypere, R. T’kindt, J. C. Dujardin, G. H. Coombs, and R. Breitling. The potential of metabolomics for Leishmania research in the post-genomics era. Parasitology, 137:1291–1302, 2010.
476
Handbook of Statistical Systems Biology
[7] R. Breitling, A. R. Pitt, and M. P. Barrett. Precision mapping of the metabolome. Trends Biotechnol, 24(12):543–548, 2006. [8] D. Ohta, S. Kanaya, and H. Suzuki. Application of Fourier-transform ion cyclotron resonance mass spectrometry to metabolic profiling and metabolite identification. Curr Opin Biotechnol, 21(1):35–44, 2010. [9] A. G. Marshall and C. L. Hendrickson. High-resolution mass spectrometers. Annu Rev Anal Chem, 1(1):579–599, 2008. [10] W. Lu, B. Bennett, and J. Rabinowitz. Analytical strategies for LC-MS-based targeted metabolomics. J Chromatogr B, 871(2):236-242, 2008. [11] F. Jourdan, R. Breitling, M. P. Barrett, and D. Gilbert. MetaNetter: inference and visualization of high-resolution metabolomic networks. Bioinformatics, 24(1):143–145, 2008. [12] T. Kind and O. Fiehn. Metabolomic database annotations via query of elemental compositions: mass accuracy is insufficient even at less than 1 ppm. BMC Bioinformatics, 7:234, 2006. [13] T. Kind and O. Fiehn. Seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics, 8(1):105, 2007. [14] A. D. Hegeman, C. F. Schulte, Q. Cui, I. A. Lewis, E. L. Huttlin, H. Eghbalnia, A. C. Harms, E. L. Ulrich, J. L. Markley, and M. R. Sussman. Stable isotope assisted assignment of elemental compositions for metabolomics. Anal Chem, 79(18):6912–6921, 2007. [15] S. Rogers, R. A. Scheltema, M. Girolami, and R. Breitling. Probabilistic assignment of formulas to mass peaks in metabolomics experiments. Bioinformatics, 25(4):512–518, 2009. [16] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2004. [17] J. S. Morris, P. J. Brown, R. C. Herrick, K. A. Baggerly, and K. R. Coombes. Bayesian analysis of mass spectrometry proteomics data using wavelet based functional mixture models. Biometrics, 64(2):479–489, 2008. [18] A. Saksena, D. Lucarelli, and I.-J. Wang. Bayesian model selection for mining mass spectrometry data. Neural Networks, 18 (5–6):843-849, 2005. [19] J. Zhang, H. Wang, A. Suffredini, D. Gonzalez, E. Gonzalez, Y. Huang, and X. Zhou. Bayesian peak detection for PRO-TOF MS MALDI data. In IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP 2008. IEEE, pp. 661–664, 2008. [20] B. Chukualim, N. Peters, C. Fowler, and M. Berriman. TrypanoCyC – a metabolic pathway database for Trypanosoma brucei. BMC Bioinformatics, 9(Suppl. 10):P5, 2008. [21] M. D. Jankowski, C. S. Henry, L. J. Broadbelt, and V. Hatzimanikatis. Group contribution method for thermodynamic analysis of complex metabolic networks. Biophys J, 95(3):1487–1499, 2008. [22] C. S. Henry, L. J. Broadbelt, and V. Hatzimanikatis. Thermodynamics-based metabolic flux analysis. Biophys J, 92(5):1792–1805, 2007. [23] R. A. Scheltema, S. Decuypere, J. C. Dujardin, D. Watson, R. C. Jansen, and R. Breitling A simple data reduction method for high resolution LC/MS data in metabolomics. Bioanalysis, 1:1569–1578, 2009. [24] M. Atlas and S. Datta. A statistical technique for monoisotopic peak detection in a mass spectrum. J Proteomics Bioinform, 2(5):202–216, 2009. [25] X. Kong and C. Reilly. A Bayesian approach to the alignment of mass spectra. Bioinformatics, 25(24):3213–3220, 2009. [26] B. Calderhead and M. Girolami. Estimating Bayes factors via thermodynamic integration and population mcmc. Comput Stat Data Anal, 53(12):4028–4045, 2009.
25 Systems Biology of microRNAs Doron Betel1 and Raya Khanin2 1
25.1
Institute of Computational Biomedicine, Weill Cornell Medical College, New York, USA 2 Bioinformatics Core, Memorial Sloan-Kettering Cancer Center, New York, USA
Introduction
MicroRNAs are a large class of short noncoding RNAs of 19–24 nucleotides (small RNAs) that negatively regulate messenger RNAs (mRNAs) and proteins levels in the cell. Since their discovery in 2001, microRNAs have been found to play critical roles in many, if not all, key biological processes, such as cell growth, tissue differentiation, cell proliferation, embryonic development, and apoptosis (Esquela-Kerscher and Slack 2006). As such, aberrations in microRNA expression, biogenesis, or sequence mutations have been linked to various diseases, such as cancers (Kumar et al. 2007), cardiovascular disease (Van Rooij and Olson 2007), and more [see Lu et al. (2008) on microRNA-disease association network]. Systems Biology study of any organism, biological process, or complex disease has to include microRNAs as central post-transcriptional regulators. The microRNA research is perhaps one of the rare examples so far when accumulation of experimental data and development of computational techniques have been happening in parallel, feeding each other’s progress and making equal contributions in advancing our understanding of gene regulation. The microRNA field has been growing exponentially as has the scope of computational problems that accompanies this field. In this chapter, we present some of the major computational challenges in microRNA Systems Biology with focus on the most active problems, in particular those that we feel will have significant impact in the future.
25.2
Current approaches in microRNA Systems Biology
Current approaches in microRNA Systems Biology, and Systems Biology in general, can be divided into three groups. Most computational efforts to understand post-transcriptional gene regulation by microRNAs Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
478
Handbook of Statistical Systems Biology
have focused on target prediction tools [see Bartel (2009) for a recent review on target prediction]. Numerous target prediction algorithms and databases are readily available. However, how microRNAs select their targets and which targets are functional remains a central and unresolved problem that has a major impact on experimental and computational advancements in microRNA biology. One of the major challenges of this problem is the incomplete knowledge of the factors involved in microRNA binding to the target mRNA as well as the interplay with other RNA binding proteins and RNA secondary structure. A second area of computational efforts that is gaining momentum is integration of target prediction methods with other regulatory relationships, such as transcription factors (TFs) and their targets, into a comprehensive network of gene regulations. A regulatory network abstractly represents various cellular interactions, including transcription, metabolic, signaling and protein – protein. The network approach in cellular biology provides a powerful framework for understanding generic principles and functions of biological processes, including gene regulation on transcriptional and post-transcriptional levels. The network approach in microRNA systems biology deals with studying principles of microRNA regulation of gene, protein and metabolic networks, identifying microRNA-mediated modules, and functional regulatory microRNA network motifs (in particular TF-microRNA motifs) from experimental and computational data. Another area in microRNA research deals with kinetic models of post-transcriptional microRNAmediated gene regulation. These models, usually described by ordinary differential equations (ODEs), are used to simulate ‘what if’ scenarios (in silico biology), and enable the inference of kinetic parameters and modes of regulation from high-throughput data. Translating the ODE-based models into stochastic form is used to study extrinsic and intrinsic noise in the cases of low numbers of molecules, and may reveal functionalities of microRNA regulatory circuits. A comprehensive review of all computational aspects in microRNA research is beyond the scope of this chapter. We therefore focus on some interesting applications of statistical techniques, and mathematical modeling, that are used to unravel basic questions in microRNA biology. We start with a very brief summary of some significant experimental findings on gene regulation by microRNAs that guide the computational efforts.
25.3
Experimental findings and data that guide the developments of computational tools
microRNAs regulate gene expression post-transcriptionally, primarily by destabilizing the target mRNA which leads to degradation or by repressing the translation of mRNA to protein. With a few reported exceptions, the effect of microRNA is repressive although the molecular mechanisms are still debated (Filipowicz et al. 2008). A recent report suggests that mRNA degradation is the predominant form of microRNA-mediated regulation (Guo et al. 2010). However, proteomics and other experiments suggest that there is an appreciable microRNA-mediated regulation that decreases the efficiency of mRNA translation. microRNAs act by imperfect base-pairing with complementary sequences on target mRNAs, located primarily but not exclusively in 3 UTRs. To date, only a small number of biologically relevant microRNA sites have been identified in coding sequences (Thomas et al. 2010). It has been computationally predicted (Krek et al. 2005; Lewis et al. 2005) and demonstrated experimentally (Lim et al. 2005; Baek et al. 2008; Selbach et al. 2008) that one type of microRNA may regulate a large number, sometimes in the hundreds, of different types of target mRNAs, and conversely one gene can be targeted by multiple microRNAs (Wu et al. 2010). Thousands of microRNAs have been discovered, and by some estimates they combinatorially regulate at least a third of all human genes (Bartel 2009). The regulating effect of microRNAs is typically studied by conducting microRNA mis-expression experiments and measuring gene expression with microarrays (Lim et al. 2005), protein levels by the mass spectromic method SILAC (Baek et al. 2008) and changes in protein synthesis by pulsed SILAC (pSILAC) (Selbach et al. 2008). There are two general types of microRNA mis-expression experiments: (i) overexpression (or
Systems Biology of microRNAs
479
transfection) of a microRNA to a cell-line or tissue where it is initially not present, or present at low levels; and (ii) knock-down of an abundant microRNA by antagomir or LNA. Another method for reducing microRNA levels is by introducing ‘sponge’ RNAs that contain an optimized complementary binding site to the microRNA of interest and thereby out-compete the endogenous targets (Ebert et al. 2010). High-throughput and detailed studies have isolated the major sequence determinant that mediates microRNA regulation – the 6-mer seed sequence which forms a consecutive Watson – Crick base-pairing between the 3 UTRs of mRNA and positions 2–7 of the microRNA (counted from its 5 end) (Bartel 2009). The extent of the microRNA-mediated regulation of the target is sometimes subtle but may result in a profound phenotype. It depends on the number of seeds, the distances between them, sequence context, secondary structure and the cellular context. High-throughput data from microRNA mis-expressions, sponge experiments, and joint profiling of mRNA and microRNA levels in various tissues, diseases such as cancer, and different conditions, are being analyzed and mined using statistical techniques. Cellular signals and conditions affect the production, longevity, and availability of microRNAs. Many microRNAs are produced in a cell-specific or tissue-specific manner as well as at different developmental stages. The primary microRNA transcript is cleaved into an ∼60–70 nucleotide long hairpin by an RNA cleavage enzyme called Drosha. This precursor is exported to the cytoplasm, where it is cleaved by a second enzyme called Dicer to yield a 22-nucleotide double-stranded RNA (dsRNA). One of the strands becomes the mature microRNA, and it is eventually loaded into an RNA-induced silencing complex (RISC) that guides the RNA silencing (Bartel 2004). Any step during the microRNA biogenesis can potentially be regulated. For example, in embryonic stem cells the microRNA let-7 is actively transcribed. However, the Dicer cleavage maturation step is inhibited resulting in a lack of mature let-7 (Rybak et al. 2008). It was also shown recently that like other RNAs, microRNAs possess differential stability in human cells and this stability can be modulated (Bail et al. 2010). Another interesting, and potentially highly regulated step, is mRNA accumulation in P-bodies in a microRNA-dependent manner (Bhattacharyya et al. 2006). This provides new insights as to how gene expression is regulated, and new challenges for mathematical models that attempt to capture, simulate and predict various scenarios.
25.4
Approaches to microRNA target predictions
microRNAs mediate their silencing affect primarily by imperfect base-pairing at the 3 UTRs of the target genes. The primary determinant of target recognition is base-pairing at positions 2–7 of the microRNA – the so-called ‘seed region’ – to the 3 UTR of the mRNA target. Most target prediction methods use seed complementarity as the main prediction factor, however, this factor is not always necessary nor sufficient for accurate target prediction as any hexamer sequence has thousands of random occurrences in 3 UTRs and some targets do not contain perfect seed base-pairing (Hobert 2004). Other factors that also contribute to target prediction, albeit to lesser extent, include base-pairing at the microRNA’s 3 end, evolutionary conservation, nucleotide composition, secondary structure, multiplicity of target sites, and other RNA-binding motifs (Figure 25.1). The computational challenge is to combine these various factors into a prediction model which will either rank predicted sites appropriately, or classify them as functional or not. Here we provide a high-level overview of some of the common prediction methods that represent the general approaches used for target prediction [for a more detailed review see Bartel (2009)]. Target prediction algorithms can generally be divided into two categories: sequence-based and energybased approaches. Sequence-based approaches identify the most likely target sites by optimizing sequence complementarity between the microRNA and mRNA with emphasis on perfect seed pairing. Energy-based approaches are similar in that they identifiy the optimal microRNA–mRNA duplex however, these are based on minimizing thermodynamic energy terms that are based on the hydrogen bonds and stacking energies of the
480
Handbook of Statistical Systems Biology
Figure 25.1 Features used in microRNA target predictions. microRNA guides an ‘RNA induced silencing complex’ (RISC) towards the 3 UTR of mRNA target leading to downregulation of the target. This example depicts the let-7a (top strand) target site in the 3 UTR of NRAS (bottom strand). The mechanism by which microRNAs identify and regulate their targets is poorly understood; however, the primary determinant is base-pairing between the so-called ‘seed’ region of the microRNA (positions 2–7) and the target sequence. Other factors such as additional pairing at the 3 of the microRNA, the nucleotide composition around the target site, secondary structure accessibility, evolutionary conservation and the presence of additional sites are additional features that are used in prediction models
nucleotides. The two approaches are not distinct since there is a strong correlation between favorable energy scores and sequence complementarity. In addition, some approaches use a combination of both sequence complementarity and energy minimization. The target prediction problem is composed of two primary steps. The first is idenfication of candidate target sites and the second is scoring or sometimes filtering by conservation. One common approach to identify target sites is to use dynamic programming, similar to sequence alignments, which identifies the maximal base-pairing between the microRNA and mRNA target. The miRanda algorithm uses a scaling factor for the alignment scores at the 5 end of the microRNA which results in alignments that are strongly biased towards base-pairing at the 5 end (John et al. 2004). Another approach is to first anchor the alignment at positions with perfect seed pairing and then extend the alignment to the 3 end of the microRNA by an optimization procedure such as dynamic programming or energy terms (Krek et al. 2005; Lewis et al. 2005). Yet another approach attempts to identify over represented sequence motifs in the target mRNAs and match those to known microRNA sequences (Miranda et al. 2006). In the second step of the prediction candidate sites are scored for likelihood of regulatory effect by using a combination of features such as the extent of base-pairing with the target gene, nucleotide composition of the target site, conservation and modeling of the mRNA secondary structure around the target site. Grimson et al. (2007) introduced a linear model termed context score, which is the basis for the scoring values used in the target prediction website TargetScan. In a series of microRNA transfection experiments the authors showed that longer complementarity in the seed region correlates with increased repression of targets. Accordingly, they classified the set of predicted targets to a hierarchy of four seed types representing different seed basepairing configurations. Three ‘contextual’ features, nucleotide composition around the target site, binding at the 3 end of the microRNA, and distance from the ends of the 3 UTR, were found to correlate with the extent of downregulation. The score is based on three regression models, trained for each seed class, that correlate
Systems Biology of microRNAs
481
the contextual feature with the extent of target log expression change. Predicted target sites are scored based on their seed complementarily and the contribution of three context features. A similar study by Nielsen et al. (2007) looked at a similar set of features and used cumulative distribution functions to asses their contribution towards target downregulation. More recently Betel et al. (2010) extended the regression approach to include additional features and unify the four different seed classes to a single model using a support vector regression (SVR) algorithm termed mirSVR (Betel et al. 2010). Given a downregulation value yi for gene i, the linear model attempts to learn a set of weights w for the feature vector xi such that: yi =< w, xi > +b
(25.1)
In the mirSVR approach, the feature vector includes the three context features as well as a description of the microRNA alignment to the mRNA, conservation and modeling of the target site accessibility. A key aspect of this method is the ability to predict and score atypical sites that contain mismatches in the seed region without the cost of inflating the number of predictions. mirSVR scores can also be interpreted as probabilities of log-fold downregulation of the target gene which can be useful for predicting experimental outcome. Most cell types express a number of microRNAs and most genes contain target sites for multiple microRNAs. It is therefore likely that many genes are simultaneously regulated by a number of different microRNAs. Indeed, genome-wide microRNA target identification experiments identified many 3 UTRs with multiple microRNA binding sites (Hafner et al. 2010; Chi et al. 2009). Prediction methods often sum the scores from individual target sites to generate a combined gene score. The PicTar algorithm uses a hidden Markov model (HMM) for scoring genes that are regulated by multiple microRNAs. The method computes a logodds score for a gene being regulated by multiple microRNAs. The emission probabilities of the model are defined by the conservation scores of the individual target sites whereas the transition probabilities are the maximum-likelihood values of the nucleotide region being a target site or background sequence. In reality the microRNA target site is likely folded into a secondary RNA structure where the mRNA nucleotides are internally based-paired with each other and thereby blocking the target site from pairing with the microRNA. The formation of a stable microRNA–mRNA duplex involves disrupting this secondary structure, or alternatively removing other proteins bound to the target sites, in order for the microRNA to basepair with the target sites. Energy-based target prediction methods attempt to model this process by computing an energy term that reflects the thermodynamic difference between the bound microRNA–mRNA duplex and the disruption of the mRNA secondary structure. G = Gbound − Gunbound
(25.2)
A negative energy differences between the two terms suggests that the free energy gained by the bound microRNA–mRNA duplex is more stable than the two unbound members and therefore this target site is likely to lead to target downregulation. The PITA (Kertesz et al. 2007) algorithm uses RNAfold and RNAduplex (Hofacker 2003) to model the RNA secondary structure (Gunbound ) and microRNA–mRNA duplex (Gbound ), respectively. Another approach uses the SFold RNA secondary structure modeling to calculate an accessibility score for the target mRNA (Long et al. 2007). The algorithm first searches for a stretch of four consecutive nucleotides on the target mRNA that are accessible for base-pairing with the microRNA. It then attempts to elongate the hybridization between the microRNA and mRNA by disruption of the local mRNA secondary structure. The total energy change associated with this hybridization is used to determine likely functional sites. A crucial part of these approaches is the choice of algorithm for calculating the energy terms and their parameter settings. In particular, RNA secondary structure modeling can vary substantially depending on the allowed base-pairing distance and the length of the RNA that is folded. Furthermore, these types of modeling are computer-intensive and are not always amenable towards whole genome predictions.
482
Handbook of Statistical Systems Biology
25.5
Analysis of mRNA and microRNA expression data
Differences in gene expression, as measured between two conditions by microarrays, can be attributed to differences in transcriptional regulation. Some differences, however, may also be due to the actions of microRNAs that result in different degradation rates of mRNAs, and therefore are indicative of microRNA signature. This has been demonstrated in numerous microRNA mis-expression experiments starting with the the paper by Lim et al. (2005). The regulatory effect of a specific microRNA can be identified in mRNA expression data that are measured under experimental conditions in which the microRNA is differentially expressed. The genome-wide effects of microRNAs are most notable upon exogenous overexpression, or inhibition, of microRNA and can also be detected at the protein levels (Selbach et al. 2008; Baek et al. 2008). However, even endogenous changes in microRNA expression levels such as in disease versus normal conditions can manifest in changes in the target mRNA levels. The systematic analysis of mRNA expression profiles can also be used to elucidate the microRNA expression signature by identifying the set of microRNA target sites that are significantly depleted in the highly expressed genes (Sood et al. 2006). This is motivated by the notion that highly expressed genes are likely to be depleted of target sites to escape silencing by the endogenous microRNAs. There is a growing compendium of joint mRNAs and microRNAs expression profiles measured in different conditions, cell-lines, tissues and various cancer types. Such data can be used to identify the combinatorial regulation of multiple microRNAs on a set of functionally related genes. The analysis of joint profiles will eventually lead to systematic linking of the post-transcriptional regulators to their targets. In this section, we review methods for microRNA signature identification from high-througput data, and also outline some new approaches for analysis of joint microRNA–mRNA profiles. 25.5.1
Identifying microRNA activity from mRNA expression
Identifying microRNA activity from mRNA expression is based on detecting genome-wide downregulation of the target genes. Methods for identifying microRNA signature in gene expression data consist of two steps: (i) choice of target selection or prediction method; and (ii) application of a statistical test or procedure to determine whether differences between candidate targets and the rest are significant (Sood et al. 2006; Arora and Simpson 2008). We define D0 as the set of targets predicted by one of the algorithms (see Section 25.4) for microRNA μ or, alternatively the mRNAs with complementary seeds in their 3 UTRs. D1 denotes a reference dataset that can be predicted targets of another microRNA or all other measured gene expression excluding D0 . Vectors with expression values or fold-changes between conditions I and II are computed for both D0 and D1 sets, expr(gjI ) expr(giI ) μ μ , i ∈ D0 , FCD1 = fcgj = log2 j ∈ D1 . (25.3) FCD0 = fcgi = log2 expr(giII ) expr(gjII ) The second step involves testing a null hypothesis, H0 that the expression levels, or fold-changes, of groups D0 and D1 are the same μ
μ
H0 : FCD0 = FCD1
(25.4)
against the alternative in which the microRNA and targets have inverse expression: μ
μ
μ
μ
H1 : FCD0 < FCD1 μI > μII or H1 : FCD0 > FCD1 μI < μII .
(25.5)
This standard comparison of two numerical vectors can be performed by a number of statistical tests. The Kolmogorov–Smirnov test is commonly used to detect the significance of the log expression change of the predicted target set versus a background set (Grimson et al. 2007; Nielsen et al. 2007; Khan et al. 2009).
Systems Biology of microRNAs
483
The Wilcoxon rank sum test has also been successfully applied to mRNA data from different tissues and microRNA mis-expression experiments. In one such study, Sood et al. (2006) detected well characterized tissue-specific microRNA signatures across a number of different cell types by observing a significant downregulation of their target sets. Various modifications of the above scheme have since been used. The significance of microRNA signature based on target sets from various prediction methods was used to compare target prediction methods (Arora and Simpson 2008). In this approach the Wilcoxon test p-values obtained for various prediction sets, generated by the different prediction algorithms, are used as indicators for accuracy of the prediction. However, it should be noted that while some methods can obtain significant p-values their recall values (i.e. the fraction of correctly identified targets from the full set of true tragets) can be low. Other statistical procedures, such as t-test, ranked ratio test (Arora and Simpson 2008), calculating microRNA activity score (that is similar to Gene Set Enrichment score), and co-inertia analysis (Madden et al. 2010) have also been used. One of the interesting applications of systematic evaluation of mRNA expression across many microRNA perturbation experiments is the ability to detect subtle changes in targets levels that provide novel insights into the microRNA silencing mechanism. One such study by Khan et al. (2009) concluded that transfection of microRNA into cells leads to reduction of target silencing by the endogenous microRNAs. By systematic evaluation of many microRNA transfection experiments the authors show that the mRNA levels of targets of endogenous microRNAs are increased relative to the native state in which no exogenous microRNAs are introduced. This upregulation effect strongly suggests that the transfected microRNA saturates the silencing complex mechanism which impairs the function of endogenous microRNAs. In another study Arvey et al. (2010) have found that microRNAs with a large number of predicted targets exert less of a regulatory effect than microRNAs with fewer targets. A survey of microRNA and siRNA transfection experiments showed that the extent of downregulation of a specific target depends on the overall size of the microRNA’s target pool. Hence, a large number of targets will practically dilute the regulatory effect of a microRNA resulting in reduced silencing of targets when compared with a microRNA with a smaller target set. A few methods attempt to identify sequence motifs in 3 UTRs that are associated with microRNA-mediated regulation. This allows for detection of the active microRNAs without prior knowledge of the expressed microRNAs by identifying the sequence motifs that are correlated with mRNA downregulation and complementary to the microRNA seed region. One such approach is to use a hypergeometric test to identify the overor under-represented sequence motifs in a ranked gene list (van Dongen et al. 2008). In a similar fashion to the gene set enrichment analysis, the sorted gene list is partitioned into two groups using a sliding scale of log-expression cut-off and the enrichment of the motif is compared between the two sets. In a typical microRNA overexpression or inhibition experiments the most significant motifs are those complementary to the mis-expressed microRNA. Another approach is an iterative linear regression model that evaluates the correlation (positive or negative) of sequence motifs in the mRNA sequence with the extent of downregulation. This is an iterative procedure, termed miReduce, for finding significant sequence motifs that correlate with changes in mRNA expression which is similar to the Reduce algorithm (Bussemaker et al. 2001). The correlated motifs are chosen one by one, by determining in every iteration which motif contribution brings about the greatest reduction in the difference between the model and the expression data. Each of these motifs can be assigned a p-value, and the procedure continues as long as the p-value is lower than some chosen threshold. In microRNA transfection or inhibition experiments miReduce identified the seed of transfected microRNA as the major motif associated with negative fold-changes (Sood et al. 2006; Selbach et al. 2008). While microRNA seed sequence is indicative of microRNA downregulation it is also true that many genes with predicted seed complementary sites are not subject to microRNA-mediated repression. This indicates that targeting of mRNA may be regulated by additional sequence factors that are not in the predicted target sites. A systematic evaluation of a large panel of microRNA transfection experiments has indeed shown that a
484
Handbook of Statistical Systems Biology
number of AU-rich 3 UTR sequence motifs are significantly correlated with microRNA-mediated regulation (Jacobsen et al. 2010). This study used a novel nonparametric correlation statistics method that scores sequence motifs based on their over representation in a ranked list of gene expression. Using a permutation test the method looks to identify sequences motifs that are significantly over-represented in the most upregulated, or downregulated, 3 UTRs. 25.5.2
Modeling combinatorial microRNA regulation from joint microRNA and mRNA expression data
The single microRNA perturbation experiments are a valuable resource for studying microRNA-mediated regulation on a genome-wide scale. However, each mRNA can be concurrently regulated by multiple microRNAs and each microRNA can have hundreds of mRNA targets. Therefore, to obtain a more realistic modeling of microRNA-mediated regulation there is a need to identify the microRNA–mRNA modules (i.e. the set of microRNAs that regulate a common group of genes). The task of identifying the combined regulation of multiple microRNAs from joint expression profiles is more complex than from microRNA mis-expression experiments in which the identity of the perturbed microRNA is known. However, the combination of correlation statistics and computational target prediction can be used to identify these microRNA–mRNA modules from joint profiling data. Most methods follow a number of steps in the analysis of joint microRNA–mRNA datasets: (i) identifying differentially expressed mRNAs and microRNAs; (ii) inferring the inverse expression relationships between these two classes of RNA; (iii) identifying microRNA–mRNA pairs by target prediction; and (iv) finding co-expressed clusters of microRNAs that co-regulate co-expressed mRNA targets. If the mRNA and microRNA measurements do not come from matched samples, statistical tests such as those described in Section 25.5.1, are used to determine whether the microRNA of interest is regulatory (or active on a global scale). If the joint profiling dataset contains several samples in each condition, different types of correlation measures between expression profiles (e.g. Pearson, Spearman, Euclidean distance) can be used. A microRNA is said to be regulatory between two conditions if the change in expression profile of a microRNA and its predicted target mRNAs is significantly more anticorrelated than with other mRNAs. μ The null hypothesis is that correlations of microRNA, μ, and targets, D0 = {ρ(ti , μ), i ∈ D0 } are the same as μ those of nontargets, D1 = {ρ(tj , μ), j ∈ D1 }, where ρ(ti , μ) is the expression correlation between microRNA μ and gene ti across the various samples. μ
μ
μ
μ
H0 : D0 = D1 H1 : D0 < D1 .
(25.6)
A microRNA-mRNA pair is considered functional if its correlation coefficient is below a certain threshold that can be fixed a priori, or chosen based on estimated false detection rates associated with a series of thresholds. Peng et al. (2009) defined the false detection rate as the percentage of microRNA–mRNA gene pairs out of the total number of selected pairs that would have the same or better correlations just by chance. These authors have generated simulated datasets by randomly choosing the same number of arrays from real expression data, and randomizing the correspondence among samples (label permuting). This procedure identified the optimal Pearson correlation in which the false discovery rate is low but still provides a reasonable size list of microRNA–mRNA gene pairs. Regulatory pairs can also be identified using an odds-ratio statistic but this requires discretizing expression data causing loss of some information (Jayaswal et al. 2009). Increasing experimental evidence suggests that microRNA regulation is mediated in combinatorial fashion where a number of microRNAs act in concert to regulate a common gene set in contrast to individual microRNA regulation which may have a weak effect (Wu et al. 2010). Several computational methods attempted to infer such microRNA–mRNA modules from joint expression data. The methods vary from linear regression models (Tu et al. 2009), bi-clustering pair-wise correlations between microRNAs and mRNAs (Qin 2008),
Systems Biology of microRNAs
485
rule induction technique (Tran et al. 2008) to probabilistic methods (Joung et al. 2007; Bonnet et al. 2010) and Bayesian models (Huang et al. 2007). The evidence for microRNA–mRNA regulatory modules suggests that microRNA have evolved to regulate cellular function by silencing related targets that are involved in common cellular pathways. An important computational challenge is to identify the cellular functional role of microRNAs through their regulatory effects on mRNA targets. The basic intuition is that if the targets of a microRNA are enriched for genes with a specific function, process or pathway, it is reasonable to infer that the microRNA is involved in regulating the same function. A number of algorithms predict microRNA biological function by inference from the process and pathway annotation of the associated mRNA targets. The methods differ in their definition or detection method of microRNA–mRNA targets, and in the choice of statistical test for enrichment. The commonly used tests are hypergeometric (Nam et al. 2008; Creighton et al. 2008) and log-likelihood (Gaidatzis et al. 2007). Two recent programs, miRbridge (Tsang et al. 2010) and FAME (functional assignment of microRNAs via enrichment) (Ulitsky et al. 2010), also take into account different biological characteristics of target sites, such as lengths of 3 UTRs, conservation, and the local context, and produce robust measures for the enrichment tests which are also supported by experimental evidence.
25.6
Network approach for studying microRNA-mediated regulation
Transcription of microRNA genes is subject to similar transcriptional regulation to protein-coding genes. In addition, there is specific post-transcriptional regulation that controls microRNA biogenesis and the maturation process (Krol et al. 2010). Many microRNA genes are embedded within larger host genes, either intronic or within exons, where their biogenesis is coupled to the host gene transcriptional regulation. In such genomic arrangements microRNAs may be the by-products of the mRNA splicing. microRNAs are also negative regulators of TFs which can have a significant effect on establishing cellular identity and development (Davis and Hata 2009). The interplay between microRNA transcriptional and biogenesis control and microRNAmediated regulation enable the cell to exert precise control over gene expression. Construction of a reliable microRNA regulatory networks which include both the expression regulation of microRNAs and their mRNA targets entails a number of difficult computational challenges including finding conserved transcription factor binding sites (Wang J. et al. 2010), and identifying reliable microRNA targets. A number of studies have attempted to construct microRNA–TF networks by both experimental and computational approaches. The first genome-scale TF–microRNA transcriptional regulatory network has been experimentally mapped in Caenorhabditis elegans (Martinez et al. 2008). In Drosophila, conserved cis-regulatory elements that are enriched in known transcription factor binding sites have been identified in the vicinity of intergenic microRNAs (Wang et al. 2008). A heuristic mining methodology was used to find TFs that regulate co-expressed microRNAs in human samples (Bandyopadhyay and Bhattacharyya 2009). Moreover, it was found that groups of genes are tightly regulated by both TFs and microRNAs (Shalgi et al. 2007). One of the hallmarks of embryonic stem cells is the high levels of transcriptional activity initiated by the stem-cell factors Oct4, Sox2 and Nanog. Genome-wide experiments to identify the promoter elements of these factors using the CHIP-seq method and RNA-seq have shown that these factors co-regulate a number of key microRNA clusters in mouse embryonic stem cells as well as in differentiated lineage (Marson et al. 2008). Various network characterisitics such as the distribution of node connectivity, average path length, clustering coefficient and more can be indicative of significant biological properties. For example, it has been shown that TF networks follow a power-law distribution in which many of the TFs have very few targets whereas a few TFs regulate many genes – so-called network hubs – and are required for cell viability. In one of the first studies of the TF–microRNA regulatory network Shalgi et al. (2007) have analyzed the global connectivity degree of a mammalian microRNA-target network generated by computational microRNA
486
Handbook of Statistical Systems Biology
target prediction. They found that many ‘hub targets’ that are potentially regulated by many microRNAs are enriched for TF functionality and cell cycle regulation. On the other hand, no clear ‘microRNA hubs’ were identified in either mammalian or C. elegans networks (Shalgi et al. 2007; Martinez et al. 2008). At the same time, the miRbridge (Tsang et al. 2010) and FAME (Ulitsky et al. 2010) methods indicate that coordinated microRNA targeting of closely connected genes is prevalent across pathways. In addition, mirBridge provides the first genome-wide evidence of pervasive cotargeting, in which a handful of microRNAs are involved in a majority of cotargeting relationships. Another network characteristic that is indicative of selective evolutionary pressure is the enrichment of specific network motifs – a defined arrangement of network connectivity between three or more nodes that are presumed to perform a specific function (Alon 2007). In networks that involve microRNAs, several network motifs have been identified and/or predicted from integration of experimental data with computational algorithms and expression profiling (Shalgi et al. 2007; Tsang et al. 2007; Martinez et al. 2008). Strong correlation between some targets and microRNAs observed in expression data can be explained by the existence of various types of feedforward and feedback loops involving microRNAs and TFs (Tsang et al. 2007). Several other microRNA-mediated regulatory modules have been found in specific instances: double negative feedback loop (Johnston et al. 2005), negative feedback loop (Zhao et al. 2010), the positive feedback loop that involves two transcription factors (E2F and Myc) and the negative microRNA (miR17-92 cluster) loop (Aguda et al. 2008), and a number of feedforward loops [reviewed by Hornstein and Shomron (2006)]. The over-representation of network motifs is usually determined by a network randomization procedure that preserves the network connectivity properties. While a number of randomization procedures for transcriptional networks exist, perhaps more sophisticated algorithms need to be developed that take into account various characteristics such as general conservation level, 3 UTR length, and GC content (Tsang et al. 2010). The notion of microRNA-TF motifs provides a partial explanation for the fact that some predicted highscoring microRNA targets exhibit strong expression correlation with their regulatory microRNAs, i.e. incoherent targets. In the cases where a TF promotes the expression of both microRNA and its gene target the role of the microRNA may be to modulate the levels of its mRNA targets rather than complete silencing (Flynt and Lai 2008). In addition to being over-represented, a network motif is required to perform a specific function. The systems biology approach is to model the complex structures using ordinary differential equations (ODEs) to study the average system response to external stimulus, the number of steady states, and the sensitivity to parameter values. ODE models of network motifs or modules can be constructed from basic principles (see Section 25.7) and studied using the tools of dynamical systems (Mangan and Alon 2003; Alon 2007; Aguda et al. 2008). The stochastic differential equations approach is used to study fine-tuning, and noise control. For example, it was recently demonstrated that the incoherent version of the microRNA-mediated feedforward loop can couple the fine-tuning of target protein level with an efficient noise control, thus conferring precision and stability (Hornstein and Shomron 2006).
25.7
Kinetic modeling of microRNA regulation
In this section we present a kinetic model of post-transcriptional gene regulation that integrates microRNAmediated regulation. Such models provide valuable insight into the mechanism by which the cell achieves specific cellular response and insight into the role of microRNAs in maintaining homeostasis. We begin by introducing the basic model that incorporates the microRNA-mediated regulation into a simple mRNA and protein kinetics. While the model is simplistic and incomplete it is sufficiently descriptive to make useful predictions that have been observed experimentally and by a number of computational studies.
Systems Biology of microRNAs
25.7.1
487
A basic model of microRNA-mediated regulation
Intuitively, the steady-state levels of any molecular species are determined by birth and death processes. In the case of mRNAs, the production rate is the transcription rate, q(t), and degradation rate δ. Similarly, for proteins, the rate of production is the translation rate of mRNA to protein, λ, and we assume a constant degradation rate δP . Thus the changes in gene expression, mRNA levels m, and protein levels p, can be described by the following ODEs: dm(t) = q(t) − δ m(t), dt dp(t) = λ m(t) − δP p(t). dt
(25.7) (25.8)
In models of transcriptional regulation it is commonly assumed that while the production rate governing mRNAs, q(t), depends on the availability of TF(s), in the absense of microRNA, the mRNA degradation and protein translation are first-order processes that occur with constant rates, δ and λ, respectively. When a gene is a target of a specific microRNA, its degradation rate and its rate of translation depend on the levels of this microRNA (see Figure 25.2). Plausible representation of microRNA-mediated regulation is given by h (25.9) δ(miR) = δ0 1 + d(miR ) , λ(miR) =
λ0 1 + a miR
h
, h ≥ 1,
(25.10)
where d(miR) represents the increase in degradation rates due to the microRNA-mediated regulation in addition to the basal degradation rate δ0 , and h ≥ 1 is a cooperativity, or Hill, coefficient that accounts for regulation by multiple target sites of the same microRNA. In equation 25.10, γ is an additive half-saturation constant, so that λ0 /γ gives maximum translation rate (when miR = 0). The variable a represent the two alternative modes of microRNA-mediated regulation where a = 1 when the microRNA leads to translational arrest and a = 0 when the microRNA leads to mRNA degeadation. The protein degradation rate, δP is assumed to be constant. Saturation type functions for δ(miR) are perhaps more realistic but a linear form can still be used for low or medium levels of microRNAs. The above common-sense descriptions of complex processes can now be connected together, just like pieces of Lego, to build models of gene regulatory circuits (Aguda et al. 2008).
Figure 25.2 MicroRNA-mediated gene regulation. MicroRNA exerts its downregulating effect by increasing mRNA degradation rate and/or repressing protein translation
488
Handbook of Statistical Systems Biology
25.7.2
Estimating fold-changes of mRNA and proteins in microRNA transfection experiments
A more realistic model should take into account the fact that each microRNA affects hundreds of targets to a different degree. Equations (25.7) and (25.8) can be generalized for each target i with gene-specific kinetic parameters qi , δPi , δ0i . In addition, parameters di , γi , and hi of the microRNA-mediated downregulation that appear in the functional relations di (miR) ≥ di (miR = 0) and λi (miR) ≤ λi (miR = 0) are also target specific and depend on the sequence and structure of each mRNA–microRNA base-pairing and other recognition elements that have not yet been identified (Khanin and Higham 2009). Let us illustrate the predictive power of this kinetic model by first considering equations at steady state and computing fold-changes of mRNAs and proteins at different levels of microRNAs (Khanin and Higham 2009): mi 1 FCimRNA := = ≤1 (25.11) mi (miR = 0) 1 + di (miR) pi λi (miR) mi prot = FCi := pi (miR = 0) λi mi (miR = 0) mi < 1. (25.12) ≤ mi (miR = 0) The inequality (25.12) predicts that downregulation of proteins is greater, or equal to, that of mRNAs as has indeed been observed for two-thirds of all measured targets in a study that measured microRNA-dependent downregulation at both at mRNA and protein levels (Selbach et al. 2008). Downregulation of a gene target is typically measured as the fold-change in expression level between the state when microRNA is present versus the state when it is low or absent. Equation (25.12) can be rewritten as prot
log2 FCi
= log2 FCimRNA + i ,
(25.13)
where i = log2 [λi (miR)] − log2 [λi ] < 0.
(25.14)
Here i represents the direct microRNA-mediated repression of translation. Using Equation (25.10) for the microRNA-regulated rate of translation, i can be rewritten as h
miR i ], (25.15) γi where hi is the number of seeds acting cooperatively (i.e. the distance between the seeds is in the optimal range). h If microRNA action on the target i results in relatively large translational repression, then miRi /γi 1, and i = −log2 [1 +
h
miR i ] = −hi log2 [miR] + log2 [γi ]. (25.16) γi It follows directly from this equation that translational repression measured as log fold-change is linearly correlated with the number of target sites h, particularly for large negative fold-changes. Indeed, to quantify the effect of microRNA-mediated regulation on the translational rates of proteins Selbach et al. (2008) measured target downregulation both at the mRNA and protein levels. The average number of seeds (i.e. target sites), plotted as a function of the differences between protein and mRNA fold-changes, had a linear decay towards the regime of equal fold-changes, suggesting that in addition to mediating mRNA downregulation target sites also mediate direct repression of translation rates for hundreds of genes. Equations (25.13) and (25.16) also suggest that protein downregulation is more pronounced than mRNA downregulation since microRNA mediates their regulation both for mRNA stability and translation rates and this is proportional to the number of target sites h. Indeed the correlation coefficients of protein and mRNA downregulation correlated with number of sites indicates that the protein downregulation is compounded over mRNA. i ≈ −log2 [
Systems Biology of microRNAs
25.7.3
489
The influence of protein and mRNA stability on microRNA function
The second illustration of the model deals with the effect of mRNA basal degradation rate on the level of microRNA-mediated downregulation. It follows from Equation (25.13) that high-turnover targets (large δ0i ) are less affected by microRNA regulation than more stable mRNA that are long-lived in the cell. Larsson et al. (2010) have examined a comprehensive library of microRNA transfections and have shown that indeed high-turnover genes are generally less affected by microRNA levels. In contrast, microRNA can decrease the rate of protein production by suppressing translation rates and therefore their regulatory effect on proteins with high turnover will be more pronounced than on more stable proteins. This can be demonstrated by estimating the time-response of different targets to microRNA transfection from the time-dependent model [(25.7) and (25.8)]. Ignoring the potential dilution effect in microRNA levels and therefore the regulation of the targets (Arvey et al. 2010), and assuming microRNA levels posttransfection remain constant, (25.7) and (25.8) have a closed-form solution for each target mRNA (Khanin and Higham 2009). The timescale for microRNA-mediated repression is determined by degradation rates of mRNAs, δ(miR), and proteins, δp (miR). High-turnover proteins (large δpi ) will change rapidly whereas stable proteins (low δpi ) will be affected later. Therefore, the inference on functional mRNA and protein targets from high-throughput microRNA transfectons is affected by the target degradation rates. 25.7.4
microRNA efficacy depends on target abundance
The simple model [(25.7) and (25.8)] adequately describes experimental observations from microRNA misexpressions. This model, however, does not consider kinetics of microRNA whose transcription is subject to tight control by TFs, and stability is regulated by various factors (Kai and Pasquinelli 2010). In addition, the microRNA itself can become a limiting factor. A recent report verified from microRNA transfections that the activity of microRNA is diluted by target mRNA abundance (Arvey et al. 2010). In other words, the activity of microRNA (and small RNA) can be influenced by the levels of all target mRNA molecules. To illustrate this concept, Khanin and Higham (2009) developed a model where the microRNA–mRNA binding step, and kinetics of microRNA–mRNA complexes, are explicitly considered. These authors demonstrated target cross-talk, wherein the level of achieved microRNA-mediated repression depends on the level of microRNA itself, and levels of all the targets. 25.7.5
Reconstructing microRNA kinetics
In this section we illustrate how the kinetics of microRNA can be inferred from time-course experiments. Lumping together all effects that cause microRNA levels to decrease, and considering a linear degradation of microRNA with the rate δm , the model [(25.7) and (25.8)] is then complemented by the equation for microRNA kinetics (see Figure 25.3): dmiR(t) (25.17) = s(t) − δm miR(t), dt where s(t) is the rate of microRNA transcription that might depend on various TFs. In cell lines and tissues where a specific microRNA is present, its level can be approximated by s/δm . In microRNA transfection experiments, the microRNA level is maximum at the initial (transfection) time: miR(0) = 1. Assuming no production of the microRNA in the cell line or tissue where it is normally not present, s(t) = 0, the microRNA temporal profile is determined by just one parameter, δm : miR(t) = e−δm t . The quantitative values for the microRNA degradation rate are rarely available and are hard to measure. It is, however, possible to infer the microRNA half-life from microRNA mis-expression data (Khanin and Vinciotti 2008). Direct targets of microRNA constitute a structure that is similar to the SIM observed in transcription networks (Alon 2007). By embedding this microRNA-mediated SIM into a statistical framework, the degradation
490
Handbook of Statistical Systems Biology
Figure 25.3 MicroRNA-mediated single input motif (SIM) with N mRNA targets. MicroRNA is produced and decays similarly to other RNAs in the cell
rate of microRNA, δm , and kinetic parameters of the SIM targets {qi , δi0 , λi , δpi , δi (miR), λ(miR)}, where i = 1, . . . , N, can be estimated using, for example, the maximum likelihood procedure. Application of the maximum likelihood procedure to time-course miR-124a post-transfection gene expression data yielded that miR-124a is stable, with a half-life of 29 h (Khanin and Vinciotti 2008). This corresponds to an overall decay rate of 0.024 h−1 and is in perfect correspondence with the recently measured decay rate of several microRNAs derived from cultured cells (Wang K. et al. 2010). The kinetic model that takes into account multiple sites for the same microRNA on the 3 UTR of the target mRNAs, gives a better fit to some mRNA profiles, so the number of active seeds, h, can also be estimated from the data. Khanin and Vinciotti (2008) estimated effective microRNA-mediated fold-change increase in each target mRNA degradation rate, and the reconstructed basal decay rates of target mRNAs in this study have a very good correspondence with experimental measurements from an independent study (Yang et al. 2003), thereby giving strong support for this modeling approach. These methods, with extended modeling assumptions, and other reconstruction techniques, such as Bayesian inference (Rogers et al. 2007), can be applied to new experimental datasets and will yield kinetic information on microRNA regulation, as well as microRNA time-course and biogenesis.
25.8
Discussion
Accumulation of knowledge and our understanding about microRNA-mediated regulation is data-driven and draws heavily on models and computations. Despite some progress in microRNA research, the major questions regarding microRNA-mediated regulation still remain to be solved. Answers to these questions require integration of computational and experimental approaches that generate new hypotheses that can be tested experimentally. The computational challenges include not only fusion of different data types but also integration of sequence, network and kinetic approaches that until recently have been developed independent of each other. It is becoming clear that searching sequences will not suffice for determining reliable target predictions. It also requires understanding the processes involved in microRNA biogenesis, microRNA turnover, and microRNA–mRNA complex formation. In addition, various types of quantitative information are required such as microRNA expression, target mRNAs and protein expression levels, RNA binding proteins and other factors that regulate RNA stability. This should be complemented with the systematic discovery of microRNA–TF modules from which reliable network models are constructed. Finally, the advancement of the field of Systems Biology requires the development of a ‘meta network’ model (Martinez and Walhout 2009) that adequately represents different types of nodes (e.g. mRNAs, proteins, transcription factors, microRNAs, other small RNAs, RNA binding proteins), and different types of regulations and interactions (e.g. transcriptional regulation, post-transcriptional regulation, translational regulation, protein–protein interactions). Studies of such multidimensional directed networks clearly require new tools for description, statistical analysis, visualization and useful kinetic modeling.
Systems Biology of microRNAs
491
References Aguda BD, Kim Y, Piper-Hunter MG, Friedman A and Marsh CB 2008 MicroRNA regulation of a cancer network: consequences of the feedback loops involving miR-17-92, E2F, and Myc. Proceedings of the National Academy of Sciences of the United States of America 105(50), 19678–19683. Alon U 2007 Network motifs: theory and experimental approaches. Nature Reviews Genetics 8(6), 450–615. Arora A and Simpson D 2008 Individual mRNA expression profiles reveal the effects of specific microRNAs. Genome Biology 9(5), R82. Arvey A, Larsson E, Sander C, Leslie CS and Marks DS 2010 Target mRNA abundance dilutes microRNA and siRNA activity. Molecular Systems Biology 6, 363. Baek D, Vill´en J, Shin C, Camargo FD, Gygi SP and Bartel DP 2008 The impact of microRNAs on protein output. Nature 455(7209), 64–71. Bail S, Swerdel M, Liu H, Jiao X, Goff L, Hart R and Kiledjian M 2010 Differential regulation of microRNA stability. RNA 16(5), 1032. Bandyopadhyay S and Bhattacharyya M 2009 Analyzing miRNA co-expression networks to explore TF-miRNA regulation. BMC Bioinformatics 10, 163. Bartel D 2004 MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116(2), 281–297. Bartel DP 2009 MicroRNAs: target recognition and regulatory functions. Cell 136(2), 215–233. Betel D, Koppal A, Agius P, Sander C and Leslie C 2010 Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites. Genome Biology 11(8), R90. Bhattacharyya S, Habermacher R, Martine U, Closs E and Filipowicz W 2006 Relief of microRNA-mediated translational repression in human cells subjected to stress. Cell 125(6), 1111–1124. Bonnet E, Tatari M, Joshi A, Michoel T, Marchal K, Berx G and Van de Peer Y 2010 Module network inference from a cancer gene expression data set identifies microRNA regulated modules. PloS One 5(4), e10162. Bussemaker H, Li H and Siggia E 2001 Regulatory element detection using correlation with expression. Nature Genetics 27(2), 167–174. Chi SW, Zang JB, Mele A and Darnell RB 2009 Argonaute hits-clip decodes microRNA–mRNA interaction maps. Nature 460(7254), 479–486. Creighton CJ, Nagaraja AK, Hanash SM, Matzuk MM and Gunaratne PH 2008 A bioinformatics tool for linking gene expression profiling results with public databases of microRNA target predictions. RNA 14(11), 2290–2296. Davis BN and Hata A 2009 Regulation of microRNA biogenesis: a miriad of mechanisms. Cell Communication and Signaling 7, 18. Ebert MS, Sharp PA, Ebert MS and Sharp PA 2010 MicroRNA Sponges: Progress and Possibilities. Springer. Esquela-Kerscher A and Slack FJ 2006 Oncomirs - microRNAs with a role in cancer. Nature Reviews Cancer 6(4), 259–269. Filipowicz W, Bhattacharyya S and Sonenberg N 2008 Mechanisms of post-transcriptional regulation by microRNAs: are the answers in sight? Nature Reviews Genetics 9(2), 102–114. Flynt AS and Lai EC 2008 Biological principles of microRNA-mediated regulation: shared themes amid diversity. Nature Reviews Genetics 9(11), 831–842. Gaidatzis D, van Nimwegen E, Hausser J and Zavolan M 2007 Inference of miRNA targets using evolutionary conservation and pathway analysis. BMC Bioinformatics 8, 69. Grimson A, Farh K, Johnston W, Garrett-Engele P, Lim L and DP Bartel 2007 MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Molecular Cell 27(1), 91–105. Guo H, Ingolia NT, Weissman JS and Bartel DP 2010 Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 466(7308), 835–840. Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano Jr, M, Jungkamp AC, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M and Tuschl T 2010 Transcriptome-wide identification of RNA-binding protein and microRNA target sites by par-clip. Cell 141(1), 129–141. Hobert O 2004 Common logic of transcription factor and microRNA action. Trends in Biochemical Sciences 29(9), 462–468. Hofacker IL 2003 Vienna RNA secondary structure server. Nucleic Acids Research 31(13), 3429–3431. Hornstein E and Shomron N 2006 Canalization of development by microRNAs. Nature Genetics 38(S20-4), 462–468.
492
Handbook of Statistical Systems Biology
Huang JC, Babak T, Corson TW, Chua G, Khan S, Gallie BL, Hughes TR, Blencowe BJ, Frey BJ and Morris QD 2007 Using expression profiling data to identify human microRNA targets. Nature Methods 4(12), 1045–1049. Jacobsen A, Wen J, Marks DS and Krogh A 2010 Signatures of RNA binding proteins globally coupled to effective microRNA target sites. Genome Research 20(8), 1010–1019. Jayaswal V, Lutherborrow M, Ma DDF and Hwa Yang Y 2009 Identification of microRNAs with regulatory potential using a matched microRNA-mRNA time-course data. Nucleic Acids Research 37(8), e60. John B, Enright AJ, Aravin A, Tuschl T, Sander C and Marks DS 2004 Human microRNA targets. PLoS Biology 2(11), e363. Johnston RJ, Chang S, Etchberger JF, Ortiz CO and Hobert O 2005 MicroRNAs acting in a double-negative feedback loop to control a neuronal cell fate decision. Proceedings of the National Academy of Sciences of the United States of America 102(35), 12449–12454. Joung JG, Hwang KB, Nam JW, Kim SJ and Zhang BT 2007 Discovery of microRNA–mRNA modules via population-based probabilistic learning. Bioinformatics 23(9), 1141–1147. Kai ZS and Pasquinelli AE 2010 MicroRNA assassins: factors that regulate the disappearance of miRNAs. Nature Structural and Molecular Biology 17(1), 5–10. Kertesz M, Iovino N, Unnerstall U, Gaul U and Segal E 2007 The role of site accessibility in microrna target recognition. Nat Genet 39(10), 1278–84. Khan AA, Betel D, Miller ML, Sander C, Leslie CS and Marks DS 2009 Transfection of small rnas globally perturbs gene regulation by endogenous micrornas. Nature Biotechnology 27(6), 549–555. Khanin R and Vinciotti V 2008 Computational modeling of post-transcriptional gene regulation by micrornas. Journal of Computational Biology 15(3), 305–316. Khanin R and Higham D. 2009 Mathematical and computational modeling of post-transcriptional gene regulation by microRNAs. pp. 197–216. Krek A, Gr¨un D, Poy M, Wolf R, Rosenberg L, Epstein E, MacMenamin P, da Piedade I, Gunsalus K, Stoffel M and Rajewsky N 2005 Combinatorial microRNA target predictions. Nature Genetics 37(5), 495–500. Krol J, Loedige I and Filipowicz W 2010 The widespread regulation of microRNA biogenesis, function and decay. Nature Reviews Genetics 11(9), 597–610. Kumar M, Lu J, Mercer K, Golub T and Jacks T 2007 Impaired microRNA processing enhances cellular transformation and tumorigenesis. Nature Genetics 39(5), 673–677. Larsson E, Sander C and Marks D 2010 mRNA turnover rate limits siRNA and microRNA efficacy. Molecular Systems Biology 6, 433. Lewis B, Burge C and Bartel D 2005 Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120(1), 15–20. Lim L, Lau N, Garrett-Engele P, Grimson A, Schelter J, Castle J, Bartel D, Linsley P and Johnson J 2005 Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 433(7027), 769–773. Long D, Lee R, Williams P, Chan C, Ambros V and Ding Y 2007 Potent effect of target structure on microRNA function. Nature Structural Molecular Biology 14(4), 287–294. Lu M, Zhang Q, Deng M, Miao J, Guo Y, Gao W and Cui Q 2008 An analysis of human microrna and disease associations. PLoS One 3(10), e3420. Madden SF, Carpenter SB, Jeffery IB, Bj¨orkbacka H, Fitzgerald KA, O’Neill LA and Higgins DG 2010 Detecting microRNA activity from gene expression data. BMC Bioinformatics 11, 257. Mangan S and Alon U 2003 Structure and function of the feed-forward loop network motif. Proceedings of the National Academy of Sciences of the United States of America 100(21), 11980–11985. Marson A, Levine SS, Cole MF, Frampton GM, Brambrink T, Johnstone S, Guenther MG, Johnston WK, Wernig M, Newman J, Calabrese JM, Dennis LM, Volkert TL, Gupta S, Love J, Hannett N, Sharp PA, Bartel DP, Jaenisch R and Young RA 2008 Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell 134(3), 521–533. Martinez NJ and Walhout AJM 2009 The interplay between transcription factors and microRNAs in genome-scale regulatory networks. BioEssays 31(4), 435–445. Martinez NJ, Ow MC, Barrasa MI, Hammell M, Sequerra R, Doucette-Stamm L, Roth FP, Ambros VR and Walhout AJM 2008 A C. elegans genome-scale microRNA network contains composite feedback motifs with high flux capacity. Genes and Development 22(18), 2535–2549.
Systems Biology of microRNAs
493
Miranda K, Huynh T, Tay Y, Ang Y, Tam W, Thomson A, Lim B and Rigoutsos I 2006 A pattern-based method for the identification of microRNA binding sites and their corresponding heteroduplexes. Cell 126, 1203–1217. Nam S, Kim B, Shin S and Lee S 2008 miRGator: an integrated system for functional annotation of microRNAs. Nucleic Acids Research 36, D159–D164. Nielsen CB, Shomron N, Sandberg R, Hornstein E, Kitzman J and Burge CB 2007 Determinants of targeting by endogenous and exogenous microRNAs and siRNAs. RNA 13(11), 1894–1910. Peng X, Li Y, Walters KA, Rosenzweig ER, Lederer SL, Aicher LD, Proll S and Katze MG 2009 Computational identification of hepatitis C virus associated microRNA-mRNA regulatory modules in human livers. BMC Genomics 10(1), 373. Qin LX 2008 An integrative analysis of microRNA and mRNA expression–a case study. Cancer Informatics 6, 369–379. Rogers S, Khanin R and Girolami M 2007 Bayesian model-based inference of transcription factor activity. BMC Bioinformatics 8(Suppl. 2), S2. Rybak A, Fuchs H, Smirnova L, Brandt C, Pohl EE, Nitsch R and Wulczyn FG 2008 A feedback loop comprising lin-28 and let-7 controls pre-let-7 maturation during neural stem-cell commitment. Nature Cell Biology 10(8), 987–993. Selbach M, Schwanhaeusser B, Thierfelder N, Fang Z, Khanin R and Rajewsky N 2008 Widespread changes in protein synthesis induced by micrornas. Nature 455(7209), 58–63. Shalgi R, Lieber D, Oren M and Pilpel Y 2007 Global and local architecture of the mammalian microRNA-transcription factor regulatory network. PLoS Computational Biology 3(7), e131. Sood P, Krek A, Zavolan M, Macino G and Rajewsky N 2006 Cell-type-specific signatures of microRNAs on target mRNA expression. Proceedings of the National Academy of Sciences of the United States of America 103(8), 2746–2751. Thomas M, Lieberman J and Lal A 2010 Desperately seeking microRNA targets. Nature Structural and Molecular Biology 17(10), 1169–1174. Tran DH, Satou K and Ho TB 2008 Finding microRNA regulatory modules in human genome using rule induction. BMC Bioinformatics 9 (Suppl. 12), S5. Tsang J, Zhu J and van Oudenaarden A 2007 MicroRNA-mediated feedback and feedforward loops are recurrent network motifs in mammals. Molecular Cell 26(5), 753–767. Tsang JS, Ebert MS and van Oudenaarden A 2010 Genome-wide dissection of microRNA functions and cotargeting networks using gene set signatures. Molecular Cell 38(1), 140–153. Tu K, Yu H, Hua YJ, Li YY, Liu L, Xie L and Li YX 2009 Combinatorial network of primary and secondary microRNA-driven regulatory mechanisms. Nucleic Acids Research 37(18), 5969–5980. Ulitsky I, Laurent LC and Shamir R 2010 Towards computational prediction of microRNA function and activity. Nucleic Acids Research 38(15), 1–13. van Dongen S, Abreu-Goodger C and Enright AJ 2008 Detecting microRNA binding and siRNA off-target effects from expression data. Nature Methods 5(12), 1023–1025. Van Rooij E and Olson E 2007 MicroRNAs: powerful new regulators of heart disease and provocative therapeutic targets. Journal of Clinical Investigation 117(9), 2369–2376. Wang J, Lu M, Qiu C and Cui Q 2010 TransmiR: a transcription factor-microRNA regulation database. Nucleic Acids Research 38, D119–22. Wang K, Zhang S, Weber J, Baxter D and Galas DJ 2010 Export of microRNAs and microRNA-protective protein by mammalian cells. Nucleic Acids Research 38(20), 7248–59. Wang X, Gu J, Zhang MQ and Li Y 2008 Identification of phylogenetically conserved microRNA cis-regulatory elements across 12 Drosophila species. Bioinformatics 24(2), 165–171. Wu S, Huang S, Ding J, Zhao Y, Liang L, Liu T, Zhan R and He X 2010 Multiple microRNAs modulate p21Cip1/Waf1 expression by directly targeting its 3’ untranslated region. Oncogene 29(15), 2302–2308. Yang E, van Nimwegen E, Zavolan M, Rajewsky N, Schroeder M, Magnasco M and Darnell JJ 2003 Decay rates of human mRNAs: correlation with functional characteristics and sequence attributes. Genome Research 13(8), 1863–1872. Zhao Z, Boyle TJ, Liu Z, Murray JI, Wood WB and Waterston RH 2010 A negative regulatory loop between microRNA and Hox gene controls posterior identities in C. elegans. PLoS Genetics 6(9), e1001089.
Index
acceptance probability, 62, 264, 273, 276, 277, 325, 369 adjacency matrix, 87, 260, 274, 290, 294, 298, 299 nonzero structure, 292 pair of eigenvalues, 294 plots of PPI networks, 305 Affymetrix GeneChip oligonucleotide arrays, 286 aggregation, 25 classifier, 25–7 variable, 25 Akaike information criterion (AIC), 35, 156, 245, 424 Akt model, 427, 430, 431 interquantile ranges for, 433 Akt signalling pathway model, 420, 427 phosphorylated EGF receptor (pEGFR), 435 posterior distribution scatterplots two-dimensional projections of, 432 posterior parameter distribution, projections of, 431 algebraic geometry, 114, 115, 130 algebraic statistical models, 118 definitions, 118–19 graphical models, 119–20 two-site phosphorylation cycle, 120–2 algebraic statistics, 114 first theorem, characterization of space, 128 in systems biology, 115 algebraic tools, 115 analytical technologies, 164–6 Andronov–Hopf bifurcation, 350, 351 annotation, 66, 67, 70 AnnotationDBi package, 74 in bioconductor, 73–4
experimental design, 72–3 functional annotation for network alignment, 215–16 genome annotation, 453 GO annotation, 216, 297, 300 by protein binding sites, 72 by similarity, 71–2 by temporal series, 72 by text-mining, 73 ANOVA, 141, 144 antagonistic immune interactions dendritic cell (DC) and Nipah virus, 458 AP-MS datasets, 320, 322 approaches based on S-systems, 107–9 approximate algorithms, 247, 363 approximate Bayesian computation (ABC), 57, 63, 209, 210, 369–71, 420, 425, 426 Arabidopsis gene expression time series, 285–7 area under the curve (AUC), 31, 279, 280, 393 ArrayExpress, 67 ARTIVA network model, 262–7, 263 assumption, 265 inference procedure, and performance evaluation, 263–7 multiple changepoints, 262 procedure performance, 264 robustness, 267 schematic illustration, 265 regression model, 262–3 synthetic data, 266 assortativity, 310, 321 PPIN, 320
Handbook of Statistical Systems Biology, First Edition. Edited by Michael P. H. Stumpf, David J. Balding and Mark Girolami. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.
496
Index
attractors, 334 chaotic (see chaotic attractors) definition, 343 quasiperiodic (see quasiperiodic attractors) steady state, 344 autonomous systems, 341 autoregressive (AR) process, 257 autoregressive time varying model, see ARTIVA network model averaged Pearson correlation coefficient (avPCC), 226 AViditiybase EXtracellular Interaction Screen (AVEXIS) system, 228 bagging, 25–7, 151 BANJO software, 102, 103 Bayes factor, 49, 51, 425 computation of, 58 range, 52 Bayes framework of moderated test statistics, 147 Bayesian analysis, 53 in action, 41–2 Bayesian approach, 15 inclusion of additional data, 471 Bayesian framework, 54 Bayesian hierarchical model, 278 Bayesian inference, 39, 247, 280, 287, 367, 382, 472, 473, 490 computational challenge offered by, 57 and Gibbs sampling techniques, 172 model-based approach for, 209 in peak identification, 472 posterior distribution for a parameter θ, 209 Bayesian information criterion (BIC), 35, 49, 156, 245 Bayesian methods, in metabolomics, 172 Bayesian modelling, 40, 42, 55 for NMRdata, 172 Bayesian model selector, 57 Bayesian multiple testing, 54–5 Bayesian networks, 239, 240, 256, 260, 282, 421 advanced applications in systems biology, 270–87 advantages, 257 Arabidopsis gene expression time series, results on, 285–7 Bayesian learning, 273 biological prior knowledge, inclusion, 273–81 DAG structure, 271 dynamic Bayesian networks, 272–3 graphical representation for, 258 heterogeneous DBNs, 281–7 inference procedures, 250
MCMC sampling scheme, 276–7 modularity, 275 by moralisation, 241 nonlinear/nonhomogeneous DBN, 282–3 performance, 279 practical implementation, 277 principled feature, 272 separation, 238 simulation results, 284–5 with single-cell data, 197 Bayesian posterior distribution, 243, 422 Bayesian regression analysis, 53 Bayesian sampling, to estimate the model parameters, 377 Bayesian statistical framework, 472 Bayesian statistics, 422 maximum a posteriori (MAP) estimate, 423 use of, 469 Bayesian variable selection, 155–6 Bayes Net Toolbox open-source Matlab package BNT for, 260 Bayes theorem, 40, 44, 155, 264 BGe model, 282 BGe score, 271, 279, 282 binary adjacency matrix, 274 binary classification problems, 23 BinGO, 77 biochemical network models, 364–6 network families, and random graphs, 312–15 network families, hypothesis testing and null models, 312–13 tailored random graph ensembles, 313–15 network topologies, quantitative characterization, 310–12 examples, 311–12 local network features, 310–11 SDE representations, 364 tailored random graphs, information-theoretic deliverables, 315–17 information-theoretic dissimilarity, 316–17 network complexity, 315–16 via tailored random graphs, 309 BioGRID database, 101 bioluminescent probe, 189 biomolecular interaction networks, 83 BioPAX, 77 Bonferroni fallacy, 56 Boolean networks, 116, 120, 256, 340 Boolean (binary) values, 227 boosting algorithms, 26 bootstrap methods, 423, 424
Index Brownian motion, 364 Buchnera aphidicola, 454 CABIN (Collective Analysis of Biological Interaction Networks), 77 Caenorhabditis elegans, 485 genome-scale TF–microRNA transcriptional regulatory network, 485 network alignment, 213 CaliBayes system, 370 cAMP pathway, 457 Candida albicans, environmental responses, 457 CART algorithm, 26 CCDS database, 69, 70 cDNA microarrays, 83 cell dynamics, 226 cell engineering, 4, 5 cell signaling systems, 4–5, 419 dynamics, 420 injection of the CagA, 460 modeling, use for, 5 cell-to-cell communication, 421 cell-to-cell variation, 182, 197, 435 expression level of fluorescent probes, 197 quantifying sources, 197–9 cell tracking, 193–4 cellular information processing, 3 cellular metabolism, manipulation of, 4 cellular regulatory systems, 3, 4 cellular signaling events, 182, 185 cellular signalling networks, 325 central limit theorem, 59 challenges about cell signaling system, 6, 7 combinatorial complexity, 7–8 automated model-guided design of experiments, 6 in development of models, 5 network-free simulation, 8 ODEs used to model kinetics, 7 parameters in models, 6–7 rule-based modeling, 8 stochastic simulation compiler (SSC), 8 tracking information, 7 translating available knowledge, 6 changepoint positions vector, 263 changepoint sensitivity, 266 chaotic attractors, 345 chaotic dynamics, 333, 334 charge-coupled device (CCD) imager, 183 chemical Langevin equation (CLE), 364, 423 chemical master equation, 361–3
chemical reaction networks, 105 chemiluminescence detection, 183 ChIP-chip experiments, 255 ChIP-seq experiments, 267 Cholesky factor, 364 chordal, 240 chromatographic behavior, 471 chromatographic retention time (RT), 166 citrate, 171 Citrobacter rodentium, 453 class comparison, 15, 16 generalized linear mixed models, 19 inference, 18–19 models for dependent data, 16 multiple testing, 19–22 class discovery, 15, 35 geometric methods, 31 hierarchical algorithms, 31–2 inference, 33–4 K means, 32 latent variable models, 32–3 classification problems, 28 binary, 23 in computational biology, 22 classification tree, 24, 25 class prediction, 15 aggregation, 25 building a classifier, 22 classification algorithms, 23–4 classifier aggregation, 25–7 performance assessment, 29–31 regularization, 27–9 variable aggregation, 25 cluster analysis, 170 clustering, 149–51 affinity propagation, 150 Gaussian mixture models (GMMs), 151 gene expression data Z clustering, 150 hierarchical clustering, 150 K-centres, 150 K-means, 150 K-medioids, 150 mixture model clustering, 151 number of clusters, estimation, 151 resampling approaches, 151 time-course data, 153 clustering coefficient, 206, 293, 302, 310, 485 c-Met signal transduction logical model for, 460
497
498
Index
codebook vector, 170 coefficient matrix, 260 Collective Analysis of Biological Interaction Networks, 77 collision hazard, 360 COMET project, 169 community detection, in PPI networks, 218–19 detection methods, 219–20 evaluation of results, 220–1 comparative genomic hybridization (CGH), 29 complementary binding domains pairwise attraction, 294 computational algebra, 115, 116–18 contingency table, 117–18 computational issues, 57 approximate Bayesian computation techniques, 63 MCMC methods, 61–3 Monte Carlo methods, 59–61 conditional density, 40 conditional independence, 119, 120, 238, 242, 243, 276 graphical models, 420 relationships, 245, 246 for simple undirected and directed acyclic graphs, 239 conditional probability distributions, 257, 425 confidence intervals, 46–8, 401 asymptotic confidence intervals, 402 coverage, 403 finite sample confidence intervals, 403 confocal microscopes, 187 conservation laws, 364 constraint-based algorithms, 242 contingency tables, 41, 114, 116–19, 125, 128, 243 convex optimization methods, 97, 261 bottom-up approach CORE-Net algorithm, 100 identification of the connectivity matrix, 97 via LMI-based optimization, 97–9 top-down approach, 99 reconstruction algorithm, 99–100 CORE-Net algorithm, 97, 100, 101, 103 covariance matrix efficient estimator, 261 Cox’s proportional hazards model, 141 cross covariances, 382 cross-validation (CV), 157 CTLS technique, 91–5 curse of dimensionality, 246
cyclin/CDK regulators, 101 Cytoscape, 77, 78 data augmentation approach, 367–9, 372 data integration, in biological studies, 74 comparing and/or integrating different technologies, 75–6 epigenetics, 76 experimental data, integration of, 74 genetics, 76 metabolomics, 76–7 microarrays sampled from different experimental designs, 75 transcriptomics, 76–7 data processing, 168, 282, 472 LC-MS data processing, 168 and management software, 474 NMR data processing, 166 normalisation, 168 on proteins, 312 data repositories, 67, 68 dChip package, 137 decomposable graphs, 240 dendrogram clustering, 322 deterministic autonomous system, 357 DHFR protein fragment complementation Assays, 228 diagonal covariance matrix, 260 Dicer cleavage maturation, 479 diffusion approximation, 364–5 dimension reduction, 260 directed acyclic graph (DAG), 257, 270 directed graph evaluation (DGE), 280, 281 directed separation, 238 discrete-time stochastic process, 259, 323 discrete-time systems, 100 bifurcation analysis, 352 discretisation scheme, 443 χ2 distribution of the LR statistics, 142 DNA damage, 440 response network, 443, 447 DNA Data Bank, 70 DNA decay, 190 DNA microarrays, 255 DNA sequences, 68, 188, 255, 391 Drosophila melanogaster approximate bipartite subnetwork, 297 Drosophila phagosomes, protein interaction network characteristic, 459 dynamical models, for network inference, 83–4 linear models, 84–5 nonlinear models, 85
Index polynomial models, 85–8 Power-law models, 88 rational models, 85–8 S-systems (synergistic-systems), 88 dynamical network, 227 dynamical stability, 342, 348, 349 dynamical systems basic solution types, 343–6 with discrete/continuous/time, space/states, 340 ergodicity, 353–4 qualitative behaviour, 346–7 qualitative inference in, 339–43 stability and bifurcations, 347–53 timescales, 354–6 time series analysis, 356–7 dynamic Bayesian networks (DBNs), 102, 243 ARTIVA network model, 262–7 continuous data, recovering genetic network from, 255–67 genetic network modelling with, 256–9 graphical representation for, 258 inference procedures, 260–1 for linear interactions and inference procedures, 259–61 multivariate AR model, 261 regulatory networks in biology, 255–6 reverse engineering time-homogeneous, 258–63 state space graph, 272 electrospray ionisation (ESI), 164 EMBL Nucleotide Sequence Data Library, 67 Ensembl, 70 enteropathogenic Escherichia coli (EPEC), 453 genomics, 453 Entrez Gene, 69 entropy, 61, 335, see also Kolmogorov–Sinai entropy entropy-based measures, see minimum description length (MDL) epidermal growth factor (EGF) receptor (EGFR) experimental data measurements of, 426 equivalence classes, 272 possible networks, 245 Erd¨os–R´enyi style random graph, 300 ergodic dynamical system, 354 ergodicity, 335, 353–4 error covariance matrix, 260 Escherichia coli Partial correlation graph using the algorithm, 249 plotting the average of rnz versus rz , 104 RNAi screening for phagocytosis, 459
499
SOS pathway, inferring by TSNI and compared with NIR, 104, 105 Euclidian distance function, 32, 299, 428 eukaryotic pathogens, environmental responses, 457 Euler discretization, 106 Euler–Maruyama discretisation, 364 European Nucleotide Archive, 70 evidence propagation studies, 246 expectation-maximization (EM), 33 experimental data, 67–8, 78, 89, 109, 317, 372, 398, 403, 405, 408, 426, 452, 477, 486 ontologies and, 77 expert systems, 241, 242 extracellular signal-regulated kinase (ERK) phosphorylation, 197 extrinsic noise, 360, 365 factorisation, 239, 240 of the ASIA Bayesian network, 244 into local distributions, 245 factor potentials, 240 false discovery rate (FDR), 21, 143, 144 family-wise error rate (FWER), 20, 143, 173, 242 Fasciola hepatica, manipulate host environment, 460 fast hybrid stochastic simulation algorithms, 363 feature extraction by image processing, 194 Fisher’s exact test, 116 fitting basis function models, 380 fixed rate constants, 366 flip/period-doubling bifurcation, 352 flow cytometry, 184–5 fluorescent dye, 189 fluorescent microscope, 185–7 fluorescent proteins, 188 flux balance analysis (FBA), 227, 454 Fourier analysis, 141 Fourier transform ion cyclotron resonance (FTICR), 164 free induction decay (FID), 166 FRET probe, 188, 189 Frobenius norm, 92 F-test, 141 functional assignment of microRNAs via enrichment (FAME), 485 fused lasso regularization, 29 Gaussian distributions, 150, 243, 271 Gaussian equivalent scores, 245
500
Index
Gaussian processes, in biological modeling, 376 corrupted observations of mRNA and protein, 386 cross covariance between p(t) and the ith output gene, 385 modeling TF as a Gaussian process in log space, 387 negative log likelihood, 386 perspective on the model for p(t), 385 single-target Gaussian process, 388 with a zero mean function and a modified covariance function, 386 Gaussian random fields, 59 GenBank, 67, 70 gene association by clustering algorithms, 78 networks, 248, 250 gene-by-gene analysis, 263 gene clustering based on expression data, 31 Gene Entrez, 70 gene expression data, 17, 54, 91, 149, 227, 376, 380, 420, 482 gene expression microarray, 136, 376, 380 gene expression omnibus (GEO), 67 gene expression program during the budding yeast, 101 gene-for-gene (GFG) model, 451 GeneID, 69 gene networks, 78, 88, 90 gene ontology (GO), 70, 73, 77, 215, 294 generalized linear model (GLM), 19, 142, 376, 379–80, 386, 387, 393 gene regulatory networks (GRNs), 90, 91, 100, 105, 106, 256, 270, 309, 453 generic bifurcations, 351, 353 gene-set analysis, 147 gene set enrichment analysis (GSEA), 148 problems using HTS, 148 using χ2 statistics, 148 using z-scores, 147–8 GeneSpring© software, 286 genetic circuits, 3 genetic interaction models, 452 genetic modifications, 4 genome-scale metabolic network, 453 genome-scale metabolic reconstructions, 454 genome-wide association studies (GWAS), 173 genome-wide target identification, 376 genome-wide transcriptional assays, 153 geometric graph models, 298 geometric random graph, 297, 298 G hypotheses tests, 144 Gibbs distribution, 275
Gibbs’ potentials, 240 Gibbs sampling, 62, 172, 247, 469, 473 Gillespie algorithm, 362–3 Gillespie’s direct method, 363 global optimization algorithms, 423 GML, 77 g-prior distribution, 53 graphical Gaussian models (GGMs), 243, 246, 248, 250, 279, 280, 281 graphical models, 237, 246–7, 250–1 application in systems biology, 247 Bayesian networks, 250 correlation networks, 247–8 covariance selection networks, 248 dynamic Bayesian networks, 250 graphical separation, 238–41 graphical structures, 237 graphlet frequencies, 293 green fluorescent protein (GFP), 188 Gr¨obner bases, 115, 125, 129 grouping genes to find biological patterns, 147 Hastings factor, 273 Helicobacter pylori metabolic models, 455 parameter estimation for network models, 209 heterogeneity modelling, 287 heterogeneous changepoint DBN model, 283 heuristic optimisation algorithms, 242 hidden Markov model (HMM), 33, 115 PicTar algorithm, 481 hidden Markov random field (HMRF), 33 hidden variable dynamic modelling (HVDM), 447 hierarchical agglomeration algorithm, 170 hierarchical clustering, 170 high-resolution time course data, 366 high-scoring networks, 273 high-throughput techniques, 290 tandem affinity purification (TAP), 290 yeast two-hybrid (Y2H), 290 Hilbert’s problems, 3 HIV-1 human interaction network, 460, 461 homogeneous Markov model, 273 Hopf bifurcation, 345, 351, 352 host immune system, 458 host–microbe metabolic analyses, 455 host–pathogen interactions, 451, 452, 458–60, 462 host–pathogen systems biology, 451–62 evolution of, 460–2 goals of, 452 immune system interactions, 458–9
Index infectious diseases, medicine for, 460–2 manipulation of, 459–60 metabolic models, 453–5 methods and resources, 452 pathogen genomics, 453 protein–protein interactions, 455–7 response to environment, 457–8 schematic overview of, 452 host–viral systems, 456 HTS technology for mRNA expression estimates, 68, 137–9 need for normalization, 139 in presence of alternative splicing, 139–40 source of error in expression estimates, 139 HUGO Gene Nomenclature Committee, 69 Human Metabolome Database, 177 human pathogens for evolution of the host–pathogen system, 460 genome sequences of, 453 human T-cells, 443 DNA damage, 440 hybrid algorithms, 242, 423 hybrid discrete-continuous Markov process model, 365 hyperchaotic, 348 identifiability, 403 connection of identifiability and observability, 405 nonidentifiability, 403–5 image compensation, 191 image cytometry, 190–1 image processing, 191 image segmentation, 191–3 different methods, 193 imatinib, 5 immune cell behaviour, 459 immune system, 458 evaluation on Raf signalling pathway, 277 in host–pathogen systems, major components, 452 immunocytochemistry, 183–4 inference ABC inference, 436 ARTIVA inference procedure, 263, 267, 472, 490 Bayesian, 39, 41, 57, 172, 209, 210, 278–80, 367, 382 DBN, 256, 259, 261, 267 of gene regulatory networks, 91 on graphical models, 246–7 maximum likelihood, 33 of mixed linear models, 18 parameter, 122, 422 procedures, 260 RJMCMC inference, 263–4
501
statistical, 48, 118, 368 for stochastic differential equation models, 372 infinite basis, 383 increasing number of basis functions, 383 inner product between the basis functions, 383–4 kernelization, 384 to specify M location parameters, 384 at uniform intervals, 383 information resources, 67 integrand, 59 integrated completed likelihood (ICL), 35 integrated signalling networks, 326 interactome concept, 205 International Conference on Systems Biology, 3 intracellular biological processes, 359 intracellular signaling events, 182 intracellular signal transduction, 181–2 intrinsic noise, 360 Jacobian matrix, 344, 352 JASPAR, 69, 72 Jeffreys’ divergence, 320 Jeffreys–Lindley paradox, 50 KEGG pathways, 70, 77, 279, 281 database, 73, 278, 279, 280, 287 key and lock proteins, 296 K means algorithm, 32 kNN algorithm, 24 knowledge databases, 68 annotation, 71–4 characteristics, 68 ontologies, 70 p53 KD tour, 68–70 p53 ontology tour, 71 Kolmogorov’s forward equations, 361 Kolmogorov–Sinai entropy, 335–6 Kolmogorov–Smirnov test, 482 Kronecker delta, 282 K¨ullback–Leibler (KL) divergence, 34, 149 Kyoto Encyclopedia of Genes and Genomes (KEGG), 274 Lactococcous lactis, 109 chemical reaction network, 109–10 reconstruction, glycolytic pathway, 109 reverse engineering the network topology, 110 Lagrange interpolation, 442 Laplace approximation, 35, 387 LARS software, 260
502
Index
Lasso, 154 estimation, 260 for groups of genes, 155 latent variable models, 32 layer system, 354 learning graphical models, 241–2 least squares, 89–91 identification of the connectivity matrix, 89–90 Levenberg–Marquardt algorithm, 428 LF-MCMC methods, 370, 371 likelihood, 33, 366 likelihood-based model, 395–6 likelihood-free approach, 368 likelihood-free MCMC (LF-MCMC), 369 likelihood-free particle filtering technique, 371 likelihood function, 40, 45, 46, 59, 63, 302, 399, 422 likelihood ratio, 48, 49, 142, 243, 305 test, 401, 407, 424 linear BGe model, 285 linearization criterion, 345, 352, 355 matrix, 355 principle, 344 linear models, 84–5 connectivity matrix by least squares methods, 89–90 via LMI-based optimization, 97–9 convex optimization methods, 97 CORE-Net algorithm, 100 CTLS, 91–5 methods based on least squares, 90–1 PACTLS algorithm, 95–7 reconstruction methods based on, 89 top-down approach, 99–100 linear noise approximation, 372 linear regression model, 260 liquid chromatography–mass spectrometry, 468 live cell imaging, 182, 187–8 fluorescent probes, 188–90 local bifurcations, 353 local FDR, 21 locus for enterocyte effacement (LEE), 453 logarithmic scalings, 63 logic sampling, 247 logistic regression algorithm, 23 log-likelihood ratio, 305 log-linear models, 115, 126–9 Lotka model, 345 trajectories in, 346 Lotka–Volterra model, 351
L2 regularization, 27 L1 regularization function, 28 LS algorithm, 91 Lyapunov exponents, 333, 334, 348, 349, 357 Lyapunov time trajectories, 335 lysate-based assay, 182, 185 machine learning algorithm, 72, 194, 262 TESLA, 262 magnetic moment, 164 major histocompatibility complex (MHC), 458 MAMC, see Markov chain Monte Carlo (MCMC) Maraviroc, 462 Markov blankets, 240, 243 Markov chain, 323 Markov chain Monte Carlo (MCMC), 57, 61, 63, 115, 262, 273, 276–7, 283–4, 370–2, 376, 379, 425–6, 473 algorithm, development, 368 Markov jump process, 360–4 random time-change representation, 363 Markov partitions, 337 Markov process models, 362, 365, 366, 367 inference, 366–72 approximate Bayesian computation, 370–1 data augmentation MCMC approaches, 368 iterative filtering, 371 likelihood-based inference, 366–7 likelihood-free approaches, 369–70 partial observation and data augmentation, 367–8 particle MCMC, 371 stochastic differential equation models, inference for, 372 stochastic model emulation, 371–2 Markov property, 63, 239, 246, 257, 361, 366 Markov structure, 34 MARKS graphical Gaussian network, 245 MaSigPro, 72 mass-action stochastic kinetics, 360–1, 367 mass spectrometric protein complex identification (HMS-PCI), 291 mass spectrometry (MS), 164 mass spectrometry analysis, 471 mass to charge ratio (m /z), 164 matching alleles (MA) models, 451 maximum a posteriori (MAP), 35, 156 maximum likelihood estimation (MLE), 18, 33, 398, 400 MCMC, see Markov chain Monte Carlo (MCMC) membrane yeast two-hybrid (MYTH) assays, 228
Index metabolic circuits, 5 metabolic control analysis (MCA), 454 metabolic correlation networks, 173–6 metabolic data, 171 metabolic modelling, 453 metabolic networks, 227 of pathogens and symbionts, 454 metabolite data probabilistic peak detection, 472–3 software development for, 474–5 statistical inference challenges, 473–4 metabolite identification challenges of, 468–9 data integration, 472 probabilistic approach, 471 schematic of, 470 thermodynamic integration, 474 metabolome-wide association studies (MWAS), 172 dietary status for caloric restriction data, 173 metabolome-wide significance level (MWSL), 173, 174 metabolomics, 163, 467, 469–71 analysis techniques, 468 software development for mzMatch pipeline, 474 mzMine, 474 mzML, 474 mzXML, 474 PeakML, 475 XCMS, 474 MetAssimulo, 177, 178 Metropolis algorithms, 63 Metropolis–Hastings algorithm, 61, 273, 276, 368 Metropolis-within-Gibbs algorithms, 63 Michaelis–Menten kinetics, 87, 88, 398, 441 microarray gene expression data (MGED), 67–8 microarray repositories, 67, 75 microarray technology, 67, 76, 136–7, 273, 380 mRNA expression estimates from, 136–7 quantitative mRNA/protein profile data from, 68 microRNA–mRNA datasets, analysis of, 484 microRNA–mRNA regulatory modules, 485 microRNAs biogenesis, 490 cellular signals, 479 computational tools, developments of, 478–9 Dicer cleavage, 479 efficacy depends, on target abundance, 489 gene expression, data, 482–5 gene regulation, kinetic modeling of, 486–90 basic model of, 487 depends on target abundance, 489
503
fold-changes of mRNAs and proteins, 488 influence of protein and mRNA stability, 489 6-mer seed sequence, 479 post-transcriptional, 478 recontruction, 489–90 genome-wide effects of, 482 mediated regulation, 486–7 network approach for, 485–6 regulation, 478 seed sequence, 483 sequences, 480 signature, 482, 483 single input motif (SIM), 490 systems biology of, 477–8 target predictions, 479–81 3 UTRs, 479 Milstein method, 365 minimum description length (MDL), 245 miRanda algorithm, 480 miRbridge, 485, 486 MIRIAM Resources, 68, 70 mirSVR approach, 481 mitogen-activated protein kinase (MEK), 197 mixed integer linear programming, 227 mixed model inference, 18 MNI algorithm, 90 model based target ranking, 387–90 evaluation results, of different rankings, 390 model fit for two different classes of Gaussian process model, 389 model of translation from TF mRNA concentration, 387–8 evaluation using data from ChIP-chip experiment, 388 TF protein concentration, 388 mRNA expression levels, 387 modeling of cell signaling systems, 3–5 model invariants, 124–6 modelling transcription factor activity, 440–50 applications, 447–9 biological system, nature of, 443–5 computation of nonzero entries of, 443 estimating intermediate points, 449–50 index bounds, approximation time point, 446 ODE, integration, 441–3 polynomial extrapolation, 445 polynomial interpolation, bounds choice for, 445–7 model of gene expression, 377 modified diffusion bridge method, 372 modular protein interaction domains, 4
504
Index
molecular mechanisms of cell, 247 of cellular information processing, 4 of immunity and subversion by intracellular parasites, 459 molecular therapeutics, 5 Monte Carlo methods, 59–61 Monte Carlo simulations, 246, 406 moral graph, 257 mRNA rate of production, 377 regulation, 255 mRNA concentration, 84, 377, 379, 380, 382, 385–7 mRNA expression levels, 135 estimates from microarrays, 137 HTS technologies, 68, 137–40 mRNA–microRNA base-pairing, 488 multidimensional scaling (MDS), 299 multinomial distribution, 32, 246, 271 multiple testing, 19, 20 multiple transcription factors, 391–3 basic single-TF ODE model, 391 inferred TF profiles, 391–3 multiple TF model, 391 receiver operating characteristic (ROC) curves, 393 multivariate Gaussian distribution, 243 multivariate regression methods, 171 Mutoss R-project, 22 Mycobacterium avium subsp. paratuberculosis (MAP), response to environment, 457 Mycobacterium tuberculosis, 454 map of the central metabolism of, 456 metabolic models, 454 NCBI database, 67, 69 nearest neighbour (NN) algorithm, 24 Neimark–Sacker bifurcation, 353 Network Analysis tools, 77 network merging, 78 network structure, 6, 33, 87, 110, 242, 246, 262, 270, 275, 276, 311, 420 network topology, 84, 94, 99, 209, 228, 265, 454 Newton polytopes, 115 NIR algorithm, 90 NMR sensitivity, 164 node association by transcription factor, 78 noise control stochastic differential equations, 486 noninformative priors, 40, 53
nonlinear dynamics, 333 chaos in biology, 338 Kolmogorov–Sinai entropy, 335–6 Lyapunov exponent, sensitivity to initial conditions, 334 natural measure, 334–5 symbolic dynamics, 336–8 nonlinear ODE models, 84, 85 nonlinear regulation process, 287 nonlinear state space process, 285 nonlinear systems chaotic dynamics, 333 linearizations, 345 non-parametric additive regression, 257 nuclear magnetic resonance (NMR) spectroscopy, 164, 166–7 nuisance parameter, 53 null hypothesis, 21, 141 numerical algorithms, 242 ODE models, for reaction networks, 396–7 dynamics of protein concentration, 397 non linear nature, 399 rate equations, 397–8 oligonucleotide chips, 83 oncogenes, 4 online predicted human interaction database (OPHID), 226 probability density plots, 226 ontologies and experimental data, 77 Gene-set analysis, 77 GO biological terms, 77 KEGG pathways, 77 Open Biomedical Ontologies (OBOs), 70, 77 ordinary differential equations (ODEs), 5, 83, 340, 359, 377, 395, 440, 478, 486, 487 integrating with differential operator, 441–3 polynomial integration, 449 2-oxoglutarate, 170 PACTLS algorithm, 95–7, 101, 102 performance of, 103 pAkt-S6 formation, 432 parameter estimation, 115, 122–3, 398 confidence intervals, 401–3 maximum likelihood estimation (MLE), 398 objective function, 398 sensitivity equations, 399–400 parameter learning, 242, 246
Index partial correlation graphs derived from genomic data, see gene association, networks partial differential equations (PDEs), 340 partial least squares (PLS) regression, 171 partially observed Markov process (POMP) models, 368 particle marginal Metropolis–Hastings (PMMH) algorithm, 371 particle MCMC methods, 371 partition function, 275 pathogen–host protein interactions, 460 Pearson’s χ2 -test, 400–401 penalized likelihood methods, 154 and Lasso, 154–5 ordinary least squares (OLS) estimator, 154 penalized logistic regression, 27 performance of a model, 156–7 cross-validation (CV), 157 hold-out method, 156–7 training error, 156 period-doubling route to chaos, 353 pharmacological inhibition, of receptor tyrosine kinase, 5 pharmacological perturbations, 4 phosphorylation-specific antibodies, 182 photobleaching technique, 188 photoconvertible fluorescent proteins, 188 PITA algorithm, 481 p53 KD tour, 68–70 Plasmodium falciparum, 455 metabolic models, 455 parameter estimation for network models, 209 Plasmodium proteins interolog approach, 457 PLS algorithm, 171 PLS Discriminant Analysis (PLS-DA), 171 PLS modelling of caloric restriction data, 172 PMMH algorithm, 371 point null hypotheses, 49–50 Poissonian degree distribution, 325 Poisson processes, 363 Poisson random variable, 263 polynomial and rational models, approaches based on, 105 chemical reaction networks, 105–6 gene regulatory networks, 106–7 polynomial dynamical system (PDS), 129 polynomial equations, 114 POMP models, 368, 371 maximum likelihood approaches to, 368 population Monte Carlo (PMC), 60
505
positive predictive value (PPV), 102, 265 posterior distribution, 40–7, 55, 58–63, 62, 155, 156, 209, 264, 272, 274, 276, 279, 280, 369, 370, 385, 425, 430, 432, 469 post-translational modification, 4 power-law approximations, 104 PPI data, 228 PPI networks computational analysis of, 205 extreme eigenvectors, components, 296 predicting function using, 221–3 PPINs applications, 317–20 collection, 322 datasets, 312, 317, 320 of H. sapiens, 311 pRDRG model, 303, 304 periodic spectral reordering algorithm, 304 precision, 102 predicting interactions, using PPI networks, 223–4, see also PPI networks limitations, 227–8 tendency to form triangles, 224 using triangles for predicting interactions, 224–5 predictive distribution, 55, 56 principal component analysis (PCA), 169 sensitivity analysis, 430–5 stiff principal components (PCs), 434 prior distribution, 42–6 prior knowledge, 102 prior knowledge matrix, 278 probability distribution, 40, 72, 117, 118, 122, 259, 274, 283, 361, 362, 422, 425 profile likelihood approach, 405–6 applications, 408 assessing the identifiability of parameter, 407 experimental design, 406 model reduction, 407 observability and confidence intervals of trajectories, 407–8 proline-rich sequence, 4 protein association, 78 protein binding microarray (PBM), 72 protein binding site (PBS) annotation, 72 protein complexes, 4 Protein Data Bank, 68 protein degradation, 84 protein/DNA interaction data, 267 protein interaction data, 204 genome scale, integration of, 455
506
Index
protein interaction networks approximate Bayesian computation, 209–10 comparison, 211 based on subgraph counts, 211–13 functional annotation for network alignment, 215–16 network alignment, 213–15 evolution, 217 affecting network alignment, 217–18 integrating transcription regulation using dynamic gene-expression data, 227 interactome concept and, 205 models of random networks, 207–9 network analysis, 205–7 parameter estimation for network models, 209 threshold behaviour in graphs, 210–11 Protein kinase B pathway, 426, 427 protein networks, 78 protein phosphorylation, 115, 185 protein-protein interactions (PPI), 83, 200, 202, 420 error in PPI data, 204–5 network models, 290, 294–301 geometric graphs to, 299 geometric networks, 297–301 geometric random graph model, 301 lock and key, 294–7 random graph models and application, 290 range-dependent graphs, 301–5 physical basis, schematic representation, 291 proteins, 201 experimental techniques, for interaction detection, 202 co-immunoprecipitation, 203 tandem affinity purification, 203 yeast two-hybrid system, 202–3 function, 201–2 protein interaction databases, 204 structure, 201–2 protein synthesis by pulsed SILAC (pSILAC), 478 proteomics, 5, 174, 203, 205, 459, 478 protozoan parasites, dataset, 471 pseudo-Bayes factors, 52 pseudo-marginal approach, 371 p53 targets, 447 p-values, 21, 22, 142, 146 quadratic penalization function, 28 quantitative measurements of biological entities, 66
of proteins, 183 use of microscopes, 187 quantitative mechanistic dynamic models, 419 quantitative mRNA/protein profile data, 68 quantitative reverse transcription polymerase chain reaction (qRT-PCR), 396 quasiperiodic attractors, 345 query nodes, 247 Raf signalling pathway, 278 cytometry data, inferring hyperparameters from, 279 dysregulation, 277 empirical evaluation on, 277–81 reconstruction, 281 random inference algorithm, 102 randomness, 40 random time change representation, 363 random variables, 237, 243 range-dependent interaction probabilities, 302 range-dependent random graph (RDRG) model, 302, 303 separation distance, 303 R/Bioconductor package, 447 reaction networks, based on ordinary differential equations, 115 reaction rate equations (RREs), 365 receiver operating characteristic (ROC) curve, 280, 300–1 receptor tyrosine kinase, pharmacological inhibition of, 5 recurrence plot, 357 regression coefficients, 260, 263, 267 regression model, 52, 262–4 regulatory networks in biology, 255–6 regulatory systems, 5 repositories, 67 restricted maximum likelihood (ReML), 18 reverse engineering, 84, 116, 129 chemical reaction networks, 105 reversible-jump Markov chain Monte Carlo (RJMCMC), 172, 263, 264, 284 RNA-induced silencing complex (RISC), 479 RNA interference (RNAi), 227 RNA-Seq, 67, 68, 139, 420, 490 robust multichip analysis (RMA), 137 Runge–Kutta fourth order scheme, 442 Saccharomyces cerevisiae, 101 cell cycle regulatory subnetwork, 102 gene regulatory subnetwork, 103–4 saddle-node bifurcation, 349 sampling biases, of experimental methods, 228 SBML models, 77, 370 score-based algorithms, 242
Index score equivalent functions, 245 self-organising map (SOM), 170 sequential Monte Carlo (SMC), 57, 60–1, 63, 210, 371, 425, 430 ABC-based approaches, 428–30 sequential quadratic programming (SQP), 108 Shannon’s information theory, 336 SH3 binding domain, 4, 295 signal cell measurement data, analysis of, 194 Bayesian network modeling, 197 quantifying sources of cell-to-cell variation, 197–9 time series, 194, 197 signaling protein, 4 signalling pathway models, inference of, 419–35 Akt signalling pathway, 426 epidermal growth factor (EGF), 427 exploring different distance functions, 428–9 nerve growth factor (NGF), 427 principal component analysis (PCA), sensitivity analysis, 430–5 type-2 diabetes and cancer, 426 dynamical systems, parameter inference, 422–5 model selection methods, 424–5 optimization algorithm, 422 stochastic dynamic models, 423 inference techniques, overview of, 420–2 signal transduction, 419 SIMoNe (statistical inference for modular networks), 260 simulation algorithms and analysis, 373 based on the assumption of the ARTIVA model, 265 computational modelling and, 453 FBA simulations, 455 hybrid simulation approaches, 365, 371 MCMC simulations, 277 of metabolic profile data, 176–8 method by Markov chain generation, 61 Monte Carlo simulations, 406 network-free simulation, 8 rule-based model, 8 of SBML models, 370 Spatial simulations of bacterial and immune cell behaviour, 459 stochastic simulation algorithm, 362, 363 single-cell assay, 182 single-cell dynamics, 421 single-species ecosystem population growth, discrete-time model for, 338 singular value decomposition (SVD), 364 slice sampler, 62
507
SMC algorithm, see sequential Monte Carlo (SMC) sparsity coefficient, 101 sparsity pattern, 100 spectral reordering algorithms, 304 splicing, 135 sponge RNAs, 479 Src homology 2 (SH2) domain, 4 Staphylococcus aureus, 459 RNAi screening for phagocytosis, 459 state-space graph, 282 state-space models, 250 static Bayesian network, 257, 259, 276, 281, 421 static modelling approaches Bayesian networks, 256 correlation networks, 256 graphical Gaussian models, 256 stochastic approximation EM (SAEM), 34 stochastic block model (SBM), 33 stochastic chemical kinetics, 360–6 chemical master equation, 361–2 diffusion approximation, 364–5 Gillespie algorithm, 362–3 Markov jump process, 360–4 modelling extrinsic noise, 365–6 random time change representation, 363 reaction networks, 360 reaction rate equations, 365 structural properties, 363–4 stochastic differential equation models, 372 stochastic dynamical systems, 359 Markov process models, inference, 366–72 approximate Bayesian computation, 370–1 data augmentation MCMC approaches, 368 iterative filtering, 371 likelihood-based inference, 366–7 likelihood-free approaches, 369–70 partial observation and data augmentation, 367–8 particle MCMC, 371 stochastic differential equation models, inference for, 372 stochastic model emulation, 371–2 stochastic chemical kinetics (see stochastic chemical kinetics) stochasticity, origins, 359–60 low copy number, 359–60 noise and heterogeneity, sources, 360 stochastic EM (SEM), 34 stochastic kinetic model, 363 stochastic model emulator, 372 stochastic process, 257 stochastic simulation algorithm (SSA), 362
508
Index
stoichiometry matrix, 360, 363 Stokes shift, 185 structural stability, 341 structure learning algorithms, 242 Student’s t distribution, 56 predictive distribution based on, 56 support vector machine (SVM), 28 support vector regression (SVR) algorithm, 481 symbolic dynamics, 336–8 synthetic network datasets, 293 marginal edge posterior probabilities for, 285 Systems Biology Graphical Notation (SBGN) project, 5 Systems Biology Markup Language (SBML), 370 target prediction algorithms, 479–80 testing hypotheses, 48 Bayes factor, 48–9 Bayesian multiple testing, 54–5 decisions, 48 improper priors, ban on, 50–2 nuisance parameters, 52–4 point null hypotheses, 49–50 TF–microRNA transcriptional regulatory network, 485 time-discrete system, 341 time homogeneity assumption, 261 time of flight (ToF), 164 time series analysis, 356–7 time-T map, 341 time-varying DBNs graphical representation for, 258 time-varying network inference, 267 topological transitivity, 335 total least squares (TLS) technique, 90 transcriptional regulatory network, 227, 376 transcription factor activity, 448 transcription factor (TF) concentration, 377, 478 TRANSFAC PWM, 72 transfection, 190
triple-quadrupole (QQQ) instruments, 164 Trypanosoma brucei Bayesian analysis of metabolomic data, 471 metabolic models, 454 Trypanosoma cruzi metabolic models, 455 TSNI algorithm, 90 t-test, 141, 171 tumor suppressor genes, 4–5 type I error, 143 tyrosine kinase, 4 UCSC Genome Browsers, 69 undirected graph evaluation (UGE), 280, 281 uropathogenic E. coli (UPEC), response to environment, 457 validation of probes, 189–90 van der Pol oscillator dynamics, 356 variance components, 16–17 vector-autoregressive (VAR) model, 250 virulence, 460 VirusMINT database, 456 visualization software, as integrative tools, 77 Viterbi algorithm, 35 Wald t-statistics, 142 Ward linkage, 170 Western blot analysis, 183, 184, 197, 203, 396, 402 Wiener process, 364 Wilcoxon rank-sum test, 141, 483 worm ‘core’ network (Wcore), 305 yeast PPI network, 305 adjacency matrix, 292 YEASTRACT database, 104 yeast two-hybrid (MYTH) assays, 228